Обсуждение: UNICODE encoding and jdbc related issues

Поиск
Список
Период
Сортировка

UNICODE encoding and jdbc related issues

От
Chris Kratz
Дата:
Hello All,

I posted this on the general mailing list several days ago without a response
so am posting here. Does anyone here have any wisdom or experience they don't
mind sharing?

Our production database was created with the default SQL_ASCII encoding.  It
appears that some of our users have entered characters into the system with
characters above 127 (accented vowels, etc).  None of the tools we use
currently have had a problem with this behavior until recently, everything
just worked.

I was testing some reporting tools this past weekend and have been playing
with Jasper reports[1] .  Jasper reports is a Java based reporting tool that
reads data from the database via JDBC.  When I initially tried to generate
reports, the jdbc connection would crash with the following message:

org.postgresql.util.PSQLException: Invalid character data was found.

Googling eventually turned up a message on the pgsql-jdbc list detailing the
problem[2].  Basically, java cannot convert these characters above 127 into
unicode which is required by java.

After some more googling, I found that if I took a recent database dump and
then ran it through iconv[3] and then created the database with a unicode
encoding, everything worked.

1. Is there any way to do a iconv type translation inline in a sql statement?
ie select translate(text_field, unicode) from table....  Btw, set
client_encoding=UNICODE does not work in this situation.  In fact the JDBC
driver for postgres seems to do this automatically.

2. I'm really not sure I want to change the encoding of our main database to
Unicode.  Is there a performance loss when going to a UNICODE database
encoding?  What about sorts, etc.  I'm really worried about unintended side
effects of moving from SQL_ASCII to UNICODE.

3. Is there any other way around this issue?  Or are we living dangerously by
trying to store non-ascii data in a database created as ascii encoded?

4. Has anyone else gone through a conversion like this?  Are there any
 gotchas we should look out for?

Thanks for your time,

-Chris

We are using postgres 7.4.5 on Linux.

[1] http://jasperreports.sourceforge.net/
[2] http://archives.postgresql.org/pgsql-jdbc/2004-10/msg00280.php
[3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall
--
Chris Kratz

Re: UNICODE encoding and jdbc related issues

От
Kris Jurka
Дата:

On Wed, 6 Apr 2005, Chris Kratz wrote:

> Our production database was created with the default SQL_ASCII encoding.  It
> appears that some of our users have entered characters into the system with
> characters above 127 (accented vowels, etc).  None of the tools we use
> currently have had a problem with this behavior until recently, everything
> just worked.
>
> I was testing some reporting tools this past weekend and have been playing
> with Jasper reports[1] .  Jasper reports is a Java based reporting tool that
> reads data from the database via JDBC.  When I initially tried to generate
> reports, the jdbc connection would crash with the following message:
>
> org.postgresql.util.PSQLException: Invalid character data was found.
>
> Googling eventually turned up a message on the pgsql-jdbc list detailing the
> problem[2].  Basically, java cannot convert these characters above 127 into
> unicode which is required by java.
>
> After some more googling, I found that if I took a recent database dump and
> then ran it through iconv[3] and then created the database with a unicode
> encoding, everything worked.
>
> 1. Is there any way to do a iconv type translation inline in a sql statement?
> ie select translate(text_field, unicode) from table....  Btw, set
> client_encoding=UNICODE does not work in this situation.  In fact the JDBC
> driver for postgres seems to do this automatically.

You can't do translation inline, how would a driver interpret the results
of SELECT translate(field1, unicode), translate(field2, latin1) ?

The driver does SET client_encoding which does work for all real server
encodings.  The problem is that SQL_ASCII is not a real encoding.  It
accepts any encoding and cannot do conversions to other encodings.  Your
db right now could easily have a mix of encodings.

> 2. I'm really not sure I want to change the encoding of our main database to
> Unicode.  Is there a performance loss when going to a UNICODE database
> encoding?  What about sorts, etc.  I'm really worried about unintended side
> effects of moving from SQL_ASCII to UNICODE.

You don't need to use unicode, but you must select another encoding.  If
you'd like to stick with a single byte encoding perhaps LATIN1 would be
appropriate for you.


> 3. Is there any other way around this issue?  Or are we living dangerously by
> trying to store non-ascii data in a database created as ascii encoded?

You are living dangerously.

> 4. Has anyone else gone through a conversion like this?  Are there any
>  gotchas we should look out for?

The gotchas here are to make sure your other client tools still work
against the new database.

> [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall

I see your data really is LATIN1.  Perhaps you should use that as your db
encoding.  That should keep your existing client tools happy as well as
the JDBC driver.

Kris Jurka

Re: UNICODE encoding and jdbc related issues

От
"Igor Postelnik"
Дата:
> > 2. I'm really not sure I want to change the encoding of our main
> database to
> > Unicode.  Is there a performance loss when going to a UNICODE
database
> > encoding?  What about sorts, etc.  I'm really worried about
unintended
> side
> > effects of moving from SQL_ASCII to UNICODE.
>
> You don't need to use unicode, but you must select another encoding.
If
> you'd like to stick with a single byte encoding perhaps LATIN1 would
be
> appropriate for you.

I've asked this before on the performance list but didn't get any reply.
Is there substantial performance difference between using SQL_ASCII,
LATIN1, or UNICODE?

> The driver does SET client_encoding which does work for all real
server
> encodings.  The problem is that SQL_ASCII is not a real encoding.  It
> accepts any encoding and cannot do conversions to other encodings.
Your
> db right now could easily have a mix of encodings.

ISTM that when you create a database with SQL_ASCII encoding you decide
to deal with character set issues in the applications. Why is the JDBC
driver dictating how the application handles character set issues?

-Igor


> -----Original Message-----
> From: pgsql-jdbc-owner@postgresql.org [mailto:pgsql-jdbc-
> owner@postgresql.org] On Behalf Of Kris Jurka
> Sent: Wednesday, April 06, 2005 1:23 PM
> To: Chris Kratz
> Cc: pgsql-jdbc@postgresql.org
> Subject: Re: [JDBC] UNICODE encoding and jdbc related issues
>
>
>
> On Wed, 6 Apr 2005, Chris Kratz wrote:
>
> > Our production database was created with the default SQL_ASCII
encoding.
> It
> > appears that some of our users have entered characters into the
system
> with
> > characters above 127 (accented vowels, etc).  None of the tools we
use
> > currently have had a problem with this behavior until recently,
> everything
> > just worked.
> >
> > I was testing some reporting tools this past weekend and have been
> playing
> > with Jasper reports[1] .  Jasper reports is a Java based reporting
tool
> that
> > reads data from the database via JDBC.  When I initially tried to
> generate
> > reports, the jdbc connection would crash with the following message:
> >
> > org.postgresql.util.PSQLException: Invalid character data was found.
> >
> > Googling eventually turned up a message on the pgsql-jdbc list
detailing
> the
> > problem[2].  Basically, java cannot convert these characters above
127
> into
> > unicode which is required by java.
> >
> > After some more googling, I found that if I took a recent database
dump
> and
> > then ran it through iconv[3] and then created the database with a
> unicode
> > encoding, everything worked.
> >
> > 1. Is there any way to do a iconv type translation inline in a sql
> statement?
> > ie select translate(text_field, unicode) from table....  Btw, set
> > client_encoding=UNICODE does not work in this situation.  In fact
the
> JDBC
> > driver for postgres seems to do this automatically.
>
> You can't do translation inline, how would a driver interpret the
results
> of SELECT translate(field1, unicode), translate(field2, latin1) ?
>
>
>
>
> > 3. Is there any other way around this issue?  Or are we living
> dangerously by
> > trying to store non-ascii data in a database created as ascii
encoded?
>
> You are living dangerously.
>
> > 4. Has anyone else gone through a conversion like this?  Are there
any
> >  gotchas we should look out for?
>
> The gotchas here are to make sure your other client tools still work
> against the new database.
>
> > [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-
> 8.dumpall
>
> I see your data really is LATIN1.  Perhaps you should use that as your
db
> encoding.  That should keep your existing client tools happy as well
as
> the JDBC driver.
>
> Kris Jurka
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org



Re: UNICODE encoding and jdbc related issues

От
Kris Jurka
Дата:

On Wed, 6 Apr 2005, Igor Postelnik wrote:

> I've asked this before on the performance list but didn't get any reply.
> Is there substantial performance difference between using SQL_ASCII,
> LATIN1, or UNICODE?

Performance where?  Backend performance in terms of string comparisons and
sorting is driver off of locale, not encoding.  You may use the C locale
with UNICODE encoding for example so that should not be an issue.  For the
JDBC driver it always wants data coming back to it in unicode.  If you've
got a unicode db no conversion is necessary.  If you've got a sql_ascii
db no conversion is possible.  If you've got a latin1 db conversion will
happen, but I don't know what the cost of that is.


> ISTM that when you create a database with SQL_ASCII encoding you decide
> to deal with character set issues in the applications. Why is the JDBC
> driver dictating how the application handles character set issues?

If the only API the JDBC driver provided was ResultSet.getBytes() then
that would be OK (note this is the only API libpq provides).  To provide
getString() the driver must know what encoding the data coming back is
really in.  A database encoding of sql_ascii tells us nothing so we can do
nothing about it.  It has been suggested in the past to allow the real
database encoding for a sql_ascii database to be specified as a URL
parameter, but I am of the opinion that is just masking the problem, not
solving it.  Data should be in a correctly encoded database.  If you store
unicode data in a sql_ascii then things like varchar(N) are now the number
of bytes instead of the number of characters as it should.  With sql_ascii
there is no restriction on what data can be entered and you can get
yourself in a real mess with different clients entering data in different
encodings.  Do yourself a favor and pick a real encoding.

Kris Jurka