A JDBC bug or problem relating to string length in Java vs. PG (long)

Поиск
Список
Период
Сортировка
От joe user
Тема A JDBC bug or problem relating to string length in Java vs. PG (long)
Дата
Msg-id 20030901015316.1284.qmail@web20416.mail.yahoo.com
обсуждение исходный текст
Ответы Re: A JDBC bug or problem relating to string length in Java
Список pgsql-jdbc
String lengths aren't the same in Java and Postgres.
_All_ Strings in Java are stored in UTF-16 (1 char =
16 bits).  That means that a string of 100 Chinese
characters will have a Java String length of 100.
But, Postgres by default does not store all strings as
UTF-16.  (I couldn't find in the docs what the default
is, except that it takes it from the computer it is
installed on, which in my case is plain-vanilla Redhat
9, which I assume uses a Latin encoding as the
default, which is 1 char = 1 byte, right?)  It has
8-bit chars (is this correct?), so that the string of
100 Chinese chars would have a length greater than 100
in Postgres-world.

Where this causes problems is in VARCHAR types.  Let's
say I have a table with a VARCHAR(100).  If I try to
store a string longer than 100 chars into that, I will
get an SQLException, so, being the defensive
programmer that I am, I write a little static method
called truncate that looks like this:

    static String truncate(String s, int l) {
        if(s.length() < l) return s;
        else return s.substring(0, l);
    }

and then I do this kind of thing:

    preparedStatement.setString(1,
        truncate(theString, 100));

which should never throw an SQLException because I
know the String length is safe... except that this is
a web application, being used by people all over the
world, including people with non-Latin characters.

What is the solution to this?  I think that the real
solution would be to ammend the JDBC spec to say that
there should be an SQL string length method somewhere
that lets us test what the database thinks is going to
be the length.

Another solution would be to convert the string to a
byte[] using UTF-8 (I assume this is what Postgres
does) and then look at the length of that byte[].

It would be possible to take the be-very-conservative
approach, and, for a VARCHAR(100), truncate the Java
String to 50, but even this will not work.  I think
there are some characters in UTF-16 that encode to
more than 2 bytes in UTF-8.

Another approach would be for me to use TEXT instead
of VARCHAR.  I have various reasons for not wanting to
do that if I could find some other solution.

Another option would be to tell Postgres that I want
to use UTF-16 for its encoding.  I don't really care
about the disk space, but the problem is that this
means that a given page of disk only holds half as
much data, which means that the OS, which caches disk
pages in RAM, will only be able to do half the
effective caching, which means that I need to buy
twice as much RAM to get the same performance, and
that would be a major hardware investment.

Finally, it would be good if there were a Postgres
option that says, "Silently truncate strings which are
too long."  That's all I really want to do anyway.

It would actually be cool if java.lang.String had a
length(String encoding) method for this purpose.

Any thoughts on this?  All this is with PG 7.3.4, btw.


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

В списке pgsql-jdbc по дате отправления:

Предыдущее
От: Paul Thomas
Дата:
Сообщение: Re: java_objects and create table
Следующее
От: joe user
Дата:
Сообщение: SQL injection bug for null-terminated strings?