RE: PostgreSQL and Unicode
От | Tatsuo Ishii |
---|---|
Тема | RE: PostgreSQL and Unicode |
Дата | |
Msg-id | 20000516160855E.t-ishii@sra.co.jp обсуждение исходный текст |
Список | pgsql-hackers |
> My understanding of the problem is UTF8 is this. Functionally, it is > equivalent to UCS-2, that is you can encode any Unicode character in UTF-8 > that you could encode in UCS-2. > The problem we've run into is only related to Postgres. For example we had > a field that was fixed at 20 characters. If we put in ASCII then we could > put in all 20 characters. If we put in UTF8 encoded Japanese then (depending > on which characters were used) we got about 3 UTF8 characters for each > Japanese character. Aside from going from 20 characters to 7 (*problem #1*) > we also now have unpredictable behavior. Some characters, like Japanese, > were 3:1 ratio when encoding. UTF8 can go as high as 6:1 encoding ratio for > some language (I don't know which off hand) this is *problem #2*. Finally, > as a side affect of this, the string was just truncated so we sometimes got > only a partial UTF8 character in the database. This made the unencoding > either fail or produce weird results (*problem #3*). Yes, I have noticed this problem too. But don't we have same problem with UCS-2, with 2:1 ratio, then? I think we should fix this in the way:char(10) should means 10 letters, not 10 bytes no matter whatencoding we use I will tackle this problem for 7.1. How do you think, Rainer? Are you still unhappy with the solution above? -- Tatsuo Ishii
В списке pgsql-hackers по дате отправления: