Re: Unicode support
От | Greg Stark |
---|---|
Тема | Re: Unicode support |
Дата | |
Msg-id | 4136ffa0904140849h36bdb5adl8b4e765b1906c4ed@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Unicode support (Peter Eisentraut <peter_e@gmx.net>) |
Ответы |
Re: Unicode support
Re: Unicode support Re: Unicode support |
Список | pgsql-hackers |
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: >> Umm, but isn't that because your encoding is using one code point? >> >> See the OP's explanation w.r.t. canonical equivalence. >> >> This isn't about the number of bytes, but about whether or not we should >> count characters encoded as two or more combined code points as a single >> char or not. > > Here is a test case that shows the problem (if your terminal can display > combining characters (xterm appears to work)): > > SELECT U&'\00E9', char_length(U&'\00E9'); > ?column? | char_length > ----------+------------- > é | 1 > (1 row) > > SELECT U&'\0065\0301', char_length(U&'\0065\0301'); > ?column? | char_length > ----------+------------- > é | 2 > (1 row) What's really at issue is "what is a string?". That is, it a sequence of characters or a sequence of code points. If it's the former then we would also have to prohibit certain strings such as U&'\0301' entirely. And we have to make substr() pick out the right number of code points, etc. -- greg
В списке pgsql-hackers по дате отправления: