Re: UNICODE characters above 0x10000
От | Dennis Bjorklund |
---|---|
Тема | Re: UNICODE characters above 0x10000 |
Дата | |
Msg-id | Pine.LNX.4.44.0408070820300.9559-100000@zigo.dhs.org обсуждение исходный текст |
Ответ на | Re: UNICODE characters above 0x10000 (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: UNICODE characters above 0x10000
|
Список | pgsql-hackers |
On Sat, 7 Aug 2004, Tom Lane wrote: > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. I can give some general info about utf-9. This is how it is encoded: character encoding ------------------- --------- 00000000 - 0000007F: 0xxxxxxx 00000080 - 000007FF: 110xxxxx 10xxxxxx 00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx If the first byte starts with a 1 then the number of ones give the length of the utf-8 sequence. And the rest of the bytes in the sequence always starts with 10 (this makes it possble to look anywhere in the string and fast find the start of a character). This also means that the start byte can never start with 7 or 8 ones, that is illegal and should be tested for and rejected. So the longest utf-8 sequence is 6 bytes (and the longest character needs 4 bytes (or 31 bits)). -- /Dennis Björklund
В списке pgsql-hackers по дате отправления: