Re: Unicode problems on IRC
От | Andrew - Supernews |
---|---|
Тема | Re: Unicode problems on IRC |
Дата | |
Msg-id | slrnd5iren.2ilg.andrew+nonews@trinity.supernews.net обсуждение исходный текст |
Ответ на | Re: Unicode problems on IRC ("John Hansen" <john@geeknet.com.au>) |
Список | pgsql-hackers |
On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew - Supernews <andrew+nonews@supernews.com> writes: >> I think you will find that this impression is actually false. Or that at >> the very least, _correct_ verification of UTF-8 sequences will still >> catch essentially all cases of non-utf-8 input mislabelled as utf-8 >> while allowing the full range of Unicode codepoints. > > Yeah? Cool. Does John's proposed patch do it "correctly"? > > http://candle.pha.pa.us/mhonarc/patches2/msg00076.html It looks correct to me. The only thing I think that code will let through incorrectly are encoded surrogates; those could be fixed by adding one line: switch (*source) { /* no fall-through in this inner switch */ case 0xE0:if (a < 0xA0) return false; break; + case 0xED: if (a > 0x9F) return false; break; case 0xF0: if (a < 0x90) return false;break; case 0xF4: if (a > 0x8F) return false; break; (Accepting encoded surrogates in utf-8 was always forbidden by most specifications that used utf-8, though the Unicode specs originally were not absolute about it (but forbade generating them). Current Unicode specifications define those sequences as malformed. Surrogates are the code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that such characters are encoded directly rather than via surrogate pairs.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
В списке pgsql-hackers по дате отправления: