Re: client libpq multibyte support
От | Tatsuo Ishii |
---|---|
Тема | Re: client libpq multibyte support |
Дата | |
Msg-id | 20000505193618U.t-ishii@sra.co.jp обсуждение исходный текст |
Ответ на | client libpq multibyte support (SAKAIDA Masaaki <sakaida@psn.co.jp>) |
Ответы |
Re: client libpq multibyte support
|
Список | pgsql-hackers |
> > That's because none-MB client does not understand how "Shift JIS > > kanji" consists of letters with different width bytes. The similar > > problem would happen with the Big5 character set (traditional > > Chinese), also. Unlike other character sets, these should be treated > > carefully since they include the same bit patterns as ASCII and that > > makes none-MB clients confused. > > I'm confused though, this would mean that somewhere in the string > `SJIS_KANJI' a backslash was found. But that's all ASCII characters. > Aren't the characters 0-127 always identical in any character set? Not always. Shift JIS and Big5 include 0-127 characters. So "how to distinguish them from ASCII?", you might ask. Here are rules for this: 1. parse from the begining byte of the string in question. If it is 0-127 then it's an ASCII (single byte letter). 2. if it's between 0xa1 and 0xdf, it's a "1 byte kana" (single byte letter). 3. otherwise it's a "kanji" (double byte letter). In this case the second byte might be in range of 0-127 (this is the source of the problem). I think Big5 has similar, but a little bit different rule (I don't remember precisely now). Other encodings having 0-127 range bytes (but they are not ASCII) include: o UCS-2, 4 (Unicode) o any 7 bit encoded ISO 2022 based charsets. for example, ISO 2022-jp. -- Tatsuo Ishii
В списке pgsql-hackers по дате отправления: