Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
От | Bart Samwel |
---|---|
Тема | Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text |
Дата | |
Msg-id | 442C4F6C.2000607@samwel.tk обсуждение исходный текст |
Ответ на | Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields (Johann Zuschlag <zuschlag2@online.de>) |
Список | pgsql-odbc |
Johann Zuschlag wrote: > The problem with UTF-8 is that all ASCII characters are represented by > one byte and all non ASCII characters, e.g. German Umlauts, are > represented by two bytes. That's why UTF-8 is called a "variable-length > multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two > bytes, every character is represented by two bytes (fixed-length > multibyte encoding). So Unicode is not equal to UTF-8, even though the > PostgreSQL documentation is stating that. Well, it's actually even more complicated, because Unicode is actually a 32-bit character set. There is actually UTF8 (variable-length multibyte, 8 bits per unit), UTF16 (variable-length multibyte) and UTF32 (fixed-length multibyte). There is also UCS2 (fixed-length 16-bit), which is limited to the 16 bits of the Basic Multilingual Plane, and UCS4, which is functionally identical to UTF32. UTF-8 actually supports up to 4 bytes per character, so it is more complete than the purely 16-bit UCS-2. Any of the variable-length encodings, and the 32-bit UTF-32 and UCS-4 encodings can represent the whole of the character set. A pure Unicode world can use any of those encodings, so it's a tradeoff. If you want a direct relationship between the number of characters in a string and the number of bytes taken, use a fixed-length encoding. If you want to be able to encode everything, use a variable-length encoding or a 32-bit encoding. If you want to use little space, use an 8-bit encoding. That's it. > Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian. > Unfortunately (or fortunately?) Windows seems to use UTF-8 for European > languages. Hiroshi can you explain that? I guess the Japanese edition of > Windows XP is using pure 2 byte Unicode. In fact, the Win32 API is UTF-16 even in European languages(started out as UCS-2 but became UTF-16 when Unicode went 32-bit :-) ), but it provides an 8-bit compatibility interface. Don't know if te 8-bit encoding is UTF-8 or plain 8-bit code pages though. Reference: http://en.wikipedia.org/wiki/Unicode Cheers, Bart
В списке pgsql-odbc по дате отправления: