Re: Unicode support
От | Marko Ristola |
---|---|
Тема | Re: Unicode support |
Дата | |
Msg-id | 4320736A.10308@kolumbus.fi обсуждение исходный текст |
Ответ на | Re: Unicode support ("Dave Page" <dpage@vale-housing.co.uk>) |
Список | pgsql-odbc |
Marc Herbert wrote: >Marko Ristola <Marko.Ristola@kolumbus.fi> writes: > > >>So I ask you, how you have thought about these things: >> >>If I have understood Windows correctly, it uses UCS-2 as it's internal >>UNICODE >>character set. Linux prefers into UTF-8. >> >> > >I am not sure what you mean by "internal UNICODE character set", but I >understand that Linux does prefer UTF-32, NOT UTF-8 ! > > > If you want to know the details about UTF-8's encoding, the following is a recommended reading (Linux manual page) :) man utf-8 It gives you a good explanation of the encoding used in UTF-8. UTF-8 uses from one to four bytes per character. It supports almost all character sets in the World. Because the task is so huge, there exist variants and bugs in the implementations. That's what I read from Samba filesystem FAQ. So, if you stick with Windows implementation, you don't find any bugs, but when you move the file into another operating system, the file might look different :( UCS-2 is a 32-bit Unicode wchar_t type. According to Linux manuals, wchar_t is not equal on all implementations. According to manuals, inside binary files, it is recommended in C to use UTF-8 strings, that are then converted at runtime into wchar_t type. Java language is another story. There might be same problems though. The number remains the same, but if you try to draw the character into the window with different implementations, you might get different drawings. >On all platforms I had a look at, variable-length encodings are only >for disk and network, never used in memory. > >Don't you agree? > > > locale LANG=fi_FI.UTF-8@euro LC_CTYPE="fi_FI.UTF-8@euro" LC_NUMERIC="fi_FI.UTF-8@euro" LC_TIME="fi_FI.UTF-8@euro" LC_COLLATE="fi_FI.UTF-8@euro" LC_MONETARY="fi_FI.UTF-8@euro" LC_MESSAGES="fi_FI.UTF-8@euro" LC_PAPER="fi_FI.UTF-8@euro" LC_NAME="fi_FI.UTF-8@euro" LC_ADDRESS="fi_FI.UTF-8@euro" LC_TELEPHONE="fi_FI.UTF-8@euro" LC_MEASUREMENT="fi_FI.UTF-8@euro" LC_IDENTIFICATION="fi_FI.UTF-8@euro" LC_ALL= So, under Linux nowadays, UTF-8 is used very much. Just as Windows recommends everybody to move into native Windows Unicode characters (UCS-2), under Linux it is recommended to move into UTF-8. Both are UNICODE character encodings. UCS-2 encoding is just simpler: just an integer, that has a numerical value. The reason for the popularity of UTF-8 under Linux is, that each program needs to be adjusted very little to be able to move from LATIN1 style encoding into UTF-8. Happy studying about Unicode character sets :) Regards, Marko Ristola
В списке pgsql-odbc по дате отправления: