Re: Patch for collation using ICU
От | Tatsuo Ishii |
---|---|
Тема | Re: Patch for collation using ICU |
Дата | |
Msg-id | 20050509.233200.71085686.t-ishii@sra.co.jp обсуждение исходный текст |
Ответ на | Re: Patch for collation using ICU ("John Hansen" <john@geeknet.com.au>) |
Список | pgsql-hackers |
> > -----Original Message----- > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > Sent: Sunday, May 08, 2005 11:08 PM > > To: John Hansen > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > I don't buy it. If current conversion tables does the > > right thing, > > > > why we need to replace. Or if conversion tables are not > > correct, why > > > > don't you fix it? I think the rule of character > > conversion will not > > > > change frequently, especially for LATIN languages. Thus > > maintaining > > > > cost is not too high. > > > > > > I never said we need to, but if we're going to implement > > ICU, then we > > > might as well go all the way. > > > > So you admit there's no benefit using ICU for replacing > > existing conversions? > > > > Besides ICU does not support all existing conversions, I > > think ICU has serious flaw for using conversion. If I > > understand correctly, ICU uses UNICODE internally to do the > > conversion. For example, to implement > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then > > converts UNICODE to EUC_JP. Problem is these conversion is > > not roud trip(conversion between SJIS/EUC_JP and UNICODE will > > lose some information). Thus SJIS->EUC_JP->SJIS conversion > > using ICU does not preserve original text. > > Just for the record, I fetched a web page encoded in sjis, and converted > it to euc-jp and back using uconv from ICU 3.2, and the result is the > original is identical to the transformed file. > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html > uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc > diff index.html index.html.sjis Not all SJIS/EUC_JP characters have the problem. You might want to try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. BTW, I got this with ICU 3.2: $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. This makes me nervous in using ICU... -- Tatsuo Ishii
В списке pgsql-hackers по дате отправления: