Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
От | Amit Langote |
---|---|
Тема | Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 |
Дата | |
Msg-id | CA+HiwqEAcSaj6XC-DdzJtUdQi0Ds=+G202F3Y2Q-mmAaPkRviw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Ответы |
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
|
Список | pgsql-hackers |
On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > > encoding, the convert function returns the correct result. See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > > convert > > --------- > > \x817c > > (1 row) > > It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason > but maybe because it was used widely. > > So ping-pong between Unicode and SJIS behaves like this: > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... Is it the following piece of code in UCS_TO_SJIS.pl that manually adds the mapping? # Add these UTF8->SJIS pairs to the table. push @$mapping, ... { direction => FROM_UNICODE, ucs => 0x2212, code => 0x817c, comment => '# MINUS SIGN', f => $this_script, l => __LINE__ }, Given that U+2212 is encoded by e28892 in utf8, I assume that's how utf8_to_sjis.map ends up with the following mapping into sjis for that byte sequence: /*** Three byte table, leaf: e288xx - offset 0x004ee ***/ /* 80 */ 0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de, /* 88 */ 0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000, /* 90 */ 0x0000, 0x8794, "0x817c", ... > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > I think we don't change authoritative mappings, but maybe can add some > one-way conversions for the convenience. Maybe UCS_TO_EUC_JP.pl could do something like the above. Are there other cases that were fixed like this in the past, either for euc_jp or sjis? -- Amit Langote EDB: http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: