Обсуждение: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

Поиск

Список

Период

Сортировка

MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Ashutosh Sharma

Дата:

30 октября 2020 г., 03:43:53

Hi All,

Today while working on some other task related to database encoding, I
noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
UTF-8. See below:

postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
 convert
----------
 \xefbc8d
(1 row)

Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
(with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
HYPHEN-MINUS SIGN.

When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
converted to EUC-JP, the convert functions fails with an error saying:
"character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
equivalent in encoding EUC_JP". See below:

postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
has no equivalent in encoding "EUC_JP"

However, when the same MINUS SIGN in UTF-8 is converted to SJIS
encoding, the convert function returns the correct result. See below:

postgres=# select convert('\xe28892', 'utf-8', 'sjis');
 convert
---------
 \x817c
(1 row)

Please note that the byte sequence (81-7c) in SJIS represents MINUS
SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
MINUS SIGN in SJIS and that is what we expect. Isn't it?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Amit Langote

Дата:

30 октября 2020 г., 06:08:51

On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
>
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> ----------
>  \xefbc8d
> (1 row)
>
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.
>
> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"
>
> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> ---------
>  \x817c
> (1 row)
>
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

So we have

a1dd in euc_jp,
817c in sjis,
efbc8d in utf-8

that convert between each other just fine.

But when it comes to

e28892 in utf-8

it currently only converts to sjis and that too just one way:

select convert('\xe28892', 'utf-8', 'sjis');
 convert
---------
 \x817c
(1 row)

select convert('\x817c', 'sjis', 'utf-8');
 convert
----------
 \xefbc8d
(1 row)

I noticed that the commit a8bd7e1c6e02 from ages ago removed
conversions from and to utf-8's e28892, in favor of efbc8d, and that
change has stuck.  (Note though that these maps looked pretty
different back then.)

--- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
-  {0xa1dd, 0xe28892},
+  {0xa1dd, 0xefbc8d},

--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
-  {0xe28892, 0xa1dd},
+  {0xefbc8d, 0xa1dd},

Can't tell what reason there was to do that, but there must have been
some.  Maybe the Japanese character sets prefer full-width hyphen
minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Kyotaro Horiguchi

Дата:

30 октября 2020 г., 06:19:50

Hello.

At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> Hi All,
> 
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
> 
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> ----------
>  \xefbc8d
> (1 row)
> 
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.

No it's not a bug, but a well-known "design":(

The mapping is generated from CP932.TXT and JIS0212.TXT by
UCS_to_UEC_JP.pl.

CP932.TXT used here is here.

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows.

0x817C    0xFF0D    #FULLWIDTH HYPHEN-MINUS

> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"

U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp
U+2212(e2 88 92) doesn't have a mapping between euc-jp.

> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> ---------
>  \x817c
> (1 row)

It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
but maybe because it was used widely.

So ping-pong between Unicode and SJIS behaves like this:

U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...

> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

I think we don't change authoritative mappings, but maybe can add some
one-way conversions for the convenience.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Tom Lane

Дата:

30 октября 2020 г., 06:24:32

Amit Langote <amitlangote09@gmail.com> writes:
> On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Today while working on some other task related to database encoding, I
>> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
>> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
>> UTF-8. See below:
>> ...
>> Isn't this a bug?

> Can't tell what reason there was to do that, but there must have been
> some.  Maybe the Japanese character sets prefer full-width hyphen
> minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

The way it's been explained to me in the past is that the conversion
between Unicode and the various Japanese encodings is not as well
defined as one could wish, because there are multiple quasi-standard
versions of the Japanese encodings.  So we shouldn't move too hastily
on changing this.  Maybe it's really a bug, but maybe there are good
reasons.

            regards, tom lane

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Kyotaro Horiguchi

Дата:

30 октября 2020 г., 06:28:51

At Fri, 30 Oct 2020 12:08:51 +0900, Amit Langote <amitlangote09@gmail.com> wrote in 
> I noticed that the commit a8bd7e1c6e02 from ages ago removed
> conversions from and to utf-8's e28892, in favor of efbc8d, and that
> change has stuck.  (Note though that these maps looked pretty
> different back then.)
> 
> --- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
> +++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
> -  {0xa1dd, 0xe28892},
> +  {0xa1dd, 0xefbc8d},
> 
> --- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
> +++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
> -  {0xe28892, 0xa1dd},
> +  {0xefbc8d, 0xa1dd},
> 
> Can't tell what reason there was to do that, but there must have been
> some.  Maybe the Japanese character sets prefer full-width hyphen
> minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

It's a decsion made by Microsoft.  Several other characters are in
similar issues. I remember many people complained but in the end that
wasn't "fixed" and led to the well-known conversion messes of Japanese
character conversion involving Unicode in Java.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Tatsuo Ishii

Дата:

30 октября 2020 г., 07:06:26

> Hi All,
> 
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
> 
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> ----------
>  \xefbc8d
> (1 row)
> 
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.

Yeah. Originally EUC_JP 0xa1dd was converted to UTF8 0xe28892. At some
point, someone changed the mapping and now you see it.

> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"

Again, originally UTF8 0xe28892 was converted to EUC_JP 0xa1dd . At
some point, someone changed the mapping.

> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> ---------
>  \x817c
> (1 row)
> 
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

Agreed.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Tatsuo Ishii

Дата:

30 октября 2020 г., 07:17:08

> The mapping is generated from CP932.TXT and JIS0212.TXT by
> UCS_to_UEC_JP.pl.

I still don't understand why this change has been made. Originally the
conversion was based on JIS0208.txt, JIS0212.txt and JIS0201.txt,
which is the exact definition of EUC-JP. CP932.txt is defined by
Microsoft for their products.

Probably we should call our "EUC-JP" something like "EUC-JP-MS" or
whatever to differentiate from true EUC-JP.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Ashutosh Sharma

Дата:

30 октября 2020 г., 07:34:22

On Fri, Oct 30, 2020 at 8:49 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > Hi All,
> >
> > Today while working on some other task related to database encoding, I
> > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> > UTF-8. See below:
> >
> > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
> >  convert
> > ----------
> >  \xefbc8d
> > (1 row)
> >
> > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> > HYPHEN-MINUS SIGN.
>
> No it's not a bug, but a well-known "design":(
>
> The mapping is generated from CP932.TXT and JIS0212.TXT by
> UCS_to_UEC_JP.pl.
>
> CP932.TXT used here is here.
>
> https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
>
> CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows.
>
> 0x817C  0xFF0D  #FULLWIDTH HYPHEN-MINUS
>

We do have MINUS SIGN (U+2212) defined in both UTF-8 and EUC-JP
encoding. So, not sure why converting MINUS SIGN from UTF-8 to EUC-JP
should throw an error saying: "... in encoding UTF8 has *no*
equivalent in EUC_JP". I mean this information looks misleading and
that's I reason I feel its a bug.

> > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> > converted to EUC-JP, the convert functions fails with an error saying:
> > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> > equivalent in encoding EUC_JP". See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> > ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> > has no equivalent in encoding "EUC_JP"
>
> U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp
> U+2212(e2 88 92) doesn't have a mapping between euc-jp.
>
> > However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> > encoding, the convert function returns the correct result. See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> >  convert
> > ---------
> >  \x817c
> > (1 row)
>
> It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
> but maybe because it was used widely.
>
> So ping-pong between Unicode and SJIS behaves like this:
>
> U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...
>
> > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > MINUS SIGN in SJIS and that is what we expect. Isn't it?
>
> I think we don't change authoritative mappings, but maybe can add some
> one-way conversions for the convenience.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Kyotaro Horiguchi

Дата:

30 октября 2020 г., 07:47:55

At Fri, 30 Oct 2020 13:17:08 +0900 (JST), Tatsuo Ishii <ishii@sraoss.co.jp> wrote in 
> > The mapping is generated from CP932.TXT and JIS0212.TXT by
> > UCS_to_UEC_JP.pl.
> 
> I still don't understand why this change has been made. Originally the
> conversion was based on JIS0208.txt, JIS0212.txt and JIS0201.txt,
> which is the exact definition of EUC-JP. CP932.txt is defined by
> Microsoft for their products.
> 
> Probably we should call our "EUC-JP" something like "EUC-JP-MS" or
> whatever to differentiate from true EUC-JP.

Seems valid.  Things are already so at the time aeed17d is introduced
(I believe it didn't make any difference in conversions.) and the
change was made by a8bd7e1c6e in 2002.


I'm not sure the point of the change, though..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Amit Langote

Дата:

30 октября 2020 г., 08:38:30

On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> > encoding, the convert function returns the correct result. See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> >  convert
> > ---------
> >  \x817c
> > (1 row)
>
> It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
> but maybe because it was used widely.
>
> So ping-pong between Unicode and SJIS behaves like this:
>
> U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...

Is it the following piece of code in UCS_TO_SJIS.pl that manually adds
the mapping?

# Add these UTF8->SJIS pairs to the table.
push @$mapping,
...
    {
        direction => FROM_UNICODE,
        ucs       => 0x2212,
        code      => 0x817c,
        comment   => '# MINUS SIGN',
        f         => $this_script,
        l         => __LINE__
    },

Given that U+2212 is encoded by e28892 in utf8, I assume that's how
utf8_to_sjis.map ends up with the following mapping into sjis for that
byte sequence:

  /*** Three byte table, leaf: e288xx - offset 0x004ee ***/

  /* 80 */  0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de,
  /* 88 */  0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000,
  /* 90 */  0x0000, 0x8794, "0x817c", ...

> > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > MINUS SIGN in SJIS and that is what we expect. Isn't it?
>
> I think we don't change authoritative mappings, but maybe can add some
> one-way conversions for the convenience.

Maybe UCS_TO_EUC_JP.pl could do something like the above.

Are there other cases that were fixed like this in the past, either
for euc_jp or sjis?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Kyotaro Horiguchi

Дата:

30 октября 2020 г., 10:33:01

At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09@gmail.com> wrote in 
> On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > So ping-pong between Unicode and SJIS behaves like this:
> >
> > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...
> 
> Is it the following piece of code in UCS_TO_SJIS.pl that manually adds
> the mapping?

Yes.

> # Add these UTF8->SJIS pairs to the table.
> push @$mapping,
> ...
>     {
>         direction => FROM_UNICODE,
>         ucs       => 0x2212,
>         code      => 0x817c,
>         comment   => '# MINUS SIGN',
>         f         => $this_script,
>         l         => __LINE__
>     },
> 
> Given that U+2212 is encoded by e28892 in utf8, I assume that's how
> utf8_to_sjis.map ends up with the following mapping into sjis for that
> byte sequence:
> 
>   /*** Three byte table, leaf: e288xx - offset 0x004ee ***/
> 
>   /* 80 */  0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de,
>   /* 88 */  0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000,
>   /* 90 */  0x0000, 0x8794, "0x817c", ...

I'm not sure how we should construct our won mapping, but the
difference made by we simply moved to JIS0208.TXT based as Ishii-san
suggested the differences in the mapping would be as the follows.

1. The following codes (regions) are not defined in JIS0208.

     8ea1 - 8edf      (up to 64 characters (I didn't actually counted them.))
     ada1 - adfc      (up to 92 characters (ditto))
     8ff3f3 - 8ff4a8  (up to 182 characters (ditto))

     a1c0  ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS)
   8ff4aa  ff07: (ff07: FULLWIDTH APOSTROPHE)

2. some individual differences

   EUC  0208  932
   a1c1 301c ff5e: (301c:WAVE DASH)
   a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO)
*  a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS)
   d1f1   a2 ffe0: (00a2: CENT SIGN) :  (ffe0: FULLWIDTH CENT SIGN)
   d1f2   a3 ffe1: (00a3: PUND SIGN) :  (ffe1: FULLWIDTH POUND SIGN)
   a2cc   ac ffe2: (00ac: NOT SIGN)  :  (ffe2: FULLWIDTH NOT SIGN)


*1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT

> > > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > > MINUS SIGN in SJIS and that is what we expect. Isn't it?
> >
> > I think we don't change authoritative mappings, but maybe can add some
> > one-way conversions for the convenience.
> 
> Maybe UCS_TO_EUC_JP.pl could do something like the above.
> 
> Are there other cases that were fixed like this in the past, either
> for euc_jp or sjis?

Honestly, I don't know how the mapping was decided in 2002, but
removing the regions in 1 would cause confusion.  So what we can do in
this area would be chaning some of 2 to 0208 mapping.  But arbitrary
mixture of different mapings would cause new problem..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

От

Kyotaro Horiguchi

Дата:

30 октября 2020 г., 10:56:38

At Fri, 30 Oct 2020 16:33:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09@gmail.com> wrote in 
> I'm not sure how we should construct our won mapping, but the
> difference made by we simply moved to JIS0208.TXT based as Ishii-san
> suggested the differences in the mapping would be as the follows.

Mmm..

I'm not sure how we should construct our won mapping, but the
difference made by simply moving to JIS0208.TXT-based as Ishii-san
suggested, the following differences would be seen in the mappings.

> 1. The following codes (regions) are not defined in JIS0208.
> 
>      8ea1 - 8edf      (up to 64 characters (I didn't actually counted them.))
>      ada1 - adfc      (up to 92 characters (ditto))
>      8ff3f3 - 8ff4a8  (up to 182 characters (ditto))

  8ea1 - 8edf      (64 chars. U+ff61 - U+ff9f) (hankaku-kana)
  ada1 - adfc      (83 chars, U+2460 - U+33a1) (numbers with cicle)
  8ff3f3 - 8ff4a8  (20 chars, U+2160 - U+2179) (roman numerals)

>      a1c0  ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS)
>    8ff4aa  ff07: (ff07: FULLWIDTH APOSTROPHE)
> 
> 2. some individual differences
> 
>    EUC  0208  932
>    a1c1 301c ff5e: (301c:WAVE DASH)
>    a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO)
> *  a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS)
>    d1f1   a2 ffe0: (00a2: CENT SIGN) :  (ffe0: FULLWIDTH CENT SIGN)
>    d1f2   a3 ffe1: (00a3: PUND SIGN) :  (ffe1: FULLWIDTH POUND SIGN)
>    a2cc   ac ffe2: (00ac: NOT SIGN)  :  (ffe2: FULLWIDTH NOT SIGN)
> 
> 
> *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
> 
> > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > > > MINUS SIGN in SJIS and that is what we expect. Isn't it?
> > >
> > > I think we don't change authoritative mappings, but maybe can add some
> > > one-way conversions for the convenience.
> > 
> > Maybe UCS_TO_EUC_JP.pl could do something like the above.
> > 
> > Are there other cases that were fixed like this in the past, either
> > for euc_jp or sjis?
> 
> Honestly, I don't know how the mapping was decided in 2002, but
> removing the regions in 1 would cause confusion.  So what we can do in
> this area would be chaning some of 2 to 0208 mapping.  But arbitrary
> mixture of different mapings would cause new problem..

 Forgot about adding one-way mappings.  I think we can add several
 such mappings, say.

 U+3031->:   EUC:a1c1 <-> U+ff5e
 U+2016->:   EUC:a1c2 <-> U+2225
 U+2212->:   EUC:a1dd <-> U+ff0d
 U+00a2->:   EUC:d1f1 <-> U+ffe0
 U+00a3->:   EUC:d1f2 <-> U+ffe1
 U+00ac->:   EUC:a2cc <-> U+ffe2

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8