Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

Поиск
Список
Период
Сортировка
От Arjen Nienhuis
Тема Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Дата
Msg-id CAG6W84JZ-ZFhAM1GQzpVUOW8YM2gx6_-f4uCKU1j2sdmt+wO6g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-bugs
On 10 Mar 2015 22:33, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote:
>
> On 03/09/2015 10:51 PM, a.g.nienhuis@gmail.com wrote:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference:      12845
>> Logged by:          Arjen Nienhuis
>> Email address:      a.g.nienhuis@gmail.com
>> PostgreSQL version: 9.3.5
>> Operating system:   Ubuntu Linux
>> Description:
>>
>> Step to reproduce:
>>
>> In psql:
>>
>> arjen=> select convert_to(chr(128512), 'GB18030');
>>
>> Actual output:
>>
>> ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding
"UTF8"
>> has no equivalent in encoding "GB18030"
>>
>> Expected output:
>>
>>   convert_to
>> ------------
>>   \x9439fc36
>> (1 row)
>
>
> Hmm, looks like our gb18030 <-> Unicode conversion table only contains
the Unicode BMP plane. Unicode points above 0xffff are not included.
>
> If we added all the missing mappings as one to one mappings, like we've
done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the missing
mappings are in linear ranges that would be fairly simple to handle in
programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).
>
> - Heikki

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

(Maybe at line 479 of conv.c:
https://github.com/postgres/postgres/blob/4baaf863eca5412e07a8441b3b7e7482b7a8b21a/src/backend/utils/mb/conv.c#L479
)

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF