Re: UTF8 national character data type support WIP patch and list of open issues.

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: UTF8 national character data type support WIP patch and list of open issues.
Дата	3 сентября 2013 г. 13:58:33
Msg-id	15548.1378216699@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: UTF8 national character data type support WIP patch and list of open issues. (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список	pgsql-hackers

Дерево обсуждения

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 03.09.2013 05:28, Boguk, Maksym wrote:
>> Target usage:  ability to store UTF8 national characters in some
>> selected fields inside a single-byte encoded database.

> I think we should take a completely different approach to this. Two 
> alternatives spring to mind:

> 1. Implement a new encoding.  The new encoding would be some variant of 
> UTF-8 that encodes languages like Russian more efficiently.

+1.  I'm not sure that SCSU satisfies the requirement (which I read as
that Russian text should be pretty much 1 byte/character).  But surely
we could devise a variant that does.  For instance, it could look like
koi8r (or any other single-byte encoding of your choice) with one byte
value, say 255, reserved as a prefix.  255 means that a UTF8 character
follows.  The main complication here is that you don't want to allow more
than one way to represent a character --- else you break text hashing,
for instance.  So you'd have to take care that you never emit the 255+UTF8
representation for a character that can be represented in the single-byte
encoding.  In particular, you'd never encode ASCII that way, and thus this
would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule.

Ideally we could make a variant like this for each supported single-byte
encoding, and thus you could optimize a database for "mostly but not
entirely LATIN1 text", etc.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: UTF8 national character data type support WIP patch and list of open issues.