Re: UTF8 national character data type support WIP patch and list of open issues.
От | Tom Lane |
---|---|
Тема | Re: UTF8 national character data type support WIP patch and list of open issues. |
Дата | |
Msg-id | 15548.1378216699@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: UTF8 national character data type support WIP patch and list of open issues. (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Список | pgsql-hackers |
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 03.09.2013 05:28, Boguk, Maksym wrote: >> Target usage: ability to store UTF8 national characters in some >> selected fields inside a single-byte encoded database. > I think we should take a completely different approach to this. Two > alternatives spring to mind: > 1. Implement a new encoding. The new encoding would be some variant of > UTF-8 that encodes languages like Russian more efficiently. +1. I'm not sure that SCSU satisfies the requirement (which I read as that Russian text should be pretty much 1 byte/character). But surely we could devise a variant that does. For instance, it could look like koi8r (or any other single-byte encoding of your choice) with one byte value, say 255, reserved as a prefix. 255 means that a UTF8 character follows. The main complication here is that you don't want to allow more than one way to represent a character --- else you break text hashing, for instance. So you'd have to take care that you never emit the 255+UTF8 representation for a character that can be represented in the single-byte encoding. In particular, you'd never encode ASCII that way, and thus this would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule. Ideally we could make a variant like this for each supported single-byte encoding, and thus you could optimize a database for "mostly but not entirely LATIN1 text", etc. regards, tom lane
В списке pgsql-hackers по дате отправления: