Re: Implementing full UTF-8 support (aka supporting 0x00)
От | Álvaro Hernández Tortosa |
---|---|
Тема | Re: Implementing full UTF-8 support (aka supporting 0x00) |
Дата | |
Msg-id | b2f6204e-a4a1-06e9-f333-1b18477d3504@8kdata.com обсуждение исходный текст |
Ответ на | Re: Implementing full UTF-8 support (aka supporting 0x00) (Álvaro Hernández Tortosa <aht@8kdata.com>) |
Ответы |
Re: Implementing full UTF-8 support (aka supporting 0x00)
Re: Implementing full UTF-8 support (aka supporting 0x00) |
Список | pgsql-hackers |
On 03/08/16 20:14, Álvaro Hernández Tortosa wrote: > > > On 03/08/16 17:47, Kevin Grittner wrote: >> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa >> <aht@8kdata.com> wrote: >> >>> What would it take to support it? >> Would it be of any value to support "Modified UTF-8"? >> >> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 >> > > That's nice, but I don't think so. > > The problem is that you cannot predict how people would send you > data, like when importing from other databases. I guess it may work if > Postgres would implement such UTF-8 variant and also the drivers, but > that would still require an encoding conversion (i.e., parsing every > string) to change the 0x00, which seems like a serious performance hit. > > It could be worse than nothing, though! > > Thanks, > > Álvaro > It may indeed work. According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code point. In short, the \u00 nul is represented (overlong encoding) by the two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 so should not appear in an otherwise valid UTF-8 string. Yet they are accepted by Postgres (like if Postgres would support Modified UTF-8 intentionally). The caracter in psql does not render as a nul but as this symbol: "삀". Given that this works, the process would look like this: - Parse all input data looking for bytes with hex value 0x00. If they appear in the string, they are the null byte. - Replace that byte with the two bytes 0xc080. - Reverse the operation when reading. This is OK but of course a performance hit (searching for 0x00 and then augmenting the byte[] or whatever data structure to account for the extra byte). A little bit of a PITA, but I guess better than fixing it all :) Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
В списке pgsql-hackers по дате отправления: