Re: chr() is still too loose about UTF8 code points
От | Heikki Linnakangas |
---|---|
Тема | Re: chr() is still too loose about UTF8 code points |
Дата | |
Msg-id | 5376404B.8090106@vmware.com обсуждение исходный текст |
Ответ на | chr() is still too loose about UTF8 code points (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: chr() is still too loose about UTF8 code points
Re: chr() is still too loose about UTF8 code points |
Список | pgsql-hackers |
On 05/16/2014 06:05 PM, Tom Lane wrote: > Quite some time ago, we made the chr() function accept Unicode code points > up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8 > string. It was pointed out to me though that RFC3629 restricted the > original definition of UTF8 to only allow code points up to U+10FFFF (for > compatibility with UTF16). While that might not be something we feel we > need to follow exactly, pg_utf8_islegal implements the checking algorithm > specified by RFC3629, and will therefore reject points above U+10FFFF. > > This means you can use chr() to create values that will be rejected on > dump and reload: > > u8=# create table tt (f1 text); > CREATE TABLE > u8=# insert into tt values(chr('x001fffff'::bit(32)::int)); > INSERT 0 1 > u8=# select * from tt; > f1 > ---- > > (1 row) > > u8=# \copy tt to 'junk' > COPY 1 > u8=# \copy tt from 'junk' > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf > CONTEXT: COPY tt, line 1 > LOCATION: report_invalid_encoding, wchar.c:2011 > > I think this probably means we need to change chr() to reject code points > above 10ffff. Should we back-patch that, or just do it in HEAD? +1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10FFFF with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story. - Heikki
В списке pgsql-hackers по дате отправления: