Re: JSON and unicode surrogate pairs
От | Noah Misch |
---|---|
Тема | Re: JSON and unicode surrogate pairs |
Дата | |
Msg-id | 20130611222652.GA577456@tornado.leadboat.com обсуждение исходный текст |
Ответ на | Re: JSON and unicode surrogate pairs (Andrew Dunstan <andrew@dunslane.net>) |
Ответы |
Re: JSON and unicode surrogate pairs
|
Список | pgsql-hackers |
On Tue, Jun 11, 2013 at 02:10:45PM -0400, Andrew Dunstan wrote: > > On 06/10/2013 11:22 PM, Noah Misch wrote: >> On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote: >>> On 06/10/2013 10:18 AM, Tom Lane wrote: >>>> Andrew Dunstan <andrew@dunslane.net> writes: >>>>> After thinking about this some more I have come to the conclusion that >>>>> we should only do any de-escaping of \uxxxx sequences, whether or not >>>>> they are for BMP characters, when the server encoding is utf8. For any >>>>> other encoding, which is already a violation of the JSON standard >>>>> anyway, and should be avoided if you're dealing with JSON, we should >>>>> just pass them through even in text output. This will be a simple and >>>>> very localized fix. >>>> Hmm. I'm not sure that users will like this definition --- it will seem >>>> pretty arbitrary to them that conversion of \u sequences happens in some >>>> databases and not others. >> Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database >> where everyone uses client_encoding = LATIN1 should not change the semantics >> of successful SQL statements. Some statements that fail with one database >> encoding will succeed in the other, but a user should not witness a changed >> non-error result. (Except functions like decode() that explicitly expose byte >> representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit '?' in the >> UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the >> wrong direction relative to that ideal. > As a final counter example, let me note that Postgres itself handles > Unicode escapes differently in UTF8 databases - in other databases it > only accepts Unicode escapes up to U+007f, i.e. ASCII characters. I don't see a counterexample there; every database that accepts without error a given Unicode escape produces from it the same text value. The proposal to which I objected was akin to having non-UTF8 databases silently translate E'\u0220' to E'\\u0220'. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: