Обсуждение: JSON and unicode surrogate pairs
In 9.2, the JSON parser didn't check the validity of the use of unicode escapes other than that it required 4 hex digits to follow '\u'. In 9.3, that is still the case. However, the JSON accessor functions and operators also try to turn JSON strings into text in the server encoding, and this includes de-escaping \u sequences. This works fine except when there is a pair of sequences representing a UTF-16 type surrogate pair, something that is explicitly permitted in the JSON spec. The attached patch is an attempt to remedy that, and a surrogate pair is turned into the correct code point before converting it to whatever the server encoding is. Note that this would mean we can still put JSON with incorrect use of surrogates into the database, as now (9.2 and later), and they will cause almost all the accessor functions to raise an error, as now (9.3). All this does is allow JSON that uses surrogates correctly not to fail when applying the accessor functions and operators. That's a possible violation of POLA, and at least worth of a note in the docs, but I'm not sure what else we can do now - adding this check to the input lexer would possibly cause restores to fail, which users might not thank us for. cheers andrew
Вложения
On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > In 9.2, the JSON parser didn't check the validity of the use of unicode > escapes other than that it required 4 hex digits to follow '\u'. In 9.3, > that is still the case. However, the JSON accessor functions and operators > also try to turn JSON strings into text in the server encoding, and this > includes de-escaping \u sequences. This works fine except when there is a > pair of sequences representing a UTF-16 type surrogate pair, something that > is explicitly permitted in the JSON spec. > > The attached patch is an attempt to remedy that, and a surrogate pair is > turned into the correct code point before converting it to whatever the > server encoding is. > > Note that this would mean we can still put JSON with incorrect use of > surrogates into the database, as now (9.2 and later), and they will cause > almost all the accessor functions to raise an error, as now (9.3). All this > does is allow JSON that uses surrogates correctly not to fail when applying > the accessor functions and operators. That's a possible violation of POLA, > and at least worth of a note in the docs, but I'm not sure what else we can > do now - adding this check to the input lexer would possibly cause restores > to fail, which users might not thank us for. I think the approach you've proposed here is a good one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/06/2013 12:53 PM, Robert Haas wrote: > On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan <andrew@dunslane.net> wrote: >> In 9.2, the JSON parser didn't check the validity of the use of unicode >> escapes other than that it required 4 hex digits to follow '\u'. In 9.3, >> that is still the case. However, the JSON accessor functions and operators >> also try to turn JSON strings into text in the server encoding, and this >> includes de-escaping \u sequences. This works fine except when there is a >> pair of sequences representing a UTF-16 type surrogate pair, something that >> is explicitly permitted in the JSON spec. >> >> The attached patch is an attempt to remedy that, and a surrogate pair is >> turned into the correct code point before converting it to whatever the >> server encoding is. >> >> Note that this would mean we can still put JSON with incorrect use of >> surrogates into the database, as now (9.2 and later), and they will cause >> almost all the accessor functions to raise an error, as now (9.3). All this >> does is allow JSON that uses surrogates correctly not to fail when applying >> the accessor functions and operators. That's a possible violation of POLA, >> and at least worth of a note in the docs, but I'm not sure what else we can >> do now - adding this check to the input lexer would possibly cause restores >> to fail, which users might not thank us for. > I think the approach you've proposed here is a good one. > I did that, but it's evident from the buildfarm that there's more work to do. The problem is that we do the de-escaping as we lex the json to construct the look ahead token, and at that stage we don't know whether or not it's really going to be needed. That means we can cause errors to be raised in far too many places. It's failing on this line: converted = pg_any_to_server(utf8str, utf8len, PG_UTF8); even though the operator in use ("->") doesn't even use the de-escaped value. The real solution is going to be to delay the de-escaping of the string until it is known to be wanted. That's unfortunately going to be a bit invasive, but I can't see a better solution. I'll work on it ASAP. Getting it to work well without a small API change might be pretty hard, though. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I did that, but it's evident from the buildfarm that there's more work > to do. The problem is that we do the de-escaping as we lex the json to > construct the look ahead token, and at that stage we don't know whether > or not it's really going to be needed. That means we can cause errors to > be raised in far too many places. It's failing on this line: > converted = pg_any_to_server(utf8str, utf8len, PG_UTF8); > even though the operator in use ("->") doesn't even use the de-escaped > value. > The real solution is going to be to delay the de-escaping of the string > until it is known to be wanted. That's unfortunately going to be a bit > invasive, but I can't see a better solution. I'll work on it ASAP. Not sure that this idea isn't a dead end. IIUC, you're proposing to jump through hoops in order to avoid complaining about illegal JSON data, essentially just for backwards compatibility with 9.2's failure to complain about it. If we switch over to a pre-parsed (binary) storage format for JSON values, won't we be forced to throw these errors anyway? If so, maybe we should just take the compatibility hit now while there's still a relatively small amount of stored JSON data in the wild. regards, tom lane
On 06/09/2013 07:47 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I did that, but it's evident from the buildfarm that there's more work >> to do. The problem is that we do the de-escaping as we lex the json to >> construct the look ahead token, and at that stage we don't know whether >> or not it's really going to be needed. That means we can cause errors to >> be raised in far too many places. It's failing on this line: >> converted = pg_any_to_server(utf8str, utf8len, PG_UTF8); >> even though the operator in use ("->") doesn't even use the de-escaped >> value. >> The real solution is going to be to delay the de-escaping of the string >> until it is known to be wanted. That's unfortunately going to be a bit >> invasive, but I can't see a better solution. I'll work on it ASAP. > Not sure that this idea isn't a dead end. IIUC, you're proposing to > jump through hoops in order to avoid complaining about illegal JSON > data, essentially just for backwards compatibility with 9.2's failure to > complain about it. If we switch over to a pre-parsed (binary) storage > format for JSON values, won't we be forced to throw these errors anyway? > If so, maybe we should just take the compatibility hit now while there's > still a relatively small amount of stored JSON data in the wild. > > No, I probably haven't explained it very well. Here is the regression diff from jacana: ERROR: cannot call json_populate_recordset on a nested object -- handling of unicode surrogate pairs selectjson '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct; ! correct ! ---------------------------- ! "\ud83d\ude04\ud83d\udc36" ! (1 row) ! select json '{ "a": "\ud83d\ud83d" }' ->'a'; -- 2 high surrogates in a row ERROR: invalid input syntax for type json DETAIL: high order surrogate mustnot follow a high order surrogate. --- 922,928 ---- ERROR: cannot call json_populate_recordset on a nested object -- handling of unicode surrogate pairs select json '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct; ! ERROR: character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "WIN1252" select json '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row ERROR: invalid input syntaxfor type json DETAIL: high order surrogate must not follow a high order surrogate. The sequence in question is two perfectly valid surrogate pairs. ... After thinking about this some more I have come to the conclusion that we should only do any de-escaping of \uxxxx sequences, whether or not they are for BMP characters, when the server encoding is utf8. For any other encoding, which is already a violation of the JSON standard anyway, and should be avoided if you're dealing with JSON, we should just pass them through even in text output. This will be a simple and very localized fix. We'll still have to deal with this issue when we get to binary storage of JSON, but that's not something we need to confront today. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > After thinking about this some more I have come to the conclusion that > we should only do any de-escaping of \uxxxx sequences, whether or not > they are for BMP characters, when the server encoding is utf8. For any > other encoding, which is already a violation of the JSON standard > anyway, and should be avoided if you're dealing with JSON, we should > just pass them through even in text output. This will be a simple and > very localized fix. Hmm. I'm not sure that users will like this definition --- it will seem pretty arbitrary to them that conversion of \u sequences happens in some databases and not others. > We'll still have to deal with this issue when we get to binary storage > of JSON, but that's not something we need to confront today. Well, if we have to break backwards compatibility when we try to do binary storage, we're not going to be happy either. So I think we'd better have a plan in mind for what will happen then. regards, tom lane
On 06/10/2013 10:18 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> After thinking about this some more I have come to the conclusion that >> we should only do any de-escaping of \uxxxx sequences, whether or not >> they are for BMP characters, when the server encoding is utf8. For any >> other encoding, which is already a violation of the JSON standard >> anyway, and should be avoided if you're dealing with JSON, we should >> just pass them through even in text output. This will be a simple and >> very localized fix. > Hmm. I'm not sure that users will like this definition --- it will seem > pretty arbitrary to them that conversion of \u sequences happens in some > databases and not others. Then what should we do when there is no matching codepoint in the database encoding? First we'll have to delay the evaluation so it's not done over-eagerly, and then we'll have to try the conversion and throw an error if it doesn't work. The second part is what's happening now, but the delayed evaluation is not. Or we could abandon the conversion altogether, but that doesn't seem very friendly either. I suspect the biggest case for people to use these sequences is where the database is UTF8 but the client encoding is not. Frankly, if you want to use Unicode escapes, you should really be using a UTF8 encoded database if at all possible. > >> We'll still have to deal with this issue when we get to binary storage >> of JSON, but that's not something we need to confront today. > Well, if we have to break backwards compatibility when we try to do > binary storage, we're not going to be happy either. So I think we'd > better have a plan in mind for what will happen then. > > I don't see any reason why we couldn't store the JSON strings with the Unicode escape sequences intact in the binary format. What the binary format buys us is that it has decomposed the JSON into a tree structure, so instead of parsing the JSON we can just walk the tree, but the leaf nodes of the tree are still (in the case of the nodes under discussion) text-like objects. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Or we could abandon the conversion altogether, but that doesn't seem > very friendly either. I suspect the biggest case for people to use these > sequences is where the database is UTF8 but the client encoding is not. Well, if that's actually the biggest use-case, then maybe we should just say we're *not* in the business of converting those escapes. That would make things nice and consistent regardless of the DB encoding, and it would avoid the problem of being able to input a value and then not being able to output it again. It's legal, is it not, to just write the equivalent Unicode character in the JSON string and not use the escapes? If so I would think that that would be the most common usage. If someone's writing an escape, they probably had a reason for doing it that way, and might not appreciate our overriding their decision. regards, tom lane
On 06/10/2013 11:43 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Or we could abandon the conversion altogether, but that doesn't seem >> very friendly either. I suspect the biggest case for people to use these >> sequences is where the database is UTF8 but the client encoding is not. > Well, if that's actually the biggest use-case, then maybe we should just > say we're *not* in the business of converting those escapes. That would > make things nice and consistent regardless of the DB encoding, and it > would avoid the problem of being able to input a value and then not > being able to output it again. > > It's legal, is it not, to just write the equivalent Unicode character in > the JSON string and not use the escapes? If so I would think that that > would be the most common usage. If someone's writing an escape, they > probably had a reason for doing it that way, and might not appreciate > our overriding their decision. > > We never store the converted values in the JSON object, nor do we return them from functions that return JSON. But many of the functions and operators that process the JSON have variants that return text instead of JSON, and in those cases, when the value returned is a JSON string, we do the following to it: * strip the outside quotes * de-escape the various escaped characters (i.e. everything preceded by a backslash in the railroad diagram for string at <http://www.json.org/>) Here's an example of the difference: andrew=# select '{ "a": "\u00a9"}'::json -> 'a'; ?column? ---------- "\u00a9" (1 row) andrew=# select '{ "a": "\u00a9"}'::json ->>'a'; ?column? ---------- © (1 row) It's the process of producing the latter that is giving us a headache in non-UTF8 databases. ... [ more caffeine is consumed ] ... I have just realized that the problem is actually quite a lot bigger than that. We also use this value for field name comparison. So, let us suppose that we have a LATIN1 database and a piece of JSON with a field name containing the Euro sign ("\u20ac"), a character that is not in LATIN1. Making that processable so it doesn't blow up would be mighty tricky and error prone. The non-orthogonality I suggested as a solution upthread is, by contrast, very small and easy to manage, and not terribly hard to explain - see attached. cheers andrew
Вложения
On Mon, Jun 10, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, if we have to break backwards compatibility when we try to do > binary storage, we're not going to be happy either. So I think we'd > better have a plan in mind for what will happen then. Who says we're ever going to do any such thing? This was extensively debated when we added the original type, and I thought that it was agreed that we might ultimately need both a type that stored JSON as text and another that stored it as binary. And we might need an XML-binary type as well. But there are also cases where storing the data as text is *better*, and I don't see us ever getting rid of that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/11/2013 12:07 AM, Robert Haas wrote: > On Mon, Jun 10, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, if we have to break backwards compatibility when we try to do >> binary storage, we're not going to be happy either. So I think we'd >> better have a plan in mind for what will happen then. > Who says we're ever going to do any such thing? This was extensively > debated when we added the original type, and I thought that it was > agreed that we might ultimately need both a type that stored JSON as > text and another that stored it as binary. This is where the compatibility comes in - we do want both to accept the same textual format. > And we might need an > XML-binary type as well. But there are also cases where storing the > data as text is *better*, Then use text :) > and I don't see us ever getting rid of that. While JSON is a "serialisation format" most things people want to used it for are actually structured types, not their serialisation to text. The serialisation should happen automatically. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/10/2013 06:07 PM, Robert Haas wrote: > On Mon, Jun 10, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, if we have to break backwards compatibility when we try to do >> binary storage, we're not going to be happy either. So I think we'd >> better have a plan in mind for what will happen then. > Who says we're ever going to do any such thing? This was extensively > debated when we added the original type, and I thought that it was > agreed that we might ultimately need both a type that stored JSON as > text and another that stored it as binary. And we might need an > XML-binary type as well. But there are also cases where storing the > data as text is *better*, and I don't see us ever getting rid of that. > It was discussed at Pgcon as a result of Oleg and Teodor's talk, and at the Unconference. But in any case it's moot here. None of what I'm suggesting has anything to do with the storage representation of JSON, only with how we process it in whatever form. And none of it will break backwards compatibility at all. So, please, let's concentrate on the problem that's actually at hand. cheers andrew
On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote: > > On 06/10/2013 10:18 AM, Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> After thinking about this some more I have come to the conclusion that >>> we should only do any de-escaping of \uxxxx sequences, whether or not >>> they are for BMP characters, when the server encoding is utf8. For any >>> other encoding, which is already a violation of the JSON standard >>> anyway, and should be avoided if you're dealing with JSON, we should >>> just pass them through even in text output. This will be a simple and >>> very localized fix. >> Hmm. I'm not sure that users will like this definition --- it will seem >> pretty arbitrary to them that conversion of \u sequences happens in some >> databases and not others. Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database where everyone uses client_encoding = LATIN1 should not change the semantics of successful SQL statements. Some statements that fail with one database encoding will succeed in the other, but a user should not witness a changed non-error result. (Except functions like decode() that explicitly expose byte representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit 'ä' in the UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the wrong direction relative to that ideal. > Then what should we do when there is no matching codepoint in the > database encoding? First we'll have to delay the evaluation so it's not > done over-eagerly, and then we'll have to try the conversion and throw > an error if it doesn't work. The second part is what's happening now, > but the delayed evaluation is not. +1 for doing it that way. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote: > >It's legal, is it not, to just write the equivalent Unicode character in > >the JSON string and not use the escapes? If so I would think that that > >would be the most common usage. If someone's writing an escape, they > >probably had a reason for doing it that way, and might not appreciate > >our overriding their decision. > We never store the converted values in the JSON object, nor do we return > them from functions that return JSON. But many of the functions and > operators that process the JSON have variants that return text instead of > JSON, and in those cases, when the value returned is a JSON string, we do > the following to it: > > I have just realized that the problem is actually quite a lot bigger than > that. We also use this value for field name comparison. So, let us suppose > that we have a LATIN1 database and a piece of JSON with a field name > containing the Euro sign ("\u20ac"), a character that is not in LATIN1. > Making that processable so it doesn't blow up would be mighty tricky and > error prone. The non-orthogonality I suggested as a solution upthread is, by > contrast, very small and easy to manage, and not terribly hard to explain - > see attached. I think this all shows pretty clearly that it was a mistake allowing json data in the database that we cannot entirely display with the database's encoding. All the proposed ugly workarounds are only necessary because we don't throw an error when originally validating the json. Even in an utf-8 database you can get errors due to \u unescaping (at attribute access time, *NOT* at json_in() time) due to invalidate surrogate pairs. I think this goes countrary to normal postgres approach of validating data as strict as necessary. And I think we are going to regret not fixing this while there are still relatively few users out there. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/11/2013 10:47 AM, Andres Freund wrote: > On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote: >>> It's legal, is it not, to just write the equivalent Unicode character in >>> the JSON string and not use the escapes? If so I would think that that >>> would be the most common usage. If someone's writing an escape, they >>> probably had a reason for doing it that way, and might not appreciate >>> our overriding their decision. >> We never store the converted values in the JSON object, nor do we return >> them from functions that return JSON. But many of the functions and >> operators that process the JSON have variants that return text instead of >> JSON, and in those cases, when the value returned is a JSON string, we do >> the following to it: >> >> I have just realized that the problem is actually quite a lot bigger than >> that. We also use this value for field name comparison. So, let us suppose >> that we have a LATIN1 database and a piece of JSON with a field name >> containing the Euro sign ("\u20ac"), a character that is not in LATIN1. >> Making that processable so it doesn't blow up would be mighty tricky and >> error prone. The non-orthogonality I suggested as a solution upthread is, by >> contrast, very small and easy to manage, and not terribly hard to explain - >> see attached. > I think this all shows pretty clearly that it was a mistake allowing > json data in the database that we cannot entirely display with the > database's encoding. All the proposed ugly workarounds are only > necessary because we don't throw an error when originally validating the > json. > Even in an utf-8 database you can get errors due to \u unescaping (at > attribute access time, *NOT* at json_in() time) due to invalidate > surrogate pairs. > > I think this goes countrary to normal postgres approach of validating > data as strict as necessary. And I think we are going to regret not > fixing this while there are still relatively few users out there. Exactly - * allow in only valid JSON.* Validate all utf8 strings for valid unicode.* have one canonic way of outputting unicode - utf8for utf8 databases, escaped for all other encodings* If you need to store anything else, use text. Requiring preserving "original text" in json data field is Not Good! I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out as '{"a":b"}' (I know that currently this is noty true and will happen only once I read in the json value in client) For anything else - don't use json, use any text type If you really need a simple text-validated-as-valid-json-input datatype then add this. Call it jsontext or somesuch :) -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/11/2013 06:53 AM, Hannu Krosing wrote: > On 06/11/2013 10:47 AM, Andres Freund wrote: >> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote: >>>> It's legal, is it not, to just write the equivalent Unicode character in >>>> the JSON string and not use the escapes? If so I would think that that >>>> would be the most common usage. If someone's writing an escape, they >>>> probably had a reason for doing it that way, and might not appreciate >>>> our overriding their decision. >>> We never store the converted values in the JSON object, nor do we return >>> them from functions that return JSON. But many of the functions and >>> operators that process the JSON have variants that return text instead of >>> JSON, and in those cases, when the value returned is a JSON string, we do >>> the following to it: >>> >>> I have just realized that the problem is actually quite a lot bigger than >>> that. We also use this value for field name comparison. So, let us suppose >>> that we have a LATIN1 database and a piece of JSON with a field name >>> containing the Euro sign ("\u20ac"), a character that is not in LATIN1. >>> Making that processable so it doesn't blow up would be mighty tricky and >>> error prone. The non-orthogonality I suggested as a solution upthread is, by >>> contrast, very small and easy to manage, and not terribly hard to explain - >>> see attached. >> I think this all shows pretty clearly that it was a mistake allowing >> json data in the database that we cannot entirely display with the >> database's encoding. All the proposed ugly workarounds are only >> necessary because we don't throw an error when originally validating the >> json. >> Even in an utf-8 database you can get errors due to \u unescaping (at >> attribute access time, *NOT* at json_in() time) due to invalidate >> surrogate pairs. >> >> I think this goes countrary to normal postgres approach of validating >> data as strict as necessary. And I think we are going to regret not >> fixing this while there are still relatively few users out there. > Exactly - > > * allow in only valid JSON. > * Validate all utf8 strings for valid unicode. > * have one canonic way of outputting unicode - utf8 for utf8 databases, > escaped for all other encodings > * If you need to store anything else, use text. > > Requiring preserving "original text" in json data field is Not Good! > > I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out > as '{"a":b"}' > > (I know that currently this is noty true and will happen only once I > read in the json value in client) > > For anything else - don't use json, use any text type > > If you really need a simple text-validated-as-valid-json-input datatype > then add this. > > Call it jsontext or somesuch :) > Most of this is irrelevant to solving the current problem. We don't have a greenfields development of JSON - we have an existing type with which we have to work. Note too that your "utf8 for utf8 databases, escaped for all other encodings" is exactly the opposite of what Noah and Andres and Tom are arguing, namely that the database encoding should not affect the behaviour, and is very close to what I proposed yesterday. It's a pity that we don't have a non-error producing conversion function (or if we do that I haven't found it). Then we might adopt a rule for processing unicode escapes that said "convert unicode escapes to the database encoding if possible, and if not then emit them unchanged." which might be a reasonable compromise. cheers andrew
On 2013-06-11 12:53 CEST, Hannu Krosing wrote: > On 06/11/2013 10:47 AM, Andres Freund wrote: >> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote: >>>> It's legal, is it not, to just write the equivalent Unicode character in >>>> the JSON string and not use the escapes? If so I would think that that >>>> would be the most common usage. If someone's writing an escape, they >>>> probably had a reason for doing it that way, and might not appreciate >>>> our overriding their decision. >>> We never store the converted values in the JSON object, nor do we return >>> them from functions that return JSON. But many of the functions and >>> operators that process the JSON have variants that return text instead of >>> JSON, and in those cases, when the value returned is a JSON string, we do >>> the following to it: >>> >>> I have just realized that the problem is actually quite a lot bigger than >>> that. We also use this value for field name comparison. So, let us suppose >>> that we have a LATIN1 database and a piece of JSON with a field name >>> containing the Euro sign ("\u20ac"), a character that is not in LATIN1. >>> Making that processable so it doesn't blow up would be mighty tricky and >>> error prone. The non-orthogonality I suggested as a solution upthread is, by >>> contrast, very small and easy to manage, and not terribly hard to explain - >>> see attached. >> I think this all shows pretty clearly that it was a mistake allowing >> json data in the database that we cannot entirely display with the >> database's encoding. All the proposed ugly workarounds are only >> necessary because we don't throw an error when originally validating the >> json. >> Even in an utf-8 database you can get errors due to \u unescaping (at >> attribute access time, *NOT* at json_in() time) due to invalidate >> surrogate pairs. >> >> I think this goes countrary to normal postgres approach of validating >> data as strict as necessary. And I think we are going to regret not >> fixing this while there are still relatively few users out there. > Exactly - > > * allow in only valid JSON. > * Validate all utf8 strings for valid unicode. > * have one canonic way of outputting unicode - utf8 for utf8 databases, > escaped for all other encodings > * If you need to store anything else, use text. > > Requiring preserving "original text" in json data field is Not Good! > > I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out > as '{"a":b"}' ahem, do you mean instead to give (none -> null and missing '"' inserted in "answer"): =# SELECT '{"a":1, "a":null, "a":true, "a":"b"}'::json; json -------------------------------------- {"a":"b"} or only when "stored" in database and subsequently retrieved? The "original text" in this case was perfectly valid JSON text. > (I know that currently this is noty true and will happen only once I > read in the json value in client) Isn't this a good situation and doesn't this also depend on the storage representation in the client? What about this: =# SELECT '{"measure":"seconds", "measure":42}'::json; json -------------------------------------- {"measure":42} I presume people being used to store metadata in "preceding" json object members with duplicate names, would want to decide in the client requesting the data what to do with the metadata information and at what point to "drop", wouldn't they :-?) > For anything else - don't use json, use any text type > > If you really need a simple text-validated-as-valid-json-input datatype > then add this. > > Call it jsontext or somesuch :) > All the best, Stefan.
On 06/11/2013 02:41 PM, Andrew Dunstan wrote: > > On 06/11/2013 06:53 AM, Hannu Krosing wrote: >> On 06/11/2013 10:47 AM, Andres Freund wrote: >>> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote: >>>>> It's legal, is it not, to just write the equivalent Unicode >>>>> character in >>>>> the JSON string and not use the escapes? If so I would think that >>>>> that >>>>> would be the most common usage. If someone's writing an escape, they >>>>> probably had a reason for doing it that way, and might not appreciate >>>>> our overriding their decision. >>>> We never store the converted values in the JSON object, nor do we >>>> return >>>> them from functions that return JSON. But many of the functions and >>>> operators that process the JSON have variants that return text >>>> instead of >>>> JSON, and in those cases, when the value returned is a JSON string, >>>> we do >>>> the following to it: >>>> >>>> I have just realized that the problem is actually quite a lot >>>> bigger than >>>> that. We also use this value for field name comparison. So, let us >>>> suppose >>>> that we have a LATIN1 database and a piece of JSON with a field name >>>> containing the Euro sign ("\u20ac"), a character that is not in >>>> LATIN1. >>>> Making that processable so it doesn't blow up would be mighty >>>> tricky and >>>> error prone. The non-orthogonality I suggested as a solution >>>> upthread is, by >>>> contrast, very small and easy to manage, and not terribly hard to >>>> explain - >>>> see attached. >>> I think this all shows pretty clearly that it was a mistake allowing >>> json data in the database that we cannot entirely display with the >>> database's encoding. All the proposed ugly workarounds are only >>> necessary because we don't throw an error when originally validating >>> the >>> json. >>> Even in an utf-8 database you can get errors due to \u unescaping (at >>> attribute access time, *NOT* at json_in() time) due to invalidate >>> surrogate pairs. >>> >>> I think this goes countrary to normal postgres approach of validating >>> data as strict as necessary. And I think we are going to regret not >>> fixing this while there are still relatively few users out there. >> Exactly - >> >> * allow in only valid JSON. >> * Validate all utf8 strings for valid unicode. >> * have one canonic way of outputting unicode - utf8 for utf8 >> databases, >> escaped for all other encodings >> * If you need to store anything else, use text. >> >> Requiring preserving "original text" in json data field is Not Good! >> >> I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out >> as '{"a":b"}' >> >> (I know that currently this is noty true and will happen only once I >> read in the json value in client) >> >> For anything else - don't use json, use any text type >> >> If you really need a simple text-validated-as-valid-json-input datatype >> then add this. >> >> Call it jsontext or somesuch :) >> > > > Most of this is irrelevant to solving the current problem. We don't > have a > greenfields development of JSON - we have an existing type with which > we have to work. We do have current (postgresql v9.2) "json" data type, which is loosely defined as "a string which can be fed to a JSON deserializer without errors". But then we also have a standard. Sadly the standard is really not about a type, but about serialising a subset of JavaScript structures to a utf8 string. This kind of implies an underlying type which is a structure consisting of a few basic types and lists and dictionaries/classes based on these but does not really specify any type. > Note too that your "utf8 for utf8 databases, escaped for all other > encodings" > is exactly the opposite of what Noah and Andres and Tom are arguing, > namely that the database encoding should not affect the behaviour, It does not affect behaviour. It makes it possible to encode valid JSON (utf8) in client encoding which may not have full set of unicode characters. We can do it here, as we know that by definition JSON _is_ unicode, so clients have to be able to process any unicode charcter. We can not do this for plain text, which has no defined charset outside what client_encoding postulates. > and is very close to what I proposed yesterday. > > It's a pity that we don't have a non-error producing conversion function > (or if we do that I haven't found it). Then we might adopt a rule for > processing > unicode escapes that said "convert unicode escapes to the database > encoding only when extracting JSON keys or values to text makes it sense to unescape to database encoding. strings inside JSON itself are by definition utf8 > if possible, and if not then emit them unchanged." which might be a > reasonable > compromise. I'd opt for "... and if not then emit them quoted". The default should be not loosing any data. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/11/2013 03:08 PM, Stefan Drees wrote: > quiring preserving "original text" in json data field is Not Good! >> >> I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out >> as '{"a":b"}' > > ahem, do you mean instead to give (none -> null and missing '"' > inserted in "answer"): yeah, mixed up none and null > =# SELECT '{"a":1, "a":null, "a":true, "a":"b"}'::json; > json > -------------------------------------- > {"a":"b"} > > or only when "stored" in database and subsequently retrieved? The > "original text" in this case was perfectly valid JSON text. > >> (I know that currently this is noty true and will happen only once I >> read in the json value in client) > > Isn't this a good situation and doesn't this also depend on the > storage representation in the client? > > What about this: > =# SELECT '{"measure":"seconds", "measure":42}'::json; > json > -------------------------------------- > {"measure":42} > > I presume people being used to store metadata in "preceding" json > object members with duplicate names, would want to decide in the > client requesting the data what to do with the metadata information > and at what point to "drop", wouldn't they :-?) Seems like blatant misuse of JSON format :) I assume that as JSON is _serialisation_ format, it should represent a data structure, not processing instructions. I can see no possible JavaScript structure which could produce duplicate key when serialised. And I don't think that any standard JSON reader supports this either. Of you want to store any JavaScript snippets in database use text. Or perhaps pl/v8 :) -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/11/2013 09:16 AM, Hannu Krosing wrote: >> >> It's a pity that we don't have a non-error producing conversion function >> (or if we do that I haven't found it). Then we might adopt a rule for >> processing >> unicode escapes that said "convert unicode escapes to the database >> encoding > only when extracting JSON keys or values to text makes it sense to unescape > to database encoding. That's exactly the scenario we are talking about. When emitting JSON the functions have always emitted unicode escapes as they are in the text, and will continue to do so. > > strings inside JSON itself are by definition utf8 We have deliberately extended that to allow JSON strings to be in any database server encoding. That was argued back in the 9.2 timeframe and I am not interested in re-litigating it. The only issue at hand is how to handle unicode escapes (which in their string form are pure ASCII) when emitting text strings. >> if possible, and if not then emit them unchanged." which might be a >> reasonable >> compromise. > I'd opt for "... and if not then emit them quoted". The default should > be not loosing > any data. > > > I don't know what this means at all. Quoted how? Let's say I have a Latin1 database and have the following JSON string: "\u20AC2.00". In a UTF8 database the text representation of this is €2.00 - what are you saying it should be in the Latin1 database? cheers andrew
On 06/11/2013 09:23 AM, Hannu Krosing wrote: > > I can see no possible JavaScript structure which could produce duplicate > key when serialised. > > And I don't think that any standard JSON reader supports this either. You are quite wrong. This was discussed quite recently on -hackers, too. V8 will happily accept a JSON string with duplicate keys via eval() and resolve it in favor of the lexically latest value. I gather most other JSON processors do likewise. Can we PLEASE PLEASE stop sending this discussion off track and concentrate on the actual problem we have at hand? It's BETA and there is not much time. I get that you don't like how we have implemented JSON. But we're not going back over that ground now. It's done and in use and LOTS of people are finding it very useful. cheers andrew
On 06/11/2013 03:42 PM, Andrew Dunstan wrote: > > On 06/11/2013 09:16 AM, Hannu Krosing wrote: > > >>> >>> It's a pity that we don't have a non-error producing conversion >>> function >>> (or if we do that I haven't found it). Then we might adopt a rule for >>> processing >>> unicode escapes that said "convert unicode escapes to the database >>> encoding >> only when extracting JSON keys or values to text makes it sense to >> unescape >> to database encoding. > > That's exactly the scenario we are talking about. When emitting JSON > the functions have always emitted unicode escapes as they are in the > text, and will continue to do so. > >> >> strings inside JSON itself are by definition utf8 > > > We have deliberately extended that to allow JSON strings to be in any > database server encoding. Ugh! Does that imply that we just not "allow" it, but rather "require" it ? Why are we arguing the "unicode surrogate pairs" as a "JSON thing" then ? Should it not be "client to server encoding conversion thing" instead ? > That was argued back in the 9.2 timeframe and I am not interested in > re-litigating it. > > The only issue at hand is how to handle unicode escapes (which in > their string form are pure ASCII) when emitting text strings. Unicode escapes in non-unicode strings seem something that is ill-defined by nature ;) That is, you can't come up with a good general answer for this. >>> if possible, and if not then emit them unchanged." which might be a >>> reasonable >>> compromise. >> I'd opt for "... and if not then emit them quoted". The default should >> be not loosing >> any data. >> >> >> > > > I don't know what this means at all. Quoted how? Let's say I have a > Latin1 database and have the following JSON string: "\u20AC2.00". In a > UTF8 database the text representation of this is €2.00 - what are you > saying it should be in the Latin1 database? utf8-quote the '€' - "\u20AC2.00" That is, convert unicode-->Latin1 what has a correspondence, utf8-quote anything that does not. If we allow unicode escapes in non-unicode strings anyway, then this seems the most logical thing to do. > > cheers > > andrew > > -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 2013-06-11 15:23 CEST, Hannu Krosing wrote: > On 06/11/2013 03:08 PM, Stefan Drees wrote: >> ... >> >> What about this: >> =# SELECT '{"measure":"seconds", "measure":42}'::json; >> json >> -------------------------------------- >> {"measure":42} >> >> I presume people being used to store metadata in "preceding" json >> object members with duplicate names, would want to decide in the >> client requesting the data what to do with the metadata information >> and at what point to "drop", wouldn't they :-?) > Seems like blatant misuse of JSON format :) > > I assume that as JSON is _serialisation_ format, it should represent a > data structure, not processing instructions. > > I can see no possible JavaScript structure which could produce duplicate > key when serialised. ahem, JSON is a notation that allows toplevel an object or an array. If it is an object, this consists of pairs called (name, value). Here value can be any object, array, number, string or the literals null, false or true. The name must be a string. That's it :-) no key **and** also no ordering on these "name"s ;-) and as the RFC does not care, where the data came from or how it was represented before it became "JSON text" (the top-level element of a JSON document) how should the parser know ... but delta notaion, commenting, or "streaming" needscreated many applications that deliver multibags and trust on some ordering conventions in their dataexchanging relations. > And I don't think that any standard JSON reader supports this either. Oh yes. Convention is merely: Keep all ("Streaming") or the last (whatever the last may mean, must be carefully ensured in the interchange relation). All would like these two scenarios, but the RFC as is does not prevent an early-out (like INSERT OR IGNORE) :-)) > Of you want to store any JavaScript snippets in database use text. JSON is language agnostic. I use more JSON from python, php than from js, but others do so differently ... > Or perhaps pl/v8 :) > Do you mean the "V8 Engine Javascript Procedural Language add-on for PostgreSQL" (http://code.google.com/p/plv8js/), I guess so. I did not want to hijack the thread, as this centered more around escaping where and what in which context (DB vs. client encoding). As the freshly created IETF json working group revamps the JSON RFC on its way to the standards track, there are currently also discussions on what to do with unicode surrogate pairs. See eg. this thread http://www.ietf.org/mail-archive/web/json/current/msg00675.html starting a summarizing effort. Just in case it helps making the fresh JSON feature of PostgreSQL bright, shining and future proof :-) Stefan.
On 06/11/2013 03:54 PM, Andrew Dunstan wrote: > > On 06/11/2013 09:23 AM, Hannu Krosing wrote: > >> >> I can see no possible JavaScript structure which could produce duplicate >> key when serialised. >> >> And I don't think that any standard JSON reader supports this either. > > You are quite wrong. This was discussed quite recently on -hackers, too. > V8 will happily accept a JSON string with duplicate keys via eval() > and resolve it in favor of the lexically latest value. This is what I mean. It is a valid _input_ value , but no existing JavaScript structure serializes to it. In other words - I want us to have in minds some underlying structure, not the text hannu=# select '1e0'::float;float8 -------- 1 (1 row) we are not preserving '1e0' in floats, why should we preserve it in json () ? We do imply an internal structured format in several functions operating on json even though we store it in text . > I gather most other JSON processors do likewise. > > > Can we PLEASE PLEASE stop sending this discussion off track and > concentrate on the actual problem we have at hand? It's BETA and there > is not much time. > I get that you don't like how we have implemented JSON. The current implementation is a reasonably good compromise, so I can't say I don't like it :) I am here going from the premise that at some point we might implement a json-like binary structured type. If it is left separate from json, I am ok with all kinds of quirks coming from current vague definition of what a "json type" is. OTOH, if the idea is to move "json" storage format to this binary structured type, we should resolve possible incompatibilities as early as possible and start thinking of the current "as text" storage in terms of something that is a real structured type, just stored in text format. This could also mean converting to canonical format on input. > But we're not going back over that ground now. It's done and in use > and LOTS of people are finding it very useful. I am one of these people who finds it very useful. I just want to avoid painting us in a corner too early. I would like us to have *one* json type in the future, not separate "json input string" and "json compatible binary structure" or somesuch > > > cheers > > andrew > > -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/11/2013 04:04 PM, Stefan Drees wrote: > On 2013-06-11 15:23 CEST, Hannu Krosing wrote: >> On 06/11/2013 03:08 PM, Stefan Drees wrote: >>> ... >>> >>> What about this: >>> =# SELECT '{"measure":"seconds", "measure":42}'::json; >>> json >>> -------------------------------------- >>> {"measure":42} >>> >>> I presume people being used to store metadata in "preceding" json >>> object members with duplicate names, would want to decide in the >>> client requesting the data what to do with the metadata information >>> and at what point to "drop", wouldn't they :-?) >> Seems like blatant misuse of JSON format :) >> >> I assume that as JSON is _serialisation_ format, it should represent a >> data structure, not processing instructions. >> >> I can see no possible JavaScript structure which could produce duplicate >> key when serialised. > > ahem, JSON is a notation that allows toplevel an object or an array. > If it is an object, this consists of pairs called (name, value). > Here value can be any object, array, number, string or the literals > null, false or true. > The name must be a string. That's it :-) no key **and** also no > ordering on these "name"s ;-) and as the RFC does not care, where the > data came from or how it was represented before it became "JSON text" > (the top-level element of a JSON document) how should the parser know > ... but delta notaion, commenting, or "streaming" needs created many > applications that deliver multibags and trust on some ordering > conventions in their dataexchanging relations. > >> And I don't think that any standard JSON reader supports this either. > > Oh yes. Convention is merely: Keep all ("Streaming") or the last > (whatever the last may mean, must be carefully ensured in the > interchange relation). > All would like these two scenarios, but the RFC as is does not prevent > an early-out (like INSERT OR IGNORE) :-)) I was kind of assuming that JSON is a (JavaScript) Object Serialization Notation, that is, there is a unique implied "JavaScript Object" which can be "Serialized" int any given JSON string. IOW, that if you serialise an object then this is what JSON should be. The fact that most JSON to Object readers support multiple keys is just an implementation artifact and not something that is required by RFC. > >> Of you want to store any JavaScript snippets in database use text. > > JSON is language agnostic. I use more JSON from python, php than from > js, but others do so differently ... Agreed. Even the fact that you can define any operations on a "JSON" string - like extracting a value for key - is actually non-standard :) Perhaps I should stop thinking of json type as something that implies any underlying structure ... > >> Or perhaps pl/v8 :) >> > > Do you mean the "V8 Engine Javascript Procedural Language add-on for > PostgreSQL" (http://code.google.com/p/plv8js/), I guess so. > > I did not want to hijack the thread, as this centered more around > escaping where and what in which context (DB vs. client encoding). > > As the freshly created IETF json working group revamps the JSON RFC on > its way to the standards track, there are currently also discussions > on what to do with unicode surrogate pairs. See eg. this thread > http://www.ietf.org/mail-archive/web/json/current/msg00675.html > starting a summarizing effort. Wow. The rabbit hole is much deeper than I thought :) -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/10/2013 11:22 PM, Noah Misch wrote: > On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote: >> On 06/10/2013 10:18 AM, Tom Lane wrote: >>> Andrew Dunstan <andrew@dunslane.net> writes: >>>> After thinking about this some more I have come to the conclusion that >>>> we should only do any de-escaping of \uxxxx sequences, whether or not >>>> they are for BMP characters, when the server encoding is utf8. For any >>>> other encoding, which is already a violation of the JSON standard >>>> anyway, and should be avoided if you're dealing with JSON, we should >>>> just pass them through even in text output. This will be a simple and >>>> very localized fix. >>> Hmm. I'm not sure that users will like this definition --- it will seem >>> pretty arbitrary to them that conversion of \u sequences happens in some >>> databases and not others. > Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database > where everyone uses client_encoding = LATIN1 should not change the semantics > of successful SQL statements. Some statements that fail with one database > encoding will succeed in the other, but a user should not witness a changed > non-error result. (Except functions like decode() that explicitly expose byte > representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit 'ä' in the > UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the > wrong direction relative to that ideal. > >> Then what should we do when there is no matching codepoint in the >> database encoding? First we'll have to delay the evaluation so it's not >> done over-eagerly, and then we'll have to try the conversion and throw >> an error if it doesn't work. The second part is what's happening now, >> but the delayed evaluation is not. > +1 for doing it that way. > As a final counter example, let me note that Postgres itself handles Unicode escapes differently in UTF8 databases - in other databases it only accepts Unicode escapes up to U+007f, i.e. ASCII characters. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > As a final counter example, let me note that Postgres itself handles > Unicode escapes differently in UTF8 databases - in other databases it > only accepts Unicode escapes up to U+007f, i.e. ASCII characters. Good point. What if we adopt that same definition for JSON, and get rid of the need to do explicit encoding conversion at all in the JSON code? regards, tom lane
On Tue, Jun 11, 2013 at 02:10:45PM -0400, Andrew Dunstan wrote: > > On 06/10/2013 11:22 PM, Noah Misch wrote: >> On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote: >>> On 06/10/2013 10:18 AM, Tom Lane wrote: >>>> Andrew Dunstan <andrew@dunslane.net> writes: >>>>> After thinking about this some more I have come to the conclusion that >>>>> we should only do any de-escaping of \uxxxx sequences, whether or not >>>>> they are for BMP characters, when the server encoding is utf8. For any >>>>> other encoding, which is already a violation of the JSON standard >>>>> anyway, and should be avoided if you're dealing with JSON, we should >>>>> just pass them through even in text output. This will be a simple and >>>>> very localized fix. >>>> Hmm. I'm not sure that users will like this definition --- it will seem >>>> pretty arbitrary to them that conversion of \u sequences happens in some >>>> databases and not others. >> Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database >> where everyone uses client_encoding = LATIN1 should not change the semantics >> of successful SQL statements. Some statements that fail with one database >> encoding will succeed in the other, but a user should not witness a changed >> non-error result. (Except functions like decode() that explicitly expose byte >> representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit '?' in the >> UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the >> wrong direction relative to that ideal. > As a final counter example, let me note that Postgres itself handles > Unicode escapes differently in UTF8 databases - in other databases it > only accepts Unicode escapes up to U+007f, i.e. ASCII characters. I don't see a counterexample there; every database that accepts without error a given Unicode escape produces from it the same text value. The proposal to which I objected was akin to having non-UTF8 databases silently translate E'\u0220' to E'\\u0220'. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 06/11/2013 06:26 PM, Noah Misch wrote: > >> As a final counter example, let me note that Postgres itself handles >> Unicode escapes differently in UTF8 databases - in other databases it >> only accepts Unicode escapes up to U+007f, i.e. ASCII characters. > I don't see a counterexample there; every database that accepts without error > a given Unicode escape produces from it the same text value. The proposal to > which I objected was akin to having non-UTF8 databases silently translate > E'\u0220' to E'\\u0220'. What? There will be no silent translation. The only debate here is about how these databases turn strings values inside a json datum into PostgreSQL text values via the documented operation of certain functions and operators. If the JSON datum doesn't already contain a unicode escape then nothing of what's been discussed would apply. Nothing whatever that's been proposed would cause a unicode escape sequence to be emitted that wasn't already there in the first place, and no patch that I have submitted has contained any escape sequence generation at all. cheers andrew
On Tue, Jun 11, 2013 at 06:58:05PM -0400, Andrew Dunstan wrote: > > On 06/11/2013 06:26 PM, Noah Misch wrote: >> >>> As a final counter example, let me note that Postgres itself handles >>> Unicode escapes differently in UTF8 databases - in other databases it >>> only accepts Unicode escapes up to U+007f, i.e. ASCII characters. >> I don't see a counterexample there; every database that accepts without error >> a given Unicode escape produces from it the same text value. The proposal to >> which I objected was akin to having non-UTF8 databases silently translate >> E'\u0220' to E'\\u0220'. > > What? > > There will be no silent translation. The only debate here is about how > these databases turn strings values inside a json datum into PostgreSQL > text values via the documented operation of certain functions and > operators. If the JSON datum doesn't already contain a unicode escape > then nothing of what's been discussed would apply. Nothing whatever > that's been proposed would cause a unicode escape sequence to be emitted > that wasn't already there in the first place, and no patch that I have > submitted has contained any escape sequence generation at all. Under your proposal to which I was referring, this statement would return true in UTF8 databases and false in databases of other encodings: SELECT '["\u0220"]'::json ->> 0 = E'\u0220' Contrast the next statement, which would return false in UTF8 databases and true in databases of other encodings: SELECT '["\u0220"]'::json ->> 0 = E'\\u0220' Defining ->>(json,int) and ->>(json,text) in this way would be *akin to* having "SELECT E'\u0220' = E'\\u0220'" return true in non-UTF8 databases. I refer to user-visible semantics, not matters of implementation. Does that help to clarify my earlier statement? -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 06/11/2013 08:18 PM, Noah Misch wrote: > On Tue, Jun 11, 2013 at 06:58:05PM -0400, Andrew Dunstan wrote: >> On 06/11/2013 06:26 PM, Noah Misch wrote: >>>> As a final counter example, let me note that Postgres itself handles >>>> Unicode escapes differently in UTF8 databases - in other databases it >>>> only accepts Unicode escapes up to U+007f, i.e. ASCII characters. >>> I don't see a counterexample there; every database that accepts without error >>> a given Unicode escape produces from it the same text value. The proposal to >>> which I objected was akin to having non-UTF8 databases silently translate >>> E'\u0220' to E'\\u0220'. >> What? >> >> There will be no silent translation. The only debate here is about how >> these databases turn strings values inside a json datum into PostgreSQL >> text values via the documented operation of certain functions and >> operators. If the JSON datum doesn't already contain a unicode escape >> then nothing of what's been discussed would apply. Nothing whatever >> that's been proposed would cause a unicode escape sequence to be emitted >> that wasn't already there in the first place, and no patch that I have >> submitted has contained any escape sequence generation at all. > Under your proposal to which I was referring, this statement would return true > in UTF8 databases and false in databases of other encodings: > > SELECT '["\u0220"]'::json ->> 0 = E'\u0220' > > Contrast the next statement, which would return false in UTF8 databases and > true in databases of other encodings: > > SELECT '["\u0220"]'::json ->> 0 = E'\\u0220' > > Defining ->>(json,int) and ->>(json,text) in this way would be *akin to* > having "SELECT E'\u0220' = E'\\u0220'" return true in non-UTF8 databases. I > refer to user-visible semantics, not matters of implementation. Does that > help to clarify my earlier statement? Well, I think that's drawing a bit of a long bow, but never mind. If we work by analogy to Postgres' own handling of Unicode escapes, we'll raise an error on any Unicode escape beyond ASCII (not on input for legacy reasons, but on trying to process such datums). I gather that would meet your objection. cheers andrew
On Tue, Jun 11, 2013 at 08:42:26PM -0400, Andrew Dunstan wrote: > If we work by analogy to Postgres' own handling of Unicode escapes, > we'll raise an error on any Unicode escape beyond ASCII (not on input > for legacy reasons, but on trying to process such datums). I gather that > would meet your objection. No objection here; that approach is free from the problem that caught my attention in the other proposal. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 06/12/2013 08:42 AM, Andrew Dunstan wrote: > > If we work by analogy to Postgres' own handling of Unicode escapes, > we'll raise an error on any Unicode escape beyond ASCII (not on input > for legacy reasons, but on trying to process such datums). I gather that > would meet your objection. I could live with that if eager validation on input was the default, but could be disabled by setting (say) compat_lazy_json_validation = on . I don't like the idea of leaving us saddled with weak validation just that's what we've got. It's been an ongoing source of pain as UTF-8 support has improved and I'd really like a way to avoid semi-valid JSON making it into the DB and causing similar problems. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/12/2013 12:13 AM, Craig Ringer wrote: > On 06/12/2013 08:42 AM, Andrew Dunstan wrote: >> If we work by analogy to Postgres' own handling of Unicode escapes, >> we'll raise an error on any Unicode escape beyond ASCII (not on input >> for legacy reasons, but on trying to process such datums). I gather that >> would meet your objection. > I could live with that if eager validation on input was the default, but > could be disabled by setting (say) compat_lazy_json_validation = on . > I don't like the idea of leaving us saddled with weak validation just > that's what we've got. It's been an ongoing source of pain as UTF-8 > support has improved and I'd really like a way to avoid semi-valid JSON > making it into the DB and causing similar problems. > I think it's rather too late in the cycle to be proposing new GUCs. We can revisit this for 9.4 perhaps. cheers andrew