Re: Q: Escapes in jsonpath Idents
От | David E. Wheeler |
---|---|
Тема | Re: Q: Escapes in jsonpath Idents |
Дата | |
Msg-id | D1AA2802-98E2-4638-B9B0-512C40AC4B49@justatheory.com обсуждение исходный текст |
Ответ на | Re: Q: Escapes in jsonpath Idents (Erik Wienhold <ewie@ewie.name>) |
Список | pgsql-hackers |
On Mar 17, 2024, at 20:09, Erik Wienhold <ewie@ewie.name> wrote: > > On 2024-03-17 20:50 +0100, David E. Wheeler wrote: >> On Mar 17, 2024, at 15:12, Erik Wienhold <ewie@ewie.name> wrote: >>> So I think it makes sense to reword the entire backslash part of the >>> paragraph and remove references to JSON entirely. The attached patch >>> does that and also formats the backslash escapes as a bulleted list for >>> readability. >> >> Ah, it’s JavaScript format, not JSON! This does clarify things quite >> nicely, thank you. Happy to add my review once it’s in a commit fest. > > Thanks. https://commitfest.postgresql.org/48/4899/ Applies cleanly, `make -C doc/src/sgml check` runs without error. Doc improvement welcome and much clearer than before. > I had the same reasoning while writing my first reply but scrapped that > part because I found it obvious: That jsonpath is parsed before calling > jsonb_path_exists() and therefore the parser has no context about any > variables, which might not even be hardcoded but may result from a > query. Right, there’s a chicken/egg problem. > Unfortunately, I don't have access to that part of the SQL spec. So I > don't know how the jsonpath grammar is specified. Seems quite logical; I think it should be documented, but I’d also be interested to know what the 2016 and 2023 standardssay, exactly. > Also checked git log src/backend/utils/adt/jsonpath_scan.l for some > insights but haven't found any yet. Everybody’s taking shortcuts relative to the standard, AFAICT. For example, jsonpath_scan.l matches unqouted identifierswith these two regular expressions: <xnq>{other}+ <xnq>\/\* <xnq,xq,xvq>\\. Plus the backslash escapes. {other} is defined as: /* "other" means anything that's not special, blank, or '\' or '"' */ other [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f] Which is waaaay more liberal than the ECMA standard[1], by my reading, but the MSDN[2] description is quite succinct (thanksfor the links!): > In JavaScript, identifiers are commonly made of alphanumeric characters, underscores (_), and dollar signs ($). Identifiersare not allowed to start with numbers. However, JavaScript identifiers are not only limited to ASCII — many Unicodecode points are allowed as well. Namely, any character in the ID_Start category can start an identifier, while anycharacter in the ID_Continue category can appear after the first character. ID_Start[3] and ID_Continue[4] point to the unicode standard codes lister, nether of which reference Emoji. Sure enough,in Safari: > x = {"🎉": true} < {🎉: true} > x.🎉 < SyntaxError: Invalid character '\ud83c’ But in Postgres jsonpath: david=# select '$.🎉'::jsonpath; jsonpath ---------- $."🎉" If the MSDN references to ID_Start and ID_Continue are correct, then the Postgres path parser is being overly-liberal. Maybethat’s totally fine? Not sure what should be documented and what’s not worth it. Aside: I’m only digging into these details because I’m busy porting the path parser, so trying to figure out where to becompatible and where not to. So far I’m rejecting '$' (but allowing '\$' and '\u0024') but taking advantage of the unicodesupport in Go to specifically validate against ID_Start and ID_Continue. Best, David [1] https://262.ecma-international.org/#sec-identifier-names [2] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers [3] https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Start%7D [4] https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Continue%7D
В списке pgsql-hackers по дате отправления: