Обсуждение: [PoC] Add CANONICAL option to xmlserialize
Hi, In order to compare pairs of XML documents for equivalence it is necessary to convert them first to their canonical form, as described at W3C Canonical XML 1.1.[1] This spec basically defines a standard physical representation of xml documents that have more then one possible representation, so that it is possible to compare them, e.g. forcing UTF-8 encoding, entity reference replacement, attributes normalization, etc. Although it is not part of the XML/SQL standard, it would be nice to have the option CANONICAL in xmlserialize. Additionally, we could also add the attribute WITH [NO] COMMENTS to keep or remove xml comments from the documents. Something like this: WITH t(col) AS ( VALUES ('<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE doc SYSTEM "doc.dtd" [ <!ENTITY val "42"> <!ATTLIST xyz attr CDATA "default"> ]> <!-- ordering of attributes --> <foo ns:c = "3" ns:b = "2" ns:a = "1" xmlns:ns="http://postgresql.org"> <!-- Normalization of whitespace in start and end tags --> <!-- Elimination of superfluous namespace declarations, as already declared in <foo> --> <bar xmlns:ns="http://postgresql.org" >&val;</bar > <!-- Empty element conversion to start-end tag pair --> <empty/> <!-- Effect of transcoding from a sample encoding to UTF-8 --> <iso8859>©</iso8859> <!-- Addition of default attribute --> <!-- Whitespace inside tag preserved --> <xyz> 321 </xyz> </foo> <!-- comment outside doc -->'::xml) ) SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t; xmlserialize -------------------------------------------------------------------------------------------------------------------------------------------------------- <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz attr="default"> 321 </xyz></foo> (1 row) -- using WITH COMMENTS WITH t(col) AS ( VALUES (' <foo ns:c = "3" ns:b = "2" ns:a = "1" xmlns:ns="http://postgresql.org"> <!-- very important comment --> <xyz> 321 </xyz> </foo>'::xml) ) SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t; xmlserialize ------------------------------------------------------------------------------------------------------------------------ <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!-- very important comment --><xyz> 321 </xyz></foo> (1 row) Another option would be to simply create a new function, e.g. xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this would be the right approach. Attached a very short draft. What do you think? Best, Jim 1- https://www.w3.org/TR/xml-c14n11/
Вложения
On 27.02.23 14:16, I wrote: > Hi, > > In order to compare pairs of XML documents for equivalence it is > necessary to convert them first to their canonical form, as described > at W3C Canonical XML 1.1.[1] This spec basically defines a standard > physical representation of xml documents that have more then one > possible representation, so that it is possible to compare them, e.g. > forcing UTF-8 encoding, entity reference replacement, attributes > normalization, etc. > > Although it is not part of the XML/SQL standard, it would be nice to > have the option CANONICAL in xmlserialize. Additionally, we could also > add the attribute WITH [NO] COMMENTS to keep or remove xml comments > from the documents. > > Something like this: > > WITH t(col) AS ( > VALUES > ('<?xml version="1.0" encoding="ISO-8859-1"?> > <!DOCTYPE doc SYSTEM "doc.dtd" [ > <!ENTITY val "42"> > <!ATTLIST xyz attr CDATA "default"> > ]> > > <!-- ordering of attributes --> > <foo ns:c = "3" ns:b = "2" ns:a = "1" > xmlns:ns="http://postgresql.org"> > > <!-- Normalization of whitespace in start and end tags --> > <!-- Elimination of superfluous namespace declarations, > as already declared in <foo> --> > <bar xmlns:ns="http://postgresql.org" >&val;</bar > > > <!-- Empty element conversion to start-end tag pair --> > <empty/> > > <!-- Effect of transcoding from a sample encoding to UTF-8 --> > <iso8859>©</iso8859> > > <!-- Addition of default attribute --> > <!-- Whitespace inside tag preserved --> > <xyz> 321 </xyz> > </foo> > <!-- comment outside doc -->'::xml) > ) > SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t; > xmlserialize > -------------------------------------------------------------------------------------------------------------------------------------------------------- > > <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" > ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz > attr="default"> 321 </xyz></foo> > (1 row) > > -- using WITH COMMENTS > > WITH t(col) AS ( > VALUES > (' <foo ns:c = "3" ns:b = "2" ns:a = "1" > xmlns:ns="http://postgresql.org"> > <!-- very important comment --> > <xyz> 321 </xyz> > </foo>'::xml) > ) > SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t; > xmlserialize > ------------------------------------------------------------------------------------------------------------------------ > > <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!-- > very important comment --><xyz> 321 </xyz></foo> > (1 row) > > > Another option would be to simply create a new function, e.g. > xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this > would be the right approach. > > Attached a very short draft. What do you think? > > Best, Jim > > 1- https://www.w3.org/TR/xml-c14n11/ The attached version includes documentation and tests to the patch. I hope things are clearer now :) Best, Jim
Вложения
On Mon, Mar 6, 2023 at 7:44 AM Jim Jones <jim.jones@uni-muenster.de> wrote: > The attached version includes documentation and tests to the patch. The CI run for that failed in an interesting way, only on Debian + Meson, 32 bit. The diffs appear to show that psql has a different opinion of the column width, while building its header (the "------" you get at the top of psql's output), even though the actual column contents was the same. regression.diff[2] shows that there is a "£1" in the output, which is how UTF-8 "£1" looks if you view it with Latin1 glasses on. Clearly this patch involves transcoding, Latin1 and UTF-8 and I haven't studied it, but it's pretty weird for the 32 bit build to give a different result... could be something to do with our environment, since .cirrus.yml sets LANG=C in the 32 bit test run -- maybe try that locally? That run also generated a core dump, but I think that's just our open SIGQUIT problem[3] and not relevant here. [1] https://cirrus-ci.com/build/6319462375227392 [2] https://api.cirrus-ci.com/v1/artifact/task/5800598633709568/testrun/build-32/testrun/regress/regress/regression.diffs [3] https://www.postgresql.org/message-id/flat/20230214202927.xgb2w6b7gnhq6tvv%40awork3.anarazel.de
On 05.03.23 22:00, Thomas Munro wrote: > The CI run for that failed in an interesting way, only on Debian + > Meson, 32 bit. The diffs appear to show that psql has a different > opinion of the column width, while building its header (the "------" > you get at the top of psql's output), even though the actual column > contents was the same. regression.diff[2] shows that there is a "£1" > in the output, which is how UTF-8 "£1" looks if you view it with > Latin1 glasses on. Clearly this patch involves transcoding, Latin1 > and UTF-8 One of the use cases of this patch is exactly the transcoding of a non utf-8 document to utf-8 - as described in the XML canonical spec. > and I haven't studied it, but it's pretty weird for the 32 > bit build to give a different result... Yeah, it's pretty weird indeed. I'll try to reproduce this environment in a container to see if I get the same diff. Although I'm not sure that by "fixing" the result set for this environment it won't break all the others. > could be something to do with > our environment, since .cirrus.yml sets LANG=C in the 32 bit test run > -- maybe try that locally? Also using LANGUAGE=C the result is the same for me - all tests pass just fine. > That run also generated a core dump, but I think that's just our open > SIGQUIT problem[3] and not relevant here. > > [1] https://cirrus-ci.com/build/6319462375227392 > [2] https://api.cirrus-ci.com/v1/artifact/task/5800598633709568/testrun/build-32/testrun/regress/regress/regression.diffs > [3] https://www.postgresql.org/message-id/flat/20230214202927.xgb2w6b7gnhq6tvv%40awork3.anarazel.de Thanks for the quick reply. Much appreciated!
On Mon, Mar 6, 2023 at 11:20 AM Jim Jones <jim.jones@uni-muenster.de> wrote: > On 05.03.23 22:00, Thomas Munro wrote: > > could be something to do with > > our environment, since .cirrus.yml sets LANG=C in the 32 bit test run > > -- maybe try that locally? > Also using LANGUAGE=C the result is the same for me - all tests pass > just fine. I couldn't reproduce that locally either, but I just tested on CI with your patch applied saw the failure, and then removed "PYTHONCOERCECLOCALE=0 LANG=C" and it's all green: https://github.com/macdice/postgres/commit/91999f5d13ac2df6f7237a301ed6cf73f2bb5b6d Without looking too closely, my first guess would have been that this just isn't going to work without UTF-8 database encoding, so you might need to skip the test (see for example src/test/regress/expected/unicode_1.out). It's annoying that "xml" already has 3 expected variants... hmm. BTW shouldn't it be failing in a more explicit way somewhere sooner if the database encoding is not UTF-8, rather than getting confused?
On 06.03.23 00:32, Thomas Munro wrote: > I couldn't reproduce that locally either, but I just tested on CI with > your patch applied saw the failure, and then removed > "PYTHONCOERCECLOCALE=0 LANG=C" and it's all green: > > https://github.com/macdice/postgres/commit/91999f5d13ac2df6f7237a301ed6cf73f2bb5b6d > > Without looking too closely, my first guess would have been that this > just isn't going to work without UTF-8 database encoding, so you might > need to skip the test (see for example > src/test/regress/expected/unicode_1.out). It's annoying that "xml" > already has 3 expected variants... hmm. BTW shouldn't it be failing > in a more explicit way somewhere sooner if the database encoding is > not UTF-8, rather than getting confused? I guess this confusion is happening because xml_parse() was being called with the database encoding from GetDatabaseEncoding(). I added a condition before calling xml_parse() to check if the xml document has a different encoding than UTF-8 parse_xml_decl(xml_text2xmlChar(data), NULL, NULL, &encodingStr, NULL); encoding = encodingStr ? xmlChar_to_encoding(encodingStr) : PG_UTF8; doc = xml_parse(data, XMLOPTION_DOCUMENT, false, encoding, NULL); v2 attached. Thanks! Best, Jim
Вложения
On 06.03.23 11:50, I wrote: > I guess this confusion is happening because xml_parse() was being > called with the database encoding from GetDatabaseEncoding(). > > I added a condition before calling xml_parse() to check if the xml > document has a different encoding than UTF-8 > > parse_xml_decl(xml_text2xmlChar(data), NULL, NULL, &encodingStr, NULL); > encoding = encodingStr ? xmlChar_to_encoding(encodingStr) : PG_UTF8; > > doc = xml_parse(data, XMLOPTION_DOCUMENT, false, encoding, NULL); It seems that this bug fix didn't change the output of the CI on Debian + Meson, 32bit. I slightly changed the test case to a character that both encodings can deal with. v3 attached.
Вложения
v4 attached fixes an encoding issue at the xml_parse call. It now uses GetDatabaseEncoding(). Best, Jim
Вложения
v5 attached is a rebase over the latest changes in xmlserialize (INDENT output).
Вложения
After some more testing I realized that v5 was leaking the xmlDocPtr. Now fixed in v6.
Вложения
The cfbot started complaining about this patch on "macOS - Ventura - Meson"
'Persistent worker failed to start the task: tart isolation failed: failed to create VM cloned from "ghcr.io/cirruslabs/macos-ventura-base:latest": tart command returned non-zero exit code: ""'
Is this a problem in my code or in the CI itself?
Thanks!
Jim
On Thu, Sep 14, 2023 at 11:54 PM Jim Jones <jim.jones@uni-muenster.de> wrote: > The cfbot started complaining about this patch on "macOS - Ventura - Meson" > > 'Persistent worker failed to start the task: tart isolation failed: failed to create VM cloned from "ghcr.io/cirruslabs/macos-ventura-base:latest":tart command returned non-zero exit code: ""' > > Is this a problem in my code or in the CI itself? There was a temporary glitch on one of the new Mac CI runner machines that caused a few tests to fail like that, but it was fixed so that should turn red again later today.
On Fri, 17 Mar 2023 at 18:01, Jim Jones <jim.jones@uni-muenster.de> wrote: > > After some more testing I realized that v5 was leaking the xmlDocPtr. > > Now fixed in v6. Few comments: 1) Why the default option was chosen without comments shouldn't it be the other way round? +opt_xml_serialize_format: + INDENT { $$ = XMLSERIALIZE_INDENT; } + | NO INDENT { $$ = XMLSERIALIZE_NO_FORMAT; } + | CANONICAL { $$ = XMLSERIALIZE_CANONICAL; } + | CANONICAL WITH NO COMMENTS { $$ = XMLSERIALIZE_CANONICAL; } + | CANONICAL WITH COMMENTS { $$ = XMLSERIALIZE_CANONICAL_WITH_COMMENTS; } + | /*EMPTY*/ { $$ = XMLSERIALIZE_NO_FORMAT; } 2) This should be added to typedefs.list: +typedef enum XmlSerializeFormat +{ + XMLSERIALIZE_INDENT, /* pretty-printed xml serialization */ + XMLSERIALIZE_CANONICAL, /* canonical form without xml comments */ + XMLSERIALIZE_CANONICAL_WITH_COMMENTS, /* canonical form with xml comments */ + XMLSERIALIZE_NO_FORMAT /* unformatted xml representation */ +} XmlSerializeFormat; 3) This change is not required: return result; + #else NO_XML_SUPPORT(); return NULL; 4) This comment body needs slight reformatting: + /* + * Parse the input according to the xmloption. + * XML canonical expects a well-formed XML input, so here in case of + * XMLSERIALIZE_CANONICAL or XMLSERIALIZE_CANONICAL_WITH_COMMENTS we + * force xml_parse() to parse 'data' as XMLOPTION_DOCUMENT despite + * of the XmlOptionType given in 'xmloption_arg'. This enables the + * canonicalization of CONTENT fragments if they contain a singly-rooted + * XML - xml_parse() will thrown an error otherwise. + */ 5) Similarly here too: - if (newline == NULL || xmlerrcxt->err_occurred) + * Emit declaration only if the input had one. Note: some versions of + * xmlSaveToBuffer leak memory if a non-null encoding argument is + * passed, so don't do that. We don't want any encoding conversion + * anyway. + */ + if (decl_len == 0) 6) Similarly here too: + /* + * Deal with the case where we have non-singly-rooted XML. + * libxml's dump functions don't work well for that without help. + * We build a fake root node that serves as a container for the + * content nodes, and then iterate over the nodes. + */ 7) Similarly here too: + /* + * We use this node to insert newlines in the dump. Note: in at + * least some libxml versions, xmlNewDocText would not attach the + * node to the document even if we passed it. Therefore, manage + * freeing of this node manually, and pass NULL here to make sure + * there's not a dangling link. + */ 8) Should this: + * of the XmlOptionType given in 'xmloption_arg'. This enables the + * canonicalization of CONTENT fragments if they contain a singly-rooted + * XML - xml_parse() will thrown an error otherwise. Be: + * of the XmlOptionType given in 'xmloption_arg'. This enables the + * canonicalization of CONTENT fragments if they contain a singly-rooted + * XML - xml_parse() will throw an error otherwise. Regards, Vignesh
Hi Vignesh Thanks for the thorough review! On 04.10.23 11:39, vignesh C wrote: > Few comments: > 1) Why the default option was chosen without comments shouldn't it be > the other way round? > +opt_xml_serialize_format: > + INDENT > { $$ = XMLSERIALIZE_INDENT; } > + | NO INDENT > { $$ = XMLSERIALIZE_NO_FORMAT; } > + | CANONICAL > { $$ = XMLSERIALIZE_CANONICAL; } > + | CANONICAL WITH NO COMMENTS > { $$ = XMLSERIALIZE_CANONICAL; } > + | CANONICAL WITH COMMENTS > { $$ = XMLSERIALIZE_CANONICAL_WITH_COMMENTS; } > + | /*EMPTY*/ > { $$ = XMLSERIALIZE_NO_FORMAT; } I'm not sure it is the way to go. The main idea is to check if two documents have the same content, and comments might be different even if the contents of two documents are identical. What are your concerns regarding this default behaviour? > 2) This should be added to typedefs.list: > +typedef enum XmlSerializeFormat > +{ > + XMLSERIALIZE_INDENT, /* > pretty-printed xml serialization */ > + XMLSERIALIZE_CANONICAL, /* > canonical form without xml comments */ > + XMLSERIALIZE_CANONICAL_WITH_COMMENTS, /* canonical form with > xml comments */ > + XMLSERIALIZE_NO_FORMAT /* > unformatted xml representation */ > +} XmlSerializeFormat; added. > 3) This change is not required: > return result; > + > #else > NO_XML_SUPPORT(); > return NULL; removed. > > 4) This comment body needs slight reformatting: > + /* > + * Parse the input according to the xmloption. > + * XML canonical expects a well-formed XML input, so here in case of > + * XMLSERIALIZE_CANONICAL or XMLSERIALIZE_CANONICAL_WITH_COMMENTS we > + * force xml_parse() to parse 'data' as XMLOPTION_DOCUMENT despite > + * of the XmlOptionType given in 'xmloption_arg'. This enables the > + * canonicalization of CONTENT fragments if they contain a singly-rooted > + * XML - xml_parse() will thrown an error otherwise. > + */ reformatted. > 5) Similarly here too: > - if (newline == NULL || xmlerrcxt->err_occurred) > + * Emit declaration only if the input had one. > Note: some versions of > + * xmlSaveToBuffer leak memory if a non-null > encoding argument is > + * passed, so don't do that. We don't want any > encoding conversion > + * anyway. > + */ > + if (decl_len == 0) reformatted. > 6) Similarly here too: > + /* > + * Deal with the case where we have > non-singly-rooted XML. > + * libxml's dump functions don't work > well for that without help. > + * We build a fake root node that > serves as a container for the > + * content nodes, and then iterate over > the nodes. > + */ reformatted. > 7) Similarly here too: > + /* > + * We use this node to insert newlines > in the dump. Note: in at > + * least some libxml versions, > xmlNewDocText would not attach the > + * node to the document even if we > passed it. Therefore, manage > + * freeing of this node manually, and > pass NULL here to make sure > + * there's not a dangling link. > + */ reformatted. > 8) Should this: > + * of the XmlOptionType given in 'xmloption_arg'. This enables the > + * canonicalization of CONTENT fragments if they contain a singly-rooted > + * XML - xml_parse() will thrown an error otherwise. > Be: > + * of the XmlOptionType given in 'xmloption_arg'. This enables the > + * canonicalization of CONTENT fragments if they contain a singly-rooted > + * XML - xml_parse() will throw an error otherwise. I didn't understand the suggestion in 8) :) Thanks again for the review. Much appreciated! v7 attached. Best, Jim
Вложения
On 2023-10-04 12:19, Jim Jones wrote: > On 04.10.23 11:39, vignesh C wrote: >> 1) Why the default option was chosen without comments shouldn't it be >> the other way round? > I'm not sure it is the way to go. The main idea is to check if two > documents have the same content, and comments might be different even > if the contents of two documents are identical. What are your concerns > regarding this default behaviour? I hope I'm not butting in, but I too would be leery of any default behavior that's going to say thing1 and thing2 are the same thing but ignoring (name part of thing here). If that's the comparison I mean to make, and it's as easy as CANONICAL WITHOUT COMMENTS to say that's what I mean, I'd be happy to write that. It also means that the next person reading my code will know "oh, he means 'same' in *that* way", without having to think about it. Regards, -Chap
Hi Chap On 04.10.23 23:05, Chapman Flack wrote: > I hope I'm not butting in, but I too would be leery of any default > behavior that's going to say thing1 and thing2 are the same thing > but ignoring (name part of thing here). If that's the comparison > I mean to make, and it's as easy as CANONICAL WITHOUT COMMENTS > to say that's what I mean, I'd be happy to write that. It also means > that the next person reading my code will know "oh, he means > 'same' in *that* way", without having to think about it. That's a very compelling argument. Thanks for that! It is indeed clearer to only remove items from the result set if explicitly said so. v8 attached changes de default behaviour to WITH COMMENTS. Best, Jim
Вложения
On 05.10.23 09:38, Jim Jones wrote: > > v8 attached changes de default behaviour to WITH COMMENTS. v9 attached with rebase due to changes done to primnodes.h in 615f5f6 -- Jim