Обсуждение: xmlserialize bug - extra empty row at the end
Hi
maybe I found a bug in xmlserialize
SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
(2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
┌─────────────────────────┐
│ xmlserialize │
╞═════════════════════════╡
│ <foo> ↵│
│ <bar> ↵│
│ <val x="y">42</val>↵│
│ </bar> ↵│
│ </foo> ↵│
│ │
└─────────────────────────┘
(1 row)
┌─────────────────────────┐
│ xmlserialize │
╞═════════════════════════╡
│ <foo> ↵│
│ <bar> ↵│
│ <val x="y">42</val>↵│
│ </bar> ↵│
│ </foo> ↵│
│ │
└─────────────────────────┘
(1 row)
Looks so there is an extra empty row.
Regards
Pavel
On 23.04.23 07:31, Pavel Stehule wrote: > Hi > > maybe I found a bug in xmlserialize > > SELECT xmlserialize(DOCUMENT '<foo><bar><val > x="y">42</val></bar></foo>' AS varchar INDENT); > > (2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT > '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT); > ┌─────────────────────────┐ > │ xmlserialize │ > ╞═════════════════════════╡ > │ <foo> ↵│ > │ <bar> ↵│ > │ <val x="y">42</val>↵│ > │ </bar> ↵│ > │ </foo> ↵│ > │ │ > └─────────────────────────┘ > (1 row) > > Looks so there is an extra empty row. > > Regards > > Pavel Hi Pavel, Good catch! It looks like it comes directly from libxml2. xmlDocPtr doc = xmlReadDoc(BAD_CAST "<foo><bar><val x=\"y\">42</val></bar></foo>", NULL, NULL, 0 ); xmlBufferPtr buf = NULL; xmlSaveCtxtPtr ctxt = NULL; buf = xmlBufferCreate(); ctxt = xmlSaveToBuffer(buf, NULL, XML_SAVE_NO_DECL | XML_SAVE_FORMAT); xmlSaveDoc(ctxt, doc); xmlSaveClose(ctxt); printf("'%s'",buf->content); ==> '<foo> <bar> <val x="y">42</val> </bar> </foo> ' I'll do some digging to see if there is a good way to get rid of this newline or if we need to chose a different dump function. Thanks! Best, Jim
> On Sun, Apr 23, 2023 at 02:02:17PM +0200, Jim Jones wrote: > On 23.04.23 07:31, Pavel Stehule wrote: > > Hi > > > > maybe I found a bug in xmlserialize > > > > SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' > > AS varchar INDENT); > > > > (2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT > > '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT); > > ┌─────────────────────────┐ > > │ xmlserialize │ > > ╞═════════════════════════╡ > > │ <foo> ↵│ > > │ <bar> ↵│ > > │ <val x="y">42</val>↵│ > > │ </bar> ↵│ > > │ </foo> ↵│ > > │ │ > > └─────────────────────────┘ > > (1 row) > > > > Looks so there is an extra empty row. > > > > Regards > > > > Pavel > > Hi Pavel, > > Good catch! It looks like it comes directly from libxml2. > > xmlDocPtr doc = xmlReadDoc(BAD_CAST "<foo><bar><val > x=\"y\">42</val></bar></foo>", NULL, NULL, 0 ); > xmlBufferPtr buf = NULL; > xmlSaveCtxtPtr ctxt = NULL; > > buf = xmlBufferCreate(); > ctxt = xmlSaveToBuffer(buf, NULL, XML_SAVE_NO_DECL | XML_SAVE_FORMAT); > > xmlSaveDoc(ctxt, doc); > xmlSaveClose(ctxt); > > printf("'%s'",buf->content); > > ==> > > '<foo> > <bar> > <val x="y">42</val> > </bar> > </foo> > ' It looks like this happens only if xml type is DOCUMENT, and thus xmlSaveDoc is used to save the doc directly. I might be wrong, but after a quick look at the corresponding libxml functionality it seems that a new line is getting added if the element type is not XML_XINCLUDE_{START|END}, which is unfortunate if correct.
On Sun, 23 Apr 2023 at 01:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Himaybe I found a bug in xmlserializeSELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);(2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
┌─────────────────────────┐
│ xmlserialize │
╞═════════════════════════╡
│ <foo> ↵│
│ <bar> ↵│
│ <val x="y">42</val>↵│
│ </bar> ↵│
│ </foo> ↵│
│ │
└─────────────────────────┘
(1 row)Looks so there is an extra empty row.
I wouldn't necessarily worry about this much. There is not, as such, an extra blank line at the end; rather it is conventional that a text file should end with a newline character. That is, conventionally every single line in a text file ends with a newline character, meaning the only text file that doesn't end with a newline is the empty file. You can see this in tools like diff, which explicitly report "no newline at end of file" if the file ends with a different character.
If you were to save the value to a file you would probably want it the way it is.
That being said, this is a database column result and I agree it would look more elegant if the blank line in the display were not there. I might go so far as to change the psql display routines to not leave a blank line after the content in the event it ends with a newline.
Jim Jones <jim.jones@uni-muenster.de> writes: > On 23.04.23 07:31, Pavel Stehule wrote: >> Looks so there is an extra empty row. > Good catch! It looks like it comes directly from libxml2. Is it really a bug? If libxml2 itself is putting in that newline, I'm not sure we should take it on ourselves to strip it. > I'll do some digging to see if there is a good way to get rid of this > newline or if we need to chose a different dump function. If we do want to strip it, I'd just add a couple lines of code to delete any trailing newlines at the end of the process. Compare, eg, pchomp(). regards, tom lane
ne 23. 4. 2023 v 14:42 odesílatel Isaac Morland <isaac.morland@gmail.com> napsal:
On Sun, 23 Apr 2023 at 01:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:Himaybe I found a bug in xmlserializeSELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);(2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
┌─────────────────────────┐
│ xmlserialize │
╞═════════════════════════╡
│ <foo> ↵│
│ <bar> ↵│
│ <val x="y">42</val>↵│
│ </bar> ↵│
│ </foo> ↵│
│ │
└─────────────────────────┘
(1 row)Looks so there is an extra empty row.I wouldn't necessarily worry about this much. There is not, as such, an extra blank line at the end; rather it is conventional that a text file should end with a newline character. That is, conventionally every single line in a text file ends with a newline character, meaning the only text file that doesn't end with a newline is the empty file. You can see this in tools like diff, which explicitly report "no newline at end of file" if the file ends with a different character.If you were to save the value to a file you would probably want it the way it is.That being said, this is a database column result and I agree it would look more elegant if the blank line in the display were not there. I might go so far as to change the psql display routines to not leave a blank line after the content in the event it ends with a newline.
psql shows to display content without changes. I don't think it should be fixed on the client side.
But it can be easily fixed on the server side. I think there is some code that tries to clean the end lines already.
regards
Pavel
ne 23. 4. 2023 v 16:48 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Jim Jones <jim.jones@uni-muenster.de> writes:
> On 23.04.23 07:31, Pavel Stehule wrote:
>> Looks so there is an extra empty row.
> Good catch! It looks like it comes directly from libxml2.
Is it really a bug? If libxml2 itself is putting in that newline,
I'm not sure we should take it on ourselves to strip it.
Maybe It is not a bug, but it can be messy. "json_pretty" doesn't do it.
> I'll do some digging to see if there is a good way to get rid of this
> newline or if we need to chose a different dump function.
If we do want to strip it, I'd just add a couple lines of code
to delete any trailing newlines at the end of the process.
Compare, eg, pchomp().
yes
Pavel
regards, tom lane
Isaac Morland <isaac.morland@gmail.com> writes: > That being said, this is a database column result and I agree it would look > more elegant if the blank line in the display were not there. Yeah, that would basically be the argument for editorializing on libxml2's result. It's a weak argument, but not entirely without merit. > I might go so > far as to change the psql display routines to not leave a blank line after > the content in the event it ends with a newline. psql has *no* business changing what it displays: if what came from the server has a trailing newline, that had better be made visible. Even if we thought it was a good idea, it's about 25 years too late to reconsider that. However, xmlserialize()'s new indenting behavior isn't set in stone yet, so we could contemplate chomping newlines within that. regards, tom lane
On Sun, 23 Apr 2023 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Isaac Morland <isaac.morland@gmail.com> writes:
> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.
psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible. Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that. However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.
The trailing newline is made visible by the little bent arrow character that appears at the right hand side. So you could still tell whether the value ended with a trailing newline. I agree that simply dropping the trailing newline before displaying the value would be a very bad idea.
Dne ne 23. 4. 2023 18:03 uživatel Isaac Morland <isaac.morland@gmail.com> napsal:
On Sun, 23 Apr 2023 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:Isaac Morland <isaac.morland@gmail.com> writes:> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.
psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible. Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that. However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.The trailing newline is made visible by the little bent arrow character that appears at the right hand side. So you could still tell whether the value ended with a trailing newline. I agree that simply dropping the trailing newline before displaying the value would be a very bad idea.
What is benefit or usage of this trailing newline?
Personally, when I do some formatting I prefer adding newlines on client side before necessity of removing.
On Sun, 23 Apr 2023 at 12:28, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Dne ne 23. 4. 2023 18:03 uživatel Isaac Morland <isaac.morland@gmail.com> napsal:On Sun, 23 Apr 2023 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:Isaac Morland <isaac.morland@gmail.com> writes:> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.
psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible. Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that. However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.The trailing newline is made visible by the little bent arrow character that appears at the right hand side. So you could still tell whether the value ended with a trailing newline. I agree that simply dropping the trailing newline before displaying the value would be a very bad idea.What is benefit or usage of this trailing newline?
When creating a text file, it is conventional to end it with a newline. Every single line of the file is ended with a newline, including the last line of a file. Various tools deal with text files which are missing the newline on the last line in various ways depending on context. If you "cat" a file which is missing its trailing newline, the command prompt naturally ends up on the same line of the display as the characters after the last newline. Tools like “less” often adjust their display so the presence or absence of the trailing newline makes no difference.
So it’s not so much about benefit or usage, it’s about what text files normally contain. Newline characters are used as line terminators, not line separators.
Of course, it's conventional for a database value not to end with a newline. If I store a person's name in the database, it would be weird to append a newline to the end. Here we have serialized XML which we tend to think of storing in a text file — where one would expect it to end with a newline — but we're displaying it in a table cell as part of the output of a database query, where one typically does not expect values to end with a newline (although they can, and psql displays values differently even if they differ only in the presence or absence of a newline at the end).
If you were to load a typical text file into a column of a database row and display it using psql, you would see the same phenomenon.
My inclination would be, if we're just calling to a long-standardized library routine, to just accept its output as is. If a program is saving the output to a text file, that would be the expected behaviour. If not, then we need to document that the output of our function is the output of the library function, minus the trailing newline.
Personally, when I do some formatting I prefer adding newlines on client side before necessity of removing.
My inclination would be, if we're just calling to a long-standardized library routine, to just accept its output as is. If a program is saving the output to a text file, that would be the expected behaviour. If not, then we need to document that the output of our function is the output of the library function, minus the trailing newline.
After some digging on the matter, I'd also tend to leave the output as is, but I also do understand the other arguments - specially the consistency with jsonb_pretty().
If we agree to remove it, the change wouldn't be substantial :) I guess we could just pchomp it in the end of the function, as suggested by Tom. Attached a draft patch.
Best, Jim
Вложения
Jim Jones <jim.jones@uni-muenster.de> writes: > If we agree to remove it, the change wouldn't be substantial :) I guess > we could just pchomp it in the end of the function, as suggested by Tom. > Attached a draft patch. I wouldn't actually *use* pchomp here, because that induces an unnecessary copy of the result string. I had in mind more like copying pchomp's code to count up the trailing newline(s) and then pass a corrected length to cstring_to_text_with_len. You could simplify matters by doing that in all cases, too. It should never find anything to remove in the non-indented case, but the check should be of negligible cost in context. regards, tom lane
On 24.04.23 03:18, Tom Lane wrote: > I wouldn't actually *use* pchomp here, because that induces an > unnecessary > copy of the result string. I had in mind more like copying pchomp's code > to count up the trailing newline(s) and then pass a corrected length > to cstring_to_text_with_len. Changed. > You could simplify matters by doing that in all cases, too. It should > never find anything to remove in the non-indented case, but the check > should be of negligible cost in context. I'm not sure I understood it correctly. The non-indented cases should never find anything and indented cases with CONTENT strings do not add trailing newlines, so this is only applicable with DOCUMENT .. INDENT, right? Something like this would suffice? if(xmloption_arg != XMLOPTION_DOCUMENT) result = (text *) xmlBuffer_to_xmltype(buf); else { int len = xmlBufferLength(buf); const char *xmloutput = (const char *) xmlBufferContent(buf); while (len > 0 && xmloutput[len - 1] == '\n') len--; result = cstring_to_text_with_len(xmloutput, len); } If we really agree on manually removing the trailing newlines I will open a CF entry for this. Best, Jim