Обсуждение: xmlserialize bug - extra empty row at the end

Поиск
Список
Период
Сортировка

xmlserialize bug - extra empty row at the end

От
Pavel Stehule
Дата:
Hi

maybe I found a bug in xmlserialize

SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);

(2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
┌─────────────────────────┐
│      xmlserialize       │
╞═════════════════════════╡
│ <foo>                  ↵│
│   <bar>                ↵│
│     <val x="y">42</val>↵│
│   </bar>               ↵│
│ </foo>                 ↵│
│                         │
└─────────────────────────┘
(1 row)

Looks so there is an extra empty row.

Regards

Pavel

Re: xmlserialize bug - extra empty row at the end

От
Jim Jones
Дата:
On 23.04.23 07:31, Pavel Stehule wrote:
> Hi
>
> maybe I found a bug in xmlserialize
>
> SELECT xmlserialize(DOCUMENT '<foo><bar><val 
> x="y">42</val></bar></foo>' AS varchar INDENT);
>
> (2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT 
> '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
> ┌─────────────────────────┐
> │      xmlserialize       │
> ╞═════════════════════════╡
> │ <foo>                  ↵│
> │   <bar>                ↵│
> │     <val x="y">42</val>↵│
> │   </bar>               ↵│
> │ </foo>                 ↵│
> │                         │
> └─────────────────────────┘
> (1 row)
>
> Looks so there is an extra empty row.
>
> Regards
>
> Pavel

Hi Pavel,

Good catch! It looks like it comes directly from libxml2.

xmlDocPtr doc = xmlReadDoc(BAD_CAST "<foo><bar><val 
x=\"y\">42</val></bar></foo>", NULL, NULL, 0 );
xmlBufferPtr buf = NULL;
xmlSaveCtxtPtr ctxt = NULL;

buf = xmlBufferCreate();
ctxt = xmlSaveToBuffer(buf, NULL, XML_SAVE_NO_DECL | XML_SAVE_FORMAT);

xmlSaveDoc(ctxt, doc);
xmlSaveClose(ctxt);

printf("'%s'",buf->content);

==>

'<foo>
   <bar>
     <val x="y">42</val>
   </bar>
</foo>
'

I'll do some digging to see if there is a good way to get rid of this 
newline or if we need to chose a different dump function.

Thanks!

Best, Jim




Re: xmlserialize bug - extra empty row at the end

От
Dmitry Dolgov
Дата:
> On Sun, Apr 23, 2023 at 02:02:17PM +0200, Jim Jones wrote:
> On 23.04.23 07:31, Pavel Stehule wrote:
> > Hi
> >
> > maybe I found a bug in xmlserialize
> >
> > SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>'
> > AS varchar INDENT);
> >
> > (2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT
> > '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
> > ┌─────────────────────────┐
> > │      xmlserialize       │
> > ╞═════════════════════════╡
> > │ <foo>                  ↵│
> > │   <bar>                ↵│
> > │     <val x="y">42</val>↵│
> > │   </bar>               ↵│
> > │ </foo>                 ↵│
> > │                         │
> > └─────────────────────────┘
> > (1 row)
> >
> > Looks so there is an extra empty row.
> >
> > Regards
> >
> > Pavel
>
> Hi Pavel,
>
> Good catch! It looks like it comes directly from libxml2.
>
> xmlDocPtr doc = xmlReadDoc(BAD_CAST "<foo><bar><val
> x=\"y\">42</val></bar></foo>", NULL, NULL, 0 );
> xmlBufferPtr buf = NULL;
> xmlSaveCtxtPtr ctxt = NULL;
>
> buf = xmlBufferCreate();
> ctxt = xmlSaveToBuffer(buf, NULL, XML_SAVE_NO_DECL | XML_SAVE_FORMAT);
>
> xmlSaveDoc(ctxt, doc);
> xmlSaveClose(ctxt);
>
> printf("'%s'",buf->content);
>
> ==>
>
> '<foo>
>   <bar>
>     <val x="y">42</val>
>   </bar>
> </foo>
> '

It looks like this happens only if xml type is DOCUMENT, and thus
xmlSaveDoc is used to save the doc directly. I might be wrong, but after
a quick look at the corresponding libxml functionality it seems that a
new line is getting added if the element type is not XML_XINCLUDE_{START|END},
which is unfortunate if correct.



Re: xmlserialize bug - extra empty row at the end

От
Isaac Morland
Дата:
On Sun, 23 Apr 2023 at 01:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

maybe I found a bug in xmlserialize

SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);

(2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
┌─────────────────────────┐
│      xmlserialize       │
╞═════════════════════════╡
│ <foo>                  ↵│
│   <bar>                ↵│
│     <val x="y">42</val>↵│
│   </bar>               ↵│
│ </foo>                 ↵│
│                         │
└─────────────────────────┘
(1 row)

Looks so there is an extra empty row.

I wouldn't necessarily worry about this much. There is not, as such, an extra blank line at the end; rather it is conventional that a text file should end with a newline character. That is, conventionally every single line in a text file ends with a newline character, meaning the only text file that doesn't end with a newline is the empty file. You can see this in tools like diff, which explicitly report "no newline at end of file" if the file ends with a different character.

If you were to save the value to a file you would probably want it the way it is.

That being said, this is a database column result and I agree it would look more elegant if the blank line in the display were not there. I might go so far as to change the psql display routines to not leave a blank line after the content in the event it ends with a newline.

Re: xmlserialize bug - extra empty row at the end

От
Tom Lane
Дата:
Jim Jones <jim.jones@uni-muenster.de> writes:
> On 23.04.23 07:31, Pavel Stehule wrote:
>> Looks so there is an extra empty row.

> Good catch! It looks like it comes directly from libxml2.

Is it really a bug?  If libxml2 itself is putting in that newline,
I'm not sure we should take it on ourselves to strip it.

> I'll do some digging to see if there is a good way to get rid of this 
> newline or if we need to chose a different dump function.

If we do want to strip it, I'd just add a couple lines of code
to delete any trailing newlines at the end of the process.
Compare, eg, pchomp().

            regards, tom lane



Re: xmlserialize bug - extra empty row at the end

От
Pavel Stehule
Дата:


ne 23. 4. 2023 v 14:42 odesílatel Isaac Morland <isaac.morland@gmail.com> napsal:
On Sun, 23 Apr 2023 at 01:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi

maybe I found a bug in xmlserialize

SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);

(2023-04-23 07:27:53) postgres=# SELECT xmlserialize(DOCUMENT '<foo><bar><val x="y">42</val></bar></foo>' AS varchar INDENT);
┌─────────────────────────┐
│      xmlserialize       │
╞═════════════════════════╡
│ <foo>                  ↵│
│   <bar>                ↵│
│     <val x="y">42</val>↵│
│   </bar>               ↵│
│ </foo>                 ↵│
│                         │
└─────────────────────────┘
(1 row)

Looks so there is an extra empty row.

I wouldn't necessarily worry about this much. There is not, as such, an extra blank line at the end; rather it is conventional that a text file should end with a newline character. That is, conventionally every single line in a text file ends with a newline character, meaning the only text file that doesn't end with a newline is the empty file. You can see this in tools like diff, which explicitly report "no newline at end of file" if the file ends with a different character.

If you were to save the value to a file you would probably want it the way it is.

That being said, this is a database column result and I agree it would look more elegant if the blank line in the display were not there. I might go so far as to change the psql display routines to not leave a blank line after the content in the event it ends with a newline.

psql shows to display content without changes. I don't think it should be fixed on the client side.

But it can be easily fixed on the server side.  I think there is some code that tries to clean the end lines already.

regards

Pavel

Re: xmlserialize bug - extra empty row at the end

От
Pavel Stehule
Дата:


ne 23. 4. 2023 v 16:48 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Jim Jones <jim.jones@uni-muenster.de> writes:
> On 23.04.23 07:31, Pavel Stehule wrote:
>> Looks so there is an extra empty row.

> Good catch! It looks like it comes directly from libxml2.

Is it really a bug?  If libxml2 itself is putting in that newline,
I'm not sure we should take it on ourselves to strip it.

Maybe It is not a bug, but it can be messy. "json_pretty" doesn't do it.

 

> I'll do some digging to see if there is a good way to get rid of this
> newline or if we need to chose a different dump function.

If we do want to strip it, I'd just add a couple lines of code
to delete any trailing newlines at the end of the process.
Compare, eg, pchomp().

yes

Pavel
 

                        regards, tom lane

Re: xmlserialize bug - extra empty row at the end

От
Tom Lane
Дата:
Isaac Morland <isaac.morland@gmail.com> writes:
> That being said, this is a database column result and I agree it would look
> more elegant if the blank line in the display were not there.

Yeah, that would basically be the argument for editorializing on libxml2's
result.  It's a weak argument, but not entirely without merit.

> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.

psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible.  Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that.  However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.

            regards, tom lane



Re: xmlserialize bug - extra empty row at the end

От
Isaac Morland
Дата:
On Sun, 23 Apr 2023 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Isaac Morland <isaac.morland@gmail.com> writes:
 
> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.

psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible.  Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that.  However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.

The trailing newline is made visible by the little bent arrow character that appears at the right hand side. So you could still tell whether the value ended with a trailing newline. I agree that simply dropping the trailing newline before displaying the value would be a very bad idea.

Re: xmlserialize bug - extra empty row at the end

От
Pavel Stehule
Дата:


Dne ne 23. 4. 2023 18:03 uživatel Isaac Morland <isaac.morland@gmail.com> napsal:
On Sun, 23 Apr 2023 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Isaac Morland <isaac.morland@gmail.com> writes:
 
> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.

psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible.  Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that.  However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.

The trailing newline is made visible by the little bent arrow character that appears at the right hand side. So you could still tell whether the value ended with a trailing newline. I agree that simply dropping the trailing newline before displaying the value would be a very bad idea.

What is benefit or usage of this trailing newline?

Personally, when I do some formatting I prefer adding newlines on client side before necessity of removing.



Re: xmlserialize bug - extra empty row at the end

От
Isaac Morland
Дата:
On Sun, 23 Apr 2023 at 12:28, Pavel Stehule <pavel.stehule@gmail.com> wrote:


Dne ne 23. 4. 2023 18:03 uživatel Isaac Morland <isaac.morland@gmail.com> napsal:
On Sun, 23 Apr 2023 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Isaac Morland <isaac.morland@gmail.com> writes:
 
> I might go so
> far as to change the psql display routines to not leave a blank line after
> the content in the event it ends with a newline.

psql has *no* business changing what it displays: if what came from the
server has a trailing newline, that had better be made visible.  Even if
we thought it was a good idea, it's about 25 years too late to reconsider
that.  However, xmlserialize()'s new indenting behavior isn't set in
stone yet, so we could contemplate chomping newlines within that.

The trailing newline is made visible by the little bent arrow character that appears at the right hand side. So you could still tell whether the value ended with a trailing newline. I agree that simply dropping the trailing newline before displaying the value would be a very bad idea.

What is benefit or usage of this trailing newline?

When creating a text file, it is conventional to end it with a newline. Every single line of the file is ended with a newline, including the last line of a file. Various tools deal with text files which are missing the newline on the last line in various ways depending on context. If you "cat" a file which is missing its trailing newline, the command prompt naturally ends up on the same line of the display as the characters after the last newline. Tools like “less” often adjust their display so the presence or absence of the trailing newline makes no difference.

So it’s not so much about benefit or usage, it’s about what text files normally contain. Newline characters are used as line terminators, not line separators.

Of course, it's conventional for a database value not to end with a newline. If I store a person's name in the database, it would be weird to append a newline to the end. Here we have serialized XML which we tend to think of storing in a text file — where one would expect it to end with a newline — but we're displaying it in a table cell as part of the output of a database query, where one typically does not expect values to end with a newline (although they can, and psql displays values differently even if they differ only in the presence or absence of a newline at the end).

If you were to load a typical text file into a column of a database row and display it using psql, you would see the same phenomenon.

My inclination would be, if we're just calling to a long-standardized library routine, to just accept its output as is. If a program is saving the output to a text file, that would be the expected behaviour. If not, then we need to document that the output of our function is the output of the library function, minus the trailing newline.

Personally, when I do some formatting I prefer adding newlines on client side before necessity of removing.

Re: xmlserialize bug - extra empty row at the end

От
Jim Jones
Дата:

My inclination would be, if we're just calling to a long-standardized library routine, to just accept its output as is. If a program is saving the output to a text file, that would be the expected behaviour. If not, then we need to document that the output of our function is the output of the library function, minus the trailing newline.

After some digging on the matter, I'd also tend to leave the output as is, but I also do understand the other arguments - specially the consistency with jsonb_pretty().

If we agree to remove it, the change wouldn't be substantial :) I guess we could just pchomp it in the end of the function, as suggested by Tom. Attached a draft patch.

Best, Jim

Вложения

Re: xmlserialize bug - extra empty row at the end

От
Tom Lane
Дата:
Jim Jones <jim.jones@uni-muenster.de> writes:
> If we agree to remove it, the change wouldn't be substantial :) I guess 
> we could just pchomp it in the end of the function, as suggested by Tom. 
> Attached a draft patch.

I wouldn't actually *use* pchomp here, because that induces an unnecessary
copy of the result string.  I had in mind more like copying pchomp's code
to count up the trailing newline(s) and then pass a corrected length
to cstring_to_text_with_len.

You could simplify matters by doing that in all cases, too.  It should
never find anything to remove in the non-indented case, but the check
should be of negligible cost in context.

            regards, tom lane



Re: xmlserialize bug - extra empty row at the end

От
Jim Jones
Дата:
On 24.04.23 03:18, Tom Lane wrote:
> I wouldn't actually *use* pchomp here, because that induces an 
> unnecessary
> copy of the result string.  I had in mind more like copying pchomp's code
> to count up the trailing newline(s) and then pass a corrected length
> to cstring_to_text_with_len.
Changed.
> You could simplify matters by doing that in all cases, too.  It should
> never find anything to remove in the non-indented case, but the check
> should be of negligible cost in context.

I'm not sure I understood it correctly.

The non-indented cases should never find anything and indented cases 
with CONTENT strings do not add trailing newlines, so this is only 
applicable with DOCUMENT .. INDENT, right?

Something like this would suffice?

if(xmloption_arg != XMLOPTION_DOCUMENT)
     result = (text *) xmlBuffer_to_xmltype(buf);
else
{
     int    len = xmlBufferLength(buf);
     const char *xmloutput = (const char *) xmlBufferContent(buf);

     while (len > 0 && xmloutput[len - 1] == '\n')
         len--;

     result = cstring_to_text_with_len(xmloutput, len);
}

If we really agree on manually removing the trailing newlines I will 
open a CF entry for this.

Best, Jim


Вложения