Обсуждение: xml type and encodings

Поиск
Список
Период
Сортировка

xml type and encodings

От
Peter Eisentraut
Дата:
We need to decide on how to handle encoding information embedded in xml 
data that is passed through the client/server encoding conversion.

Here is an example:

Client encoding is A, server encoding is B.  Client sends an xml datum 
that looks like this:

INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0" 
encoding="C"?><content>...</content>'));

Assuming that A, B, and C are all distinct, this could fail at a number 
of places.

I suggest that we make the system ignore all encoding declarations in 
xml data.  That is, in the above example, the string would actually 
have to be encoded in client encoding B on the client, would be 
converted to A on the server and stored as such.  As far as I can tell, 
this is easily implemented and allowed by the XML standard.

The same would be done on the way back.  The datum would arrive in 
encoding B on the client.  It might be implementation-dependent whether 
the datum actually contains an XML declaration specifying an encoding 
and whether that encoding might read A, B, or C -- I haven't figured 
that out yet -- but the client will always be required to consider it 
to be B.

What should be done above the binary send/receive functionality?  
Looking at the send/receive functions for the text type, they 
communicate all data in the server encoding, so it seems reasonable to 
do this here as well.

Comments?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> Looking at the send/receive functions for the text type, they 
> communicate all data in the server encoding, so it seems reasonable to 
> do this here as well.

Uh, no, I'm pretty sure there's a translation to the client encoding.
It's in a subroutine not in text_send proper.
        regards, tom lane


Re: xml type and encodings

От
"Nikolay Samokhvalov"
Дата:
On 1/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:
Client encoding is A, server encoding is B.  Client sends an xml datum
that looks like this:

INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
encoding="C"?><content>...</content>'));

Assuming that A, B, and C are all distinct, this could fail at a number
of places.

I suggest that we make the system ignore all encoding declarations in
xml data.  That is, in the above example, the string would actually
have to be encoded in client encoding B on the client, would be
converted to A on the server and stored as such.  As far as I can tell,
this is easily implemented and allowed by the XML standard.

In other words, in case when B != C server must trigger an error, right?

--
Best regards,
Nikolay

Re: xml type and encodings

От
Peter Eisentraut
Дата:
Am Montag, 15. Januar 2007 12:42 schrieb Nikolay Samokhvalov:
> On 1/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:
> > Client encoding is A, server encoding is B.  Client sends an xml datum
> > that looks like this:
> >
> > INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
> > encoding="C"?><content>...</content>'));
> >
> > Assuming that A, B, and C are all distinct, this could fail at a number
> > of places.
> >
> > I suggest that we make the system ignore all encoding declarations in
> > xml data.  That is, in the above example, the string would actually
> > have to be encoded in client encoding B on the client, would be
> > converted to A on the server and stored as such.  As far as I can tell,
> > this is easily implemented and allowed by the XML standard.
>
> In other words, in case when B != C server must trigger an error, right?

No, C is ignored in all cases.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
"Florian G. Pflug"
Дата:
Peter Eisentraut wrote:
> Am Montag, 15. Januar 2007 12:42 schrieb Nikolay Samokhvalov:
>> On 1/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:
>>> Client encoding is A, server encoding is B.  Client sends an xml datum
>>> that looks like this:
>>>
>>> INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
>>> encoding="C"?><content>...</content>'));
>>>
>>> Assuming that A, B, and C are all distinct, this could fail at a number
>>> of places.
>>>
>>> I suggest that we make the system ignore all encoding declarations in
>>> xml data.  That is, in the above example, the string would actually
>>> have to be encoded in client encoding B on the client, would be
>>> converted to A on the server and stored as such.  As far as I can tell,
>>> this is easily implemented and allowed by the XML standard.
>> In other words, in case when B != C server must trigger an error, right?
> 
> No, C is ignored in all cases.

Would this mean that if the client_encoding is for example latin1, and I
retrieve an xml document uploaded by a client with client_encoding utf-8 
(and thus having encoding="c" in the xml tag), that I would get a 
document with latin1 encoding but saying that it's utf-8 in it's xml tag?

greetings, Florian Pflug




Re: xml type and encodings

От
Peter Eisentraut
Дата:
Am Montag, 15. Januar 2007 17:33 schrieb Florian G. Pflug:
> Would this mean that if the client_encoding is for example latin1, and I
> retrieve an xml document uploaded by a client with client_encoding utf-8
> (and thus having encoding="c" in the xml tag), that I would get a
> document with latin1 encoding but saying that it's utf-8 in it's xml tag?

That is likely to be a consequence of this proposed behaviour, but no doubt 
not a nice one.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
Martijn van Oosterhout
Дата:
On Mon, Jan 15, 2007 at 05:47:37PM +0100, Peter Eisentraut wrote:
> Am Montag, 15. Januar 2007 17:33 schrieb Florian G. Pflug:
> > Would this mean that if the client_encoding is for example latin1, and I
> > retrieve an xml document uploaded by a client with client_encoding utf-8
> > (and thus having encoding="c" in the xml tag), that I would get a
> > document with latin1 encoding but saying that it's utf-8 in it's xml tag?
>
> That is likely to be a consequence of this proposed behaviour, but no doubt
> not a nice one.

The only real alternative is to treat xml more like bytea than text
(ie, treat the input as a stream of octets). Whether that's "nice", I
have no idea.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: xml type and encodings

От
"Florian G. Pflug"
Дата:
Peter Eisentraut wrote:
> Am Montag, 15. Januar 2007 17:33 schrieb Florian G. Pflug:
>> Would this mean that if the client_encoding is for example latin1, and I
>> retrieve an xml document uploaded by a client with client_encoding utf-8
>> (and thus having encoding="c" in the xml tag), that I would get a
>> document with latin1 encoding but saying that it's utf-8 in it's xml tag?
> 
> That is likely to be a consequence of this proposed behaviour, but no doubt 
> not a nice one.

Couldn't the server change the encoding declaration inside the xml to 
the correct
one (the same as client_encoding) before returning the result?
Otherwise, parsing the xml on the client with some xml library becomes 
difficult, because the library is likely to get confused by the wrong 
encoding tag - and you can't even fix that by using the correct client
encoding, because you don't know what the encoding tag says until you 
have retrieved the document...

greetings, Florian Pflug



Re: xml type and encodings

От
Peter Eisentraut
Дата:
Martijn van Oosterhout wrote:
> The only real alternative is to treat xml more like bytea than text
> (ie, treat the input as a stream of octets).

bytea isn't "treated" any different than other data types.  You just 
have to take care in the client that you escape every byte greater than 
127.  The same option is available to you in xml, if you escape all 
suspicious characters using entities.  Then, the encoding declaration 
is immaterial anyway.  (Unless you allow UTF-16 into the picture, but 
let's say we exclude that implicitly.)

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
Peter Eisentraut
Дата:
Florian G. Pflug wrote:
> Couldn't the server change the encoding declaration inside the xml to
> the correct
> one (the same as client_encoding) before returning the result?

The data type output function doesn't know what the client encoding is 
or whether the data will be shipped to the client at all.  But what I'm 
thinking is that we should remove the encoding declaration if possible.  
At least that would be less confusing, albeit still potentially 
incorrect if the client continues to process the document without care.
-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
"Florian G. Pflug"
Дата:
Peter Eisentraut wrote:
> Florian G. Pflug wrote:
>> Couldn't the server change the encoding declaration inside the xml to
>> the correct
>> one (the same as client_encoding) before returning the result?
> 
> The data type output function doesn't know what the client encoding is 
> or whether the data will be shipped to the client at all.  But what I'm 
> thinking is that we should remove the encoding declaration if possible.  
> At least that would be less confusing, albeit still potentially 
> incorrect if the client continues to process the document without care.

Sorry, I don't get it - how does this work for text, then? It works there
to dynamically recode the data from the database encoding to the client
encoding before sending it off to the client, no?

greetings, Florian Pflug


Re: xml type and encodings

От
Peter Eisentraut
Дата:
Florian G. Pflug wrote:
> Sorry, I don't get it - how does this work for text, then? It works
> there to dynamically recode the data from the database encoding to
> the client encoding before sending it off to the client, no?

Sure, but it doesn't change the text inside the datum.
-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
"Andrew Dunstan"
Дата:
Peter Eisentraut wrote:
> Florian G. Pflug wrote:
>> Couldn't the server change the encoding declaration inside the xml to
>> the correct
>> one (the same as client_encoding) before returning the result?
>
> The data type output function doesn't know what the client encoding is
> or whether the data will be shipped to the client at all.  But what I'm
> thinking is that we should remove the encoding declaration if possible.
> At least that would be less confusing, albeit still potentially
> incorrect if the client continues to process the document without care.

The XML SPec says:

"In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is a fatal error for an entity including an
encoding declaration to be presented to the XML processor in an encoding
other than that named in the declaration, or for an entity which begins
with neither a Byte Order Mark nor an encoding declaration to use an
encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
ordinary ASCII entities do not strictly need an encoding declaration."

ISTM we are reasonably entitled to require the client to pass in an xml
document that uses the client encoding, re-encoding it if necessary (and
adjusting the encoding decl if any in the process).

We should error out on any explicit encoding that conflicts with the
client encoding. I don't like the idea of just ignoring an explicit
encoding decl.

Are we going to ensure that what we hand back to another client has an
appropriate encding decl? Or will we just remove it in all cases?

cheers

andrew



Re: xml type and encodings

От
Peter Eisentraut
Дата:
Andrew Dunstan wrote:
> We should error out on any explicit encoding that conflicts with the
> client encoding. I don't like the idea of just ignoring an explicit
> encoding decl.

That is an instance of the problem of figuring out which encoding names 
are equivalent, which I believe we have settled on finding impossible.

> Are we going to ensure that what we hand back to another client has
> an appropriate encding decl? Or will we just remove it in all cases?

We can't do the former, but the latter might be doable.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
Peter Eisentraut
Дата:
I wrote:
> We need to decide on how to handle encoding information embedded in
> xml data that is passed through the client/server encoding
> conversion.

Tangentially related, I'm currently experimenting with a setup that 
stores all xml data in UTF-8 on the server, converting it back to the 
server encoding on output.  This doesn't do anything to solve the 
problem above, but it makes the internal processing much simpler, since 
all of libxml uses UTF-8 internally anyway.  Is anyone opposed to that 
setup on principle?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: xml type and encodings

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> Andrew Dunstan wrote:
>> Are we going to ensure that what we hand back to another client has
>> an appropriate encding decl? Or will we just remove it in all cases?

> We can't do the former, but the latter might be doable.

I think that in the case of binary output, it'd be possible for xml_send
to include an encoding decl safely, because it could be sure that that's
where the data is going forthwith.  Not sure if that's worth anything
though.  The idea of text and binary output behaving differently on this
point doesn't seem all that attractive ...
        regards, tom lane


Re: xml type and encodings

От
"Florian G. Pflug"
Дата:
Peter Eisentraut wrote:
> I wrote:
>> We need to decide on how to handle encoding information embedded in
>> xml data that is passed through the client/server encoding
>> conversion.
> 
> Tangentially related, I'm currently experimenting with a setup that 
> stores all xml data in UTF-8 on the server, converting it back to the 
> server encoding on output.  This doesn't do anything to solve the 
> problem above, but it makes the internal processing much simpler, since 
> all of libxml uses UTF-8 internally anyway.  Is anyone opposed to that 
> setup on principle?

If you do that, maybe it would be the easiest and least confusing thing
to just _always_ represent an xml document in utf-8, ignoring the client_encoding
entirely for xml. The only good reason for not using utf-8 that comes to
my mind is the increased storage size, especially for eastern scripts where
nearly all characters need 2 or more bytes. But if you store it in utf-8
internally anyway, than I don't think this arguments carries a lot of weight
anymore...

You could warn the user about that fact whenever he sends or recieves an
xml document, and the client_encoding is not set to utf-8.

Not that I'm entirely conviced about this being a good idea myself - but I
think I'd prefer a clear rule like that over surprises like "text and binary
output have different semantics" or "the encoding information is totally misleading
and must be ignored". And most software that uses xml probably uses utf-8...

greetings, Florian Pflug


Re: xml type and encodings

От
Martijn van Oosterhout
Дата:
On Tue, Jan 16, 2007 at 06:41:56PM +0100, Florian G. Pflug wrote:
> If you do that, maybe it would be the easiest and least confusing thing
> to just _always_ represent an xml document in utf-8, ignoring the
> client_encoding
> entirely for xml.

You can't do that. The server needs to parse the incoming string before
it knows it's dealing with XML. The string from the client must be
interpreted in the client encoding...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.