Обсуждение: BUG #4622: xpath only work in utf-8 server encoding

Поиск
Список
Период
Сортировка

BUG #4622: xpath only work in utf-8 server encoding

От
"Sergey Burladyan"
Дата:
The following bug has been logged online:

Bug reference:      4622
Logged by:          Sergey Burladyan
Email address:      eshkinkot@gmail.com
PostgreSQL version: 8.3.5
Operating system:   Debian testing
Description:        xpath only work in utf-8 server encoding
Details:

hello, all !

i am trying for test parse xml string in other than utf-8 encoding, it
correctly loaded but xpath(text, xml) can't handle it:

seb@seb:~/tmp/pg$ echo $LANG
ru_RU.CP1251
seb@seb:~/tmp/pg$ /usr/lib/postgresql/8.3/bin/postgres -p 5433 -k s -s -D .
LOG:  система была отключена: 2009-01-22 16:30:07 MSK
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

seb@seb:~$ echo $LANG
ru_RU.CP1251
seb@seb:~$ psql -h localhost -p 5433
Welcome to psql 8.3.5, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

seb=# select * from (select
xml('<русский>язык</русский>')) as x(v);
            v
-------------------------
 <русский>язык</русский>
(1 запись)

seb=# select xpath('/русский/text()', v::xml) from (select
xml('<русский>язык</русский>')) as x(v);
ERROR:  could not parse XML data
DETAIL:  Entity: line 1: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xF0 0xF3 0xF1 0xF1
<x><русский>язык</русский></x>
    ^
seb=# select name, setting from pg_settings where name like 'lc_%' or name
like '%enco%';
      name       |   setting
-----------------+--------------
 client_encoding | WIN1251
 lc_collate      | ru_RU.CP1251
 lc_ctype        | ru_RU.CP1251
 lc_messages     | ru_RU.CP1251
 lc_monetary     | ru_RU.CP1251
 lc_numeric      | ru_RU.CP1251
 lc_time         | ru_RU.CP1251
 server_encoding | WIN1251
(8 rows)

in utf-8 server encoding it work correctly:

seb=> select xpath('/русский/text()', v::xml) from (select
xml('<русский>язык</русский>')) as x(v);
 xpath
--------
 {язык}
(1 запись)

seb=> select name, setting from pg_settings where name like 'lc_%' or name
like '%enco%';
      name       |   setting
-----------------+-------------
 client_encoding | UTF8
 lc_collate      | ru_RU.UTF-8
 lc_ctype        | ru_RU.UTF-8
 lc_messages     | ru_RU.UTF-8
 lc_monetary     | ru_RU.UTF-8
 lc_numeric      | ru_RU.UTF-8
 lc_time         | ru_RU.UTF-8
 server_encoding | UTF8
(8 rows)

i am think something is wrong here, string parsed correctly by xml(text),
but it result can't pass to xpath function...

Re: BUG #4622: xpath only work in utf-8 server encoding

От
Peter Eisentraut
Дата:
T24gVGh1cnNkYXkgMjIgSmFudWFyeSAyMDA5IDE1OjM5OjAwIFNlcmdleSBC
dXJsYWR5YW4gd3JvdGU6Cj4gc2ViPSMgc2VsZWN0IHhwYXRoKCcv0YDRg9GB
0YHQutC40LkvdGV4dCgpJywgdjo6eG1sKSBmcm9tIChzZWxlY3QKPiB4bWwo
JzzRgNGD0YHRgdC60LjQuT7Rj9C30YvQujwv0YDRg9GB0YHQutC40Lk+Jykp
IGFzIHgodik7Cj4gRVJST1I6ICBjb3VsZCBub3QgcGFyc2UgWE1MIGRhdGEK
PiBERVRBSUw6ICBFbnRpdHk6IGxpbmUgMTogcGFyc2VyIGVycm9yIDogSW5w
dXQgaXMgbm90IHByb3BlciBVVEYtOCwgaW5kaWNhdGUKPiBlbmNvZGluZyAh
Cj4gQnl0ZXM6IDB4RjAgMHhGMyAweEYxIDB4RjEKPiA8eD480YDRg9GB0YHQ
utC40Lk+0Y/Qt9GL0Lo8L9GA0YPRgdGB0LrQuNC5PjwveD4KPiAgICAgXgoK
VGhpcyByYWlzZXMgdGhlIHF1ZXN0aW9uOiBXaGF0IGFyZSB0aGUgcnVsZXMg
YWJvdXQgZW5jb2RpbmcgdGhlIGNoYXJhY3RlcnMgaW4gClhQYXRoIGV4cHJl
c3Npb25zIHRoZW1zZWx2ZXM/ICBJIGhhdmVuJ3QgZm91bmQgYW55dGhpbmcg
YWJvdXQgdGhhdCBpbiB0aGUgCnN0YW5kYXJkLiAgQW55b25lIGtub3c/Cg==

Re: BUG #4622: xpath only work in utf-8 server encoding

От
eshkinkot
Дата:
23 января 2009 г. 0:58 пользователь Peter Eisentraut <peter_e@gmx.net> написал:
> On Thursday 22 January 2009 15:39:00 Sergey Burladyan wrote:
>> seb=# select xpath('/русский/text()', v::xml) from (select
>> xml('<русский>язык</русский>')) as x(v);
>> ERROR:  could not parse XML data
>> DETAIL:  Entity: line 1: parser error : Input is not proper UTF-8, indicate
>> encoding !
>> Bytes: 0xF0 0xF3 0xF1 0xF1
>> <x><русский>язык</русский></x>
>>     ^

> This raises the question: What are the rules about encoding the characters in
> XPath expressions themselves?  I haven't found anything about that in the
> standard.  Anyone know?

PostgreSQL does not use libxml2 internal encoding support and strip
xml encoding from xml body, so i think there is no choice, by default
for libxml2 it must be in it internal encoding utf-8 anyway.

i am not sure about xml standard but may be documentation of libxml2
can help to solve this issue ? see http://xmlsoft.org/encoding.html

"What does this mean in practice for the libxml2 user:
* xmlChar, the libxml2 data type is a byte, those bytes must be
assembled as UTF-8 valid strings. The proper way to terminate an
xmlChar * string is simply to append 0 byte, as usual.
* One just need to make sure that when using chars outside the ASCII
set, the values has been properly converted to UTF-8"

I understand this as: all xmlChar strings must be in utf-8 encoding,
no matter what is encoding of xml body

i try to fix this issue for xpath function, see patch in attachment

by the way, contrib/xml2 also have this issue...

Вложения