Обсуждение: inserts bypass encoding conversion
Hi,
With client_encoding=UTF8 and server_encoding=LATIN1, looks like insert into value with chr(codepoint) bypass encoding conversion , is it expected ? test as below ,
jamet=# delete from testutf8;
DELETE 1
jamet=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)
jamet=# show server_encoding;
server_encoding
-----------------
LATIN1
(1 row)
jamet=# \d testutf8
Table "public.testutf8"
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+---------
test | character varying(128) | | |
jamet=# insert into testutf8 values('…');
ERROR: character with byte sequence 0xe2 0x80 0xa6 in encoding "UTF8" has no equivalent in encoding "LATIN1" <<< here it’s expected to see encoding conversion error
jamet=# insert into testutf8 values(chr(226)||chr(128)||chr(166)); <<< here, looks like using chr(codepoint) works, it bypass encoding_conversion ?
INSERT 0 1
jamet=# set client_encoding='LATIN1';
SET
jamet=# show client_encoding;
client_encoding
-----------------
LATIN1
(1 row)
jamet=# show server_encoding;
server_encoding
-----------------
LATIN1
(1 row)
jamet=# select * from testutf8;
test
------
…
(1 row)
jamet=# insert into testutf8 values('…'); <<< here, with client and server same LATIN1, no any encoding conversion , and the data got inserted.
INSERT 0 1
jamet=# select * from testutf8;
test
------
…
…
(2 rows)
jamet=# select encode(test::bytea,'hex') from testutf8; <<< both show same value
encode
--------
e280a6
e280a6
(2 rows)
"James Pang (chaolpan)" <chaolpan@cisco.com> writes: > With client_encoding=UTF8 and server_encoding=LATIN1, looks like insert into value with chr(codepoint) bypass encodingconversion , is it expected ? test as below , The chr() function is a server-side operation that has nothing to do with the client encoding. regards, tom lane
In this case, the real value stored in database is UTF8 byte sequence instead of LATIN1 encoding text, right? When Irun "select * from table" with client_encoding=LATIN1, automatically encoding conversion from UTF8 byte sequence to text? Thanks, James -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Wednesday, August 16, 2023 10:03 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: pgsql-admin@lists.postgresql.org Subject: Re: inserts bypass encoding conversion "James Pang (chaolpan)" <chaolpan@cisco.com> writes: > With client_encoding=UTF8 and server_encoding=LATIN1, looks like > insert into value with chr(codepoint) bypass encoding conversion , is > it expected ? test as below , The chr() function is a server-side operation that has nothing to do with the client encoding. regards, tom lane
"James Pang (chaolpan)" <chaolpan@cisco.com> writes: > In this case, the real value stored in database is UTF8 byte sequence > instead of LATIN1 encoding text, right? Not if you have server_encoding = LATIN1, as you stated earlier. In that case, the data in the database is in LATIN1, and chr() interprets its argument as a LATIN1 code value --- which happens to look enough like a Unicode code point to be possibly confusing, until you try to use code points that aren't within LATIN1. regards, tom lane
So, insert into values(chr(226)||chr(128)||chr(166)) actually got stored in database with LATIN1 with single byte sequence,but when query select * from testutf8, it got converted to UTF8 three byte sequence first ? jamet=# select chr(226)||chr(128)||chr(166); ?column? ---------- ... (1 row) jamet=# select * from testutf8; test -------------------------------------------------------------------------------- ... jamet=# select encode(test::bytea,'hex') from testutf8; encode ------------------------------------------------------------------------------------------------------------------------------------------------------------- - e280a6 Thanks, James -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Thursday, August 17, 2023 9:33 AM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: pgsql-admin@lists.postgresql.org Subject: Re: inserts bypass encoding conversion "James Pang (chaolpan)" <chaolpan@cisco.com> writes: > In this case, the real value stored in database is UTF8 byte sequence > instead of LATIN1 encoding text, right? Not if you have server_encoding = LATIN1, as you stated earlier. In that case, the data in the database is in LATIN1, and chr() interprets its argument as a LATIN1 code value --- which happensto look enough like a Unicode code point to be possibly confusing, until you try to use code points that aren't withinLATIN1. regards, tom lane
"James Pang (chaolpan)" <chaolpan@cisco.com> writes: > So, insert into values(chr(226)||chr(128)||chr(166)) actually got stored in database with LATIN1 with single byte sequence,but when query select * from testutf8, it got converted to UTF8 three byte sequence first ? There are no LATIN1 characters that have longer than 2-byte UTF8 representations, so no. I think your fundamental misunderstanding is supposing that this: chr(226)||chr(128)||chr(166) produces something equivalent to the UTF8 sequence 0xe2 0x80 0xa6. It will not, no matter which server encoding you are dealing with. It will produce something that is three separate characters according to the server encoding. In LATIN1, that could well be the byte sequence 0xe2 0x80 0xa6, but *that byte sequence does not mean the same thing that it would mean in UTF8 encoding*. You also seem not to grasp the fact that an encoding conversion will happen between your client and the server if client_encoding is different from server_encoding. Because of that, the output of a SELECT command doesn't prove much of anything here. regards, tom lane