Обсуждение: inserts bypass encoding conversion

Поиск

Список

Период

Сортировка

inserts bypass encoding conversion

От

"James Pang (chaolpan)"

Дата:

16 августа 2023 г., 07:06:51

Hi,

With client_encoding=UTF8 and server_encoding=LATIN1, looks like insert into value with chr(codepoint) bypass encoding conversion , is it expected ? test as below ,

jamet=# delete from testutf8;

DELETE 1

jamet=# show client_encoding;

client_encoding

-----------------

UTF8

(1 row)

jamet=# show server_encoding;

server_encoding

-----------------

LATIN1

(1 row)

jamet=# \d testutf8

Table "public.testutf8"

Column | Type | Collation | Nullable | Default

--------+------------------------+-----------+----------+---------

test | character varying(128) | | |

jamet=# insert into testutf8 values('…');

ERROR: character with byte sequence 0xe2 0x80 0xa6 in encoding "UTF8" has no equivalent in encoding "LATIN1" <<< here it’s expected to see encoding conversion error

jamet=# insert into testutf8 values(chr(226)||chr(128)||chr(166)); <<< here, looks like using chr(codepoint) works, it bypass encoding_conversion ?

INSERT 0 1

jamet=# set client_encoding='LATIN1';

SET

jamet=# show client_encoding;

client_encoding

-----------------

LATIN1

(1 row)

jamet=# show server_encoding;

server_encoding

-----------------

LATIN1

(1 row)

jamet=# select * from testutf8;

test

------

…

(1 row)

jamet=# insert into testutf8 values('…'); <<< here, with client and server same LATIN1, no any encoding conversion , and the data got inserted.

INSERT 0 1

jamet=# select * from testutf8;

test

------

…

(2 rows)

jamet=# select encode(test::bytea,'hex') from testutf8; <<< both show same value

encode

--------

e280a6

(2 rows)

Re: inserts bypass encoding conversion

От

Tom Lane

Дата:

16 августа 2023 г., 14:02:55

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>    With client_encoding=UTF8 and server_encoding=LATIN1, looks like insert into value with chr(codepoint) bypass
encodingconversion , is it expected ?  test as below , 

The chr() function is a server-side operation that has nothing to do
with the client encoding.

            regards, tom lane

RE: inserts bypass encoding conversion

От

"James Pang (chaolpan)"

Дата:

17 августа 2023 г., 01:08:21

In this case, the real value stored in database is UTF8 byte sequence instead of LATIN1 encoding text, right?   When
Irun "select * from table" with client_encoding=LATIN1, automatically encoding conversion from UTF8 byte sequence to
text? 

Thanks,

James

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Wednesday, August 16, 2023 10:03 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: pgsql-admin@lists.postgresql.org
Subject: Re: inserts bypass encoding conversion

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>    With client_encoding=UTF8 and server_encoding=LATIN1, looks like
> insert into value with chr(codepoint) bypass encoding conversion , is
> it expected ?  test as below ,

The chr() function is a server-side operation that has nothing to do with the client encoding.

            regards, tom lane

Re: inserts bypass encoding conversion

От

Tom Lane

Дата:

17 августа 2023 г., 01:33:00

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>   In this case, the real value stored in database is UTF8 byte sequence
>   instead of LATIN1 encoding text, right?

Not if you have server_encoding = LATIN1, as you stated earlier.
In that case, the data in the database is in LATIN1, and chr()
interprets its argument as a LATIN1 code value --- which happens
to look enough like a Unicode code point to be possibly confusing,
until you try to use code points that aren't within LATIN1.

            regards, tom lane

RE: inserts bypass encoding conversion

От

"James Pang (chaolpan)"

Дата:

17 августа 2023 г., 02:25:57

So,  insert into values(chr(226)||chr(128)||chr(166)) actually got stored in database with LATIN1 with single byte
sequence,but when query select * from testutf8, it got converted to UTF8 three byte sequence first ?  

jamet=# select chr(226)||chr(128)||chr(166);
 ?column?
----------
 ...
(1 row)

jamet=# select * from testutf8;
                                      test
--------------------------------------------------------------------------------
 ...

jamet=# select encode(test::bytea,'hex') from testutf8;
                                                                            encode

-------------------------------------------------------------------------------------------------------------------------------------------------------------
-
 e280a6

Thanks,

James

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Thursday, August 17, 2023 9:33 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: pgsql-admin@lists.postgresql.org
Subject: Re: inserts bypass encoding conversion

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>   In this case, the real value stored in database is UTF8 byte sequence
>   instead of LATIN1 encoding text, right?

Not if you have server_encoding = LATIN1, as you stated earlier.
In that case, the data in the database is in LATIN1, and chr() interprets its argument as a LATIN1 code value --- which
happensto look enough like a Unicode code point to be possibly confusing, until you try to use code points that aren't
withinLATIN1. 

            regards, tom lane

Re: inserts bypass encoding conversion

От

Tom Lane

Дата:

17 августа 2023 г., 02:40:32

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
> So,  insert into values(chr(226)||chr(128)||chr(166)) actually got stored in database with LATIN1 with single byte
sequence,but when query select * from testutf8, it got converted to UTF8 three byte sequence first ?  

There are no LATIN1 characters that have longer than 2-byte UTF8
representations, so no.

I think your fundamental misunderstanding is supposing that this:

    chr(226)||chr(128)||chr(166)

produces something equivalent to the UTF8 sequence 0xe2 0x80 0xa6.
It will not, no matter which server encoding you are dealing with.
It will produce something that is three separate characters
according to the server encoding.  In LATIN1, that could well be
the byte sequence 0xe2 0x80 0xa6, but *that byte sequence does not
mean the same thing that it would mean in UTF8 encoding*.

You also seem not to grasp the fact that an encoding conversion
will happen between your client and the server if client_encoding
is different from server_encoding.  Because of that, the output of
a SELECT command doesn't prove much of anything here.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: inserts bypass encoding conversion

inserts bypass encoding conversion

Re: inserts bypass encoding conversion

RE: inserts bypass encoding conversion

Re: inserts bypass encoding conversion

RE: inserts bypass encoding conversion

Re: inserts bypass encoding conversion