Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
От | Guillaume Cottenceau |
---|---|
Тема | Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution |
Дата | |
Msg-id | 873btccqgw.fsf@meuh.mnc.ch обсуждение исходный текст |
Ответ на | Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution (Anders Hermansen <anders@yoyo.no>) |
Ответы |
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution |
Список | pgsql-jdbc |
Anders Hermansen <anders 'at' yoyo.no> writes: > * Guillaume Cottenceau (gc@mnc.ch) wrote: > > Anders Hermansen <anders 'at' yoyo.no> writes: > > > * Guillaume Cottenceau (gc@mnc.ch) wrote: > > > > Isn't there a problem with your UTF-8 data containing 0x00EF? > > > > > > E0 to EF hex (224 to 239): first byte of a three-byte sequence. > > > > Well 00 is first byte here, isn't it? > > UTF-8 is a byte sequence, so it's not about the first byte in the whole > sequence. But about the first byte in a tree byte sequece. Yes. I forgot that you assumed the machine was big-endian. So the UTF-8 character is here probably first byte 0xEF, second byte 0x00? I did my test with first byte 0x00 and second byte 0xEF, hence confusion with your initial comment. My reasoning was that if the first byte of this two-byte sequence is 0x00 then the rule that 0xEF is first byte of a three-byte sequence doesn't apply, since 0xEF is second byte in the sequence. > There should be no nul (0) bytes when encoding UTF-8. I believe > this is in the specification to allow it to be compatible with > C nul-terminated strings. > > I believe that the byte sequence 0x00EF i illegal UTF-8 because: > 1) It contains nul (0x00) byte > 2) 0xEF is not followed by two more bytes > > On the other hand U+00EF is a valid unicode code point. Which points to: I think this is assumed little-endian, e.g. first byte 0x00 and second byte 0xEF (especially because UTF-8 is just a series of bytes without any endianness aspects, so it makes good sense to actually read this left-to-right, e.g. byte 0x00 first). > LATIN SMALL LETTER I WITH DIAERESIS > It is encoded as 0xC3AF in UTF-8 > As 0x00EF in UTF-16 (and UCS-2 ?) Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are the same[1]. > As 0xEF in ISO-8859-1 Hum I think I may understand what's going on here. It's possible that in the message: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 when they say "0x00ef" they don't talk about UTF-8 per-see but they use the unicode representation (which is error prone). Ref: [1] UCS-2 is a subset of UTF-16 which comprises all the 2-byte sequence characters but no 3 or 4-byte sequence characters -- Guillaume Cottenceau
В списке pgsql-jdbc по дате отправления: