Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
От | Anders Hermansen |
---|---|
Тема | Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution |
Дата | |
Msg-id | 20050427140534.GC582@online.no обсуждение исходный текст |
Ответ на | Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution (Guillaume Cottenceau <gc@mnc.ch>) |
Список | pgsql-jdbc |
* Guillaume Cottenceau (gc@mnc.ch) wrote: > Anders Hermansen <anders 'at' yoyo.no> writes: > > UTF-8 is a byte sequence, so it's not about the first byte in the whole > > sequence. But about the first byte in a tree byte sequece. > > Yes. I forgot that you assumed the machine was big-endian. So the > UTF-8 character is here probably first byte 0xEF, second byte > 0x00? > > I did my test with first byte 0x00 and second byte 0xEF, hence > confusion with your initial comment. > > My reasoning was that if the first byte of this two-byte > sequence is 0x00 then the rule that 0xEF is first byte of a > three-byte sequence doesn't apply, since 0xEF is second byte in > the sequence. Endianness is not a problem when working with a sequnce of bytes (8-bit) like in utf-8. It only becomes a problem when you deal with more than 1 byte representing 1 value. So it's an issue in UTF-16 which is big-endian by default I think. So I interpreted the message "ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1" as a byte sequence with 0x00 first, and then 0xef. Maybe that's a wrong assumption. > > There should be no nul (0) bytes when encoding UTF-8. I believe > > this is in the specification to allow it to be compatible with > > C nul-terminated strings. > > > > I believe that the byte sequence 0x00EF i illegal UTF-8 because: > > 1) It contains nul (0x00) byte > > 2) 0xEF is not followed by two more bytes > > > > On the other hand U+00EF is a valid unicode code point. Which points to: > > I think this is assumed little-endian, e.g. first byte 0x00 and > second byte 0xEF (especially because UTF-8 is just a series of > bytes without any endianness aspects, so it makes good sense to > actually read this left-to-right, e.g. byte 0x00 first). As I said above. Endiness is not an issue for UTF-8. The byte _sequence_ is always read from start to end. > > LATIN SMALL LETTER I WITH DIAERESIS > > It is encoded as 0xC3AF in UTF-8 > > As 0x00EF in UTF-16 (and UCS-2 ?) > > Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are > the same[1]. Yes. > > As 0xEF in ISO-8859-1 > > Hum I think I may understand what's going on here. It's possible > that in the message: > > ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > > when they say "0x00ef" they don't talk about UTF-8 per-see but > they use the unicode representation (which is error prone). If 0x00ef refers to a unicode codepoint, it should not have been a problem to convert it to ISO-8859-1 (0xef). If 0x00ef refers to a byte sequence, then the error message is a bit misleading because it's not a character but a byte sequence. And the error is decoding the UTF-8, not encoding the ISO-8859-1. Anders Hermansen
В списке pgsql-jdbc по дате отправления: