Обсуждение: Bug with UTF-8 character
good morning, I got a bug request for the following unicode character in PostgreSQL 8.1.4: 0xedaeb8 ERROR: invalid byte sequence for encoding "UTF8": 0xedaeb8 This one seemed to work properly in PostgreSQL 8.0.3. I think the following code in postgreSQL 814 has a bug in it. File: postgresql-8.1.4/src/backend/utils/mb/wchar.c The entry values to the function are: source = ed ae b8 20 20 20 20 20 20 20 20 20 20 20 20 length = 3 (length is the length of current utf-8 character) But the code does a check where the second character should not be greater than 0x9F, when first character is 0xED. This is not according to UTF-8 standard in RFC 3629. I believe that is not a valid test. This test fails on our string, when it shouldn’t. I believe this is a bug, could you please confirm or let me know what I am doing wrong. Many thanks, Hans -- Cybertec Geschwinde & Schönig GmbH Schöngrabern 134; A-2020 Hollabrunn Tel: +43/1/205 10 35 / 340 www.postgresql.at, www.cybertec.at
On Fri, May 26, 2006 at 08:21:56AM +0200, Hans-Jürgen Schönig wrote: > good morning, > > I got a bug request for the following unicode character in PostgreSQL > 8.1.4: 0xedaeb8 > > ERROR: invalid byte sequence for encoding "UTF8": 0xedaeb8 > > This one seemed to work properly in PostgreSQL 8.0.3. > > I think the following code in postgreSQL 814 has a bug in it. > > File: postgresql-8.1.4/src/backend/utils/mb/wchar.c Your character converts to char DBB8. According to the standard, characters in the range D800-DFFF are not characters but surrogates. They don't mean anything by themselves and are thus rejected by postgres. http://www.unicode.org/faq/utf_bom.html#30 This character should be preceded by a low surrogate (D800-DBFF). You should combine the two into a single 4-byte UTF-8 character. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On 5/26/06, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Fri, May 26, 2006 at 08:21:56AM +0200, Hans-Jürgen Schönig wrote: > > I got a bug request for the following unicode character in PostgreSQL > > 8.1.4: 0xedaeb8 > > > > ERROR: invalid byte sequence for encoding "UTF8": 0xedaeb8 > Your character converts to char DBB8. According to the standard, > characters in the range D800-DFFF are not characters but surrogates. > They don't mean anything by themselves and are thus rejected by > postgres. > > http://www.unicode.org/faq/utf_bom.html#30 > > This character should be preceded by a low surrogate (D800-DBFF). You > should combine the two into a single 4-byte UTF-8 character. You are talking about UTF16, not UTF8. -- marko
Hans-Jürgen Schönig <postgres@cybertec.at> writes: > But the code does a check where the second character should not be > greater than 0x9F, when first character is 0xED. This is not according > to UTF-8 standard in RFC 3629. Better read the RFC again: it says UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail) ------------ The reason for the prohibition is explained as The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use withthe UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. I don't know anything about "surrogate pairs", but I am not about to decide that we know more about this than the RFC authors do. If they say it's invalid, it's invalid. regards, tom lane
On Fri, May 26, 2006 at 05:16:59PM +0300, Marko Kreen wrote: > On 5/26/06, Martijn van Oosterhout <kleptog@svana.org> wrote: > >On Fri, May 26, 2006 at 08:21:56AM +0200, Hans-Jürgen Schönig wrote: > >> I got a bug request for the following unicode character in PostgreSQL > >> 8.1.4: 0xedaeb8 > >> > >> ERROR: invalid byte sequence for encoding "UTF8": 0xedaeb8 > > >Your character converts to char DBB8. According to the standard, > >characters in the range D800-DFFF are not characters but surrogates. > >They don't mean anything by themselves and are thus rejected by > >postgres. > > > >http://www.unicode.org/faq/utf_bom.html#30 > > > >This character should be preceded by a low surrogate (D800-DBFF). You > >should combine the two into a single 4-byte UTF-8 character. > > You are talking about UTF16, not UTF8. UTF-8 and UTF-16 use the same charater set as base, just the encoding is different. As that page says, to convert the surrogate pair in UTF-16 (D800 DC00) to UTF-8, you have to combine them into a single 4-byte UTF-8 character. The direct encoding for D800 into UTF-8 is invalid because no such character exists. The OP apparently has some broken UTF-16 to UTF-8 conversion software and thus produced invalid UTF-8, which postgres is rejecting. Given he didn't post the other half of the surrogate, we don't actually know what character he's trying to represent, so we can't help him with the encoding. However, supplementary characters (which require surrogates in UTF-16) are all in the range 0x10000 to 0x10FFFF. If you don't beleive me, check the unicode database yourself (warning large: 944KB). http://www.unicode.org/Public/UNIDATA/UnicodeData.txt DBB8 is a private use surrogate, maybe he should be using something in the range E000-F8FF which are normal private use characters. Have a ncie day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.