Re: BUG #19354: JOHAB rejects valid byte sequences
| От | Robert Haas |
|---|---|
| Тема | Re: BUG #19354: JOHAB rejects valid byte sequences |
| Дата | |
| Msg-id | CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: BUG #19354: JOHAB rejects valid byte sequences (Jeroen Vermeulen <jtvjtv@gmail.com>) |
| Ответы |
Re: BUG #19354: JOHAB rejects valid byte sequences
|
| Список | pgsql-bugs |
On Tue, Dec 16, 2025 at 2:42 AM Jeroen Vermeulen <jtvjtv@gmail.com> wrote: > My one worry is perhaps Johab is on the list because one important user needed it. > > But even then that requirement may have gone away? Well, that was over 20 years ago. There's a very good chance that even if somebody was using JOHAB back then, they're not still using it now. What's mystifying to me is that, presumably, somebody had a reason at the time for thinking that this was correct. I know that our quality standards were a whole looser back then, but I still don't quite understand why someone would have spent time and effort writing code based on a purely fictitious encoding scheme. So I went looking for where we got the mapping tables from. UCS_to_JOHAB.pl expects to read from a file JOHAB.TXT, of which the latest version seems to be found here: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it regenerates the current mapping files. Playing with it a bit: rhaas=# select convert_from(e'\\x8a5c'::bytea, 'johab'); ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c rhaas=# select convert_from(e'\\x8444'::bytea, 'johab'); ERROR: invalid byte sequence for encoding "JOHAB": 0x84 0x44 rhaas=# select convert_from(e'\\x89ef'::bytea, 'johab'); convert_from -------------- 괦 (1 row) So, \x8a5c is the original example, which does appear in JOHAB.TXT, and \x8444 is the first multi-byte character in that file, and both of them fail. But 89ef, which also appears in that file, doesn't fail, and from what I can tell the mapping is correct. So apparently we've got the "right" mappings, but you can only actually the ones that match the code's rules for something to be a valid multi-byte character, which aren't actually in sync with the mapping table. I'm left with the conclusions that (1) nobody ever actually tried using this encoding for anything real until 3 days ago and (2) we don't have any testing infrastructure that verifies that the characters in the mapping tables are actually accepted by pg_verifymbstr(). I wonder how many other encodings we have that don't actually work? -- Robert Haas EDB: http://www.enterprisedb.com
В списке pgsql-bugs по дате отправления: