Обсуждение: BUG #19354: JOHAB rejects valid byte sequences
The following bug has been logged on the website:
Bug reference: 19354
Logged by: Jeroen Vermeulen
Email address: jtvjtv@gmail.com
PostgreSQL version: 18.1
Operating system: Debian unstable x86-64, macOS, Windows, etc.
Description:
Calling libpq, connecting to a UTF8 database and successfully setting client
encoding to JOHAB, this statement:
PQexec(connection, "SELECT '\x8a\x5c'");
Returned an empty result with this error message:
ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
Easily verified in Python:
print(b'\x8a\x5c'.decode('johab'))
It's the same story for some other valid sequences I tried, including this
character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
My test code did work with similar two-byte characters in BIG5, GB18030,
UTF-8, SJIS, and UHC. It just breaks with these JOHAB characters on all of
these x86-64 docker images: "archlinux", "debian", "debian:unstable",
"fedora", and "ubuntu". And I got the same results on macOS+homebrew,
Windows+MinGW with pacman-installed postgres, and a native Windows VM with
whatever-postgres-they-preinstall.
On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Calling libpq, connecting to a UTF8 database and successfully setting client
> encoding to JOHAB, this statement:
>
> PQexec(connection, "SELECT '\x8a\x5c'");
>
> Returned an empty result with this error message:
>
> ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
>
> AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
> Easily verified in Python:
>
> print(b'\x8a\x5c'.decode('johab'))
>
> It's the same story for some other valid sequences I tried, including this
> character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.
What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Calling libpq, connecting to a UTF8 database and successfully setting client
> encoding to JOHAB, this statement:
>
> PQexec(connection, "SELECT '\x8a\x5c'");
>
> Returned an empty result with this error message:
>
> ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
>
> AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
> Easily verified in Python:
>
> print(b'\x8a\x5c'.decode('johab'))
>
> It's the same story for some other valid sequences I tried, including this
> character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.
What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.
--
Robert Haas
EDB: http://www.enterprisedb.com
Jeroen Vermeulen <jtvjtv@gmail.com> writes:
> This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
> such as an EBCDIC-based one and a stateful one!
Yeah. So what we have here is:
1. Our JOHAB implementation has apparently been wrong since day one.
2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.
3. Your complaint is the first, AFAIR.
4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."
Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it. Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct? I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.
regards, tom lane
Thanks all,That analysis makes a lot of sense.
Given the lack of a clear spec,the existence of multiple JOHAB variants,and how long this has apparently been "working" without anyone noticing,IMHO desupporting it does seem like the least risky option.At this point,trying to fix JOHAB variants feels like opening a pretty big can of worms,especially with the potential for dump/reload surprises or subtle parsing/security issues.
I don't have additional data to add,but +1 on removal or deprecation being a reasonable outcome here,given how obscure and effectively dead the encoding is nowadays.
Thanks for digging into this.
Cheers,
Vasuki M
Jeroen Vermeulen <jtvjtv@gmail.com> writes:
> This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
> such as an EBCDIC-based one and a stateful one!
Yeah. So what we have here is:
1. Our JOHAB implementation has apparently been wrong since day one.
2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.
3. Your complaint is the first, AFAIR.
4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."
Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it. Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct? I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.
regards, tom lane
Thanks all,That analysis makes a lot of sense.
Given the lack of a clear spec,the existence of multiple JOHAB variants,and how long this has apparently been "working" without anyone noticing,IMHO desupporting it does seem like the least risky option.At this point,trying to fix JOHAB variants feels like opening a pretty big can of worms,especially with the potential for dump/reload surprises or subtle parsing/security issues.
I don't have additional data to add,but +1 on removal or deprecation being a reasonable outcome here,given how obscure and effectively dead the encoding is nowadays.
Thanks for digging into this.Cheers,
Vasuki MOn Tue, Dec 16, 2025 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:Jeroen Vermeulen <jtvjtv@gmail.com> writes:
> This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
> such as an EBCDIC-based one and a stateful one!
Yeah. So what we have here is:
1. Our JOHAB implementation has apparently been wrong since day one.
2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.
3. Your complaint is the first, AFAIR.
4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."
Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it. Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct? I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.
regards, tom lane
On Tue, Dec 16, 2025 at 2:42 AM Jeroen Vermeulen <jtvjtv@gmail.com> wrote: > My one worry is perhaps Johab is on the list because one important user needed it. > > But even then that requirement may have gone away? Well, that was over 20 years ago. There's a very good chance that even if somebody was using JOHAB back then, they're not still using it now. What's mystifying to me is that, presumably, somebody had a reason at the time for thinking that this was correct. I know that our quality standards were a whole looser back then, but I still don't quite understand why someone would have spent time and effort writing code based on a purely fictitious encoding scheme. So I went looking for where we got the mapping tables from. UCS_to_JOHAB.pl expects to read from a file JOHAB.TXT, of which the latest version seems to be found here: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it regenerates the current mapping files. Playing with it a bit: rhaas=# select convert_from(e'\\x8a5c'::bytea, 'johab'); ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c rhaas=# select convert_from(e'\\x8444'::bytea, 'johab'); ERROR: invalid byte sequence for encoding "JOHAB": 0x84 0x44 rhaas=# select convert_from(e'\\x89ef'::bytea, 'johab'); convert_from -------------- 괦 (1 row) So, \x8a5c is the original example, which does appear in JOHAB.TXT, and \x8444 is the first multi-byte character in that file, and both of them fail. But 89ef, which also appears in that file, doesn't fail, and from what I can tell the mapping is correct. So apparently we've got the "right" mappings, but you can only actually the ones that match the code's rules for something to be a valid multi-byte character, which aren't actually in sync with the mapping table. I'm left with the conclusions that (1) nobody ever actually tried using this encoding for anything real until 3 days ago and (2) we don't have any testing infrastructure that verifies that the characters in the mapping tables are actually accepted by pg_verifymbstr(). I wonder how many other encodings we have that don't actually work? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > ... So I went looking for > where we got the mapping tables from. UCS_to_JOHAB.pl expects to read > from a file JOHAB.TXT, of which the latest version seems to be found > here: > https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT > And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it > regenerates the current mapping files. Thanks for doing that research! > So apparently we've > got the "right" mappings, but you can only actually the ones that > match the code's rules for something to be a valid multi-byte > character, which aren't actually in sync with the mapping table. Yeah. Looking at the code in wchar.c, it's clear that it thinks that JOHAB has the same character-length rules as EUC_KR, which is something that one might guess based on available documentation that says it's related to that encoding. So I can see how we got here. However, that doesn't mean we can fix pg_johab_mblen() and we're done. I'm still quite afraid that we'd be introducing security-grade inconsistencies of interpretation between different PG versions. > I'm > left with the conclusions that (1) nobody ever actually tried using > this encoding for anything real until 3 days ago and (2) we don't have > any testing infrastructure that verifies that the characters in the > mapping tables are actually accepted by pg_verifymbstr(). I wonder how > many other encodings we have that don't actually work? Indeed. Anyone want to do some testing? regards, tom lane
On Tue, Dec 16, 2025 at 10:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, that doesn't mean we can fix pg_johab_mblen() and we're done. > I'm still quite afraid that we'd be introducing security-grade > inconsistencies of interpretation between different PG versions. I understand that fear, but I do not have an opinion either way on whether there would be an actual vulnerability I think there is a good chance that the right going-forward fix is to deprecate the encoding, because according to https://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt this and everything else that's now under https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ were deprecated in 2001. By the time v19 is released, the deprecation will be a quarter-century old, and the fact that it doesn't work is good evidence that few people will miss it, though perhaps the original poster will want to put forward an argument for why we should still care about this. What to do in the back branches is a more difficult question. Since this is a client-only encoding, there's no issue of what is already stored in the database, and we would not be proposing to change any of the mappings, just allow the ones that don't currently work to do so. I *think* that fixing pg_johab_mblen() would be "forward compatible": the subset of the encoding that already works would continue to behave in the same way, and the rest of it would begin working as well. And, I don't really like throwing up our hands and deciding that already-released features are free to continue not working. That's what bug-fix release are for. On the other hand, fixing this bug which apparently affects very few users, and in the process creating a scarier, CVE-worthy bug would not win us many friends, especially in view of the apparently-low uptake of this encoding. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 16, 2025 at 10:41:46AM -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I'm >> left with the conclusions that (1) nobody ever actually tried using >> this encoding for anything real until 3 days ago and (2) we don't have >> any testing infrastructure that verifies that the characters in the >> mapping tables are actually accepted by pg_verifymbstr(). I wonder how >> many other encodings we have that don't actually work? > > Indeed. Anyone want to do some testing? FWIW, I have been made aware a couple of weeks ago by a colleague that SJIS and SHIFT_JIS_2004 are used by some customers, and that we are many years behind an update of the conversion mappings in the tree with Postgres not understanding some of the characters. These are two marginal in the mostly-UTF8 world we live in these days, but it's annoying for byte sequences that should not change across the years, just be refreshed with new data. -- Michael