Обсуждение: BUG #19354: JOHAB rejects valid byte sequences

Поиск
Список
Период
Сортировка

BUG #19354: JOHAB rejects valid byte sequences

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      19354
Logged by:          Jeroen Vermeulen
Email address:      jtvjtv@gmail.com
PostgreSQL version: 18.1
Operating system:   Debian unstable x86-64, macOS, Windows, etc.
Description:

Calling libpq, connecting to a UTF8 database and successfully setting client
encoding to JOHAB, this statement:

    PQexec(connection, "SELECT '\x8a\x5c'");

Returned an empty result with this error message:

    ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c

AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
Easily verified in Python:

    print(b'\x8a\x5c'.decode('johab'))

It's the same story for some other valid sequences I tried, including this
character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My test code did work with similar two-byte characters in BIG5, GB18030,
UTF-8, SJIS, and UHC.  It just breaks with these JOHAB characters on all of
these x86-64 docker images: "archlinux", "debian", "debian:unstable",
"fedora", and "ubuntu".  And I got the same results on macOS+homebrew,
Windows+MinGW with pacman-installed postgres, and a native Windows VM with
whatever-postgres-they-preinstall.


Re: BUG #19354: JOHAB rejects valid byte sequences

От
Robert Haas
Дата:
On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Calling libpq, connecting to a UTF8 database and successfully setting client
> encoding to JOHAB, this statement:
>
>     PQexec(connection, "SELECT '\x8a\x5c'");
>
> Returned an empty result with this error message:
>
>     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
>
> AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
> Easily verified in Python:
>
>     print(b'\x8a\x5c'.decode('johab'))
>
> It's the same story for some other valid sequences I tried, including this
> character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.

What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: BUG #19354: JOHAB rejects valid byte sequences

От
Jeroen Vermeulen
Дата:
Hi Robert.  Thanks for following up.

The original author of the support code in libpqxx also noted that there was a discrepancy.  Python does accept these 2-byte sequences, and decodes them to Hangul characters.

The way I read the Wikipedia section, Johab isn't like the EUC encodings in that it adds characters that contain ASCII-like values in the second byte.  I guess that was needed to support Chinese characters in addition to Hangul.  Unit-testing for the embedded-backslash hazard was what led me to find the problem.

This bit worries me: "TlOther, vendor-defined, Johab variants also exist" — such as an EBCDIC-based one and a stateful one!


Jeroen

On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Calling libpq, connecting to a UTF8 database and successfully setting client
> encoding to JOHAB, this statement:
>
>     PQexec(connection, "SELECT '\x8a\x5c'");
>
> Returned an empty result with this error message:
>
>     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
>
> AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
> Easily verified in Python:
>
>     print(b'\x8a\x5c'.decode('johab'))
>
> It's the same story for some other valid sequences I tried, including this
> character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.

What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: BUG #19354: JOHAB rejects valid byte sequences

От
Tom Lane
Дата:
Jeroen Vermeulen <jtvjtv@gmail.com> writes:
> This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
> such as an EBCDIC-based one and a stateful one!

Yeah.  So what we have here is:

1. Our JOHAB implementation has apparently been wrong since day one.

2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.

3. Your complaint is the first, AFAIR.

4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."

Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it.  Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct?  I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.

            regards, tom lane



Re: BUG #19354: JOHAB rejects valid byte sequences

От
VASUKI M
Дата:

Thanks all,That analysis makes a lot of sense.

Given the lack of a clear spec,the existence of multiple JOHAB variants,and how long this has apparently been "working" without anyone noticing,IMHO desupporting it does seem like the least risky option.At this point,trying to fix JOHAB variants feels like opening a pretty big can of worms,especially with the potential for dump/reload surprises or subtle parsing/security issues.

I don't have additional data to add,but +1 on removal or deprecation being a reasonable outcome here,given how obscure and effectively dead the encoding is nowadays.

Thanks for digging into this.

Cheers,
Vasuki M


On Tue, Dec 16, 2025 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeroen Vermeulen <jtvjtv@gmail.com> writes:
> This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
> such as an EBCDIC-based one and a stateful one!

Yeah.  So what we have here is:

1. Our JOHAB implementation has apparently been wrong since day one.

2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.

3. Your complaint is the first, AFAIR.

4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."

Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it.  Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct?  I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.

                        regards, tom lane




Re: BUG #19354: JOHAB rejects valid byte sequences

От
Jeroen Vermeulen
Дата:
My one worry is perhaps Johab is on the list because one important user needed it.

But even then that requirement may have gone away?


Jeroen

On Tue, Dec 16, 2025, 07:23 VASUKI M <vasukianand0119@gmail.com> wrote:

Thanks all,That analysis makes a lot of sense.

Given the lack of a clear spec,the existence of multiple JOHAB variants,and how long this has apparently been "working" without anyone noticing,IMHO desupporting it does seem like the least risky option.At this point,trying to fix JOHAB variants feels like opening a pretty big can of worms,especially with the potential for dump/reload surprises or subtle parsing/security issues.

I don't have additional data to add,but +1 on removal or deprecation being a reasonable outcome here,given how obscure and effectively dead the encoding is nowadays.

Thanks for digging into this.

Cheers,
Vasuki M


On Tue, Dec 16, 2025 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeroen Vermeulen <jtvjtv@gmail.com> writes:
> This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
> such as an EBCDIC-based one and a stateful one!

Yeah.  So what we have here is:

1. Our JOHAB implementation has apparently been wrong since day one.

2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.

3. Your complaint is the first, AFAIR.

4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."

Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it.  Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct?  I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.

                        regards, tom lane




Re: BUG #19354: JOHAB rejects valid byte sequences

От
Robert Haas
Дата:
On Tue, Dec 16, 2025 at 2:42 AM Jeroen Vermeulen <jtvjtv@gmail.com> wrote:
> My one worry is perhaps Johab is on the list because one important user needed it.
>
> But even then that requirement may have gone away?

Well, that was over 20 years ago. There's a very good chance that even
if somebody was using JOHAB back then, they're not still using it now.

What's mystifying to me is that, presumably, somebody had a reason at
the time for thinking that this was correct. I know that our quality
standards were a whole looser back then, but I still don't quite
understand why someone would have spent time and effort writing code
based on a purely fictitious encoding scheme. So I went looking for
where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
from a file JOHAB.TXT, of which the latest version seems to be found
here:

https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT

And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
regenerates the current mapping files. Playing with it a bit:

rhaas=# select convert_from(e'\\x8a5c'::bytea, 'johab');
ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
rhaas=# select convert_from(e'\\x8444'::bytea, 'johab');
ERROR:  invalid byte sequence for encoding "JOHAB": 0x84 0x44
rhaas=# select convert_from(e'\\x89ef'::bytea, 'johab');
 convert_from
--------------
 괦
(1 row)

So, \x8a5c is the original example, which does appear in JOHAB.TXT,
and \x8444 is the first multi-byte character in that file, and both of
them fail. But 89ef, which also appears in that file, doesn't fail,
and from what I can tell the mapping is correct. So apparently we've
got the "right" mappings, but you can only actually the ones that
match the code's rules for something to be a valid multi-byte
character, which aren't actually in sync with the mapping table. I'm
left with the conclusions that (1) nobody ever actually tried using
this encoding for anything real until 3 days ago and (2) we don't have
any testing infrastructure that verifies that the characters in the
mapping tables are actually accepted by pg_verifymbstr(). I wonder how
many other encodings we have that don't actually work?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: BUG #19354: JOHAB rejects valid byte sequences

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> ... So I went looking for
> where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
> from a file JOHAB.TXT, of which the latest version seems to be found
> here:
> https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT
> And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
> regenerates the current mapping files.

Thanks for doing that research!

> So apparently we've
> got the "right" mappings, but you can only actually the ones that
> match the code's rules for something to be a valid multi-byte
> character, which aren't actually in sync with the mapping table.

Yeah.  Looking at the code in wchar.c, it's clear that it thinks
that JOHAB has the same character-length rules as EUC_KR, which is
something that one might guess based on available documentation that
says it's related to that encoding.  So I can see how we got here.

However, that doesn't mean we can fix pg_johab_mblen() and we're done.
I'm still quite afraid that we'd be introducing security-grade
inconsistencies of interpretation between different PG versions.

> I'm
> left with the conclusions that (1) nobody ever actually tried using
> this encoding for anything real until 3 days ago and (2) we don't have
> any testing infrastructure that verifies that the characters in the
> mapping tables are actually accepted by pg_verifymbstr(). I wonder how
> many other encodings we have that don't actually work?

Indeed.  Anyone want to do some testing?

            regards, tom lane



Re: BUG #19354: JOHAB rejects valid byte sequences

От
Robert Haas
Дата:
On Tue, Dec 16, 2025 at 10:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, that doesn't mean we can fix pg_johab_mblen() and we're done.
> I'm still quite afraid that we'd be introducing security-grade
> inconsistencies of interpretation between different PG versions.

I understand that fear, but I do not have an opinion either way on
whether there would be an actual vulnerability

I think there is a good chance that the right going-forward fix is to
deprecate the encoding, because according to
https://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt this and
everything else that's now under
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ were
deprecated in 2001. By the time v19 is released, the deprecation will
be a quarter-century old, and the fact that it doesn't work is good
evidence that few people will miss it, though perhaps the original
poster will want to put forward an argument for why we should still
care about this.

What to do in the back branches is a more difficult question. Since
this is a client-only encoding, there's no issue of what is already
stored in the database, and we would not be proposing to change any of
the mappings, just allow the ones that don't currently work to do so.
I *think* that fixing pg_johab_mblen() would be "forward compatible":
the subset of the encoding that already works would continue to behave
in the same way, and the rest of it would begin working as well.

And, I don't really like throwing up our hands and deciding that
already-released features are free to continue not working. That's
what bug-fix release are for.

On the other hand, fixing this bug which apparently affects very few
users, and in the process creating a scarier, CVE-worthy bug would not
win us many friends, especially in view of the apparently-low uptake
of this encoding.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: BUG #19354: JOHAB rejects valid byte sequences

От
Michael Paquier
Дата:
On Tue, Dec 16, 2025 at 10:41:46AM -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm
>> left with the conclusions that (1) nobody ever actually tried using
>> this encoding for anything real until 3 days ago and (2) we don't have
>> any testing infrastructure that verifies that the characters in the
>> mapping tables are actually accepted by pg_verifymbstr(). I wonder how
>> many other encodings we have that don't actually work?
>
> Indeed.  Anyone want to do some testing?

FWIW, I have been made aware a couple of weeks ago by a colleague that
SJIS and SHIFT_JIS_2004 are used by some customers, and that we are
many years behind an update of the conversion mappings in the tree
with Postgres not understanding some of the characters.  These are two
marginal in the mostly-UTF8 world we live in these days, but it's
annoying for byte sequences that should not change across the years,
just be refreshed with new data.
--
Michael

Вложения