Обсуждение: C11: should we use char32_t for unicode code points?
Now that we're using C11, should we use char32_t for unicode code
points?
Right now, we use pg_wchar for two purposes:
1. to abstract away some problems with wchar_t on platforms where
it's 16 bits; and
2. hold unicode code point values
In UTF8, they are are equivalent and can be freely cast back and forth,
but not necessarily in other encodings. That can be confusing in some
contexts. Attached is a patch to use char32_t for the second purpose.
Both are equivalent to uint32, so there's no functional change and no
actual typechecking, it's just for readability.
Is this helpful, or needless code churn?
Regards,
Jeff Davis
Вложения
> Now that we're using C11, should we use char32_t for unicode code > points? > > Right now, we use pg_wchar for two purposes: > > 1. to abstract away some problems with wchar_t on platforms where > it's 16 bits; and > 2. hold unicode code point values > > In UTF8, they are are equivalent and can be freely cast back and forth, > but not necessarily in other encodings. That can be confusing in some > contexts. Attached is a patch to use char32_t for the second purpose. > > Both are equivalent to uint32, so there's no functional change and no > actual typechecking, it's just for readability. > > Is this helpful, or needless code churn? Unless char32_t is solely used for the Unicode code point data, I think it would be better to define something like "pg_unicode" and use it instead of directly using char32_t because it would be cleaner for code readers. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote: > > Unless char32_t is solely used for the Unicode code point data, I > think it would be better to define something like "pg_unicode" and > use > it instead of directly using char32_t because it would be cleaner for > code readers. That was my original idea, but then I saw that apparently char32_t is intended for Unicode code points: https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html But I am also OK with a new type if others find it more readable. Regards, Jeff Davis
On Sat, Oct 25, 2025 at 4:25 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote: > > Unless char32_t is solely used for the Unicode code point data, I > > think it would be better to define something like "pg_unicode" and > > use > > it instead of directly using char32_t because it would be cleaner for > > code readers. > > That was my original idea, but then I saw that apparently char32_t is > intended for Unicode code points: > > https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html It's definitely a codepoint but C11 only promised UTF-32 encoding if __STDC_UTF_32__ is defined to 1, and otherwise the encoding is unknown. The C23 standard resolved that insanity and required UTF-32, and there are no known systems[1] that didn't already conform, but I guess you could static_assert(__STDC_UTF_32__, "char32_t must use UTF-32 encoding"). It's also defined as at least, not exactly, 32 bits but we already require the machine to have uint32_t so it must be exactly 32 bits for us, and we could static_assert(sizeof(char32_t) == 4) for good measure. So all up, the standard type matches our existing assumptions about pg_wchar *if* the database encoding is UTF8. IIUC you're proposing that all the stuff that only works when database encoding is UTF8 should be flipped over to the new type, and that seems like a really good idea to me: remaining uses of pg_wchar would be warnings that the encoding is only conditionally known. It'd be documentation without new type safety though: for example I think you missed a spot, the return type of the definition of utf8_to_unicode() (I didn't search exhaustively). Only in C++ is it a distinct type that would catch that and a few other mistakes. Do you consider explicit casts between eg pg_wchar and char32_t to be useful documentation for humans, when coercion should just work? I kinda thought we were trying to cut down on useless casts, they might signal something but can also hide bugs. Should the few places that deal in surrogates be using char16_t instead? I wonder if the XXX_libc_mb() functions that contain our hard-coded assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should use your to_char32_t() too (probably with a longer name pg_wchar_to_char32_t() if it's in a header for wider use). That'd highlight the exact points at which we make that assumption and centralise the assertion about database encoding, and then the code that compares with various known cut-off values would be clearly in the char32_t world. > But I am also OK with a new type if others find it more readable. Adding yet another name to this soup doesn't immediately sound like it would make anything more readable to me. ISO has standardised this for the industry, so I'd vote for adopting it without indirection that makes the reader work harder to understand what it is. The churn doesn't seem excessive either, it's fairly well contained stuff already moving around a lot in recent releases with all your recent and ongoing revamping work. There is one small practical problem though: Apple hasn't got around to supplying <uchar.h> in its C SDK yet. It's there for C++ only, and isn't needed for the type in C++ anyway. I don't think that alone warrants a new name wart, as the standard tells us it must match uint32_least32_t so we can just define it ourselves if !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets around to that. Since it confused me briefly: Apple does provide <unicode/uchar.h> but that's a coincidentally named ICU header, and on that subject I see that ICU hasn't adopted these types yet but there are some hints that they're thinking about it; meanwhile their C++ interfaces have begun to document that they are acceptable in a few template functions. All other target systems have it AFAICS. Windows: tested by CI, MinGW: found discussion, *BSD, Solaris, Illumos: found man pages. As for the conversion functions in <uchar.h>, they're of course missing on macOS but they also depend on the current locale, so it's almost like C, POSIX and NetBSD have conspired to make them as useless to us as possible. They solve the "size and encoding of wchar_t is undefined" problem, but there are no _l() variants and we can't depend on uselocale() being available. Probably wouldn't be much use to us anyway considering our more complex and general transcoding requirements, I just thought about this while contemplating hypothetical pre-C23 systems that don't use UTF-32, specifically what would break if such a system existed: probably nothing as long as you don't use these. I guess another way you could tell would be if you used the fancy new U-prefixed character/string literal syntax, but I can't see much need for that. In passing, we seem to have a couple of mentions of "pg_wchar_t" (bogus _t) in existing comments. [1] https://thephd.dev/c-the-improvements-june-september-virtual-c-meeting
On Sat, 2025-10-25 at 16:21 +1300, Thomas Munro wrote:
> I
> guess you could static_assert(__STDC_UTF_32__, "char32_t must use
> UTF-32 encoding").
Done.
> It's also defined as at least, not exactly, 32
> bits but we already require the machine to have uint32_t so it must
> be
> exactly 32 bits for us, and we could static_assert(sizeof(char32_t)
> ==
> 4) for good measure.
What would be the problem if it were larger than 32 bits?
I don't mind adding the asserts, but it's slightly awkward because
StaticAssertDecl() isn't defined yet at the point we are including
uchar.h.
> IIUC you're proposing that all the stuff that only works when
> database
> encoding is UTF8 should be flipped over to the new type, and that
> seems like a really good idea to me: remaining uses of pg_wchar would
> be warnings that the encoding is only conditionally known.
Exactly. The idea is to make pg_wchar stand out more as a platform-
dependent (or encoding-dependent) representation, and remove the doubt
when someone sees char32_t.
> It'd be
> documentation without new type safety though: for example I think you
> missed a spot, the return type of the definition of utf8_to_unicode()
> (I didn't search exhaustively).
Right, it's not offering type safety. Fixed the omission.
> Do you consider explicit casts between eg pg_wchar and char32_t to be
> useful documentation for humans, when coercion should just work? I
> kinda thought we were trying to cut down on useless casts, they might
> signal something but can also hide bugs.
The patch doesn't add any explicit casts, except in to_char32() and
to_pg_wchar(), so I assume that the callsites of those functions are
what you meant by "explicit casts"?
We can get rid of those functions if you want. The main reason they
exist is for a place to comment on the safety of converting pg_wchar to
char32_t. I can put that somewhere else, though.
> Should the few places that
> deal in surrogates be using char16_t instead?
Yes, done.
> I wonder if the XXX_libc_mb() functions that contain our hard-coded
> assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> use your to_char32_t() too (probably with a longer name
> pg_wchar_to_char32_t() if it's in a header for wider use).
I don't think those functions do depend on UTF-32. iswalpha(), etc.,
take a wint_t, which is just a wchar_t that can also be WEOF.
And if we don't use to_char32/to_pg_wchar in there, I don't see much
need for it outside of pg_locale_builtin.c, but if the need arises we
can move it to a header file and give it a longer name.
> That'd
> highlight the exact points at which we make that assumption and
> centralise the assertion about database encoding, and then the code
> that compares with various known cut-off values would be clearly in
> the char32_t world.
The asserts about UTF-8 in pg_locale_libc.c are there because the
previous code only took those code paths for UTF-8, and I preserved
that. Also there is some code that depends on UTF-8 for decoding, but I
don't think anything in there depends on UTF-32 specifically.
> There is one small practical problem though: Apple hasn't got around
> to supplying <uchar.h> in its C SDK yet. It's there for C++ only,
> and
> isn't needed for the type in C++ anyway. I don't think that alone
> warrants a new name wart, as the standard tells us it must match
> uint32_least32_t so we can just define it ourselves if
> !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
> around to that.
Thank you, I added a configure test for uchar.h and some more
preprocessor logic in c.h.
> Since it confused me briefly: Apple does provide <unicode/uchar.h>
> but
> that's a coincidentally named ICU header, and on that subject I see
> that ICU hasn't adopted these types yet but there are some hints that
> they're thinking about it; meanwhile their C++ interfaces have begun
> to document that they are acceptable in a few template functions.
Even when they fully move to char32_t, we will still have to support
the older ICU versions for a long time.
> All other target systems have it AFAICS. Windows: tested by CI,
> MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.
Great, thank you!
> They solve the "size and encoding of wchar_t is
> undefined" problem
One thing I never understood about this is that it's our code that
converts from the server encoding to pg_wchar (e.g.
pg_latin12wchar_with_len()), so we must understand the representation
of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
understand the encoding of wchar_t, too, right?
> In passing, we seem to have a couple of mentions of "pg_wchar_t"
> (bogus _t) in existing comments.
Thank you. I'll fix that separately.
Regards,
Jeff Davis
Вложения
On Mon, Oct 27, 2025 at 8:43 AM Jeff Davis <pgsql@j-davis.com> wrote: > What would be the problem if it were larger than 32 bits? Hmm, OK fair question, I can't think of any, I was just working through the standard and thinking myopically about the exact definition, but I think it's actually already covered by other things we assume/require (ie the existence of uint32_t forces the size of char32_t if you follow the chain of definitions backwards), and as you say it probably doesn't even matter. I suppose you could also skip the __STC_UTF_32__ assertion given that we already make a larger assumption about wchar_t encoding, and it seems to be exhaustively established that no implementation fails to conform to C23 for char32_t (see earlier link to Meneide's blog). I don't personally understand what C11 was smoking when it left that unspecified for another 12 years. > > I wonder if the XXX_libc_mb() functions that contain our hard-coded > > assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should > > use your to_char32_t() too (probably with a longer name > > pg_wchar_to_char32_t() if it's in a header for wider use). > > I don't think those functions do depend on UTF-32. iswalpha(), etc., > take a wint_t, which is just a wchar_t that can also be WEOF. I was noticing that toupper_libc_mb() directly tests if a pg_wchar value is in the ASCII range, which only makes sense given knowledge of pg_wchar's encoding, so perhap that should trigger this new coding rule. But I agree that's pretty obscure... feel free to ignore that suggestion. Hmm, the comment at the top explains that we apply that special ASCII treatment for default locales and not non-default locales, but it doesn't explain *why* we make that distinction. Do you know? > One thing I never understood about this is that it's our code that > converts from the server encoding to pg_wchar (e.g. > pg_latin12wchar_with_len()), so we must understand the representation > of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we > understand the encoding of wchar_t, too, right? Right, we do know the encoding of pg_wchar in every case (assuming that all pg_wchar values come from our transcoding routines). We just don't know if that encoding is also the one used by libc's locale-sensitive functions that deal in wchar_t, except when the locale is one that uses UTF-8 for char encoding, in which case we assume that every libc must surely use Unicode codepoints in wchar_t. That probably covers the vast majority of real world databases in the UTF-8 age, and no known system fails to meet this expectation. Of course the encoding used by every libc for non-UTF-8 locales is theoretically knowable too, but since they vary and in some cases are not even documented, it would be too painful to contemplate any dependency on that. Let me try to work through this in more detail... corrections welcome, but this is what I have managed to understand about this module so far, in my quest to grok PostgreSQL's overall character encoding model (and holes therein): For locales that use UTF-8 for char, we expect libc to understand pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16. The expected source of these pg_wchar values is our various regexp code paths that will use our mbutils pg_wchar conversion to UTF-32, with a reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows and I think otherwise only AIX in 32 bit builds, if it comes back). If any libc didn't use Unicode codepoints in its locale-sensitive wchar_t functions for UTF-8 locales we'd get garbage results, but we don't know of any such system. It's a bit of a shame that C11 didn't introduce the obvious isualpha(char32_t) variants for a standard-supported version of that realpolitik we depend on, but perhaps one day... There is one minor quirk here that it might be nice to document in top comment section 2: on Windows we also expect wchar_t to be understood by system wctype functions as UTF-16 for locales that *don't* use UTF-8 for char (an assumption that definitely doesn't hold on many Unixen). That is important because on Windows we allow non-UTF-8 locales to be used in UTF-8 databases for historical reasons. For single-byte encodings: pg_latin12wchar_with_len() just zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it completes a perfect round trip inside our code. (BTW pg_latin12wchar_with_len() has the same definition as pg_ascii2wchar_with_len(), and is used for many single-byte encodings other than LATIN1 which makes me wonder why we don't just have a single function pg_char2wchar_with_len() that is used by all "simple widening" cases.) We never know or care which encoding libc would itself use for these locales' wchar_t, as we don't ever pass it a wchar_t. Assuming I understood that correctly, I think it would be nice if the "100% correct for LATINn" comment stated the reason for that certainty explicitly, ie that it closes an information-preserving round-trip beginning with the coercion in pg_latin12wchar_with_len() and that libc never receives a wchar_t/wint_t that we fabricated. A bit of a digression, which I *think* is out-of-scope for this module, but just while I'm working through all the implications: This could produce unspecified results if a wchar_t from another source ever arrived into these functions eg wchar_t made by libc or L"literal" made by the compiler, both unspecified. In practice, a wchar_t of non-PostgreSQL origin that is truncated to 8 bits would probably still give a sensible result for codepoints 0-127 (= 7 bit subset of Unicode, and we require all server encodings to be supersets of ASCII), and 0-255 for LATIN1 (= 8 bit subset of Unicode), because: the two main approaches to single-byte char -> wchar_t conversion in libc implementations seem to be conversion to Unicode (Windows, glibc?), and simply casting char to wchar_t (I think this is probably what *BSD and Solaris do for single-byte non-UTF-8 locales leading to the complaint that wchar_t encoding is locale-dependent on those systems, though I haven't checked in detail, and that's of course also exactly what our own conversion does), so I think that means that 128-255 that would give nonsense results for non-LATIN1 single byte encodings on Windows or glibc (?) but perhaps not other Unixen. For example, take ISO 8859-7, the legacy single byte encoding for Greek: it encodes α as 0xe1, and Windows and glibc (?) would presumably encode that as (wchar_t) 0x03b1 (the Unicode codepoint), and then wc_isalpha_libc_sb() would truncate that to 0xb1 which is ± in ISO 8859-7, so isalpha_l() would return false, despite α being the OG alpha (not tested, just a thought experiment looking at tables). But since handling pg_wchar of non-PostgreSQL origin doesn't seem to be one of our goals, there is no problem to fix here, it might just be worthy of a note in that commentary: we don't try to deal with wchar_t values not made by PostgreSQL, except where noted (non-escaping uses of char2wchar() in controlled scopes). For multi-byte encodings other than UTF-8, pg_locale_libc.c is basically giving up almost completely, but could probably be tightened up. I can't imagine we'll ever add another multibyte encoding, and I believe we can ignore MULE internal, as no libc supports it (so you could only get here with the C locale where you'll get the garbage results you asked for... in fact I wonder why need MULE internal at all... it seems to be a sort of double-encoding for multiplexing other encodings, so we can't exactly say it's not blessed by a standard, it's indirectly defined by "all the standards" in a sense, but it's also entirely obsoleted by Unicode's unification so I don't know what problem it solves for anyone, or if anyone ever needed it in any reasonable pg_upgrade window of history...). Of server-supported encodings, that leaves only EUC_* to think about. The EUC family has direct encoding of 7-bit ASCII and then 3 selectable character sets represented by sequences with the high bit set, with details varying between the Chinese (simplified Chinese), Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean variants. I don't know if the pg_wchar encoding we're producing in pg_euc*2wchar_with_len() has a name, but it doesn't appear to match the description of the standard "fixed" representation on the Wikipedia page for Extended Unix Code (it's too wide for starters, looking at the shift distances). The main thing seems to be that we simply zero-extend the ASCII range into a pg_wchar directly, so when we cast it down to call 8-bit ctype functions, I expect we produce correct results for ASCII characters... and then I don't know what but I guess nothing good for 128-255, and then surely hot garbage for everything else, cycling through the 0-255 answers repeatedly as we climb the pg_wchar value range. The key point being that it's *not* a perfect information-preserving round-trip, as we achieve for single-byte encodings. Some ideas for improvements: 1. Cheap but incomplete: use a different ctype method table that short-circuits the results (false for isalpha et al, pass-through for upper/lower) for pg_wchar >= 128 and uses the existing 8-bit ctype functions for ASCII. 2. More expensive but complete: handle ASCII range with existing 8-bit ctype functions, and otherwise convert our pg_wchar back to MB char format and then use libc's mbstowcs_l() to make a wchar_t that libc's wchar_t-based functions should understand. To avoid doing hard work for nothing (ideogram-based languages generally don't care about ctype stuff so that'd be the vast majority of characters appearing in Chinese/Japanese/Korean text) at the cost of having to do a bunch of research, we could should short-circuit the core CJK character ranges, and do the extra CPU cycles for the rest, to catch the Latin + accents, Greek, Cyrillic characters that are also supported in these encodings for foreign names, variables in scientific language etc. I guess that implies a classifier that would be associated with ... the encoding? That would of course break if wchar_t values of non-PostgreSQL origin arrive here, but see above note about nailing down a contract that formally excludes that outside narrow non-escaping sites. 3. I assume there are some good reasons we don't do this but... if we used char2wchar() in the first place (= libc native wchar_t) for the regexp stuff that calls this stuff (as we do already inside whole-string upper/lower, just not character upper/lower or character classification), then we could simply call the wchar_t libc functions directly and unconditionally in the libc provider for all cases, instead of the 8-bit variants with broken edge cases for non-UTF-8 databases. I didn't try to find the historical discussions, but I can imagine already that we might not have done that because it has to copy to cope with non-NULL-terminated strings, might perhaps have weird incompatibilities with our own multibyte sequence detection, might be slower (and/or might have been unusably broken ancient libcs?), and it would only be appropriate for libc locales anyway and yet now we have other locale providers that certainly don't want some unspecified wchar_t encoding or libc involved. It's also likely that non-UTF-8 systems are of dwindling interest to anyone outside perhaps client encodings (hence my attempt to ram home some simplifying assumptions about that in that project to nail down some rules where the encoding is fuzzy that I mentioned in a thread from a few months ago). So I'm not seriously suggesting this, just thinking out loud about the corner we've painted ourselves into where idea #2's multiple transcoding steps would be necessary to get the "right" answer for any character in these encodings. Hnngh. In passing, I wonder why _libc.c has that comment about ICU in parentheses. Not relevant here. I haven't thought much about whether it's relevant in the ICU provider code (it may come back to that do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it also applies to Windows and probably glibc in the libc provider and I don't immediately see any problem (assuming no-we-don't! answer).
> The EUC family has direct encoding of 7-bit ASCII and then 3 > selectable character sets represented by sequences with the high bit > set, with details varying between the Chinese (simplified Chinese), > Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean > variants. I don't know if the pg_wchar encoding we're producing in > pg_euc*2wchar_with_len() has a name, but it doesn't appear to match > the description of the standard "fixed" representation on the > Wikipedia page for Extended Unix Code (it's too wide for starters, > looking at the shift distances). Yes. pg_euc*2wchar_with_len() creates "variable length" representation of EUC, 1 byte to 4 bytes range per character. Then, expands each character into pg_wchar. Also it can be converted back to the multibyte representation easily. Note that the standard "fixed" representation of EUC includes ASCII range bytes in *non* ASCII characters, thus I think it is not easy to use for backend safe encoding. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Tue, 2025-10-28 at 15:40 +1300, Thomas Munro wrote:
> I was noticing that toupper_libc_mb() directly tests if a pg_wchar
> value is in the ASCII range, which only makes sense given knowledge
> of
> pg_wchar's encoding, so perhap that should trigger this new coding
> rule. But I agree that's pretty obscure... feel free to ignore that
> suggestion.
I'm not sure that casting it to char32_t would be an improvement there.
Perhaps if we can find some ways to generally clarify things (some of
which you suggest below), that could be part of a follow-up.
It looks like the current patch is a step in the right direction, so
I'll commit that soon and see what the buildfarm says.
> Hmm, the comment at the top explains that we apply that special ASCII
> treatment for default locales and not non-default locales, but it
> doesn't explain *why* we make that distinction. Do you know?
It makes some sense: I suppose someone thought that non-ASCII behavior
in the default locale is just too likely to cause problems. But the
non-ASCII behavior is allowed if you use a COLLATE clause.
But the pattern wasn't followed quite the same way with ICU, which uses
the given locale for UPPER()/LOWER() regardless of whether it's the
default locale or not. And for regexes, ICU doesn't use the locale at
all, it just uses u_isalpha(), etc., even if you use a COLLATE clause.
And there are still some places that call plain tolower()/toupper(),
such as fuzzystrmatch and ltree.
>
> Right, we do know the encoding of pg_wchar in every case (assuming
> that all pg_wchar values come from our transcoding routines). We
> just
> don't know if that encoding is also the one used by libc's
> locale-sensitive functions that deal in wchar_t, except when the
> locale is one that uses UTF-8 for char encoding, in which case we
> assume that every libc must surely use Unicode codepoints in wchar_t.
Ah, right. We create pg_wchars for any encoding, but we only pass a
pg_wchar to a libc multibyte function in the UTF-8 encoding.
(Aside: we do pass pg_wchars directly to ICU as UTF-32 codepoints,
regardless of encoding, which is a bug.)
> For locales that use UTF-8 for char, we expect libc to understand
> pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16. The
> expected source of these pg_wchar values is our various regexp code
> paths that will use our mbutils pg_wchar conversion to UTF-32, with a
> reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows
> and I think otherwise only AIX in 32 bit builds, if it comes back).
> If any libc didn't use Unicode codepoints in its locale-sensitive
> wchar_t functions for UTF-8 locales we'd get garbage results, but we
> don't know of any such system.
Check.
> It's a bit of a shame that C11 didn't
> introduce the obvious isualpha(char32_t) variants for a
> standard-supported version of that realpolitik we depend on, but
> perhaps one day...
Yeah...
> There is one minor quirk here that it might be nice to document in
> top
> comment section 2: on Windows we also expect wchar_t to be understood
> by system wctype functions as UTF-16 for locales that *don't* use
> UTF-8 for char (an assumption that definitely doesn't hold on many
> Unixen). That is important because on Windows we allow non-UTF-8
> locales to be used in UTF-8 databases for historical reasons.
Interesting.
> For single-byte encodings: pg_latin12wchar_with_len() just
> zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c
> functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it
> completes a perfect round trip inside our code.
So you're saying that pg_wchar is more like a union type?
typedef pg_wchar
{
char ch; /* single-byte encodings or
non-UTF8 encodings on unix */
char16_t utf16; /* windows non-UTF8 encodings */
char32_t utf32; /* UTF-8 encoding */
} pg_wchar;
(we'd have to be careful about the memory layout if we're casting,
though)
> (BTW
> pg_latin12wchar_with_len() has the same definition as
> pg_ascii2wchar_with_len(), and is used for many single-byte encodings
> other than LATIN1 which makes me wonder why we don't just have a
> single function pg_char2wchar_with_len() that is used by all "simple
> widening" cases.)
Sounds like a nice simplification.
> We never know or care which encoding libc would
> itself use for these locales' wchar_t, as we don't ever pass it a
> wchar_t.
Ah, that makes sense.
> Assuming I understood that correctly, I think it would be
> nice if the "100% correct for LATINn" comment stated the reason for
> that certainty explicitly, ie that it closes an information-
> preserving
> round-trip beginning with the coercion in pg_latin12wchar_with_len()
> and that libc never receives a wchar_t/wint_t that we fabricated.
Agreed, though I think some refactoring would be helpful to accompany
the comment. I've worked with this stuff a lot and I still find it hard
to keep everything in mind at once.
> A bit of a digression, which I *think* is out-of-scope for this
> module, but just while I'm working through all the implications:
> This
> could produce unspecified results if a wchar_t from another source
> ever arrived into these functions
Ugh.
When I first started dealing with pg_wchar, I assumed it was just a
wider wchar_t to abstract away some of the complexity when
sizeof(wchar_t) == 2 (e.g. get rid of surrogate pairs). It's clearly
more complicated than that.
> For multi-byte encodings other than UTF-8, pg_locale_libc.c is
> basically giving up almost completely
Right.
> I
> believe we can ignore MULE internal, as no libc supports it (so you
> could only get here with the C locale where you'll get the garbage
> results you asked for... in fact I wonder why need MULE internal at
> all... it seems to be a sort of double-encoding for multiplexing
> other
> encodings, so we can't exactly say it's not blessed by a standard,
> it's indirectly defined by "all the standards" in a sense, but it's
> also entirely obsoleted by Unicode's unification so I don't know what
> problem it solves for anyone, or if anyone ever needed it in any
> reasonable pg_upgrade window of history...).
I have never heard of someone using it in production, and I wouldn't
object if someone wants to deprecate it.
> 2. More expensive but complete: handle ASCII range with existing
> 8-bit ctype functions, and otherwise convert our pg_wchar back to MB
> char format and then use libc's mbstowcs_l() to make a wchar_t that
> libc's wchar_t-based functions should understand.
Correct. Sounds painful, but perhaps we could just do it and measure
the performance.
> To avoid doing hard
> work for nothing (ideogram-based languages generally don't care about
> ctype stuff so that'd be the vast majority of characters appearing in
> Chinese/Japanese/Korean text) at the cost of having to do a bunch of
> research, we could should short-circuit the core CJK character
> ranges,
> and do the extra CPU cycles for the rest,
I don't think we should start making a bunch of assumptions like that.
> 3. I assume there are some good reasons we don't do this but... if
> we
> used char2wchar() in the first place (= libc native wchar_t) for the
> regexp stuff that calls this stuff (as we do already inside
> whole-string upper/lower, just not character upper/lower or character
> classification), then we could simply call the wchar_t libc functions
> directly and unconditionally in the libc provider for all cases,
> instead of the 8-bit variants with broken edge cases for non-UTF-8
> databases.
I'm not sure about that either, but I think it's because you can end up
with surrogate pairs, which can't be represented in UTF-8.
> I didn't try to find the historical discussions, but I can
> imagine already that we might not have done that because it has to
> copy to cope with non-NULL-terminated strings,
That's probably another reason.
> and it would only be appropriate for libc locales anyway and
> yet now we have other locale providers that certainly don't want some
> unspecified wchar_t encoding or libc involved.
We could fix that by making some of these APIs take a char pointer
instead. That would allow libc to decode to wchar_t, and other
providers to decode to UTF-32. Or, we could say that pg_wchar is an
opaque type that can only be created by the provider, and passed back
to the same provider.
> It's also likely that
> non-UTF-8 systems are of dwindling interest to anyone outside perhaps
> client encodings
That's been my experience -- haven't run into many non-UTF8 server
encodings.
> In passing, I wonder why _libc.c has that comment about ICU in
> parentheses. Not relevant here.
I moved it in 4da12e9e2e.
> I haven't thought much about whether
> it's relevant in the ICU provider code (it may come back to that
> do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it
> also applies to Windows and probably glibc in the libc provider and I
> don't immediately see any problem (assuming no-we-don't! answer).
It's relevant for the regc_wc_isalpha(), etc. functions:
https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com
Regards,
Jeff Davis
This patch looks good to me overall, it's a nice improvement in clarity. On 26.10.25 20:43, Jeff Davis wrote: > +/* > + * char16_t and char32_t > + * Unicode code points. > + */ > +#ifndef __cplusplus > +#ifdef HAVE_UCHAR_H > +#include <uchar.h> > +#ifndef __STDC_UTF_16__ > +#error "char16_t must use UTF-16 encoding" > +#endif > +#ifndef __STDC_UTF_32__ > +#error "char32_t must use UTF-32 encoding" > +#endif > +#else > +typedef uint16_t char16_t; > +typedef uint32_t char32_t; > +#endif > +#endif This could be improved a bit. The reason for some of these conditionals is not clear. Like, what does __cplusplus have to do with this? I think it would be more correct to write a configure/meson check for the actual types rather than depend indirectly on a header check. The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as was discussed elsewhere, since we don't use any standard library functions that make use of these facts, and the need goes away with C23 anyway.
On Tue, 2025-10-28 at 19:45 +0100, Peter Eisentraut wrote:
> This could be improved a bit. The reason for some of these
> conditionals
> is not clear. Like, what does __cplusplus have to do with this? I
> think it would be more correct to write a configure/meson check for
> the
> actual types rather than depend indirectly on a header check.
Fixed, thank you.
> The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as
> was discussed elsewhere, since we don't use any standard library
> functions that make use of these facts, and the need goes away with
> C23
> anyway.
Removed.
I also made the pg_config.h.in changes and ran autoconf.
Regards,
Jeff Davis
Вложения
On Wed, Oct 29, 2025 at 7:45 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 26.10.25 20:43, Jeff Davis wrote:
> > +/*
> > + * char16_t and char32_t
> > + * Unicode code points.
> > + */
> > +#ifndef __cplusplus
> > +#ifdef HAVE_UCHAR_H
> > +#include <uchar.h>
> > +#ifndef __STDC_UTF_16__
> > +#error "char16_t must use UTF-16 encoding"
> > +#endif
> > +#ifndef __STDC_UTF_32__
> > +#error "char32_t must use UTF-32 encoding"
> > +#endif
> > +#else
> > +typedef uint16_t char16_t;
> > +typedef uint32_t char32_t;
> > +#endif
> > +#endif
>
> This could be improved a bit. The reason for some of these conditionals
> is not clear. Like, what does __cplusplus have to do with this? I
> think it would be more correct to write a configure/meson check for the
> actual types rather than depend indirectly on a header check.
I suggested testing __cplusplus because I predicted that that typedef
would fail on a C++ compiler (since C++11), where char32_t is a
language keyword identifying a distinct type requiring no #include.
This is an Apple-only problem, without which we could just include
<uchar.h> unconditionally, and presumably will eventually when Apple
supplies this non-optional-per-C11 header. On a Mac, #include
<uchar.h> fails for C (there is no $SDK/usr/include/uchar.h) but works
for C++ (it finds $SDK/usr/include/c++/v1/uchar.h), and since we'd
probe for HAVE_UCHAR_H with the C compiler, we'd not find it and thus
also need to exclude __cplusplus at compile time. Otherwise, let's
see what the error looks like...
test.cpp:2:22: error: cannot combine with previous 'int' declaration specifier
2 | typedef unsigned int char32_t;
| ^
test.cpp:2:1: warning: typedef requires a name [-Wmissing-declarations]
2 | typedef unsigned int char32_t;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.
GCC has a clearer message:
test.cpp:2:22: error: redeclaration of C++ built-in type 'char32_t'
[-fpermissive]
2 | typedef unsigned int char32_t;
| ^~~~~~~~
If you try to test for the existence of the type rather than the
header in meson/configure, won't you still have the configure-with-C
compile-with-C++ problem, with no way to resolve it except by keeping
the test for __cplusplus that you're trying to get rid of? So what do
you gain other than more lines of configure stuff?
Out of curiosity, even with -std=C++03 (old C++ standard that might
not work for PostgreSQL for other reasons, but I wanted to see what
would happen with a standard before char32_t became a fundamental
language type) I was surprised to see that the standard library
supplied char32_t. It incorrectly(?) imports a typename from the
future standards using an internal type, so our typedef still fails,
just with a different Clang error:
test.cpp:2:22: error: typedef redefinition with different types
('unsigned int' vs 'char32_t')
2 | typedef unsigned int char32_t;
| ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1/__config:320:20:
note: previous definition is here
320 | typedef __char32_t char32_t;
| ^
> The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as
> was discussed elsewhere, since we don't use any standard library
> functions that make use of these facts, and the need goes away with C23
> anyway.
+1
On Wed, 2025-10-29 at 09:03 +1300, Thomas Munro wrote:
> If you try to test for the existence of the type rather than the
> header in meson/configure, won't you still have the configure-with-C
> compile-with-C++ problem
I must have misunderstood the first time. If we depend on
HAVE_CHAR32_T, then it will be set in stone in pg_config.h, and if C++
tries to include the file then it will try the typedef again and fail.
I tried with headerscheck --cplusplus before posting it, but because my
machine has uchar.h, then it didn't fail.
I went back to using the check for __cplusplus, and added a comment
that hopefully clarifies things.
I also reordered the checks so that it prefers to include uchar.h if
available, even when using C++, because that seems like the cleaner end
goal. However, that caused another problem in CI (mingw_cross_warning),
apparently due to a conflict between uchar.h and win32_port.h on that
platform:
[21:48:21.794] ../../src/include/port/win32_port.h: At top level:
[21:48:21.794] ../../src/include/port/win32_port.h:254:8: error:
redefinition of ‘struct stat’
[21:48:21.794] 254 | struct stat
/* This should match struct __stat64 */
[21:48:21.794] | ^~~~
[21:48:21.794] In file included from /usr/share/mingw-
w64/include/wchar.h:413,
[21:48:21.794] from /usr/share/mingw-
w64/include/uchar.h:28,
[21:48:21.794] from ../../src/include/c.h:526:
[21:48:21.794] /usr/share/mingw-w64/include/_mingw_stat64.h:40:10:
note: originally defined here
[21:48:21.794] 40 | struct stat {
[21:48:21.794] | ^~~~
https://cirrus-ci.com/task/4849300577976320
I could reverse the checks again and I think it will work, but let me
know if you have an idea for a better fix.
I never thought it would be so much trouble just to get a suitable type
for a UTF-32 code point...
Regards,
Jeff Davis
Вложения
On Wed, Oct 29, 2025 at 6:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> So you're saying that pg_wchar is more like a union type?
>
> typedef pg_wchar
> {
> char ch; /* single-byte encodings or
> non-UTF8 encodings on unix */
> char16_t utf16; /* windows non-UTF8 encodings */
> char32_t utf32; /* UTF-8 encoding */
> } pg_wchar;
>
> (we'd have to be careful about the memory layout if we're casting,
> though)
Interesting idea. I think it'd have to be something like:
typedef union
{
unsigned char ch; /* (1) single-byte encoding databases */
char32_t utf32; /* (2) UTF-8 databases */
uint32_t ascii_or_custom; /* (3) MULE, EUC_XX databases */
} pg_wchar;
Dunno if it's worth actually doing, but it's a good illustration and a
better way to explain all this than the wall of text I wrote
yesterday. The collusion between common/wchar.c and pg_locale_libc.c
is made more explicit.
I wonder if the logic to select the member/semantics could be turned
into an enum in the encoding table, to make it even clearer, and then
that could be used as an index into a table of ctype methods obejcts
in _libc.c. The encoding module would be declaring which pg_wchar
semantics it uses, instead of having the _libc.c module infer it from
other properties, for a more explicit contract. Or since they are
inferrable, perhaps a function in the mb module could do that and
return the enum. Hmm, perhaps that alone would be clarifying enough,
without the union type. I'm picturing something like PG_WCHAR_CHAR
(direclty usable with ctype.h), PG_WCHAR_UTF32 (self-explanatory, also
assumed be compatible with UTF-8 locales' wchar_t), PG_WCHAR_CUSTOM
(we only know that ASCII range is sane as Ishii-san explained, and for
anything else you'd need to re-encode via libc or give up, but
preferably not go nuts and return junk). The enum would create a new
central place to document the cross-module semantics.
You showed char16_t for Windows, but we don't ever get char16_t out of
wchar.c, it's always char32_t for UTF-8 input. It's just that _libc.c
truncates to UTF-16 or short-circuits to avoid overflow on that
platform (and in the past AIX 32-bit and maybe more), so it wouldn't
belong in a hypothetical union or enum.
> > To avoid doing hard
> > work for nothing (ideogram-based languages generally don't care about
> > ctype stuff so that'd be the vast majority of characters appearing in
> > Chinese/Japanese/Korean text) at the cost of having to do a bunch of
> > research, we could should short-circuit the core CJK character
> > ranges,
> > and do the extra CPU cycles for the rest,
>
> I don't think we should start making a bunch of assumptions like that.
Yeah, maybe not. Thought process: I had noticed that EUC was the only
relevant encoding family, and it has a character set selector, CS0 =
ASCII, and CS1, CS2, CS3 defined appropriately by the national
variants. I had noticed that at least the Japanese one can encode
Latin with accents, Greek etc (non-ASCII stuff that has a meaningful
isalpha() etc) and I took a wild guess that it might be easy to
distinguish them if they'd chosen to put those under a different CS
number. But I see now that they actually stuffed them all into CS1
along with kanji and kana, making it slightly more difficult: they're
still in different assigned "rows" though. At a guess, you can
probably identify extra punctuation (huh, that's surely relevant even
for pure Japanese text if we want ispunct to work?) and foreign
alphabets with some bitmasks. There might be something similar for
the other EUCs.
It's true that it's really not nice to carry special knowledge like
that (it's not just "assumptions", it's a set of black and white
published standards), and we should probably try hard to avoid that.
Perhaps we could at least put the conversion in a new encoding table
function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
place to put that sort of optimisation in (as opposed to making
_libc.c call char2wchar() with no hope of fast path)... that is, if
we want to do any of this at all and not just make new ctype functions
that return false for PG_WCHAR_CUSTOM with value >= 128 and call it a
day...
If we do develop this idea though, one issue to contemplate is that
EUC code points might generate more than one wchar_t, looking at
EUC_JIS_2004[1]. We'd need a pg_wchar_custom_to_wchar_t() signature
that takes a single pg_wchar and writes to an output array and returns
the count, and then we'd have to decide what to do if we get more than
one. Surrogates are trivial under the existing "punt" doctrine:
Windows went big on Unicode before it grew, C doesn't do wctype for
multi-wchar_t sequences, and we can't fix any of that. If it's a
(rare?) combining character sequence then uhh... same problem one
level up, I think, even on Unix? I'm not sure if we could do much
better than the "punt" path in both cases: return either false or the
input character as appropriate.
> > 3. I assume there are some good reasons we don't do this but... if
> > we
> > used char2wchar() in the first place (= libc native wchar_t) for the
> > regexp stuff that calls this stuff (as we do already inside
> > whole-string upper/lower, just not character upper/lower or character
> > classification), then we could simply call the wchar_t libc functions
> > directly and unconditionally in the libc provider for all cases,
> > instead of the 8-bit variants with broken edge cases for non-UTF-8
> > databases.
>
> I'm not sure about that either, but I think it's because you can end up
> with surrogate pairs, which can't be represented in UTF-8.
Yeah, I think that alone is a good reason. We need PG_WCHAR_UTF32 (in
the sketch terminology above).
I wondered about PG_WCHAR_SYSTEM_WCHAR_T, that could potentially
replace PG_WCHAR_CUSTOM, in other words using system wchar_t but only
for EUC_*. The point of this would be for eg regexes to be able to
convert whole strings up-front with one libc call, rather than calling
for each character. The problem seems to be that you'd lose any
ability to deal with surrogates and combining characters as discussed
above, as you'd lose character synchronisation for want of a better
word. So I just can't see how to make this work. Which leads back to
the do-it-one-by-one idea, which then leads back to the
maybe-try-to-make-a-fast-path-for-kanji-etc idea 'cos otherwise it
sounds too expensive...
[1] https://en.wikipedia.org/wiki/JIS_X_0213
On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
> I wonder if the logic to select the member/semantics could be turned
> into an enum in the encoding table, to make it even clearer, and then
> that could be used as an index into a table of ctype methods obejcts
> in _libc.c.
As long as we're able to isolate that logic in the libc provider,
that's reasonable. The other providers don't need that complexity, they
just need to decode straight to UTF-32.
> You showed char16_t for Windows, but we don't ever get char16_t out
> of
> wchar.c, it's always char32_t for UTF-8 input. It's just that
> _libc.c
> truncates to UTF-16 or short-circuits to avoid overflow on that
> platform (and in the past AIX 32-bit and maybe more), so it wouldn't
> belong in a hypothetical union or enum.
Oh, I see.
> >
> Perhaps we could at least put the conversion in a new encoding table
> function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
> place to put that sort of optimisation in
That sounds like a good step forward. And maybe one to convert to UTF-
32 for ICU, also?
> If we do develop this idea though, one issue to contemplate is that
> EUC code points might generate more than one wchar_t, looking at
> EUC_JIS_2004[1].
Wow, that's unfortunate.
Regards,
Jeff Davis
On Wed, Oct 29, 2025 at 2:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm picturing something like PG_WCHAR_CHAR
> (direclty usable with ctype.h), PG_WCHAR_UTF32 (self-explanatory, also
> assumed be compatible with UTF-8 locales' wchar_t), PG_WCHAR_CUSTOM
> (we only know that ASCII range is sane as Ishii-san explained, and for
> anything else you'd need to re-encode via libc or give up, but
> preferably not go nuts and return junk). The enum would create a new
> central place to document the cross-module semantics.
Here are some sketch-quality patches to try out some of these ideas,
for discussion. I gave them .txt endings so as not to hijack your
thread's CI.
* Fixing a different but related bug spotted in passing: we truncate
codepoints passed to Windows' iswalpha_l() et al, instead of detecting
overflow like some other places do. Not tested on Windows, but it
seemed pretty obviously wrong?
* Classifying all pg_wchar encodings as producing PG_WCHAR_CHAR,
PG_WCHAR_UTF32 or PG_WCHAR_CUSTOM, and dispatching to libc ctype
methods based with that.
* Easy EUC change: filtering out non-ASCII for _CUSTOM. I can't seem
to convince SQL-level regexes to expose bogus results on master
though... maybe the pg_wchar encoding actively avoids the by shifting
values up so you often or always cast to a harmless value? Still
better to formalise that I think, if we don't move ahead with the more
ambitious plan...
* More ambitious re-encoding strategy, replacing previous change, with
apparently plausible results.
* Various refactorings with helper macros to avoid making mistakes in
all that repetitive wrapper stuff.
Here's what my ja_JP.eucJP database shows, on FreeBSD. BTW in my
earlier emails I was confused and thought that kanji would not be in
class [[:alpha:]], but that's wrong: Unicode calls it "other letter",
and it looks like that makes all modern libcs return true for
iswalpha():
postgres=# select regexp_replace('1234 Постгрес 5678', '[[:alpha:]]+', '象');
regexp_replace
----------------
1234 象 5678
(1 row)
postgres=# select regexp_replace('1234 ポスグレ 5678', '[[:alpha:]]+', '象');
regexp_replace
----------------
1234 象 5678
(1 row)
postgres=# select regexp_replace('1234 ポスグレ? 5678', '[[:punct:]]+', '。');
regexp_replace
----------------------
1234 ポスグレ。 5678
(1 row)
(That's not an ASCII question mark, it's one of the kanji-box sized
punctuation characters.)
I had to hack regc_pg_locale.c slightly to teach it that just because
I set max_chr to 127 it doesn't mean I want it to turn locale support
off. Haven't looked into that code to figure out what it should do
instead, but it definitely shouldn't be allowed to probe made up
pg_wchar values, because EUC's pg_wchar encoding is sparse and
transcoding can error out.
A mystery that blocked me for too long: regexp_match('café', 'CAFÉ',
'i') and regexp_match('Αθήνα', 'ΑΘΉΝΑ', 'i') match with Apple's
ja_JP.eucJP as do the examples above, but mysteriously didn't on
FreeBSD's where this code started, could be a bug in its ja_JP.eucJP
locale affecting toupper/tolower... Wish I could get that time back.
I imagine that for the ICU + non-UTF-8 locale bug you mentioned, we
might need a very similar set of re-encoding wrappers: something like
pg_wchar -> mb -> UTF-8 -> UTF-32. All this re-encoding sounds
pretty bad, but I can't see any way around the re-encoding with these
edge-case configurations, and we're still supposed to spit out correct
right answers...
Вложения
- 0001-Fix-Windows-wctype.h-usage-for-codepoints-outside-Unic.txt
- 0002-Formalize-pg_wchar-encoding-schemes.txt
- 0003-Fix-corrupted-ctype.h-handling-for-non-ASCII-in-EUC-en.txt
- 0004-Support-wctype.h-classification-for-EUC-encodings.txt
- 0005-XXX-work-around-regc_pg_locale.c-s-probing-logic.txt
- 0006-Improve-naming-of-libc-collation-functions.txt
- 0007-Use-compact-notation-for-isXXX_l-wrappers.txt
- 0008-Use-compact-notation-for-toupper-tolower-wrappers.txt
On 28.10.25 22:54, Jeff Davis wrote: > I went back to using the check for __cplusplus, and added a comment > that hopefully clarifies things. Yes, that looks more helpful now.
On Tue, 2025-10-28 at 14:54 -0700, Jeff Davis wrote:
> [21:48:21.794] ../../src/include/port/win32_port.h: At top level:
> [21:48:21.794] ../../src/include/port/win32_port.h:254:8: error:
> redefinition of ‘struct stat’
> [21:48:21.794] 254 | struct stat
> /* This should match struct __stat64 */
> [21:48:21.794] | ^~~~
> [21:48:21.794] In file included from /usr/share/mingw-
> w64/include/wchar.h:413,
> [21:48:21.794] from /usr/share/mingw-
> w64/include/uchar.h:28,
> [21:48:21.794] from ../../src/include/c.h:526:
> [21:48:21.794] /usr/share/mingw-w64/include/_mingw_stat64.h:40:10:
> note: originally defined here
> [21:48:21.794] 40 | struct stat {
> [21:48:21.794] | ^~~~
>
> https://cirrus-ci.com/task/4849300577976320
It seems to work on the two windows CI instances just fine, but fails
mingw_cross_warning.
Apparently, <uchar.h> somehow includes (some portion of?) <sys/stat.h>
on that platform, which then conflicts with the hackery done in
<win32_port.h> (which expects to include <sys/stat.h> itself after some
special #defines).
The attached patch moves the inclusion of <uchar.h> after "port.h",
which solves the problem. It's a bit out of place, but I added a note
in the comment explaining why. I'll go ahead and commit.
Regards,
Jeff Davis
Вложения
On Thu, 2025-10-30 at 04:25 +1300, Thomas Munro wrote: > Here are some sketch-quality patches to try out some of these ideas, > for discussion. I gave them .txt endings so as not to hijack your > thread's CI. I like the direction this is going. I will commit the char32_t work anyway, so afterward feel free to hijack the thread (there's a lot of good information here so continuing here might be more productive than starting a new thread). Regarding 0002, IIUC, for PG_WCHAR_UTF32, surrogates are forbidden, but the comment about UTF-16 is a bit vague. I think we should add some asserts to make it clear. The basic communication mechanism between the modules is the database encoding: it determines PgWcharEncodingScheme in both wchar.c and pg_locale_libc.c. That seems reasonable to me, and doesn't interfere with the other providers. I'm still not quite sure how this fits with ICU in a single-byte encoding, but doesn't seem worse than what we do currently. Also, tangentially, I'm a bit anxious to do a permanent setlocale(LC_CTYPE, "C"), and we are very close. If these two threads are successful, I believe we can do it: https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel%40j-davis.com https://www.postgresql.org/message-id/d9657a6e51aa20702447bb2386b32fea6218670f.camel@j-davis.com That would be a big simplification because it would isolate libc ctype behavior to pg_locale_libc.c. That would make me feel generally more comfortable with additional work in this area. Regards, Jeff Davis