Обсуждение: C11: should we use char32_t for unicode code points?

Поиск

Список

Период

Сортировка

C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

23 октября, 21:15:54

Now that we're using C11, should we use char32_t for unicode code
points?

Right now, we use pg_wchar for two purposes: 

  1. to abstract away some problems with wchar_t on platforms where
it's 16 bits; and
  2. hold unicode code point values

In UTF8, they are are equivalent and can be freely cast back and forth,
but not necessarily in other encodings. That can be confusing in some
contexts. Attached is a patch to use char32_t for the second purpose.

Both are equivalent to uint32, so there's no functional change and no
actual typechecking, it's just for readability.

Is this helpful, or needless code churn?

Regards,
    Jeff Davis

Вложения

v1-0001-Use-C11-char32_t-for-Unicode-code-points.patch

Re: C11: should we use char32_t for unicode code points?

От

Tatsuo Ishii

Дата:

24 октября, 12:43:15

> Now that we're using C11, should we use char32_t for unicode code
> points?
>
> Right now, we use pg_wchar for two purposes: 
>
>   1. to abstract away some problems with wchar_t on platforms where
> it's 16 bits; and
>   2. hold unicode code point values
>
> In UTF8, they are are equivalent and can be freely cast back and forth,
> but not necessarily in other encodings. That can be confusing in some
> contexts. Attached is a patch to use char32_t for the second purpose.
>
> Both are equivalent to uint32, so there's no functional change and no
> actual typechecking, it's just for readability.
>
> Is this helpful, or needless code churn?

Unless char32_t is solely used for the Unicode code point data, I
think it would be better to define something like "pg_unicode" and use
it instead of directly using char32_t because it would be cleaner for
code readers.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

24 октября, 18:25:27

On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote:
>
> Unless char32_t is solely used for the Unicode code point data, I
> think it would be better to define something like "pg_unicode" and
> use
> it instead of directly using char32_t because it would be cleaner for
> code readers.

That was my original idea, but then I saw that apparently char32_t is
intended for Unicode code points:

https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html

But I am also OK with a new type if others find it more readable.

Regards,
    Jeff Davis

Re: C11: should we use char32_t for unicode code points?

От

Thomas Munro

Дата:

25 октября, 06:21:28

On Sat, Oct 25, 2025 at 4:25 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote:
> > Unless char32_t is solely used for the Unicode code point data, I
> > think it would be better to define something like "pg_unicode" and
> > use
> > it instead of directly using char32_t because it would be cleaner for
> > code readers.
>
> That was my original idea, but then I saw that apparently char32_t is
> intended for Unicode code points:
>
> https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html

It's definitely a codepoint but C11 only promised UTF-32 encoding if
__STDC_UTF_32__ is defined to 1, and otherwise the encoding is
unknown.  The C23 standard resolved that insanity and required UTF-32,
and there are no known systems[1] that didn't already conform, but I
guess you could static_assert(__STDC_UTF_32__, "char32_t must use
UTF-32 encoding").  It's also defined as at least, not exactly, 32
bits but we already require the machine to have uint32_t so it must be
exactly 32 bits for us, and we could static_assert(sizeof(char32_t) ==
4) for good measure.  So all up, the standard type matches our
existing assumptions about pg_wchar *if* the database encoding is
UTF8.

IIUC you're proposing that all the stuff that only works when database
encoding is UTF8 should be flipped over to the new type, and that
seems like a really good idea to me: remaining uses of pg_wchar would
be warnings that the encoding is only conditionally known.  It'd be
documentation without new type safety though: for example I think you
missed a spot, the return type of the definition of utf8_to_unicode()
(I didn't search exhaustively).  Only in C++ is it a distinct type
that would catch that and a few other mistakes.

Do you consider explicit casts between eg pg_wchar and char32_t to be
useful documentation for humans, when coercion should just work?  I
kinda thought we were trying to cut down on useless casts, they might
signal something but can also hide bugs.  Should the few places that
deal in surrogates be using char16_t instead?

I wonder if the XXX_libc_mb() functions that contain our hard-coded
assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
use your to_char32_t() too (probably with a longer name
pg_wchar_to_char32_t() if it's in a header for wider use).  That'd
highlight the exact points at which we make that assumption and
centralise the assertion about database encoding, and then the code
that compares with various known cut-off values would be clearly in
the char32_t world.

> But I am also OK with a new type if others find it more readable.

Adding yet another name to this soup doesn't immediately sound like it
would make anything more readable to me.  ISO has standardised this
for the industry, so I'd vote for adopting it without indirection that
makes the reader work harder to understand what it is.  The churn
doesn't seem excessive either, it's fairly well contained stuff
already moving around a lot in recent releases with all your recent
and ongoing revamping work.

There is one small practical problem though: Apple hasn't got around
to supplying <uchar.h> in its C SDK yet.  It's there for C++ only, and
 isn't needed for the type in C++ anyway.  I don't think that alone
warrants a new name wart, as the standard tells us it must match
uint32_least32_t so we can just define it ourselves if
!defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
around to that.

Since it confused me briefly: Apple does provide <unicode/uchar.h> but
that's a coincidentally named ICU header, and on that subject I see
that ICU hasn't adopted these types yet but there are some hints that
they're thinking about it; meanwhile their C++ interfaces have begun
to document that they are acceptable in a few template functions.

All other target systems have it AFAICS.  Windows: tested by CI,
MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.

As for the conversion functions in <uchar.h>, they're of course
missing on macOS but they also depend on the current locale, so it's
almost like C, POSIX and NetBSD have conspired to make them as useless
to us as possible.  They solve the "size and encoding of wchar_t is
undefined" problem, but there are no _l() variants and we can't depend
on uselocale() being available.  Probably wouldn't be much use to us
anyway considering our more complex and general transcoding
requirements, I just thought about this while contemplating
hypothetical pre-C23 systems that don't use UTF-32, specifically what
would break if such a system existed: probably nothing as long as you
don't use these.  I guess another way you could tell would be if you
used the fancy new U-prefixed character/string literal syntax, but I
can't see much need for that.

In passing, we seem to have a couple of mentions of "pg_wchar_t"
(bogus _t) in existing comments.

[1] https://thephd.dev/c-the-improvements-june-september-virtual-c-meeting

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

26 октября, 22:43:01

On Sat, 2025-10-25 at 16:21 +1300, Thomas Munro wrote:
> I
> guess you could static_assert(__STDC_UTF_32__, "char32_t must use
> UTF-32 encoding").

Done.

>   It's also defined as at least, not exactly, 32
> bits but we already require the machine to have uint32_t so it must
> be
> exactly 32 bits for us, and we could static_assert(sizeof(char32_t)
> ==
> 4) for good measure.

What would be the problem if it were larger than 32 bits?

I don't mind adding the asserts, but it's slightly awkward because
StaticAssertDecl() isn't defined yet at the point we are including
uchar.h.

> IIUC you're proposing that all the stuff that only works when
> database
> encoding is UTF8 should be flipped over to the new type, and that
> seems like a really good idea to me: remaining uses of pg_wchar would
> be warnings that the encoding is only conditionally known.

Exactly. The idea is to make pg_wchar stand out more as a platform-
dependent (or encoding-dependent) representation, and remove the doubt
when someone sees char32_t.

>   It'd be
> documentation without new type safety though: for example I think you
> missed a spot, the return type of the definition of utf8_to_unicode()
> (I didn't search exhaustively).

Right, it's not offering type safety. Fixed the omission.

> Do you consider explicit casts between eg pg_wchar and char32_t to be
> useful documentation for humans, when coercion should just work?  I
> kinda thought we were trying to cut down on useless casts, they might
> signal something but can also hide bugs.

The patch doesn't add any explicit casts, except in to_char32() and
to_pg_wchar(), so I assume that the callsites of those functions are
what you meant by "explicit casts"?

We can get rid of those functions if you want. The main reason they
exist is for a place to comment on the safety of converting pg_wchar to
char32_t. I can put that somewhere else, though.

>   Should the few places that
> deal in surrogates be using char16_t instead?

Yes, done.

> I wonder if the XXX_libc_mb() functions that contain our hard-coded
> assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> use your to_char32_t() too (probably with a longer name
> pg_wchar_to_char32_t() if it's in a header for wider use).

I don't think those functions do depend on UTF-32. iswalpha(), etc.,
take a wint_t, which is just a wchar_t that can also be WEOF.

And if we don't use to_char32/to_pg_wchar in there, I don't see much
need for it outside of pg_locale_builtin.c, but if the need arises we
can move it to a header file and give it a longer name.

>   That'd
> highlight the exact points at which we make that assumption and
> centralise the assertion about database encoding, and then the code
> that compares with various known cut-off values would be clearly in
> the char32_t world.

The asserts about UTF-8 in pg_locale_libc.c are there because the
previous code only took those code paths for UTF-8, and I preserved
that. Also there is some code that depends on UTF-8 for decoding, but I
don't think anything in there depends on UTF-32 specifically.

> There is one small practical problem though: Apple hasn't got around
> to supplying <uchar.h> in its C SDK yet.  It's there for C++ only,
> and
>  isn't needed for the type in C++ anyway.  I don't think that alone
> warrants a new name wart, as the standard tells us it must match
> uint32_least32_t so we can just define it ourselves if
> !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
> around to that.

Thank you, I added a configure test for uchar.h and some more
preprocessor logic in c.h.

> Since it confused me briefly: Apple does provide <unicode/uchar.h>
> but
> that's a coincidentally named ICU header, and on that subject I see
> that ICU hasn't adopted these types yet but there are some hints that
> they're thinking about it; meanwhile their C++ interfaces have begun
> to document that they are acceptable in a few template functions.

Even when they fully move to char32_t, we will still have to support
the older ICU versions for a long time.

> All other target systems have it AFAICS.  Windows: tested by CI,
> MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.

Great, thank you!

> They solve the "size and encoding of wchar_t is
> undefined" problem

One thing I never understood about this is that it's our code that
converts from the server encoding to pg_wchar (e.g.
pg_latin12wchar_with_len()), so we must understand the representation
of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
understand the encoding of wchar_t, too, right?

>  In passing, we seem to have a couple of mentions of "pg_wchar_t"
> (bogus _t) in existing comments.

Thank you. I'll fix that separately.

Regards,
    Jeff Davis

Вложения

v2-0001-Use-C11-char16_t-and-char32_t-for-Unicode-code-po.patch

Re: C11: should we use char32_t for unicode code points?

От

Thomas Munro

Дата:

28 октября, 05:40:16

On Mon, Oct 27, 2025 at 8:43 AM Jeff Davis <pgsql@j-davis.com> wrote:
> What would be the problem if it were larger than 32 bits?

Hmm, OK fair question, I can't think of any, I was just working
through the standard and thinking myopically about the exact
definition, but I think it's actually already covered by other things
we assume/require (ie the existence of uint32_t forces the size of
char32_t if you follow the chain of definitions backwards), and as you
say it probably doesn't even matter.  I suppose you could also skip
the __STC_UTF_32__ assertion given that we already make a larger
assumption about wchar_t encoding, and it seems to be exhaustively
established that no implementation fails to conform to C23 for
char32_t (see earlier link to Meneide's blog).  I don't personally
understand what C11 was smoking when it left that unspecified for
another 12 years.

> > I wonder if the XXX_libc_mb() functions that contain our hard-coded
> > assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> > use your to_char32_t() too (probably with a longer name
> > pg_wchar_to_char32_t() if it's in a header for wider use).
>
> I don't think those functions do depend on UTF-32. iswalpha(), etc.,
> take a wint_t, which is just a wchar_t that can also be WEOF.

I was noticing that toupper_libc_mb() directly tests if a pg_wchar
value is in the ASCII range, which only makes sense given knowledge of
pg_wchar's encoding, so perhap that should trigger this new coding
rule.  But I agree that's pretty obscure...  feel free to ignore that
suggestion.

Hmm, the comment at the top explains that we apply that special ASCII
treatment for default locales and not non-default locales, but it
doesn't explain *why* we make that distinction.  Do you know?

> One thing I never understood about this is that it's our code that
> converts from the server encoding to pg_wchar (e.g.
> pg_latin12wchar_with_len()), so we must understand the representation
> of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
> understand the encoding of wchar_t, too, right?

Right, we do know the encoding of pg_wchar in every case (assuming
that all pg_wchar values come from our transcoding routines).  We just
don't know if that encoding is also the one used by libc's
locale-sensitive functions that deal in wchar_t, except when the
locale is one that uses UTF-8 for char encoding, in which case we
assume that every libc must surely use Unicode codepoints in wchar_t.
That probably covers the vast majority of real world databases in the
UTF-8 age, and no known system fails to meet this expectation.  Of
course the encoding used by every libc for non-UTF-8 locales is
theoretically knowable too, but since they vary and in some cases are
not even documented, it would be too painful to contemplate any
dependency on that.

Let me try to work through this in more detail...  corrections
welcome, but this is what I have managed to understand about this
module so far, in my quest to grok PostgreSQL's overall character
encoding model (and holes therein):

For locales that use UTF-8 for char, we expect libc to understand
pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16.  The
expected source of these pg_wchar values is our various regexp code
paths that will use our mbutils pg_wchar conversion to UTF-32, with a
reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows
and I think otherwise only AIX in 32 bit builds, if it comes back).
If any libc didn't use Unicode codepoints in its locale-sensitive
wchar_t functions for UTF-8 locales we'd get garbage results, but we
don't know of any such system.  It's a bit of a shame that C11 didn't
introduce the obvious isualpha(char32_t) variants for a
standard-supported version of that realpolitik we depend on, but
perhaps one day...

There is one minor quirk here that it might be nice to document in top
comment section 2: on Windows we also expect wchar_t to be understood
by system wctype functions as UTF-16 for locales that *don't* use
UTF-8 for char (an assumption that definitely doesn't hold on many
Unixen).  That is important because on Windows we allow non-UTF-8
locales to be used in UTF-8 databases for historical reasons.

For single-byte encodings: pg_latin12wchar_with_len() just
zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c
functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it
completes a perfect round trip inside our code.  (BTW
pg_latin12wchar_with_len() has the same definition as
pg_ascii2wchar_with_len(), and is used for many single-byte encodings
other than LATIN1 which makes me wonder why we don't just have a
single function pg_char2wchar_with_len() that is used by all "simple
widening" cases.)  We never know or care which encoding libc would
itself use for these locales' wchar_t, as we don't ever pass it a
wchar_t.  Assuming I understood that correctly, I think it would be
nice if the "100% correct for LATINn" comment stated the reason for
that certainty explicitly, ie that it closes an information-preserving
round-trip beginning with the coercion in pg_latin12wchar_with_len()
and that libc never receives a wchar_t/wint_t that we fabricated.

A bit of a digression, which I *think* is out-of-scope for this
module, but just while I'm working through all the implications:  This
could produce unspecified results if a wchar_t from another source
ever arrived into these functions eg wchar_t made by libc or
L"literal" made by the compiler, both unspecified.  In practice, a
wchar_t of non-PostgreSQL origin that is truncated to 8 bits would
probably still give a sensible result for codepoints 0-127 (= 7 bit
subset of Unicode, and we require all server encodings to be supersets
of ASCII), and 0-255 for LATIN1 (= 8 bit subset of Unicode), because:
the two main approaches to single-byte char -> wchar_t conversion in
libc implementations seem to be conversion to Unicode (Windows,
glibc?), and simply casting char to wchar_t (I think this is probably
what *BSD and Solaris do for single-byte non-UTF-8 locales leading to
the complaint that wchar_t encoding is locale-dependent on those
systems, though I haven't checked in detail, and that's of course also
exactly what our own conversion does), so I think that means that
128-255 that would give nonsense results for non-LATIN1 single byte
encodings on Windows or glibc (?) but perhaps not other Unixen.  For
example, take ISO 8859-7, the legacy single byte encoding for Greek:
it encodes α as 0xe1, and Windows and glibc (?) would presumably
encode that as (wchar_t) 0x03b1 (the Unicode codepoint), and then
wc_isalpha_libc_sb() would truncate that to 0xb1 which is ± in ISO
8859-7, so isalpha_l() would return false, despite α being the OG
alpha (not tested, just a thought experiment looking at tables).  But
since handling pg_wchar of non-PostgreSQL origin doesn't seem to be
one of our goals, there is no problem to fix here, it might just be
worthy of a note in that commentary: we don't try to deal with wchar_t
values not made by PostgreSQL, except where noted (non-escaping uses
of char2wchar() in controlled scopes).

For multi-byte encodings other than UTF-8, pg_locale_libc.c is
basically giving up almost completely, but could probably be tightened
up.  I can't imagine we'll ever add another multibyte encoding, and I
believe we can ignore MULE internal, as no libc supports it (so you
could only get here with the C locale where you'll get the garbage
results you asked for...  in fact I wonder why need MULE internal at
all... it seems to be a sort of double-encoding for multiplexing other
encodings, so we can't exactly say it's not blessed by a standard,
it's indirectly defined by "all the standards" in a sense, but it's
also entirely obsoleted by Unicode's unification so I don't know what
problem it solves for anyone, or if anyone ever needed it in any
reasonable pg_upgrade window of history...).  Of server-supported
encodings, that leaves only EUC_* to think about.

The EUC family has direct encoding of 7-bit ASCII and then 3
selectable character sets represented by sequences with the high bit
set, with details varying between the Chinese (simplified Chinese),
Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean
variants.  I don't know if the pg_wchar encoding we're producing in
pg_euc*2wchar_with_len() has a name, but it doesn't appear to match
the description of the standard "fixed" representation on the
Wikipedia page for Extended Unix Code (it's too wide for starters,
looking at the shift distances).  The main thing seems to be that we
simply zero-extend the ASCII range into a pg_wchar directly, so when
we cast it down to call 8-bit ctype functions, I expect we produce
correct results for ASCII characters... and then I don't know what but
I guess nothing good for 128-255, and then surely hot garbage for
everything else, cycling through the 0-255 answers repeatedly as we
climb the pg_wchar value range.  The key point being that it's *not* a
perfect information-preserving round-trip, as we achieve for
single-byte encodings.  Some ideas for improvements:

1.  Cheap but incomplete: use a different ctype method table that
short-circuits the results (false for isalpha et al, pass-through for
upper/lower) for pg_wchar >= 128 and uses the existing 8-bit ctype
functions for ASCII.

2.  More expensive but complete: handle ASCII range with existing
8-bit ctype functions, and otherwise convert our pg_wchar back to MB
char format and then use libc's mbstowcs_l() to make a wchar_t that
libc's wchar_t-based functions should understand.  To avoid doing hard
work for nothing (ideogram-based languages generally don't care about
ctype stuff so that'd be the vast majority of characters appearing in
Chinese/Japanese/Korean text) at the cost of having to do a bunch of
research, we could should short-circuit the core CJK character ranges,
and do the extra CPU cycles for the rest, to catch the Latin +
accents, Greek, Cyrillic characters that are also supported in these
encodings for foreign names, variables in scientific language etc.  I
guess that implies a classifier that would be associated with ... the
encoding?  That would of course break if wchar_t values of
non-PostgreSQL origin arrive here, but see above note about nailing
down a contract that formally excludes that outside narrow
non-escaping sites.

3.  I assume there are some good reasons we don't do this but... if we
used char2wchar() in the first place (= libc native wchar_t) for the
regexp stuff that calls this stuff (as we do already inside
whole-string upper/lower, just not character upper/lower or character
classification), then we could simply call the wchar_t libc functions
directly and unconditionally in the libc provider for all cases,
instead of the 8-bit variants with broken edge cases for non-UTF-8
databases.  I didn't try to find the historical discussions, but I can
imagine already that we might not have done that because it has to
copy to cope with non-NULL-terminated strings, might perhaps have
weird incompatibilities with our own multibyte sequence detection,
might be slower (and/or might have been unusably broken ancient
libcs?), and it would only be appropriate for libc locales anyway and
yet now we have other locale providers that certainly don't want some
unspecified wchar_t encoding or libc involved.  It's also likely that
non-UTF-8 systems are of dwindling interest to anyone outside perhaps
client encodings (hence my attempt to ram home some simplifying
assumptions about that in that project to nail down some rules where
the encoding is fuzzy that I mentioned in a thread from a few months
ago).  So I'm not seriously suggesting this, just thinking out loud
about the corner we've painted ourselves into where idea #2's multiple
transcoding steps would be necessary to get the "right" answer for any
character in these encodings.  Hnngh.

In passing, I wonder why _libc.c has that comment about ICU in
parentheses.  Not relevant here.  I haven't thought much about whether
it's relevant in the ICU provider code (it may come back to that
do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it
also applies to Windows and probably glibc in the libc provider and I
don't immediately see any problem (assuming no-we-don't! answer).

Re: C11: should we use char32_t for unicode code points?

От

Tatsuo Ishii

Дата:

28 октября, 11:36:13

> The EUC family has direct encoding of 7-bit ASCII and then 3
> selectable character sets represented by sequences with the high bit
> set, with details varying between the Chinese (simplified Chinese),
> Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean
> variants.  I don't know if the pg_wchar encoding we're producing in
> pg_euc*2wchar_with_len() has a name, but it doesn't appear to match
> the description of the standard "fixed" representation on the
> Wikipedia page for Extended Unix Code (it's too wide for starters,
> looking at the shift distances).

Yes. pg_euc*2wchar_with_len() creates "variable length" representation
of EUC, 1 byte to 4 bytes range per character. Then, expands each
character into pg_wchar. Also it can be converted back to the
multibyte representation easily.

Note that the standard "fixed" representation of EUC includes ASCII
range bytes in *non* ASCII characters, thus I think it is not easy to
use for backend safe encoding.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

28 октября, 20:59:26

On Tue, 2025-10-28 at 15:40 +1300, Thomas Munro wrote:
> I was noticing that toupper_libc_mb() directly tests if a pg_wchar
> value is in the ASCII range, which only makes sense given knowledge
> of
> pg_wchar's encoding, so perhap that should trigger this new coding
> rule.  But I agree that's pretty obscure...  feel free to ignore that
> suggestion.

I'm not sure that casting it to char32_t would be an improvement there.
Perhaps if we can find some ways to generally clarify things (some of
which you suggest below), that could be part of a follow-up.

It looks like the current patch is a step in the right direction, so
I'll commit that soon and see what the buildfarm says.

> Hmm, the comment at the top explains that we apply that special ASCII
> treatment for default locales and not non-default locales, but it
> doesn't explain *why* we make that distinction.  Do you know?

It makes some sense: I suppose someone thought that non-ASCII behavior
in the default locale is just too likely to cause problems. But the
non-ASCII behavior is allowed if you use a COLLATE clause.

But the pattern wasn't followed quite the same way with ICU, which uses
the given locale for UPPER()/LOWER() regardless of whether it's the
default locale or not. And for regexes, ICU doesn't use the locale at
all, it just uses u_isalpha(), etc., even if you use a COLLATE clause.

And there are still some places that call plain tolower()/toupper(),
such as fuzzystrmatch and ltree.

>
> Right, we do know the encoding of pg_wchar in every case (assuming
> that all pg_wchar values come from our transcoding routines).  We
> just
> don't know if that encoding is also the one used by libc's
> locale-sensitive functions that deal in wchar_t, except when the
> locale is one that uses UTF-8 for char encoding, in which case we
> assume that every libc must surely use Unicode codepoints in wchar_t.

Ah, right. We create pg_wchars for any encoding, but we only pass a
pg_wchar to a libc multibyte function in the UTF-8 encoding.

(Aside: we do pass pg_wchars directly to ICU as UTF-32 codepoints,
regardless of encoding, which is a bug.)


> For locales that use UTF-8 for char, we expect libc to understand
> pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16.  The
> expected source of these pg_wchar values is our various regexp code
> paths that will use our mbutils pg_wchar conversion to UTF-32, with a
> reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows
> and I think otherwise only AIX in 32 bit builds, if it comes back).
> If any libc didn't use Unicode codepoints in its locale-sensitive
> wchar_t functions for UTF-8 locales we'd get garbage results, but we
> don't know of any such system.

Check.

>   It's a bit of a shame that C11 didn't
> introduce the obvious isualpha(char32_t) variants for a
> standard-supported version of that realpolitik we depend on, but
> perhaps one day...

Yeah...

> There is one minor quirk here that it might be nice to document in
> top
> comment section 2: on Windows we also expect wchar_t to be understood
> by system wctype functions as UTF-16 for locales that *don't* use
> UTF-8 for char (an assumption that definitely doesn't hold on many
> Unixen).  That is important because on Windows we allow non-UTF-8
> locales to be used in UTF-8 databases for historical reasons.

Interesting.

> For single-byte encodings: pg_latin12wchar_with_len() just
> zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c
> functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it
> completes a perfect round trip inside our code.

So you're saying that pg_wchar is more like a union type?

    typedef pg_wchar
    {
       char ch; /* single-byte encodings or 
                   non-UTF8 encodings on unix */
       char16_t utf16; /* windows non-UTF8 encodings */
       char32_t utf32; /* UTF-8 encoding */
    } pg_wchar;

(we'd have to be careful about the memory layout if we're casting,
though)

>   (BTW
> pg_latin12wchar_with_len() has the same definition as
> pg_ascii2wchar_with_len(), and is used for many single-byte encodings
> other than LATIN1 which makes me wonder why we don't just have a
> single function pg_char2wchar_with_len() that is used by all "simple
> widening" cases.)

Sounds like a nice simplification.

>   We never know or care which encoding libc would
> itself use for these locales' wchar_t, as we don't ever pass it a
> wchar_t.

Ah, that makes sense.

>   Assuming I understood that correctly, I think it would be
> nice if the "100% correct for LATINn" comment stated the reason for
> that certainty explicitly, ie that it closes an information-
> preserving
> round-trip beginning with the coercion in pg_latin12wchar_with_len()
> and that libc never receives a wchar_t/wint_t that we fabricated.

Agreed, though I think some refactoring would be helpful to accompany
the comment. I've worked with this stuff a lot and I still find it hard
to keep everything in mind at once.

> A bit of a digression, which I *think* is out-of-scope for this
> module, but just while I'm working through all the implications: 
> This
> could produce unspecified results if a wchar_t from another source
> ever arrived into these functions

Ugh.

When I first started dealing with pg_wchar, I assumed it was just a
wider wchar_t to abstract away some of the complexity when
sizeof(wchar_t) == 2 (e.g. get rid of surrogate pairs). It's clearly
more complicated than that.

> For multi-byte encodings other than UTF-8, pg_locale_libc.c is
> basically giving up almost completely

Right.

> I
> believe we can ignore MULE internal, as no libc supports it (so you
> could only get here with the C locale where you'll get the garbage
> results you asked for...  in fact I wonder why need MULE internal at
> all... it seems to be a sort of double-encoding for multiplexing
> other
> encodings, so we can't exactly say it's not blessed by a standard,
> it's indirectly defined by "all the standards" in a sense, but it's
> also entirely obsoleted by Unicode's unification so I don't know what
> problem it solves for anyone, or if anyone ever needed it in any
> reasonable pg_upgrade window of history...).

I have never heard of someone using it in production, and I wouldn't
object if someone wants to deprecate it.

> 2.  More expensive but complete: handle ASCII range with existing
> 8-bit ctype functions, and otherwise convert our pg_wchar back to MB
> char format and then use libc's mbstowcs_l() to make a wchar_t that
> libc's wchar_t-based functions should understand.

Correct. Sounds painful, but perhaps we could just do it and measure
the performance.

>   To avoid doing hard
> work for nothing (ideogram-based languages generally don't care about
> ctype stuff so that'd be the vast majority of characters appearing in
> Chinese/Japanese/Korean text) at the cost of having to do a bunch of
> research, we could should short-circuit the core CJK character
> ranges,
> and do the extra CPU cycles for the rest,

I don't think we should start making a bunch of assumptions like that.

> 3.  I assume there are some good reasons we don't do this but... if
> we
> used char2wchar() in the first place (= libc native wchar_t) for the
> regexp stuff that calls this stuff (as we do already inside
> whole-string upper/lower, just not character upper/lower or character
> classification), then we could simply call the wchar_t libc functions
> directly and unconditionally in the libc provider for all cases,
> instead of the 8-bit variants with broken edge cases for non-UTF-8
> databases.

I'm not sure about that either, but I think it's because you can end up
with surrogate pairs, which can't be represented in UTF-8.

>   I didn't try to find the historical discussions, but I can
> imagine already that we might not have done that because it has to
> copy to cope with non-NULL-terminated strings,

That's probably another reason.

> and it would only be appropriate for libc locales anyway and
> yet now we have other locale providers that certainly don't want some
> unspecified wchar_t encoding or libc involved.

We could fix that by making some of these APIs take a char pointer
instead. That would allow libc to decode to wchar_t, and other
providers to decode to UTF-32. Or, we could say that pg_wchar is an
opaque type that can only be created by the provider, and passed back
to the same provider.

>   It's also likely that
> non-UTF-8 systems are of dwindling interest to anyone outside perhaps
> client encodings

That's been my experience -- haven't run into many non-UTF8 server
encodings.

> In passing, I wonder why _libc.c has that comment about ICU in
> parentheses.  Not relevant here.

I moved it in 4da12e9e2e.

>   I haven't thought much about whether
> it's relevant in the ICU provider code (it may come back to that
> do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it
> also applies to Windows and probably glibc in the libc provider and I
> don't immediately see any problem (assuming no-we-don't! answer).

It's relevant for the regc_wc_isalpha(), etc. functions:

https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com

Regards,
    Jeff Davis

Re: C11: should we use char32_t for unicode code points?

От

Peter Eisentraut

Дата:

28 октября, 21:45:18

This patch looks good to me overall, it's a nice improvement in clarity.

On 26.10.25 20:43, Jeff Davis wrote:
> +/*
> + * char16_t and char32_t
> + *      Unicode code points.
> + */
> +#ifndef __cplusplus
> +#ifdef HAVE_UCHAR_H
> +#include <uchar.h>
> +#ifndef __STDC_UTF_16__
> +#error "char16_t must use UTF-16 encoding"
> +#endif
> +#ifndef __STDC_UTF_32__
> +#error "char32_t must use UTF-32 encoding"
> +#endif
> +#else
> +typedef uint16_t char16_t;
> +typedef uint32_t char32_t;
> +#endif
> +#endif

This could be improved a bit. The reason for some of these conditionals 
is not clear.  Like, what does __cplusplus have to do with this?  I 
think it would be more correct to write a configure/meson check for the 
actual types rather than depend indirectly on a header check.

The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as 
was discussed elsewhere, since we don't use any standard library 
functions that make use of these facts, and the need goes away with C23 
anyway.

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

28 октября, 23:03:17

On Tue, 2025-10-28 at 19:45 +0100, Peter Eisentraut wrote:
> This could be improved a bit. The reason for some of these
> conditionals
> is not clear.  Like, what does __cplusplus have to do with this?  I
> think it would be more correct to write a configure/meson check for
> the
> actual types rather than depend indirectly on a header check.

Fixed, thank you.

> The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as
> was discussed elsewhere, since we don't use any standard library
> functions that make use of these facts, and the need goes away with
> C23
> anyway.

Removed.

I also made the pg_config.h.in changes and ran autoconf.

Regards,
    Jeff Davis

Вложения

v3-0001-Use-C11-char16_t-and-char32_t-for-Unicode-code-po.patch

Re: C11: should we use char32_t for unicode code points?

От

Thomas Munro

Дата:

28 октября, 23:03:33

On Wed, Oct 29, 2025 at 7:45 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 26.10.25 20:43, Jeff Davis wrote:
> > +/*
> > + * char16_t and char32_t
> > + *      Unicode code points.
> > + */
> > +#ifndef __cplusplus
> > +#ifdef HAVE_UCHAR_H
> > +#include <uchar.h>
> > +#ifndef __STDC_UTF_16__
> > +#error "char16_t must use UTF-16 encoding"
> > +#endif
> > +#ifndef __STDC_UTF_32__
> > +#error "char32_t must use UTF-32 encoding"
> > +#endif
> > +#else
> > +typedef uint16_t char16_t;
> > +typedef uint32_t char32_t;
> > +#endif
> > +#endif
>
> This could be improved a bit. The reason for some of these conditionals
> is not clear.  Like, what does __cplusplus have to do with this?  I
> think it would be more correct to write a configure/meson check for the
> actual types rather than depend indirectly on a header check.

I suggested testing __cplusplus because I predicted that that typedef
would fail on a C++ compiler (since C++11), where char32_t is a
language keyword identifying a distinct type requiring no #include.
This is an Apple-only problem, without which we could just include
<uchar.h> unconditionally, and presumably will eventually when Apple
supplies this non-optional-per-C11 header.  On a Mac, #include
<uchar.h> fails for C (there is no $SDK/usr/include/uchar.h) but works
for C++ (it finds $SDK/usr/include/c++/v1/uchar.h), and since we'd
probe for HAVE_UCHAR_H with the C compiler, we'd not find it and thus
also need to exclude __cplusplus at compile time.  Otherwise, let's
see what the error looks like...

test.cpp:2:22: error: cannot combine with previous 'int' declaration specifier
    2 | typedef unsigned int char32_t;
      |                      ^
test.cpp:2:1: warning: typedef requires a name [-Wmissing-declarations]
    2 | typedef unsigned int char32_t;
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.

GCC has a clearer message:

test.cpp:2:22: error: redeclaration of C++ built-in type 'char32_t'
[-fpermissive]
    2 | typedef unsigned int char32_t;
      |                      ^~~~~~~~

If you try to test for the existence of the type rather than the
header in meson/configure, won't you still have the configure-with-C
compile-with-C++ problem, with no way to resolve it except by keeping
the test for __cplusplus that you're trying to get rid of?  So what do
you gain other than more lines of configure stuff?

Out of curiosity, even with -std=C++03 (old C++ standard that might
not work for PostgreSQL for other reasons, but I wanted to see what
would happen with a standard before char32_t became a fundamental
language type) I was surprised to see that the standard library
supplied char32_t.  It incorrectly(?) imports a typename from the
future standards using an internal type, so our typedef still fails,
just with a different Clang error:

test.cpp:2:22: error: typedef redefinition with different types
('unsigned int' vs 'char32_t')
    2 | typedef unsigned int char32_t;
      |                      ^

/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1/__config:320:20:
note: previous definition is here
  320 | typedef __char32_t char32_t;
      |                    ^

> The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as
> was discussed elsewhere, since we don't use any standard library
> functions that make use of these facts, and the need goes away with C23
> anyway.

+1

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

29 октября, 00:54:35

On Wed, 2025-10-29 at 09:03 +1300, Thomas Munro wrote:
> If you try to test for the existence of the type rather than the
> header in meson/configure, won't you still have the configure-with-C
> compile-with-C++ problem

I must have misunderstood the first time. If we depend on
HAVE_CHAR32_T, then it will be set in stone in pg_config.h, and if C++
tries to include the file then it will try the typedef again and fail.

I tried with headerscheck --cplusplus before posting it, but because my
machine has uchar.h, then it didn't fail.

I went back to using the check for __cplusplus, and added a comment
that hopefully clarifies things.

I also reordered the checks so that it prefers to include uchar.h if
available, even when using C++, because that seems like the cleaner end
goal. However, that caused another problem in CI (mingw_cross_warning),
apparently due to a conflict between uchar.h and win32_port.h on that
platform:

[21:48:21.794] ../../src/include/port/win32_port.h: At top level:
[21:48:21.794] ../../src/include/port/win32_port.h:254:8: error:
redefinition of ‘struct stat’
[21:48:21.794]   254 | struct stat
/* This should match struct __stat64 */
[21:48:21.794]       |        ^~~~
[21:48:21.794] In file included from /usr/share/mingw-
w64/include/wchar.h:413,
[21:48:21.794]                  from /usr/share/mingw-
w64/include/uchar.h:28,
[21:48:21.794]                  from ../../src/include/c.h:526:
[21:48:21.794] /usr/share/mingw-w64/include/_mingw_stat64.h:40:10:
note: originally defined here
[21:48:21.794]    40 |   struct stat {
[21:48:21.794]       |          ^~~~

https://cirrus-ci.com/task/4849300577976320

I could reverse the checks again and I think it will work, but let me
know if you have an idea for a better fix.

I never thought it would be so much trouble just to get a suitable type
for a UTF-32 code point...

Regards,
    Jeff Davis

Вложения

v4-0001-Use-C11-char16_t-and-char32_t-for-Unicode-code-po.patch

Re: C11: should we use char32_t for unicode code points?

От

Thomas Munro

Дата:

29 октября, 04:00:54

On Wed, Oct 29, 2025 at 6:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> So you're saying that pg_wchar is more like a union type?
>
>     typedef pg_wchar
>     {
>        char ch; /* single-byte encodings or
>                    non-UTF8 encodings on unix */
>        char16_t utf16; /* windows non-UTF8 encodings */
>        char32_t utf32; /* UTF-8 encoding */
>     } pg_wchar;
>
> (we'd have to be careful about the memory layout if we're casting,
> though)

Interesting idea.  I think it'd have to be something like:

typedef union
{
  unsigned char ch; /* (1) single-byte encoding databases */
  char32_t utf32; /* (2) UTF-8 databases */
  uint32_t ascii_or_custom; /* (3) MULE, EUC_XX databases */
} pg_wchar;

Dunno if it's worth actually doing, but it's a good illustration and a
better way to explain all this than the wall of text I wrote
yesterday.  The collusion between common/wchar.c and pg_locale_libc.c
is made more explicit.

I wonder if the logic to select the member/semantics could be turned
into an enum in the encoding table, to make it even clearer, and then
that could be used as an index into a table of ctype methods obejcts
in _libc.c.  The encoding module would be declaring which pg_wchar
semantics it uses, instead of having the _libc.c module infer it from
other properties, for a more explicit contract.  Or since they are
inferrable, perhaps a function in the mb module could do that and
return the enum.  Hmm, perhaps that alone would be clarifying enough,
without the union type.  I'm picturing something like PG_WCHAR_CHAR
(direclty usable with ctype.h), PG_WCHAR_UTF32 (self-explanatory, also
assumed be compatible with UTF-8 locales' wchar_t), PG_WCHAR_CUSTOM
(we only know that ASCII range is sane as Ishii-san explained, and for
anything else you'd need to re-encode via libc or give up, but
preferably not go nuts and return junk).  The enum would create a new
central place to document the cross-module semantics.

You showed char16_t for Windows, but we don't ever get char16_t out of
wchar.c, it's always char32_t for UTF-8 input.  It's just that _libc.c
truncates to UTF-16 or short-circuits to avoid overflow on that
platform (and in the past AIX 32-bit and maybe more), so it wouldn't
belong in a hypothetical union or enum.

> >   To avoid doing hard
> > work for nothing (ideogram-based languages generally don't care about
> > ctype stuff so that'd be the vast majority of characters appearing in
> > Chinese/Japanese/Korean text) at the cost of having to do a bunch of
> > research, we could should short-circuit the core CJK character
> > ranges,
> > and do the extra CPU cycles for the rest,
>
> I don't think we should start making a bunch of assumptions like that.

Yeah, maybe not.  Thought process: I had noticed that EUC was the only
relevant encoding family, and it has a character set selector, CS0 =
ASCII, and CS1, CS2, CS3 defined appropriately by the national
variants.  I had noticed that at least the Japanese one can encode
Latin with accents, Greek etc (non-ASCII stuff that has a meaningful
isalpha() etc) and I took a wild guess that it might be easy to
distinguish them if they'd chosen to put those under a different CS
number.  But I see now that they actually stuffed them all into CS1
along with kanji and kana, making it slightly more difficult: they're
still in different assigned "rows" though.  At a guess, you can
probably identify extra punctuation (huh, that's surely relevant even
for pure Japanese text if we want ispunct to work?) and foreign
alphabets with some bitmasks.  There might be something similar for
the other EUCs.

It's true that it's really not nice to carry special knowledge like
that (it's not just "assumptions", it's a set of black and white
published standards), and we should probably try hard to avoid that.
Perhaps we could at least put the conversion in a new encoding table
function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
place to put that sort of optimisation in (as opposed to making
_libc.c call char2wchar() with no hope of fast path)...  that is, if
we want to do any of this at all and not just make new ctype functions
that return false for PG_WCHAR_CUSTOM with value >= 128 and call it a
day...

If we do develop this idea though, one issue to contemplate is that
EUC code points might generate more than one wchar_t, looking at
EUC_JIS_2004[1].  We'd need a pg_wchar_custom_to_wchar_t() signature
that takes a single pg_wchar and writes to an output array and returns
the count, and then we'd have to decide what to do if we get more than
one.  Surrogates are trivial under the existing "punt" doctrine:
Windows went big on Unicode before it grew, C doesn't do wctype for
multi-wchar_t sequences, and we can't fix any of that.  If it's a
(rare?) combining character sequence then uhh... same problem one
level up, I think, even on Unix?  I'm not sure if we could do much
better than the "punt" path in both cases: return either false or the
input character as appropriate.

> > 3.  I assume there are some good reasons we don't do this but... if
> > we
> > used char2wchar() in the first place (= libc native wchar_t) for the
> > regexp stuff that calls this stuff (as we do already inside
> > whole-string upper/lower, just not character upper/lower or character
> > classification), then we could simply call the wchar_t libc functions
> > directly and unconditionally in the libc provider for all cases,
> > instead of the 8-bit variants with broken edge cases for non-UTF-8
> > databases.
>
> I'm not sure about that either, but I think it's because you can end up
> with surrogate pairs, which can't be represented in UTF-8.

Yeah, I think that alone is a good reason.  We need PG_WCHAR_UTF32 (in
the sketch terminology above).

I wondered about PG_WCHAR_SYSTEM_WCHAR_T, that could potentially
replace PG_WCHAR_CUSTOM, in other words using system wchar_t but only
for EUC_*.  The point of this would be for eg regexes to be able to
convert whole strings up-front with one libc call, rather than calling
for each character.  The problem seems to be that you'd lose any
ability to deal with surrogates and combining characters as discussed
above, as you'd lose character synchronisation for want of a better
word.  So I just can't see how to make this work.  Which leads back to
the do-it-one-by-one idea, which then leads back to the
maybe-try-to-make-a-fast-path-for-kanji-etc idea 'cos otherwise it
sounds too expensive...

[1] https://en.wikipedia.org/wiki/JIS_X_0213

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

29 октября, 18:12:01

On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
> I wonder if the logic to select the member/semantics could be turned
> into an enum in the encoding table, to make it even clearer, and then
> that could be used as an index into a table of ctype methods obejcts
> in _libc.c.

As long as we're able to isolate that logic in the libc provider,
that's reasonable. The other providers don't need that complexity, they
just need to decode straight to UTF-32.

> You showed char16_t for Windows, but we don't ever get char16_t out
> of
> wchar.c, it's always char32_t for UTF-8 input.  It's just that
> _libc.c
> truncates to UTF-16 or short-circuits to avoid overflow on that
> platform (and in the past AIX 32-bit and maybe more), so it wouldn't
> belong in a hypothetical union or enum.

Oh, I see.

> >
> Perhaps we could at least put the conversion in a new encoding table
> function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
> place to put that sort of optimisation in

That sounds like a good step forward. And maybe one to convert to UTF-
32 for ICU, also?

> If we do develop this idea though, one issue to contemplate is that
> EUC code points might generate more than one wchar_t, looking at
> EUC_JIS_2004[1].

Wow, that's unfortunate.


Regards,
    Jeff Davis

Re: C11: should we use char32_t for unicode code points?

От

Thomas Munro

Дата:

29 октября, 18:25:11

On Wed, Oct 29, 2025 at 2:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm picturing something like PG_WCHAR_CHAR
> (direclty usable with ctype.h), PG_WCHAR_UTF32 (self-explanatory, also
> assumed be compatible with UTF-8 locales' wchar_t), PG_WCHAR_CUSTOM
> (we only know that ASCII range is sane as Ishii-san explained, and for
> anything else you'd need to re-encode via libc or give up, but
> preferably not go nuts and return junk).  The enum would create a new
> central place to document the cross-module semantics.

Here are some sketch-quality patches to try out some of these ideas,
for discussion.  I gave them .txt endings so as not to hijack your
thread's CI.

* Fixing a different but related bug spotted in passing: we truncate
codepoints passed to Windows' iswalpha_l() et al, instead of detecting
overflow like some other places do.  Not tested on Windows, but it
seemed pretty obviously wrong?
* Classifying all pg_wchar encodings as producing PG_WCHAR_CHAR,
PG_WCHAR_UTF32 or PG_WCHAR_CUSTOM, and dispatching to libc ctype
methods based with that.
* Easy EUC change: filtering out non-ASCII for _CUSTOM.  I can't seem
to convince SQL-level regexes to expose bogus results on master
though... maybe the pg_wchar encoding actively avoids the by shifting
values up so you often or always cast to a harmless value?  Still
better to formalise that I think, if we don't move ahead with the more
ambitious plan...
* More ambitious re-encoding strategy, replacing previous change, with
apparently plausible results.
* Various refactorings with helper macros to avoid making mistakes in
all that repetitive wrapper stuff.

Here's what my ja_JP.eucJP database shows, on FreeBSD.  BTW in my
earlier emails I was confused and thought that kanji would not be in
class [[:alpha:]], but that's wrong: Unicode calls it "other letter",
and it looks like that makes all modern libcs return true for
iswalpha():

postgres=# select regexp_replace('1234 Постгрес 5678', '[[:alpha:]]+', '象');
 regexp_replace
----------------
 1234 象 5678
(1 row)

postgres=# select regexp_replace('1234 ポスグレ 5678', '[[:alpha:]]+', '象');
 regexp_replace
----------------
 1234 象 5678
(1 row)

postgres=# select regexp_replace('1234 ポスグレ？ 5678', '[[:punct:]]+', '。');
    regexp_replace
----------------------
 1234 ポスグレ。 5678
(1 row)

(That's not an ASCII question mark, it's one of the kanji-box sized
punctuation characters.)

I had to hack regc_pg_locale.c slightly to teach it that just because
I set max_chr to 127 it doesn't mean I want it to turn locale support
off.   Haven't looked into that code to figure out what it should do
instead, but it definitely shouldn't be allowed to probe made up
pg_wchar values, because EUC's pg_wchar encoding is sparse and
transcoding can error out.

A mystery that blocked me for too long: regexp_match('café', 'CAFÉ',
'i') and regexp_match('Αθήνα', 'ΑΘΉΝΑ', 'i') match with Apple's
ja_JP.eucJP as do the examples above, but mysteriously didn't on
FreeBSD's where this code started, could be a bug in its ja_JP.eucJP
locale affecting toupper/tolower...  Wish I could get that time back.

I imagine that for the ICU + non-UTF-8 locale bug you mentioned, we
might need a very similar set of re-encoding wrappers: something like
pg_wchar ->  mb -> UTF-8 -> UTF-32.  All this re-encoding sounds
pretty bad, but I can't see any way around the re-encoding with these
edge-case configurations, and we're still supposed to spit out correct
right answers...

Вложения

Re: C11: should we use char32_t for unicode code points?

От

Peter Eisentraut

Дата:

29 октября, 18:41:16

On 28.10.25 22:54, Jeff Davis wrote:
> I went back to using the check for __cplusplus, and added a comment
> that hopefully clarifies things.

Yes, that looks more helpful now.

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

29 октября, 21:54:34

On Tue, 2025-10-28 at 14:54 -0700, Jeff Davis wrote:
> [21:48:21.794] ../../src/include/port/win32_port.h: At top level:
> [21:48:21.794] ../../src/include/port/win32_port.h:254:8: error:
> redefinition of ‘struct stat’
> [21:48:21.794]   254 | struct stat                                   
> /* This should match struct __stat64 */
> [21:48:21.794]       |        ^~~~
> [21:48:21.794] In file included from /usr/share/mingw-
> w64/include/wchar.h:413,
> [21:48:21.794]                  from /usr/share/mingw-
> w64/include/uchar.h:28,
> [21:48:21.794]                  from ../../src/include/c.h:526:
> [21:48:21.794] /usr/share/mingw-w64/include/_mingw_stat64.h:40:10:
> note: originally defined here
> [21:48:21.794]    40 |   struct stat {
> [21:48:21.794]       |          ^~~~
>
> https://cirrus-ci.com/task/4849300577976320

It seems to work on the two windows CI instances just fine, but fails
mingw_cross_warning.

Apparently, <uchar.h> somehow includes (some portion of?) <sys/stat.h>
on that platform, which then conflicts with the hackery done in
<win32_port.h> (which expects to include <sys/stat.h> itself after some
special #defines).

The attached patch moves the inclusion of <uchar.h> after "port.h",
which solves the problem. It's a bit out of place, but I added a note
in the comment explaining why. I'll go ahead and commit.

Regards,
    Jeff Davis

Вложения

v5-0001-Use-C11-char16_t-and-char32_t-for-Unicode-code-po.patch

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

29 октября, 23:14:21

On Thu, 2025-10-30 at 04:25 +1300, Thomas Munro wrote:
> Here are some sketch-quality patches to try out some of these ideas,
> for discussion. I gave them .txt endings so as not to hijack your
> thread's CI.

I like the direction this is going. I will commit the char32_t work
anyway, so afterward feel free to hijack the thread (there's a lot of
good information here so continuing here might be more productive than
starting a new thread).

Regarding 0002, IIUC, for PG_WCHAR_UTF32, surrogates are forbidden, but
the comment about UTF-16 is a bit vague. I think we should add some
asserts to make it clear.

The basic communication mechanism between the modules is the database
encoding: it determines PgWcharEncodingScheme in both wchar.c and
pg_locale_libc.c. That seems reasonable to me, and doesn't interfere
with the other providers.

I'm still not quite sure how this fits with ICU in a single-byte
encoding, but doesn't seem worse than what we do currently.

Also, tangentially, I'm a bit anxious to do a permanent
setlocale(LC_CTYPE, "C"), and we are very close. If these two threads
are successful, I believe we can do it:

https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel%40j-davis.com

https://www.postgresql.org/message-id/d9657a6e51aa20702447bb2386b32fea6218670f.camel@j-davis.com

That would be a big simplification because it would isolate libc ctype
behavior to pg_locale_libc.c. That would make me feel generally more
comfortable with additional work in this area.

Regards,
Jeff Davis

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: C11: should we use char32_t for unicode code points?

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения