Обсуждение: Remaining dependency on setlocale()
After some previous work here: https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com we are less dependent on setlocale(), but it's still not completely gone. setlocale() counts as thread-unsafe, so it would be nice to eliminate it completely. The obvious answer is uselocale(), which sets the locale only for the calling thread, and takes precedence over whatever is set with setlocale(). But there are a couple problems: 1. I don't think it's supported on Windows. 2. I don't see a good way to canonicalize a locale name, like in check_locale(), which uses the result of setlocale(). Thoughts? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
> But there are a couple problems:
> 1. I don't think it's supported on Windows.
Can't help with that, but surely Windows has some thread-safe way.
> 2. I don't see a good way to canonicalize a locale name, like in
> check_locale(), which uses the result of setlocale().
What I can tell you about that is that check_locale's expectation
that setlocale does any useful canonicalization is mostly wishful
thinking [1]. On a lot of platforms you just get the input string
back again. If that's the only thing keeping us on setlocale,
I think we could drop it. (Perhaps we should do some canonicalization
of our own instead?)
regards, tom lane
[1] https://www.postgresql.org/message-id/14856.1348497531@sss.pgh.pa.us
On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > But there are a couple problems: > > > 1. I don't think it's supported on Windows. > > Can't help with that, but surely Windows has some thread-safe way. It does. It's not exactly the same, instead there is a thing you can call that puts setlocale() itself into a thread-local mode, but last time I checked that mode was missing on MinGW so that's a bit of an obstacle. How far can we get by using more _l() functions? For example, [1] shows a use of strftime() that I think can be converted to strftime_l() so that it doesn't depend on setlocale(). Since POSIX doesn't specify every obvious _l function, we might need to provide any missing wrappers that save/restore thread-locally with uselocale(). Windows doesn't have uselocale(), but it generally doesn't need such wrappers because it does have most of the obvious _l() functions. > > 2. I don't see a good way to canonicalize a locale name, like in > > check_locale(), which uses the result of setlocale(). > > What I can tell you about that is that check_locale's expectation > that setlocale does any useful canonicalization is mostly wishful > thinking [1]. On a lot of platforms you just get the input string > back again. If that's the only thing keeping us on setlocale, > I think we could drop it. (Perhaps we should do some canonicalization > of our own instead?) +1 I know it does something on Windows (we know the EDB installer gives it strings like "Language,Country" and it converts them to "Language_Country.Encoding", see various threads about it all going wrong), but I'm not sure it does anything we actually want to encourage. I'm hoping we can gradually screw it down so that we only have sane BCP 47 in the system on that OS, and I don't see why we wouldn't just use them verbatim. [1] https://www.postgresql.org/message-id/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com
On 8/7/24 03:07, Thomas Munro wrote:
> How far can we get by using more _l() functions? For example, [1]
> shows a use of strftime() that I think can be converted to
> strftime_l() so that it doesn't depend on setlocale(). Since POSIX
> doesn't specify every obvious _l function, we might need to provide
> any missing wrappers that save/restore thread-locally with
> uselocale(). Windows doesn't have uselocale(), but it generally
> doesn't need such wrappers because it does have most of the obvious
> _l() functions.
Most of the strtoX functions have an _l variant, but one to watch is
atoi, which is defined with a hardcoded call to strtol, at least with glibc:
8<----------
/* Convert a string to an int. */
int
atoi (const char *nptr)
{
return (int) strtol (nptr, (char **) NULL, 10);
}
8<----------
I guess in many/most places we use atoi we don't care, but maybe it
matters for some?
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 7, 2024 at 9:42 AM Joe Conway <mail@joeconway.com> wrote:
> I guess in many/most places we use atoi we don't care, but maybe it
> matters for some?
I think we should move in the direction of replacing atoi() calls with
strtol() and actually checking for errors. In many places where use
atoi(), it's unlikely that the string would be anything but an
integer, so error checks are arguably unnecessary. A backup label file
isn't likely to say "START TIMELINE: potaytoes". On the other hand, if
it did say that, I'd prefer to get an error about potaytoes than have
it be treated as if it said "START TIMELINE: 0". And I've definitely
found missing error-checks over the years. For example, on pg14,
"pg_basebackup -Ft -Zmaximum -Dx" works as if you specified "-Z0"
because atoi("maximum") == 0. If we make a practice of checking
integer conversions for errors everywhere, we might avoid some such
silliness.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
> How far can we get by using more _l() functions?
There are a ton of calls to, for example, isspace(), used mostly for
parsing.
I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.
So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.
Regards,
Jeff Davis
On 8/7/24 13:16, Jeff Davis wrote: > On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote: >> How far can we get by using more _l() functions? > > There are a ton of calls to, for example, isspace(), used mostly for > parsing. > > I wouldn't expect a lot of differences in behavior from locale to > locale, like might be the case with iswspace(), but behavior can be > different at least in theory. > > So I guess we're stuck with setlocale()/uselocale() for a while, unless > we're able to move most of those call sites over to an ascii-only > variant. FWIW I see all of these in glibc: isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote: > FWIW I see all of these in glibc: > > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, > isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, > isxdigit_l On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.
We do know of a few isspace() calls that are already questionable[1]
(should be scanner_isspace(), or something like that). It's not only
weird that SELECT ROW('libertà!') is displayed with or without double
quote depending (in theory) on your locale, it's also undefined
behaviour because we feed individual bytes of a multi-byte sequence to
isspace(), so OSes disagree, and in practice we know that macOS and
Windows think that the byte 0xa inside 'à' is a space while glibc and
FreeBSD don't. Looking at the languages with many sequences
containing 0xa0, I guess you'd probably need to be processing CJK text
and cross-platform for the difference to become obvious (that was the
case for the problem report I analysed):
for i in range(1, 0xffff):
if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
print("%04x: %s" % (i, chr(i)))
[1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com
On Thu, Aug 8, 2024 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
> > FWIW I see all of these in glibc:
> >
> > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> > isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> > isxdigit_l
>
> On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.
Those (except isascii_l) are from POSIX 2008[1]. They were absorbed
from "Extended API Set Part 4"[2], along with locale_t (that's why
there is a header <xlocale.h> on a couple of systems even though after
absorption they are supposed to be in <locale.h>). We already
decided that all computers have that stuff (commit 8d9a9f03), but the
reality is a little messier than that... NetBSD hasn't implemented
uselocale() yet[3], though it has a good set of _l functions. As
discussed in [3], ECPG code is therefore currently broken in
multithreaded clients because it's falling back to a setlocale() path,
and I think Windows+MinGW must be too (it lacks
HAVE__CONFIGTHREADLOCALE), but those both have a good set of _l
functions. In that thread I tried to figure out how to use _l
functions to fix that problem, but ...
The issue there is that we have our own snprintf.c, that implicitly
requires LC_NUMERIC to be "C" (it is documented as always printing
floats a certain way ignoring locale and that's what the callers there
want in frontend and backend code, but in reality it punts to system
snprintf for floats, assuming that LC_NUMERIC is "C", which we
configure early in backend startup, but frontend code has to do it for
itself!). So we could use snprintf_l or strtod_l instead, but POSIX
hasn't got those yet. Or we could use own own Ryu code (fairly
specific), but integrating Ryu into our snprintf.c (and correctly
implementing all the %... stuff?) sounds like quite a hard,
devil-in-the-details kind of an undertaking to me. Or maybe it's
easy, I dunno. As for the _l functions, you could probably get away
with "every computer has either uselocale() or snprintf_() (or
strtod_()?)" and have two code paths in our snprintf.c. But then we'd
also need a place to track a locale_t for a long-lived newlocale("C"),
which was too messy in my latest attempt...
[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html
[2] https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf
[3] https://www.postgresql.org/message-id/flat/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech
On Wed, 2024-08-07 at 13:28 -0400, Joe Conway wrote:
> FWIW I see all of these in glibc:
>
> isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> isxdigit_l
My point was just that there are a lot of those call sites (especially
for isspace()) in various parsers. It feels like a lot of code churn to
change all of them, when a lot of them seem to be intended for ascii
anyway.
And where do we get the locale_t structure from? We can create our own
at database connection time, and supply it to each of those call sites,
but I'm not sure that's a huge advantage over just using uselocale().
Regards,
Jeff Davis
On 8/8/24 12:45 AM, Jeff Davis wrote: > My point was just that there are a lot of those call sites (especially > for isspace()) in various parsers. It feels like a lot of code churn to > change all of them, when a lot of them seem to be intended for ascii > anyway. > > And where do we get the locale_t structure from? We can create our own > at database connection time, and supply it to each of those call sites, > but I'm not sure that's a huge advantage over just using uselocale(). I am leaning towards that we should write our own pure ascii functions for this. Since we do not support any non-ascii compatible encodings anyway I do not see the point in having locale support in most of these call-sites. Andewas
On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
> I am leaning towards that we should write our own pure ascii
> functions
> for this.
That makes sense for a lot of call sites, but it could cause breakage
if we aren't careful.
> Since we do not support any non-ascii compatible encodings
> anyway I do not see the point in having locale support in most of
> these
> call-sites.
An ascii-compatible encoding just means that the code points in the
ascii range are represented as ascii. I'm not clear on whether code
points in the ascii range can return different results for things like
isspace(), but it sounds plausible -- toupper() can return different
results for 'i' in tr_TR.
Also, what about the values outside 128-255, which are still valid
input to isspace()?
Regards,
Jeff Davis
On Tue Aug 6, 2024 at 5:00 PM CDT, Jeff Davis wrote: > After some previous work here: > > https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com > > we are less dependent on setlocale(), but it's still not completely > gone. > > setlocale() counts as thread-unsafe, so it would be nice to eliminate > it completely. > > The obvious answer is uselocale(), which sets the locale only for the > calling thread, and takes precedence over whatever is set with > setlocale(). > > But there are a couple problems: > > 1. I don't think it's supported on Windows. > > 2. I don't see a good way to canonicalize a locale name, like in > check_locale(), which uses the result of setlocale(). > > Thoughts? Hey Jeff, See this thread[0] for some work that I had previously done. Feel free to take it over, or we could collaborate. [0]: https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech -- Tristan Partin Neon (https://neon.tech)
Hi, On Fri, 2024-08-09 at 15:16 -0500, Tristan Partin wrote: > Hey Jeff, > > See this thread[0] for some work that I had previously done. Feel > free > to take it over, or we could collaborate. > > [0]: > https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech Sounds good, sorry I missed that. Can you please rebase and we can discuss in that thread? Regards, Jeff Davis
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > But there are a couple problems: > > > > > 1. I don't think it's supported on Windows. > > > > Can't help with that, but surely Windows has some thread-safe way. > > It does. It's not exactly the same, instead there is a thing you can > call that puts setlocale() itself into a thread-local mode, but last > time I checked that mode was missing on MinGW so that's a bit of an > obstacle. Actually the MinGW situation might be better than that these days. I know of three environments where we currently have to keep code working on MinGW: build farm animal fairywren (msys2 compiler toochain), CI's optional "Windows - Server 2019, MinGW64 - Meson" task, and CI's "CompilerWarnings" task, in the "mingw_cross_warning" step (which actually runs on Linux, and uses configure rather than meson). All three environments show that they have _configthreadlocale. So could we could simply require it on Windows? Then it might be possible to write a replacement implementation of uselocale() that does a two-step dance with _configthreadlocale() and setlocale(), restoring both afterwards if they changed. That's what ECPG open-codes already. The NetBSD situation is more vexing. I was trying to find out if someone is working on it and unfortunately it looks like there is a principled stand against adding it: https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html They're right that we really just want to use "C" in some places, and their LC_C_LOCALE is a very useful system-provided value to be able to pass into _l functions. It's a shame it's non-standard, because without it you have to allocate a locale_t for "C" and keep it somewhere to feed to _l functions...
I've posted a new attempt at ripping those ECPG setlocales() out on the other thread that had the earlier version and discussion: https://www.postgresql.org/message-id/CA%2BhUKG%2BYv%2Bps%3DnS2T8SS1UDU%3DiySHSr4sGHYiYGkPTpZx6Ooww%40mail.gmail.com
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote: > > How far can we get by using more _l() functions? > > There are a ton of calls to, for example, isspace(), used mostly for > parsing. > > I wouldn't expect a lot of differences in behavior from locale to > locale, like might be the case with iswspace(), but behavior can be > different at least in theory. > > So I guess we're stuck with setlocale()/uselocale() for a while, unless > we're able to move most of those call sites over to an ascii-only > variant. Here are two more cases that I don't think I've seen discussed. 1. The nl_langinfo() call in pg_get_encoding_from_locale(), can probably be changed to nl_langinfo_l() (it is everywhere we currently care about except Windows, which has a different already-thread-safe alternative; AIX seems to lack the _l version, but someone writing a patch to re-add support for that OS could supply the configure goo for a uselocale() safe/restore implementation). One problem is that it has callers that pass it NULL meaning the backend default, but we'd perhaps use LC_C_GLOBAL for now and have to think about where we get the database default locale_t in the future. 2. localeconv() is *doubly* non-thread-safe: it depends on the current locale, and it also returns an object whose storage might be clobbered by any other call to localeconv(), setlocale, or even, according to POSIX, uselocale() (!!!). I think that effectively closes off that escape hatch. On some OSes (macOS, BSDs) you find localeconv_l() and then I think they give you a more workable lifetime: as long as the locale_t lives, which makes perfect sense. I am surprised that no one has invented localeconv_r() where you supply the output storage, and you could wrap that in uselocale() save/restore to deal with the other problem, or localeconv_r_l() or something. I can't understand why this is so bad. The glibc documentation calls it "a masterpiece of poor design". Ahh, so it seems like we need to delete our use of localeconf() completely, because we should be able to get all the information we need from nl_langinfo_l() instead: https://www.gnu.org/software/libc/manual/html_node/Locale-Information.html
On Mon, Aug 12, 2024 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > 1. The nl_langinfo() call in pg_get_encoding_from_locale(), can > probably be changed to nl_langinfo_l() (it is everywhere we currently > care about except Windows, which has a different already-thread-safe > alternative ... ... though if we wanted to replace all use of localeconv and struct lconv with nl_langinfo_l() calls, it's not totally obvious how to do that on Windows. Its closest thing is GetLocaleInfoEx(), but that has complications: it takes wchar_t locale names, which we don't even have and can't access when we only have a locale_t, and it must look them up in some data structure every time, and it copies data out to the caller as wchar_t so now you have two conversion problems and a storage problem. If I understand correctly, the whole point of nl_langinfo_l(item, loc) is that it is supposed to be fast, it's really just an array lookup, and item is just an index, and the result is supposed to be stable as long as loc hasn't been freed (and the thread hasn't exited). So you can use it without putting your own caching in front of it. One idea I came up with which I haven't tried and it might turn out to be terrible, is that we could change our definition of locale_t on Windows. Currently it's a typedef to Windows' _locale_t, and we use it with a bunch of _XXX functions that we access by macro to remove the underscore. Instead, we could make locale_t a pointer to a struct of our own design in WIN32 builds, holding the native _locale_t and also an array full of all the values that nl_langinfo_l() can return. We'd provide the standard enums, indexes into that array, in a fake POSIX-oid header <langinfo.h>. Then nl_langinfo_l(item, loc) could be implemented as loc->private_langinfo[item], and strcoll_l(.., loc) could be a static inline function that does _strcol_l(..., loc->private_windows_locale_t). These structs would be allocated and freed with standard-looking newlocale() and freelocale(), so we could finally stop using #ifdef WIN32-wrapped _create_locale() directly. Then everything would look more POSIX-y, nl_langinfo_l() could be used directly wherever we need fast access to that info, and we could, I think, banish the awkward localeconv, right? I don't know if this all makes total sense and haven't tried it, just spitballing here...
On Mon, Aug 12, 2024 at 4:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... though if we wanted to replace all use of localeconv and struct > lconv with nl_langinfo_l() calls, Gah. I realised while trying the above that you can't really replace localeconv() with nl_langinfo_l() as the GNU documentation recommends, because some of the lconv fields we're using are missing from langinfo.h in POSIX (only GNU added them all, that was a good idea). :-( Next idea: Windows: its localeconv() returns pointer to thread-local storage, good, but it still needs setlocale(). So the options are: make our own lconv-populator function for Windows, using GetLocaleInfoEx(), or do that _configthreadlocale() dance (possibly excluding some MinGW configurations from working) Systems that have localeconv_l(): use that POSIX: use uselocale() and also put a big global lock around localeconv() call + accessing result (optionally skipping that on an OS-by-OS basis after confirming that its implementation doesn't really need it) The reason the uselocale() + localeconv() seems to require a Big Lock (by default at least) is that the uselocale() deals only with the "current locale" aspect, not the output buffer aspect. Clearly the standard allows for it to be thread-local storage (that's why since 2008 it says that after thread-exit you can't access the result, and I guess that's how it works on real systems (?)), but it also seems to allow for a single static buffer (that's why it says that it's not re-entrant, and any call to localeconv() might clobber it). That might be OK in practice because we tend to cache that stuff, eg when assigning GUC lc_monetary (that cache would presumably become thread-local in the first phase of the multithreading plan), so the locking shouldn't really hurt. The reason we'd have to have three ways, and not just two, is again that NetBSD declined to implement uselocale(). I'll try this in a bit unless someone else has better ideas or plans for this part... sorry for the drip-feeding.
On Tue, Aug 13, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I'll try this in a bit unless someone else has better ideas or plans > for this part... sorry for the drip-feeding. And done, see commitfest entry #5170.
On Sat, 2024-08-10 at 09:42 +1200, Thomas Munro wrote: > The NetBSD situation is more vexing. I was trying to find out if > someone is working on it and unfortunately it looks like there is a > principled stand against adding it: > > https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html > https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html The objection seems to be very general: that uselocale() modifies the thread state and affects calls a long distance from uselocale(). I don't disagree with the general sentiment. But in effect, that just prevents people migrating away from setlocale(), to which the same argument applies, and is additionally thread-unsafe. The only alternative is to essentially ban the use of non-_l variants, which is fine I suppose, but causes a fair amount of code churn. > They're right that we really just want to use "C" in some places, and > their LC_C_LOCALE is a very useful system-provided value to be able > to > pass into _l functions. It's a shame it's non-standard, because > without it you have to allocate a locale_t for "C" and keep it > somewhere to feed to _l functions... If we're going to do that, why not just have ascii-only variants of our own? pg_ascii_isspace(...) is at least as readable as isspace_l(..., LC_C_LOCALE). Regards, Jeff Davis
On Wed, Aug 14, 2024 at 1:05 PM Jeff Davis <pgsql@j-davis.com> wrote: > The only alternative is to essentially ban the use of non-_l variants, > which is fine I suppose, but causes a fair amount of code churn. Let's zoom out a bit and consider some ways we could set up the process, threads and individual calls: 1. The process global locale is always "C". If you ever call uselocale(), it can only be for short stretches, and you have to restore it straight after; perhaps it is only ever used in replacement _l() functions for systems that lack them. You need to use _l() functions for all non-"C" locales. The current database default needs to be available as a variable (in future: thread-local variable, or reachable from one), so you can use it in _l() functions. The "C" locale can be accessed implicitly with non-l() functions, or you could ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) for "C". Or a name like PG_C_LOCALE, which, in backend code could be just LC_GLOBAL_LOCALE, while in frontend/library code it could be the singleton mechanism I showed in CF#5166. XXX Note that nailing LC_ALL to "C" in the backend would extend our existing policy for LC_NUMERIC to all categories. That's why we can use strtod() in the backend and expect the radix character to be '.'. It's interesting to contemplate the strtod() calls in CF#5166: they are code copied-and-pasted from backend and frontend; in the backend we can use strtod() currently but in the frontend code I showed a change to strtod_l(..., PG_C_LOCALE), in order to be able to delete some ugly deeply nested uselocale()/setlocale() stuff of the exact sort that NetBSD hackers (and I) hate. It's obviously a bit of a code smell that it's copied-and-pasted in the first place, and should really share code. Supposing some of that stuff finishes up in src/common, then I think you'd want a strtod_l(..., PG_C_LOCALE) that could be allowed to take advantage of the knowledge that the global locale is "C" in the backend. Just thoughts... 2. The process global locale is always "C". Each backend process (in future: thread) calls uselocale() to set the thread-local locale to the database default, so it can keep using the non-_l() functions as a way to access the database default, and otherwise uses _l() functions if it wants something else (as we do already). The "C" locale can be accessed with foo_l(..., LC_GLOBAL_LOCALE) or PG_C_LOCALE etc. XXX This option is blocked by NetBSD's rejection of uselocale(). I guess if you really wanted to work around NetBSD's policy you could make your own wrapper for all affected functions, translating foo() to foo_l(..., pg_thread_current_locale), so you could write uselocale(), which is pretty much what every other libc does... But eughhh 3. The process global locale is inherited from the system or can be set by the user however they want for the benefit of extensions, but we never change it after startup or refer to it. Then we do the same as 1 or 2, except if we ever want "C" we'll need a locale_t for that, again perhaps using the PC_C_LOCALE mechanism. Non-_l() functions are effectively useless except in cases where you really want to use the system's settings inherited from startup, eg for messages, so they'd mostly be banned. What else? > > They're right that we really just want to use "C" in some places, and > > their LC_C_LOCALE is a very useful system-provided value to be able > > to > > pass into _l functions. It's a shame it's non-standard, because > > without it you have to allocate a locale_t for "C" and keep it > > somewhere to feed to _l functions... > > If we're going to do that, why not just have ascii-only variants of our > own? pg_ascii_isspace(...) is at least as readable as isspace_l(..., > LC_C_LOCALE). Yeah, I agree there are some easy things we should do that way. In fact we've already established that scanner_isspace() needs to be used in lots more places for that, even aside from thread-safety.
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> 1. The process global locale is always "C". If you ever call
> uselocale(), it can only be for short stretches, and you have to
> restore it straight after; perhaps it is only ever used in
> replacement
> _l() functions for systems that lack them. You need to use _l()
> functions for all non-"C" locales. The current database default
> needs
> to be available as a variable (in future: thread-local variable, or
> reachable from one), so you can use it in _l() functions. The "C"
> locale can be accessed implicitly with non-l() functions, or you
> could
> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> for
> "C". Or a name like PG_C_LOCALE, which, in backend code could be
> just
> LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> singleton mechanism I showed in CF#5166.
+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.
We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.
Regards,
Jeff Davis
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > 2. I don't see a good way to canonicalize a locale name, like in
> > > check_locale(), which uses the result of setlocale().
> >
> > What I can tell you about that is that check_locale's expectation
> > that setlocale does any useful canonicalization is mostly wishful
> > thinking [1]. On a lot of platforms you just get the input string
> > back again. If that's the only thing keeping us on setlocale,
> > I think we could drop it. (Perhaps we should do some canonicalization
> > of our own instead?)
>
> +1
>
> I know it does something on Windows (we know the EDB installer gives
> it strings like "Language,Country" and it converts them to
> "Language_Country.Encoding", see various threads about it all going
> wrong), but I'm not sure it does anything we actually want to
> encourage. I'm hoping we can gradually screw it down so that we only
> have sane BCP 47 in the system on that OS, and I don't see why we
> wouldn't just use them verbatim.
Some more thoughts on check_locale() and canonicalisation:
I doubt the canonicalisation does anything useful on any Unix system,
as they're basically just file names. In the case of glibc, the
encoding part is munged before opening the file so it tolerates .utf8
or .UTF-8 or .u---T----f------8 on input, but it still returns
whatever you gave it so the return value isn't cleaning the input or
anything.
"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places. We could change that to newlocale("") + query instead, but
there is a portability pipeline problem getting the name out of it:
1. POSIX only just added getlocalename_l() in 2024[1][2].
2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc).
3. The <xlocale.h> systems (macOS/*BSD) have non-standard
querylocale(mask, loc).
4. AFAIK there is no way to do it on pure POSIX 2008 systems.
5. For Windows, there is a completely different thing to get the
user's default locale, see CF#3772.
The systems in category 4 would in practice be Solaris and (if it
comes back) AIX. Given that, we probably just can't go that way soon.
So I think the solution could perhaps be something like: in some early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.
[1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html
[2] https://www.austingroupbugs.net/view.php?id=1220
On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> So I think the solution could perhaps be something like: in some
> early
> startup phase before there are any threads, we nail down all the
> locale categories to "C" (or whatever we decide on for the permanent
> global locale), and also query the "" categories and make a copy of
> them in case anyone wants them later, and then never call setlocale()
> again.
+1.
Regards,
Jeff Davis
On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> > So I think the solution could perhaps be something like: in some
> > early
> > startup phase before there are any threads, we nail down all the
> > locale categories to "C" (or whatever we decide on for the permanent
> > global locale), and also query the "" categories and make a copy of
> > them in case anyone wants them later, and then never call setlocale()
> > again.
>
> +1.
We currently nail down these categories:
/* We keep these set to "C" always. See pg_locale.c for explanation. */
init_locale("LC_MONETARY", LC_MONETARY, "C");
init_locale("LC_NUMERIC", LC_NUMERIC, "C");
init_locale("LC_TIME", LC_TIME, "C");
CF #5170 has patches to make it so that we stop changing them even
transiently, using locale_t interfaces to feed our caches of stuff
needed to work with those categories, so they really stay truly nailed
down.
It sounds like someone needs to investigate doing the same thing for
these two, from CheckMyDatabase():
if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
ereport(FATAL,
(errmsg("database locale is incompatible with
operating system"),
errdetail("The database was initialized with
LC_COLLATE \"%s\", "
" which is not recognized by setlocale().", collate),
errhint("Recreate the database with another locale or
install the missing locale.")));
if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
ereport(FATAL,
(errmsg("database locale is incompatible with
operating system"),
errdetail("The database was initialized with LC_CTYPE \"%s\", "
" which is not recognized by setlocale().", ctype),
errhint("Recreate the database with another locale or
install the missing locale.")));
How should that work? Maybe we could imagine something like
MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
set appropriately. Or should it be a pg_locale_t instead (if your
database default provider is ICU, then you don't even need a locale_t,
right?).
Then I think there is one quite gnarly category, from
assign_locale_messages() (a GUC assignment function):
(void) pg_perm_setlocale(LC_MESSAGES, newval);
I have never really studied gettext(), but I know it was just
standardised in POSIX 2024, and the standardised interface has _l()
variants of all functions. Current implementations don't have them
yet. Clearly we absolutely couldn't call pg_perm_setlocale() after
early startup -- but if gettext() is relying on the current locale to
affect far away code, then maybe this is one place where we'd just
have to use uselocale(). Perhaps we could plan some transitional
strategy where NetBSD users lose the ability to change the GUC without
restarting the server and it has to be the same for all sessions, or
something like that, until they produce either gettext_l() or
uselocale(), but I haven't thought hard about this part at all yet...
On 15.08.24 00:43, Thomas Munro wrote:
> "" is a problem however... the special value for "native environment"
> is returned as a real locale name, which we probably still need in
> places. We could change that to newlocale("") + query instead, but
Where do we need that in the server?
It should just be initdb doing that and then initializing the server
with concrete values based on that.
I guess technically some of these GUC settings default to the
environment? But I think we could consider getting rid of that.
On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 15.08.24 00:43, Thomas Munro wrote:
> > "" is a problem however... the special value for "native environment"
> > is returned as a real locale name, which we probably still need in
> > places. We could change that to newlocale("") + query instead, but
>
> Where do we need that in the server?
Hmm. Yeah, right, the only way I've found so far to even reach that
code and that captures that result is:
create database db2 locale = '';
Thats puts 'en_NZ.UTF-8' or whatever in pg_database. In contrast,
create collation will accept '' but just store it verbatim, and the
GUCs for changing time, monetary, numeric accept it too and keep it
verbatim. We could simply ban '' in all user commands. I doubt
they're documented as acceptable values, once you get past initdb and
have a running system. Looking into that...
> It should just be initdb doing that and then initializing the server
> with concrete values based on that.
Right.
> I guess technically some of these GUC settings default to the
> environment? But I think we could consider getting rid of that.
Yeah.
On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > It should just be initdb doing that and then initializing the server > > with concrete values based on that. > > Right. > > > I guess technically some of these GUC settings default to the > > environment? But I think we could consider getting rid of that. > > Yeah. Seems to make a lot of sense. I tried that out over in CF #5170. (In case it's not clear why I'm splitting discussion between threads: I was thinking of this thread as the one where we're discussing what needs to be done, with other threads being spun off to become CF entry with concrete patches. I realised re-reading some discussion that that might not be obvious...)
On 8/9/24 8:24 PM, Jeff Davis wrote: > On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote: >> I am leaning towards that we should write our own pure ascii >> functions >> for this. > > That makes sense for a lot of call sites, but it could cause breakage > if we aren't careful. > >> Since we do not support any non-ascii compatible encodings >> anyway I do not see the point in having locale support in most of >> these >> call-sites. > > An ascii-compatible encoding just means that the code points in the > ascii range are represented as ascii. I'm not clear on whether code > points in the ascii range can return different results for things like > isspace(), but it sounds plausible -- toupper() can return different > results for 'i' in tr_TR. > > Also, what about the values outside 128-255, which are still valid > input to isspace()? My idea was that in a lot of those cases we only try to parse e.g. 0-9 as digits and always only . as the decimal separator so we should make just make that obvious by either using locale C or writing our own ascii only functions. These strings are meant to be read by machines, not humans, primarily. Andreas
On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> > 1. The process global locale is always "C". If you ever call
> > uselocale(), it can only be for short stretches, and you have to
> > restore it straight after; perhaps it is only ever used in
> > replacement
> > _l() functions for systems that lack them. You need to use _l()
> > functions for all non-"C" locales. The current database default
> > needs
> > to be available as a variable (in future: thread-local variable, or
> > reachable from one), so you can use it in _l() functions. The "C"
> > locale can be accessed implicitly with non-l() functions, or you
> > could
> > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> > for
> > "C". Or a name like PG_C_LOCALE, which, in backend code could be
> > just
> > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> > singleton mechanism I showed in CF#5166.
>
> +1 to this approach. It makes things more consistent across platforms
> and avoids surprising dependencies on the global setting.
>
> We'll have to be careful that each call site is either OK with C, or
> that it gets changed to an _l() variant. We also have to be careful
> about extensions.
Did we reach a conclusion here? Any thoughts on moving in this
direction, and whether 18 is the right time to do it?
Regards,
Jeff Davis
On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote: > > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote: > > > 1. The process global locale is always "C". If you ever call > > > uselocale(), it can only be for short stretches, and you have to > > > restore it straight after; perhaps it is only ever used in > > > replacement > > > _l() functions for systems that lack them. You need to use _l() > > > functions for all non-"C" locales. The current database default > > > needs > > > to be available as a variable (in future: thread-local variable, or > > > reachable from one), so you can use it in _l() functions. The "C" > > > locale can be accessed implicitly with non-l() functions, or you > > > could > > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) > > > for > > > "C". Or a name like PG_C_LOCALE, which, in backend code could be > > > just > > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the > > > singleton mechanism I showed in CF#5166. > > > > +1 to this approach. It makes things more consistent across platforms > > and avoids surprising dependencies on the global setting. > > > > We'll have to be careful that each call site is either OK with C, or > > that it gets changed to an _l() variant. We also have to be careful > > about extensions. > > Did we reach a conclusion here? Any thoughts on moving in this > direction, and whether 18 is the right time to do it? I think this is the best way, and I haven't seen anyone supporting any other idea. (I'm working on those setlocale()-removal patches I mentioned, more very soon...)
On 13.12.24 10:44, Thomas Munro wrote: > On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote: >> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote: >>> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote: >>>> 1. The process global locale is always "C". If you ever call >>>> uselocale(), it can only be for short stretches, and you have to >>>> restore it straight after; perhaps it is only ever used in >>>> replacement >>>> _l() functions for systems that lack them. You need to use _l() >>>> functions for all non-"C" locales. The current database default >>>> needs >>>> to be available as a variable (in future: thread-local variable, or >>>> reachable from one), so you can use it in _l() functions. The "C" >>>> locale can be accessed implicitly with non-l() functions, or you >>>> could >>>> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) >>>> for >>>> "C". Or a name like PG_C_LOCALE, which, in backend code could be >>>> just >>>> LC_GLOBAL_LOCALE, while in frontend/library code it could be the >>>> singleton mechanism I showed in CF#5166. >>> >>> +1 to this approach. It makes things more consistent across platforms >>> and avoids surprising dependencies on the global setting. >>> >>> We'll have to be careful that each call site is either OK with C, or >>> that it gets changed to an _l() variant. We also have to be careful >>> about extensions. >> >> Did we reach a conclusion here? Any thoughts on moving in this >> direction, and whether 18 is the right time to do it? > > I think this is the best way, and I haven't seen anyone supporting any > other idea. (I'm working on those setlocale()-removal patches I > mentioned, more very soon...) I also think this is the right direction, and we'll get closer with the remaining patches that Thomas has lined up. I think at this point, we could already remove all locale settings related to LC_COLLATE. Nothing uses that anymore. I think we will need to keep the global LC_CTYPE setting set to something useful, for example so that system error messages come out in the right encoding. But I'm concerned about the the Perl_setlocale() dance in plperl.c. Perl apparently does a setlocale(LC_ALL, "") during startup, and that code is a workaround to reset everything back afterwards. We need to be careful not to break that. (Perl has fixed that in 5.19, but the fix requires that you set another environment variable before launching Perl, which you can't do in a threaded system, so we'd probably need another fix eventually. See <https://github.com/Perl/perl5/issues/8274>.)
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote: > I think we will need to keep the global LC_CTYPE setting set to > something useful, for example so that system error messages come out > in > the right encoding. Do we need to rely on the global LC_CTYPE setting? We already use bind_textdomain_codeset(). > But I'm concerned about the the Perl_setlocale() dance in plperl.c. > Perl apparently does a setlocale(LC_ALL, "") during startup, and that > code is a workaround to reset everything back afterwards. We need to > be > careful not to break that. > > (Perl has fixed that in 5.19, but the fix requires that you set > another > environment variable before launching Perl, which you can't do in a > threaded system, so we'd probably need another fix eventually. See > <https://github.com/Perl/perl5/issues/8274>.) I don't fully understand that issue, but I would think the direction we are going (keeping the global LC_CTYPE more consistent and relying on it less) would make the problem better. Regards, Jeff Davis
On 17.12.24 19:10, Jeff Davis wrote: > On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote: >> I think we will need to keep the global LC_CTYPE setting set to >> something useful, for example so that system error messages come out >> in >> the right encoding. > > Do we need to rely on the global LC_CTYPE setting? We already use > bind_textdomain_codeset(). I don't think that would cover messages from the C library (strerror, dlerror, etc.). >> But I'm concerned about the the Perl_setlocale() dance in plperl.c. >> Perl apparently does a setlocale(LC_ALL, "") during startup, and that >> code is a workaround to reset everything back afterwards. We need to >> be >> careful not to break that. >> >> (Perl has fixed that in 5.19, but the fix requires that you set >> another >> environment variable before launching Perl, which you can't do in a >> threaded system, so we'd probably need another fix eventually. See >> <https://github.com/Perl/perl5/issues/8274>.) > > I don't fully understand that issue, but I would think the direction we > are going (keeping the global LC_CTYPE more consistent and relying on > it less) would make the problem better. Yes, I think it's the right direction, but we need to figure this issue out eventually.
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote: > > > > +1 to this approach. It makes things more consistent across > > > > platforms > > > > and avoids surprising dependencies on the global setting. > > > > > > > > I think this is the best way, and I haven't seen anyone supporting > > any > > other idea. (I'm working on those setlocale()-removal patches I > > mentioned, more very soon...) > > I also think this is the right direction, and we'll get closer with > the > remaining patches that Thomas has lined up. > > I think at this point, we could already remove all locale settings > related to LC_COLLATE. Nothing uses that anymore. > > I think we will need to keep the global LC_CTYPE setting set to > something useful, for example so that system error messages come out > in > the right encoding. > > But I'm concerned about the the Perl_setlocale() dance in plperl.c. > Perl apparently does a setlocale(LC_ALL, "") during startup, and that > code is a workaround to reset everything back afterwards. We need to > be > careful not to break that. > > (Perl has fixed that in 5.19, but the fix requires that you set > another > environment variable before launching Perl, which you can't do in a > threaded system, so we'd probably need another fix eventually. See > <https://github.com/Perl/perl5/issues/8274>.) To continue this thread, I did a symbol search in the meson build directory like (patterns.txt attached): for f in `find . -name *.o`; do if ( nm --format=just-symbols $f | \ grep -xE -f /tmp/patterns.txt > /dev/null ); then echo $f; fi; done and it output: ./contrib/fuzzystrmatch/fuzzystrmatch.so.p/dmetaphone.c.o ./contrib/fuzzystrmatch/fuzzystrmatch.so.p/fuzzystrmatch.c.o ./contrib/isn/isn.so.p/isn.c.o ./contrib/spi/refint.so.p/refint.c.o ./contrib/ltree/ltree.so.p/crc32.c.o ./src/backend/postgres_lib.a.p/commands_copyfromparse.c.o ./src/backend/postgres_lib.a.p/utils_adt_pg_locale_libc.c.o ./src/backend/postgres_lib.a.p/tsearch_wparser_def.c.o ./src/backend/postgres_lib.a.p/parser_scansup.c.o ./src/backend/postgres_lib.a.p/utils_adt_inet_net_pton.c.o ./src/backend/postgres_lib.a.p/tsearch_ts_locale.c.o ./src/bin/psql/psql.p/meson-generated_.._tab-complete.c.o ./src/interfaces/ecpg/preproc/ecpg.p/meson-generated_.._preproc.c.o ./src/interfaces/ecpg/compatlib/libecpg_compat.so.3.18.p/informix.c.o ./src/interfaces/ecpg/compatlib/libecpg_compat.a.p/informix.c.o ./src/port/libpgport_srv.a.p/pgstrcasecmp.c.o ./src/port/libpgport_shlib.a.p/pgstrcasecmp.c.o ./src/port/libpgport.a.p/pgstrcasecmp.c.o Not a short list, but not a long list, either, so seems tractable. Note that this misses things like isdigit() which is inlined. A few observations while spot-checking these files: --------------------- pgstrcasecmp.c - has code like: else if (IS_HIGHBIT_SET(ch) && islower(ch)) ch = toupper(ch); and comments like "Note however that the whole thing is a bit bogus for multibyte character sets." Most of the callers are directly comparing with ascii literals, so I'm not sure what the point is. There are probably some more interesting callers hidden in there. ---------------------- pg_locale_libc.c - char2wchar and wchar2char use mbstowcs and wcstombs when the input locale is NULL. The main culprit seems to be full text search, which has a bunch of /* TODO */ comments. Another caller is get_iso_localename(). There are also a couple false positives where mbstowcs_l/wcstombs_l are emulated with uselocale() and mbstowcs/wcstombs. In that case, it's not actually sensitive to the global setting. ----------------------- copyfromparse.c - the input is ASCII so it can use pg_ascii_tolower() instead of tolower() ----------------------- Regards, Jeff Davis
Вложения
On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
> To continue this thread, I did a symbol search in the meson build
> directory like (patterns.txt attached):
Attached a rough patch series which does what everyone seemed to agree
on:
* Change some trivial ASCII cases to use pg_ascii_* variants
* Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
* Introduce a new global_lc_ctype for callers that still need to use
operations that depend on datctype
There should be no behavior changes in this series.
Benefits:
* a tiny step toward multithreading, because connections to different
databases no longer require different setlocale() settings
* easier to identify dependence on datctype, because callers will
need to refer to global_lc_ctype.
* harder to accidentally depend on datctype or datcollate
Ideally, when the database locale provider is not libc, the user
shouldn't need to even think about a valid LC_CTYPE locale at all. But
that requires more work, and potentially risk of breakage.
Regards,
Jeff Davis
`
Вложения
- v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch
- v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch
- v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch
- v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patch
- v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch
- v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch
- v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch
- v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
On 07.06.25 00:23, Jeff Davis wrote: > On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote: >> To continue this thread, I did a symbol search in the meson build >> directory like (patterns.txt attached): > > Attached a rough patch series which does what everyone seemed to agree > on: > > * Change some trivial ASCII cases to use pg_ascii_* variants > * Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale > * Introduce a new global_lc_ctype for callers that still need to use > operations that depend on datctype v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patch These look good to me. v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch This looks ok (but might depend on how patch 0006 turns out). v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch I think these need further individual analysis and explanation why these should use the global lc_ctype setting. For example, you could argue that the SQL-callable soundex(text) function should use the collation object of its input value, not the global locale. But furthermore, soundex_code() could actually just use pg_ascii_toupper() instead. And in ts_locale.c, the isalnum_l() call should use mylocale that already exists in that function. The problem to solve it getting a good value into mylocale. Using the global setting confuses the issue a bit, I think. v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch Do we have any data what platforms we'd need these checks for? Also, if you look into wparser_def.c what p_isxdigit is used for, it's used for parsing XML (presumably HTML) files, so we just need ASCII-only behavior and no locale dependency. v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch As I mentioned earlier in the thread, I don't think we can do this for LC_CTYPE, because otherwise system error messages would not come out in the right encoding. For the LC_COLLATE settings, I think we could just do the setting in main(), where the other non-database-specific locale categories are set.
On Tue, 2025-06-10 at 17:32 +0200, Peter Eisentraut wrote:
> As I mentioned earlier in the thread, I don't think we can do this
> for
> LC_CTYPE, because otherwise system error messages would not come out
> in
> the right encoding.
Is there any way around that? If all we need is the right encoding, do
we need a proper locale?
Regards,
Jeff Davis
On Mon, 2025-07-07 at 17:56 -0700, Jeff Davis wrote: > I looked into this a bit, and if I understand correctly, the only > problem is with strerror() and strerror_r(), which depend on > LC_MESSAGES for the language but LC_CTYPE to find the right encoding. ... > Windows would be a different story, though: strerror() doesn't seem > to > have a variant that accepts a _locale_t object, and even if it did, I > don't see a way to create a _locale_t object with LC_MESSAGES and > LC_CTYPE set to different values. I think I have an answer to the second part here: "For information about the format of the locale argument, see Locale names, Languages, and Country/Region strings." https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale?view=msvc-170 and when I follow that link, I see: "You can specify multiple category types, separated by semicolons. Category types that aren't specified use the current locale setting. For example, this code snippet sets the current locale for all categories to de-DE, and then sets the categories LC_MONETARY to en-GB and LC_TIME to es-ES: _wsetlocale(LC_ALL, L"de-DE"); _wsetlocale(LC_ALL, L"LC_MONETARY=en-GB;LC_TIME=es-ES");" https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-170 So we just need to construct a string of the right form, and we can have a _locale_t object representing the global locale for all categories. I'm not sure exactly how we escape the individual locale names, but it might be enough to just reject ';' in the locale name (at least for windows). The first problem -- how to affect the encoding of strings returned by strerror() on windows -- may be solvable as well. It looks like LC_MESSAGES is not supported at all on windows, so the only thing to be concerned about is the encoding, which is affected by LC_CTYPE. But windows doesn't offer uselocale() or strerror_l(). The only way seems to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then setlocale(LC_CTYPE, datctype) right before strerror(), and switch it back to "C" right afterward. Comments welcome. Regards, Jeff Davis
On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com> wrote: > The first problem -- how to affect the encoding of strings returned by > strerror() on windows -- may be solvable as well. It looks like > LC_MESSAGES is not supported at all on windows, so the only thing to be > concerned about is the encoding, which is affected by LC_CTYPE. But > windows doesn't offer uselocale() or strerror_l(). The only way seems > to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then > setlocale(LC_CTYPE, datctype) right before strerror(), and switch it > back to "C" right afterward. Comments welcome. FWIW there is an example of that in src/port/pg_localeconv_r.c.
On Tue, Jul 8, 2025 at 1:14 PM Jeff Davis <pgsql@j-davis.com> wrote: > v4-0008 uses LC_C_LOCALE, and I'm not sure if that's portable, but if > the buildfarm complains then I'll fix it or revert it. (Catching up with this thread...) LC_C_LOCALE is definitely not portable: I've only seen it on macOS and NetBSD. It would be a good thing to propose to POSIX, since no other approach can be retrofitted quite so cromulently... I tried to make a portable PG_C_LOCALE mechanism like that, but it was reverted for reasons needing more investigation... see 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by 3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
On Thu, 2025-07-10 at 11:53 +1200, Thomas Munro wrote:
> On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com>
> wrote:
> > The first problem -- how to affect the encoding of strings returned
> > by
> > strerror() on windows -- may be solvable as well. It looks like
> > LC_MESSAGES is not supported at all on windows, so the only thing
> > to be
> > concerned about is the encoding, which is affected by LC_CTYPE. But
> > windows doesn't offer uselocale() or strerror_l(). The only way
> > seems
> > to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and
> > then
> > setlocale(LC_CTYPE, datctype) right before strerror(), and switch
> > it
> > back to "C" right afterward. Comments welcome.
>
> FWIW there is an example of that in src/port/pg_localeconv_r.c.
OK, so it seems we have a path forward here:
1. Have a global_libc_locale that represents all of the categories, and
keep it up to date with GUC changes. On windows, it requires keeping
the textual locale names handy (e.g. copies of datcollate and
datctype), and building the special locale string and doing
_create_locale(LC_ALL, "LC_ABC=somelocale;LC_XYZ=otherlocale").
2. When there's no _l() variant of a function, like strerror_r(), wrap
with uselocale(). On windows, this means using the trick above with
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE).
I don't have a great windows development environment, and it appears CI
and the buildfarm don't offer great coverage either. Can I ask for a
volunteer to do the windows side of this work?
Regards,
Jeff Davis
On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
> I tried to make a portable PG_C_LOCALE mechanism like that, but it
> was
> reverted for reasons needing more investigation... see
> 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
> 3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
The revert seems to be related to pgport_shlib. At least for my current
work, I'm focused on removing setlocale() dependencies in the backend,
and a PG_C_LOCALE should work fine there.
Regards,
Jeff Davis
On Fri, Jul 11, 2025 at 6:33 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote: > > I tried to make a portable PG_C_LOCALE mechanism like that, but it > > was > > reverted for reasons needing more investigation... see > > 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by > > 3c8e463b0d885e0d976f6a13a1fb78187b25c86f). > > The revert seems to be related to pgport_shlib. At least for my current > work, I'm focused on removing setlocale() dependencies in the backend, > and a PG_C_LOCALE should work fine there. OK, I'll figure out what happened with that and try to post a new version over on that other thread soon. (FWIW I learned a couple of encouraging things about that topic: glibc's newlocale(LC_ALL, NULL, 0) seems to give you a static singleton anyway, no allocation happens, so it can't fail in practice and it's cheap, and FreeBSD also supports LC_C_LOCALE like NetBSD and macOS, it just doesn't have a name, you pass NULL instead (which is the value that LC_C_LOCALE has on those other systems). I actually rather like the macro, because how else are you supposed to test whether this system can accept NULL there?)
On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote: > I don't have a great windows development environment, and it appears CI > and the buildfarm don't offer great coverage either. Can I ask for a > volunteer to do the windows side of this work? Me neither but I'm willing to help with that, and have done lots of closely related things through trial-by-CI...
On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
> On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > I don't have a great windows development environment, and it
> > appears CI
> > and the buildfarm don't offer great coverage either. Can I ask for
> > a
> > volunteer to do the windows side of this work?
>
> Me neither but I'm willing to help with that, and have done lots of
> closely related things through trial-by-CI...
Attached a patch to separate the message translation (both gettext and
strerror translations) from setlocale(). That's a step towards thread
safety, and also a step toward setting LC_CTYPE=C permanently (more
work still required there).
The patch feels a bit over-engineered, but I'd like to know what you
think. It would be great if you could test/debug the windows NLS-
enabled paths.
I'm also not sure what to do about the NetBSD path. NetBSD has no
uselocale(), so I have to fall bad to temporary setlocale(), which is
not thread safe. And I'm getting a mysterious error in test_aio for
NetBSD, which I haven't investigated yet.
Regards,
Jeff Davis
Вложения
On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
> The patch feels a bit over-engineered, but I'd like to know what you
> think. It would be great if you could test/debug the windows NLS-
> enabled paths.
Let me explain how it ended up looking over-engineered, and perhaps
someone has a simpler solution.
For gettext, we already configure the encoding with
bind_textdomain_codeset(). All it needs is LC_MESSAGES set properly,
which can be done with uselocale(), as a semi-permanent setting until
the next GUC change, just like setlocale() today. There are a couple
minor problems for platforms without uselocale(). For windows, we could
just permanently do:
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE)
and then use _wsetlocale. For NetBSD, I don't have a solution, but
perhaps we can just reject new lc_messages settings after startup, or
just defer the problem until threading actually becomes a pressing
issue.
The main problem is with strerror_r(). To get the right LC_MESSAGES
setting, we need the separate path for windows (which has neither
uselocale() nor strerror_l()). Because we need to keep track of that
path anyway, I used it for gettext as well to have a cleaner separation
for the entire message translation locale. That means we can avoid
permanent locale settings, and reduce the chances that we accidentally
depend on the global locale.
Regards,
Jeff Davis
On Thu, 2025-07-24 at 11:10 -0700, Jeff Davis wrote:
> The main problem is with strerror_r()...
Postgres messages, like "division by zero" are translated just fine
without LC_CTYPE; gettext() only needs LC_MESSAGES and the server
encoding. So these are fine.
We use strerror_r() to translate the system errno into a readable
message, like "No such file or directory", i.e. the %m replacements.
That needs LC_CTYPE set (just for the encoding, not the
language/region) as well as LC_MESSAGES (for the language/region).
When using a locale provider other than libc, it's unfortunate to
require LC_CTYPE to be set for just this one single purpose. The locale
itself, e.g. the "en_US" part, is not used at all; only the encoding
part of the setting is relevant. And there is no value other than "C"
that works on all platforms. It's fairly confusing to explain why the
LC_CTYPE setting is required for the builtin or ICU providers at all.
Also, while it's far from the biggest challenge when it comes to
multithreading, it does cause thread-safety headaches on platforms
without uselocale().
Perhaps we could get the ASCII message and run it through gettext()?
That would be extra work for translators, but perhaps not a lot, given
that it's a small and static set of messages in practice. That would
also have the benefit that either NLS is enabled or not -- right now,
since the translation happens in two different ways you can end up with
partially-translated messages. It would also result in consistent
translations across platforms.
Regards,
Jeff Davis
On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
> On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
> > On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com>
> > wrote:
> > > I don't have a great windows development environment, and it
> > > appears CI
> > > and the buildfarm don't offer great coverage either. Can I ask
> > > for
> > > a
> > > volunteer to do the windows side of this work?
> >
> > Me neither but I'm willing to help with that, and have done lots of
> > closely related things through trial-by-CI...
Attached a new patch series, v6.
Rather than creating new global locale_t objects, this series (along
with a separate patch for NLS[1]) removes the dependency on the global
LC_CTYPE entirely. It's a bunch of small patches that replace direct
calls to tolower()/toupper() with calls into the provider.
An assumption of these patches is that, in the UTF-8 encoding, the
logic in pg_tolower()/pg_toupper() is equivalent to
pg_ascii_tolower()/pg_ascii_toupper().
Generally these preserve existing behavior, but there are a couple
differences:
* If using the builtin C locale (not C.UTF-8) along with a datctype
that's a non-C locale with single-byte encoding, it could affect the
results of downcase_identifier(), ltree, and fuzzystrmatch on
characters > 127. For ICU, I went to a bit of extra effort to preserve
the existing behavior here, because it's more likely to be used for
single-byte encodings.
* When using ICU or builtin C.UTF-8, along with a datctype of
"tr_TR.UTF-8", then it will affect ltree's and fuzzystrmatch's
treatment of i/I.
If these are a concern we can fix them with some hacks, but those
behaviors seem fairly obscure to me.
Regards,
Jeff Davis
[1]
https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
Вложения
- v6-0001-Avoid-global-LC_CTYPE-dependency-in-pg_locale_lib.patch
- v6-0002-Define-char_tolower-char_toupper-for-all-locale-p.patch
- v6-0003-Avoid-global-LC_CTYPE-dependency-in-like.c.patch
- v6-0004-Avoid-global-LC_CTYPE-dependency-in-scansup.c.patch
- v6-0005-Avoid-global-LC_CTYPE-dependency-in-pg_locale_icu.patch
- v6-0006-Avoid-global-LC_CTYPE-dependency-in-ltree-crc32.c.patch
- v6-0007-Avoid-global-LC_CTYPE-dependency-in-fuzzystrmatch.patch
- v6-0008-Don-t-include-ICU-headers-in-pg_locale.h.patch
- v6-0009-Avoid-global-LC_CTYPE-dependency-in-strcasecmp.c-.patch
On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote: > Attached a new patch series, v6. I'm eager to start committing this series so that we have plenty of time to sort out any problems. I welcome feedback before or after commit, and I can revert if necessary. The goal here is to do a permanent: setlocale(LC_CTYPE, "C") in the postmaster, and instead use _l() variants where necessary. Forcing the global LC_CTYPE to C will avoid platform-specific nuances spread throughout the code, and prevent new code from accidentally depending on platform-specific libc behavior. Instead, libc ctype behavior will only happen through a pg_locale_t object. It also takes us a step closer to thread safety. LC_COLLATE was already permenently set to "C" (5e6e42e4), and most of LC_CTYPE behavior already uses a pg_locale_t object. This series is about removing the last few places that rely on raw calls to tolower()/toupper() (sometimes through pg_tolower()). Where there isn't a pg_locale_t immediately available it uses the database default locale (which might or might not be libc). There's another thread for what to do about strerror_r[1], which depends on LC_CTYPE for the encoding: https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com pg_localeconv_r() does depend on the LC_CTYPE for the encoding, but it already sets it from lc_monetary and lc_numeric, without using datctype or the global setting. Then PGLC_localeconv() converts to the database encoding, if necessary. So it's an exception to the rule that all ctype behavior goes through a pg_locale_t, but it's not a problem. (Aside: we could consider this approach as a narrower fix for strerror_r(), as well.) There may be a loose end around plperl, as well, but not sure if this will make it any worse. Some other LC_* settings still rely on setlocale(), which can be considered separately unless there's some interaction that I missed. Note that the datcollate and datctype fields are already mostly irrelevant for non-libc providers. We could set those to NULL, but for now I don't intend to do that. Regards, Jeff Davis
Jeff Davis wrote: > The goal here is to do a permanent: > > setlocale(LC_CTYPE, "C") > > in the postmaster, and instead use _l() variants where necessary. What about code in extensions? AFAIU a user can control the locale in effect by setting the LC_CTYPE argument of CREATE DATABASE, which ends up in the environment of backends serving that database. If it's forced to "C", how can an extension use locale-aware libc functions? In theory it's the same problem with LC_COLLATE, except that functions like tolower()/toupper() are much more likely to be used in extensions than strcoll(). Best regards, -- Daniel Vérité https://postgresql.verite.pro/
On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
> What about code in extensions? AFAIU a user can control the
> locale in effect by setting the LC_CTYPE argument of
> CREATE DATABASE, which ends up in the environment
> of backends serving that database.
> If it's forced to "C", how can an extension use locale-aware
> libc functions?
Extensions often need to be updated for a new major version.
The extension should call pg_database_locale(), and pass that to a
function exposed in pg_locale.h. A recent commit exposed pg_iswalpha(),
etc., so there's a reasonable set of functions that should be suitable
for most purposes.
If it's not available in pg_locale.h, or the extension really needs to
use a different LC_CTYPE for some reason, it can use an _l() variant of
the function.
Regards,
Jeff Davis
On 30.10.25 18:17, Jeff Davis wrote: > On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote: >> Attached a new patch series, v6. > > I'm eager to start committing this series so that we have plenty of > time to sort out any problems. I welcome feedback before or after > commit, and I can revert if necessary. What is one supposed to do with this statement? You post a series of 9 patches and the next day you say you are eager to commit it? Do you not want to give others the time to properly review this? The patches say they are "v6", but AFAICT the previous patches "v5" and "v4" in this thread are substantially different from these. > The goal here is to do a permanent: > > setlocale(LC_CTYPE, "C") > > in the postmaster, and instead use _l() variants where necessary. > > Forcing the global LC_CTYPE to C will avoid platform-specific nuances > spread throughout the code, and prevent new code from accidentally > depending on platform-specific libc behavior. Instead, libc ctype > behavior will only happen through a pg_locale_t object. > > It also takes us a step closer to thread safety. At first glance, these patches seem reasonable steps into that direction. But I'm not sure that we actually want to make that switch. It would be good if our code is independent of the global locale settings, but that doesn't mean that there couldn't be code in extensions, other libraries, or other corners of the operating system that relies on this. In general, and I haven't looked this up in the applicable standards, it seems like a good idea to accurately declare what encoding you operate in.
Jeff Davis wrote:
> On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
> > What about code in extensions? AFAIU a user can control the
> > locale in effect by setting the LC_CTYPE argument of
> > CREATE DATABASE, which ends up in the environment
> > of backends serving that database.
> > If it's forced to "C", how can an extension use locale-aware
> > libc functions?
>
> Extensions often need to be updated for a new major version.
I think forcing the C locale is not comparable to API changes,
and the consequences are not even necessarily fixable for extensions.
For instance, consider the following function, when run in a database
with en_US.utf8 as locale.
CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
use locale; return ($_[0] lt $_[1])?1:0;
$$ LANGUAGE plperlu;
select lt_test('a', 'B');
With PG 18 it returns true
With 19devel it returns false.
This is since commit 5e6e42e4 doing that:
+ * Collation is handled by pg_locale.c, and the behavior is dependent
on
+ * the provider. strcoll(), etc., should not be called directly.
+ */
+ init_locale("LC_COLLATE", LC_COLLATE, "C");
+
+ /*
Obviously libperl is not going to be updated to call Postgres
string comparisons functions instead of strcoll().
The same is probably true for other languages available as
extensions that expose POSIX locale-aware functions.
Extending this logic to LC_CTYPE will extend the breakage.
While I agree with the goal of not depending on setlocale()
in the core code for anything that should be locale-provider
dependent, making this goal leak into extensions seems
unnecessarily user-hostile. What it's saying to users is,
before v19 you could choose your locale, and starting
with v19 you'll have "C" whether you want it or not.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
On Fri, 2025-10-31 at 15:01 +0100, Daniel Verite wrote:
> > Extensions often need to be updated for a new major version.
>
> I think forcing the C locale is not comparable to API changes,
> and the consequences are not even necessarily fixable for extensions.
Are we in agreement that it's fine for C extensions?
> For instance, consider the following function, when run in a database
> with en_US.utf8 as locale.
>
> CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
> use locale; return ($_[0] lt $_[1])?1:0;
> $$ LANGUAGE plperlu;
>
> select lt_test('a', 'B');
Are you aware of PL code that does things like that? If the database
locale is ICU, that would be at least a little bit confusing.
Regards,
Jeff Davis
On Fri, 2025-10-31 at 10:40 +0100, Peter Eisentraut wrote:
> But I'm not sure that we actually want to make that switch. It would
> be
> good if our code is independent of the global locale settings, but
> that
> doesn't mean that there couldn't be code in extensions, other
> libraries,
> or other corners of the operating system that relies on this.
This question has been brewing for a while. How should we make this
decision?
> In
> general, and I haven't looked this up in the applicable standards, it
> seems like a good idea to accurately declare what encoding you
> operate in.
One frustration (for me, at least) is that there is no way to set the
encoding without specifying the locale. LC_CTYPE=C.UTF-8 is close, but
the libc version is not available on all platforms and has some quirks.
That makes any changes to the initdb default logic difficult to sort
out. Some combinations which seem simple -- like ICU/UTF8 -- need to
handle the case when LC_CTYPE is not compatible with UTF-8, even though
the LC_CTYPE has no effect in that case.
Regards,
Jeff Davis
Jeff Davis wrote:
> > > Extensions often need to be updated for a new major version.
> >
> > I think forcing the C locale is not comparable to API changes,
> > and the consequences are not even necessarily fixable for extensions.
>
> Are we in agreement that it's fine for C extensions?
No, I think we should put the database's lc_ctype
into LC_CTYPE and the database's lc_collate into
LC_COLLATE, independently of anything else,
like it was done until commit 5e6e42e.
I believe that's the purpose of these database
properties, whether the provider is libc or ICU or builtin.
Forcing "C" is a disruptive change, that IMO does
not seem compensated by substantial advantages
that would justify the disruption.
> > CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
> > use locale; return ($_[0] lt $_[1])?1:0;
> > $$ LANGUAGE plperlu;
> >
> > select lt_test('a', 'B');
>
> Are you aware of PL code that does things like that? If the database
> locale is ICU, that would be at least a little bit confusing.
plperl users writing "use locale" should understand that
it's the libc locale, like when this code is run outside Postgres.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote: > No, I think we should put the database's lc_ctype > into LC_CTYPE and the database's lc_collate into > LC_COLLATE, independently of anything else, > like it was done until commit 5e6e42e. > I believe that's the purpose of these database > properties, whether the provider is libc or ICU or builtin. Is there a clean way to document this behavior? I have tried to improve the documentation in this area before, but it's not easy because the behavior is so nuanced. The CREATE DATABASE command needs LOCALE (or LC_CTYPE/LC_COLLATE). But it's easy to get LOCALE and ICU_LOCALE or BUILTIN_LOCALE confused. And similarly for initdb. We could clean those up a lot if LC_CTYPE/LC_COLLATE didn't even need to be set for non-libc providers. Users can end up in a situation where they need to define LC_CTYPE/LC_COLLATE, even though it has almost (but not entirely) no effect: https://www.postgresql.org/message-id/f794e177b0b1ed8917e75258726ae315cf8fbbef.camel%40j-davis.com Reverting commit 5e6e42e may be the right thing, but I'd like to hear what others have to say on this point first. In particualr, I'd like to know whether such a revert is based on principle, a practical problem, or just an abundance of caution. Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc providers, but leave it as in v17 for libc providers. If someone explicitly wants libc behavior (by specifying something like "use locale" in plperl), they probably want to be using the libc provider for the database. Regards, Jeff Davis
On Mon, 2025-11-03 at 11:59 -0800, Jeff Davis wrote:
> Reverting commit 5e6e42e may be the right thing, but I'd like to hear
> what others have to say on this point first. In particualr, I'd like
> to
> know whether such a revert is based on principle, a practical
> problem,
> or just an abundance of caution.
>
> Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
> providers, but leave it as in v17 for libc providers. If someone
> explicitly wants libc behavior (by specifying something like "use
> locale" in plperl), they probably want to be using the libc provider
> for the database.
Actually, there's yet another option: lc_ctype and lc_collate could be
GUCs again. If they don't affect any backend behavior, then why not?
The user would have complete control.
(Probably PGC_POSTMASTER, because one of the goals is to not rely on
setlocale() in the backends.)
Of course, we need to be sure that they are compatible with the
database encoding. We have code to do that, but not sure how reliable
it is across platforms.
Regards,
Jeff Davis