Обсуждение: Remaining dependency on setlocale()

Поиск

Список

Период

Сортировка

Remaining dependency on setlocale()

От

Jeff Davis

Дата:

06 августа 2024 г., 21:59:55

After some previous work here:

https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com

we are less dependent on setlocale(), but it's still not completely
gone.

setlocale() counts as thread-unsafe, so it would be nice to eliminate
it completely.

The obvious answer is uselocale(), which sets the locale only for the
calling thread, and takes precedence over whatever is set with
setlocale().

But there are a couple problems:

1. I don't think it's supported on Windows.

2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().

Thoughts?

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Tom Lane

Дата:

06 августа 2024 г., 22:23:09

Jeff Davis <pgsql@j-davis.com> writes:
> But there are a couple problems:

> 1. I don't think it's supported on Windows.

Can't help with that, but surely Windows has some thread-safe way.

> 2. I don't see a good way to canonicalize a locale name, like in
> check_locale(), which uses the result of setlocale().

What I can tell you about that is that check_locale's expectation
that setlocale does any useful canonicalization is mostly wishful
thinking [1].  On a lot of platforms you just get the input string
back again.  If that's the only thing keeping us on setlocale,
I think we could drop it.  (Perhaps we should do some canonicalization
of our own instead?)

            regards, tom lane

[1] https://www.postgresql.org/message-id/14856.1348497531@sss.pgh.pa.us

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

07 августа 2024 г., 07:07:40

On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > But there are a couple problems:
>
> > 1. I don't think it's supported on Windows.
>
> Can't help with that, but surely Windows has some thread-safe way.

It does.  It's not exactly the same, instead there is a thing you can
call that puts setlocale() itself into a thread-local mode, but last
time I checked that mode was missing on MinGW so that's a bit of an
obstacle.

How far can we get by using more _l() functions?  For example, [1]
shows a use of strftime() that I think can be converted to
strftime_l() so that it doesn't depend on setlocale().  Since POSIX
doesn't specify every obvious _l function, we might need to provide
any missing wrappers that save/restore thread-locally with
uselocale().  Windows doesn't have uselocale(), but it generally
doesn't need such wrappers because it does have most of the obvious
_l() functions.

> > 2. I don't see a good way to canonicalize a locale name, like in
> > check_locale(), which uses the result of setlocale().
>
> What I can tell you about that is that check_locale's expectation
> that setlocale does any useful canonicalization is mostly wishful
> thinking [1].  On a lot of platforms you just get the input string
> back again.  If that's the only thing keeping us on setlocale,
> I think we could drop it.  (Perhaps we should do some canonicalization
> of our own instead?)

+1

I know it does something on Windows (we know the EDB installer gives
it strings like "Language,Country" and it converts them to
"Language_Country.Encoding", see various threads about it all going
wrong), but I'm not sure it does anything we actually want to
encourage.  I'm hoping we can gradually screw it down so that we only
have sane BCP 47 in the system on that OS, and I don't see why we
wouldn't just use them verbatim.

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com

Re: Remaining dependency on setlocale()

От

Joe Conway

Дата:

07 августа 2024 г., 13:42:32

On 8/7/24 03:07, Thomas Munro wrote:
> How far can we get by using more _l() functions?  For example, [1]
> shows a use of strftime() that I think can be converted to
> strftime_l() so that it doesn't depend on setlocale().  Since POSIX
> doesn't specify every obvious _l function, we might need to provide
> any missing wrappers that save/restore thread-locally with
> uselocale().  Windows doesn't have uselocale(), but it generally
> doesn't need such wrappers because it does have most of the obvious
> _l() functions.

Most of the strtoX functions have an _l variant, but one to watch is 
atoi, which is defined with a hardcoded call to strtol, at least with glibc:

8<----------
/* Convert a string to an int.  */
int
atoi (const char *nptr)
{
   return (int) strtol (nptr, (char **) NULL, 10);
}
8<----------

I guess in many/most places we use atoi we don't care, but maybe it 
matters for some?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Remaining dependency on setlocale()

От

Robert Haas

Дата:

07 августа 2024 г., 15:28:06

On Wed, Aug 7, 2024 at 9:42 AM Joe Conway <mail@joeconway.com> wrote:
> I guess in many/most places we use atoi we don't care, but maybe it
> matters for some?

I think we should move in the direction of replacing atoi() calls with
strtol() and actually checking for errors. In many places where use
atoi(), it's unlikely that the string would be anything but an
integer, so error checks are arguably unnecessary. A backup label file
isn't likely to say "START TIMELINE: potaytoes". On the other hand, if
it did say that, I'd prefer to get an error about potaytoes than have
it be treated as if it said "START TIMELINE: 0". And I've definitely
found missing error-checks over the years. For example, on pg14,
"pg_basebackup -Ft -Zmaximum -Dx" works as if you specified "-Z0"
because atoi("maximum") == 0. If we make a practice of checking
integer conversions for errors everywhere, we might avoid some such
silliness.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

07 августа 2024 г., 17:16:09

On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
> How far can we get by using more _l() functions?

There are a ton of calls to, for example, isspace(), used mostly for
parsing.

I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.

So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Joe Conway

Дата:

07 августа 2024 г., 17:28:45

On 8/7/24 13:16, Jeff Davis wrote:
> On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
>> How far can we get by using more _l() functions?
> 
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
> 
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
> 
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

FWIW I see all of these in glibc:

isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, 
isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, 
isxdigit_l


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Remaining dependency on setlocale()

От

Robert Haas

Дата:

07 августа 2024 г., 18:18:00

On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
> FWIW I see all of these in glibc:
>
> isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> isxdigit_l

On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

07 августа 2024 г., 20:52:41

On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

We do know of a few isspace() calls that are already questionable[1]
(should be scanner_isspace(), or something like that).  It's not only
weird that SELECT ROW('libertà!') is displayed with or without double
quote depending (in theory) on your locale, it's also undefined
behaviour because we feed individual bytes of a multi-byte sequence to
isspace(), so OSes disagree, and in practice we know that macOS and
Windows think that the byte 0xa inside 'à' is a space while glibc and
FreeBSD don't.  Looking at the languages with many sequences
containing 0xa0, I guess you'd probably need to be processing CJK text
and cross-platform for the difference to become obvious (that was the
case for the problem report I analysed):

for i in range(1, 0xffff):
  if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
    print("%04x: %s" % (i, chr(i)))

[1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

07 августа 2024 г., 21:40:56

On Thu, Aug 8, 2024 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
> > FWIW I see all of these in glibc:
> >
> > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> > isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> > isxdigit_l
>
> On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.

Those (except isascii_l) are from POSIX 2008[1].  They were absorbed
from "Extended API Set Part 4"[2], along with locale_t (that's why
there is a header <xlocale.h> on a couple of systems even though after
absorption they are supposed to be in <locale.h>).  We already
decided that all computers have that stuff (commit 8d9a9f03), but the
reality is a little messier than that... NetBSD hasn't implemented
uselocale() yet[3], though it has a good set of _l functions.  As
discussed in [3], ECPG code is therefore currently broken in
multithreaded clients because it's falling back to a setlocale() path,
and I think Windows+MinGW must be too (it lacks
HAVE__CONFIGTHREADLOCALE), but those both have a good set of _l
functions.  In that thread I tried to figure out how to use _l
functions to fix that problem, but ...

The issue there is that we have our own snprintf.c, that implicitly
requires LC_NUMERIC to be "C" (it is documented as always printing
floats a certain way ignoring locale and that's what the callers there
want in frontend and backend code, but in reality it punts to system
snprintf for floats, assuming that LC_NUMERIC is "C", which we
configure early in backend startup, but frontend code has to do it for
itself!).  So we could use snprintf_l or strtod_l instead, but POSIX
hasn't got those yet.  Or we could use own own Ryu code (fairly
specific), but integrating Ryu into our snprintf.c (and correctly
implementing all the %... stuff?) sounds like quite a hard,
devil-in-the-details kind of an undertaking to me.  Or maybe it's
easy, I dunno.  As for the _l functions, you could probably get away
with "every computer has either uselocale() or snprintf_() (or
strtod_()?)" and have two code paths in our snprintf.c.  But then we'd
also need a place to track a locale_t for a long-lived newlocale("C"),
which was too messy in my latest attempt...

[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html
[2] https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf
[3] https://www.postgresql.org/message-id/flat/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

07 августа 2024 г., 22:45:19

On Wed, 2024-08-07 at 13:28 -0400, Joe Conway wrote:
> FWIW I see all of these in glibc:
>
> isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> isxdigit_l

My point was just that there are a lot of those call sites (especially
for isspace()) in various parsers. It feels like a lot of code churn to
change all of them, when a lot of them seem to be intended for ascii
anyway.

And where do we get the locale_t structure from? We can create our own
at database connection time, and supply it to each of those call sites,
but I'm not sure that's a huge advantage over just using uselocale().

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Andreas Karlsson

Дата:

09 августа 2024 г., 11:41:26

On 8/8/24 12:45 AM, Jeff Davis wrote:
> My point was just that there are a lot of those call sites (especially
> for isspace()) in various parsers. It feels like a lot of code churn to
> change all of them, when a lot of them seem to be intended for ascii
> anyway.
> 
> And where do we get the locale_t structure from? We can create our own
> at database connection time, and supply it to each of those call sites,
> but I'm not sure that's a huge advantage over just using uselocale().

I am leaning towards that we should write our own pure ascii functions 
for this. Since we do not support any non-ascii compatible encodings 
anyway I do not see the point in having locale support in most of these 
call-sites.

Andewas

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

09 августа 2024 г., 18:24:03

On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
> I am leaning towards that we should write our own pure ascii
> functions
> for this.

That makes sense for a lot of call sites, but it could cause breakage
if we aren't careful.

>  Since we do not support any non-ascii compatible encodings
> anyway I do not see the point in having locale support in most of
> these
> call-sites.

An ascii-compatible encoding just means that the code points in the
ascii range are represented as ascii. I'm not clear on whether code
points in the ascii range can return different results for things like
isspace(), but it sounds plausible -- toupper() can return different
results for 'i' in tr_TR.

Also, what about the values outside 128-255, which are still valid
input to isspace()?

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

"Tristan Partin"

Дата:

09 августа 2024 г., 20:16:58

On Tue Aug 6, 2024 at 5:00 PM CDT, Jeff Davis wrote:
> After some previous work here:
>
> https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com
>
> we are less dependent on setlocale(), but it's still not completely
> gone.
>
> setlocale() counts as thread-unsafe, so it would be nice to eliminate
> it completely.
>
> The obvious answer is uselocale(), which sets the locale only for the
> calling thread, and takes precedence over whatever is set with
> setlocale().
>
> But there are a couple problems:
>
> 1. I don't think it's supported on Windows.
>
> 2. I don't see a good way to canonicalize a locale name, like in
> check_locale(), which uses the result of setlocale().
>
> Thoughts?

Hey Jeff,

See this thread[0] for some work that I had previously done. Feel free
to take it over, or we could collaborate.

[0]: https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech

--
Tristan Partin
Neon (https://neon.tech)

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

09 августа 2024 г., 20:48:20

Hi,

On Fri, 2024-08-09 at 15:16 -0500, Tristan Partin wrote:
> Hey Jeff,
>
> See this thread[0] for some work that I had previously done. Feel
> free
> to take it over, or we could collaborate.
>
> [0]:
> https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech

Sounds good, sorry I missed that.

Can you please rebase and we can discuss in that thread?

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

09 августа 2024 г., 21:42:29

On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > But there are a couple problems:
> >
> > > 1. I don't think it's supported on Windows.
> >
> > Can't help with that, but surely Windows has some thread-safe way.
>
> It does.  It's not exactly the same, instead there is a thing you can
> call that puts setlocale() itself into a thread-local mode, but last
> time I checked that mode was missing on MinGW so that's a bit of an
> obstacle.

Actually the MinGW situation might be better than that these days.  I
know of three environments where we currently have to keep code
working on MinGW: build farm animal fairywren (msys2 compiler
toochain), CI's optional "Windows - Server 2019, MinGW64 - Meson"
task, and CI's "CompilerWarnings" task, in the "mingw_cross_warning"
step (which actually runs on Linux, and uses configure rather than
meson).  All three environments show that they have
_configthreadlocale.  So could we could simply require it on Windows?
Then it might be possible to write a replacement implementation of
uselocale() that does a two-step dance with _configthreadlocale() and
setlocale(), restoring both afterwards if they changed.  That's what
ECPG open-codes already.

The NetBSD situation is more vexing.  I was trying to find out if
someone is working on it and unfortunately it looks like there is a
principled stand against adding it:

https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html

They're right that we really just want to use "C" in some places, and
their LC_C_LOCALE is a very useful system-provided value to be able to
pass into _l functions.  It's a shame it's non-standard, because
without it you have to allocate a locale_t for "C" and keep it
somewhere to feed to _l functions...

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

10 августа 2024 г., 01:36:39

I've posted a new attempt at ripping those ECPG setlocales() out on
the other thread that had the earlier version and discussion:

https://www.postgresql.org/message-id/CA%2BhUKG%2BYv%2Bps%3DnS2T8SS1UDU%3DiySHSr4sGHYiYGkPTpZx6Ooww%40mail.gmail.com

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

12 августа 2024 г., 03:24:01

On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
> > How far can we get by using more _l() functions?
>
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

Here are two more cases that I don't think I've seen discussed.

1.  The nl_langinfo() call in pg_get_encoding_from_locale(), can
probably be changed to nl_langinfo_l() (it is everywhere we currently
care about except Windows, which has a different already-thread-safe
alternative; AIX seems to lack the _l version, but someone writing a
patch to re-add support for that OS could supply the configure goo for
a uselocale() safe/restore implementation).  One problem is that it
has callers that pass it NULL meaning the backend default, but we'd
perhaps use LC_C_GLOBAL for now and have to think about where we get
the database default locale_t in the future.

2.  localeconv() is *doubly* non-thread-safe: it depends on the
current locale, and it also returns an object whose storage might be
clobbered by any other call to localeconv(), setlocale, or even,
according to POSIX, uselocale() (!!!).  I think that effectively
closes off that escape hatch.  On some OSes (macOS, BSDs) you find
localeconv_l() and then I think they give you a more workable
lifetime: as long as the locale_t lives, which makes perfect sense.  I
am surprised that no one has invented localeconv_r() where you supply
the output storage, and you could wrap that in uselocale()
save/restore to deal with the other problem, or localeconv_r_l() or
something.  I can't understand why this is so bad.  The glibc
documentation calls it "a masterpiece of poor design".  Ahh, so it
seems like we need to delete our use of localeconf() completely,
because we should be able to get all the information we need from
nl_langinfo_l() instead:

https://www.gnu.org/software/libc/manual/html_node/Locale-Information.html

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

12 августа 2024 г., 04:53:17

On Mon, Aug 12, 2024 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 1.  The nl_langinfo() call in pg_get_encoding_from_locale(), can
> probably be changed to nl_langinfo_l() (it is everywhere we currently
> care about except Windows, which has a different already-thread-safe
> alternative ...

... though if we wanted to replace all use of localeconv and struct
lconv with nl_langinfo_l() calls, it's not totally obvious how to do
that on Windows.  Its closest thing is GetLocaleInfoEx(), but that has
complications: it takes wchar_t locale names, which we don't even have
and can't access when we only have a locale_t, and it must look them
up in some data structure every time, and it copies data out to the
caller as wchar_t so now you have two conversion problems and a
storage problem.  If I understand correctly, the whole point of
nl_langinfo_l(item, loc) is that it is supposed to be fast, it's
really just an array lookup, and item is just an index, and the result
is supposed to be stable as long as loc hasn't been freed (and the
thread hasn't exited).  So you can use it without putting your own
caching in front of it.  One idea I came up with which I haven't tried
and it might turn out to be terrible, is that we could change our
definition of locale_t on Windows.  Currently it's a typedef to
Windows' _locale_t, and we use it with a bunch of _XXX functions that
we access by macro to remove the underscore.  Instead, we could make
locale_t a pointer to a struct of our own design in WIN32 builds,
holding the native _locale_t and also an array full of all the values
that nl_langinfo_l() can return.  We'd provide the standard enums,
indexes into that array, in a fake POSIX-oid header <langinfo.h>.
Then nl_langinfo_l(item, loc) could be implemented as
loc->private_langinfo[item], and strcoll_l(.., loc) could be a static
inline function that does _strcol_l(...,
loc->private_windows_locale_t).  These structs would be allocated and
freed with standard-looking newlocale() and freelocale(), so we could
finally stop using #ifdef WIN32-wrapped _create_locale() directly.
Then everything would look more POSIX-y, nl_langinfo_l() could be used
directly wherever we need fast access to that info, and we could, I
think, banish the awkward localeconv, right?  I don't know if this all
makes total sense and haven't tried it, just spitballing here...

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

12 августа 2024 г., 22:43:14

On Mon, Aug 12, 2024 at 4:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... though if we wanted to replace all use of localeconv and struct
> lconv with nl_langinfo_l() calls,

Gah.  I realised while trying the above that you can't really replace
localeconv() with nl_langinfo_l() as the GNU documentation recommends,
because some of the lconv fields we're using are missing from
langinfo.h in POSIX (only GNU added them all, that was a good idea).
:-(

Next idea:

Windows: its localeconv() returns pointer to thread-local storage,
good, but it still needs setlocale().  So the options are: make our
own lconv-populator function for Windows, using GetLocaleInfoEx(), or
do that _configthreadlocale() dance (possibly excluding some MinGW
configurations from working)
Systems that have localeconv_l(): use that
POSIX: use uselocale() and also put a big global lock around
localeconv() call + accessing result (optionally skipping that on an
OS-by-OS basis after confirming that its implementation doesn't really
need it)

The reason the uselocale() + localeconv() seems to require a Big Lock
(by default at least) is that the uselocale() deals only with the
"current locale" aspect, not the output buffer aspect.  Clearly the
standard allows for it to be thread-local storage (that's why since
2008 it says that after thread-exit you can't access the result, and I
guess that's how it works on real systems (?)), but it also seems to
allow for a single static buffer (that's why it says that it's not
re-entrant, and any call to localeconv() might clobber it).  That
might be OK in practice because we tend to cache that stuff, eg when
assigning GUC lc_monetary (that cache would presumably become
thread-local in the first phase of the multithreading plan), so the
locking shouldn't really hurt.

The reason we'd have to have three ways, and not just two, is again
that NetBSD declined to implement uselocale().

I'll try this in a bit unless someone else has better ideas or plans
for this part... sorry for the drip-feeding.

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

13 августа 2024 г., 05:47:46

On Tue, Aug 13, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'll try this in a bit unless someone else has better ideas or plans
> for this part... sorry for the drip-feeding.

And done, see commitfest entry #5170.

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

14 августа 2024 г., 01:05:23

On Sat, 2024-08-10 at 09:42 +1200, Thomas Munro wrote:
> The NetBSD situation is more vexing.  I was trying to find out if
> someone is working on it and unfortunately it looks like there is a
> principled stand against adding it:
>
> https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
> https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html

The objection seems to be very general: that uselocale() modifies the
thread state and affects calls a long distance from uselocale(). I
don't disagree with the general sentiment. But in effect, that just
prevents people migrating away from setlocale(), to which the same
argument applies, and is additionally thread-unsafe.

The only alternative is to essentially ban the use of non-_l variants,
which is fine I suppose, but causes a fair amount of code churn.

> They're right that we really just want to use "C" in some places, and
> their LC_C_LOCALE is a very useful system-provided value to be able
> to
> pass into _l functions.  It's a shame it's non-standard, because
> without it you have to allocate a locale_t for "C" and keep it
> somewhere to feed to _l functions...

If we're going to do that, why not just have ascii-only variants of our
own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
LC_C_LOCALE).

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

14 августа 2024 г., 02:31:20

On Wed, Aug 14, 2024 at 1:05 PM Jeff Davis <pgsql@j-davis.com> wrote:
> The only alternative is to essentially ban the use of non-_l variants,
> which is fine I suppose, but causes a fair amount of code churn.

Let's zoom out a bit and consider some ways we could set up the
process, threads and individual calls:

1.  The process global locale is always "C".  If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in replacement
_l() functions for systems that lack them.  You need to use _l()
functions for all non-"C" locales.  The current database default needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions.  The "C"
locale can be accessed implicitly with non-l() functions, or you could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) for
"C".  Or a name like PG_C_LOCALE, which, in backend code could be just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.

XXX Note that nailing LC_ALL to "C" in the backend would extend our
existing policy for LC_NUMERIC to all categories.  That's why we can
use strtod() in the backend and expect the radix character to be '.'.
It's interesting to contemplate the strtod() calls in CF#5166: they
are code copied-and-pasted from backend and frontend; in the backend
we can use strtod() currently but in the frontend code I showed a
change to strtod_l(..., PG_C_LOCALE), in order to be able to delete
some ugly deeply nested uselocale()/setlocale() stuff of the exact
sort that NetBSD hackers (and I) hate.  It's obviously a bit of a code
smell that it's copied-and-pasted in the first place, and should
really share code.  Supposing some of that stuff finishes up in
src/common, then I think you'd want a strtod_l(..., PG_C_LOCALE) that
could be allowed to take advantage of the knowledge that the global
locale is "C" in the backend.  Just thoughts...

2.  The process global locale is always "C".  Each backend process (in
future: thread) calls uselocale() to set the thread-local locale to
the database default, so it can keep using the non-_l() functions as a
way to access the database default, and otherwise uses _l() functions
if it wants something else (as we do already).  The "C" locale can be
accessed with foo_l(..., LC_GLOBAL_LOCALE) or PG_C_LOCALE etc.

XXX This option is blocked by NetBSD's rejection of uselocale().  I
guess if you really wanted to work around NetBSD's policy you could
make your own wrapper for all affected functions, translating foo() to
foo_l(..., pg_thread_current_locale), so you could write uselocale(),
which is pretty much what every other libc does...  But eughhh

3.  The process global locale is inherited from the system or can be
set by the user however they want for the benefit of extensions, but
we never change it after startup or refer to it.  Then we do the same
as 1 or 2, except if we ever want "C" we'll need a locale_t for that,
again perhaps using the PC_C_LOCALE mechanism.  Non-_l() functions are
effectively useless except in cases where you really want to use the
system's settings inherited from startup, eg for messages, so they'd
mostly be banned.

What else?

> > They're right that we really just want to use "C" in some places, and
> > their LC_C_LOCALE is a very useful system-provided value to be able
> > to
> > pass into _l functions.  It's a shame it's non-standard, because
> > without it you have to allocate a locale_t for "C" and keep it
> > somewhere to feed to _l functions...
>
> If we're going to do that, why not just have ascii-only variants of our
> own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
> LC_C_LOCALE).

Yeah, I agree there are some easy things we should do that way.  In
fact we've already established that scanner_isspace() needs to be used
in lots more places for that, even aside from thread-safety.

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

14 августа 2024 г., 22:00:52

On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> 1.  The process global locale is always "C".  If you ever call
> uselocale(), it can only be for short stretches, and you have to
> restore it straight after; perhaps it is only ever used in
> replacement
> _l() functions for systems that lack them.  You need to use _l()
> functions for all non-"C" locales.  The current database default
> needs
> to be available as a variable (in future: thread-local variable, or
> reachable from one), so you can use it in _l() functions.  The "C"
> locale can be accessed implicitly with non-l() functions, or you
> could
> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> for
> "C".  Or a name like PG_C_LOCALE, which, in backend code could be
> just
> LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> singleton mechanism I showed in CF#5166.

+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.

We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

15 августа 2024 г., 01:43:50

On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > 2. I don't see a good way to canonicalize a locale name, like in
> > > check_locale(), which uses the result of setlocale().
> >
> > What I can tell you about that is that check_locale's expectation
> > that setlocale does any useful canonicalization is mostly wishful
> > thinking [1].  On a lot of platforms you just get the input string
> > back again.  If that's the only thing keeping us on setlocale,
> > I think we could drop it.  (Perhaps we should do some canonicalization
> > of our own instead?)
>
> +1
>
> I know it does something on Windows (we know the EDB installer gives
> it strings like "Language,Country" and it converts them to
> "Language_Country.Encoding", see various threads about it all going
> wrong), but I'm not sure it does anything we actually want to
> encourage.  I'm hoping we can gradually screw it down so that we only
> have sane BCP 47 in the system on that OS, and I don't see why we
> wouldn't just use them verbatim.

Some more thoughts on check_locale() and canonicalisation:

I doubt the canonicalisation does anything useful on any Unix system,
as they're basically just file names.  In the case of glibc, the
encoding part is munged before opening the file so it tolerates .utf8
or .UTF-8 or .u---T----f------8 on input, but it still returns
whatever you gave it so the return value isn't cleaning the input or
anything.

"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places.  We could change that to newlocale("") + query instead, but
there is a portability pipeline problem getting the name out of it:

1. POSIX only just added getlocalename_l() in 2024[1][2].
2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc).
3. The <xlocale.h> systems (macOS/*BSD) have non-standard
querylocale(mask, loc).
4. AFAIK there is no way to do it on pure POSIX 2008 systems.
5. For Windows, there is a completely different thing to get the
user's default locale, see CF#3772.

The systems in category 4 would in practice be Solaris and (if it
comes back) AIX.  Given that, we probably just can't go that way soon.

So I think the solution could perhaps be something like: in some early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.

[1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html
[2] https://www.austingroupbugs.net/view.php?id=1220

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

15 августа 2024 г., 02:00:39

On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> So I think the solution could perhaps be something like: in some
> early
> startup phase before there are any threads, we nail down all the
> locale categories to "C" (or whatever we decide on for the permanent
> global locale), and also query the "" categories and make a copy of
> them in case anyone wants them later, and then never call setlocale()
> again.

+1.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

15 августа 2024 г., 11:46:15

On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> > So I think the solution could perhaps be something like: in some
> > early
> > startup phase before there are any threads, we nail down all the
> > locale categories to "C" (or whatever we decide on for the permanent
> > global locale), and also query the "" categories and make a copy of
> > them in case anyone wants them later, and then never call setlocale()
> > again.
>
> +1.

We currently nail down these categories:

    /* We keep these set to "C" always.  See pg_locale.c for explanation. */
    init_locale("LC_MONETARY", LC_MONETARY, "C");
    init_locale("LC_NUMERIC", LC_NUMERIC, "C");
    init_locale("LC_TIME", LC_TIME, "C");

CF #5170 has patches to make it so that we stop changing them even
transiently, using locale_t interfaces to feed our caches of stuff
needed to work with those categories, so they really stay truly nailed
down.

It sounds like someone needs to investigate doing the same thing for
these two, from CheckMyDatabase():

    if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with
LC_COLLATE \"%s\", "
                           " which is not recognized by setlocale().", collate),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

    if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with LC_CTYPE \"%s\", "
                           " which is not recognized by setlocale().", ctype),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

How should that work?  Maybe we could imagine something like
MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
set appropriately.  Or should it be a pg_locale_t instead (if your
database default provider is ICU, then you don't even need a locale_t,
right?).

Then I think there is one quite gnarly category, from
assign_locale_messages() (a GUC assignment function):

    (void) pg_perm_setlocale(LC_MESSAGES, newval);

I have never really studied gettext(), but I know it was just
standardised in POSIX 2024, and the standardised interface has _l()
variants of all functions.  Current implementations don't have them
yet.  Clearly we absolutely couldn't call pg_perm_setlocale() after
early startup --  but if gettext() is relying on the current locale to
affect far away code, then maybe this is one place where we'd just
have to use uselocale().  Perhaps we could plan some transitional
strategy where NetBSD users lose the ability to change the GUC without
restarting the server and it has to be the same for all sessions, or
something like that, until they produce either gettext_l() or
uselocale(), but I haven't thought hard about this part at all yet...

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

15 августа 2024 г., 16:25:35

On 15.08.24 00:43, Thomas Munro wrote:
> "" is a problem however... the special value for "native environment"
> is returned as a real locale name, which we probably still need in
> places.  We could change that to newlocale("") + query instead, but

Where do we need that in the server?

It should just be initdb doing that and then initializing the server 
with concrete values based on that.

I guess technically some of these GUC settings default to the 
environment?  But I think we could consider getting rid of that.

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

16 августа 2024 г., 00:09:40

On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 15.08.24 00:43, Thomas Munro wrote:
> > "" is a problem however... the special value for "native environment"
> > is returned as a real locale name, which we probably still need in
> > places.  We could change that to newlocale("") + query instead, but
>
> Where do we need that in the server?

Hmm.  Yeah, right, the only way I've found so far to even reach that
code and that captures that result is:

create database db2 locale = '';

Thats puts 'en_NZ.UTF-8' or whatever in pg_database.  In contrast,
create collation will accept '' but just store it verbatim, and the
GUCs for changing time, monetary, numeric accept it too and keep it
verbatim.  We could simply ban '' in all user commands.  I doubt
they're documented as acceptable values, once you get past initdb and
have a running system.  Looking into that...

> It should just be initdb doing that and then initializing the server
> with concrete values based on that.

Right.

> I guess technically some of these GUC settings default to the
> environment?  But I think we could consider getting rid of that.

Yeah.

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

16 августа 2024 г., 04:12:33

On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > It should just be initdb doing that and then initializing the server
> > with concrete values based on that.
>
> Right.
>
> > I guess technically some of these GUC settings default to the
> > environment?  But I think we could consider getting rid of that.
>
> Yeah.

Seems to make a lot of sense.  I tried that out over in CF #5170.

(In case it's not clear why I'm splitting discussion between threads:
I was thinking of this thread as the one where we're discussing what
needs to be done, with other threads being spun off to become CF entry
with concrete patches.  I realised re-reading some discussion that
that might not be obvious...)

Re: Remaining dependency on setlocale()

От

Andreas Karlsson

Дата:

28 августа 2024 г., 19:26:04

On 8/9/24 8:24 PM, Jeff Davis wrote:
> On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
>> I am leaning towards that we should write our own pure ascii
>> functions
>> for this.
> 
> That makes sense for a lot of call sites, but it could cause breakage
> if we aren't careful.
> 
>>   Since we do not support any non-ascii compatible encodings
>> anyway I do not see the point in having locale support in most of
>> these
>> call-sites.
> 
> An ascii-compatible encoding just means that the code points in the
> ascii range are represented as ascii. I'm not clear on whether code
> points in the ascii range can return different results for things like
> isspace(), but it sounds plausible -- toupper() can return different
> results for 'i' in tr_TR.
> 
> Also, what about the values outside 128-255, which are still valid
> input to isspace()?

My idea was that in a lot of those cases we only try to parse e.g. 0-9 
as digits and always only . as the decimal separator so we should make 
just make that obvious by either using locale C or writing our own ascii 
only functions. These strings are meant to be read by machines, not 
humans, primarily.

Andreas

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

12 декабря 2024 г., 22:22:09

On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> > 1.  The process global locale is always "C".  If you ever call
> > uselocale(), it can only be for short stretches, and you have to
> > restore it straight after; perhaps it is only ever used in
> > replacement
> > _l() functions for systems that lack them.  You need to use _l()
> > functions for all non-"C" locales.  The current database default
> > needs
> > to be available as a variable (in future: thread-local variable, or
> > reachable from one), so you can use it in _l() functions.  The "C"
> > locale can be accessed implicitly with non-l() functions, or you
> > could
> > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> > for
> > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
> > just
> > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> > singleton mechanism I showed in CF#5166.
>
> +1 to this approach. It makes things more consistent across platforms
> and avoids surprising dependencies on the global setting.
>
> We'll have to be careful that each call site is either OK with C, or
> that it gets changed to an _l() variant. We also have to be careful
> about extensions.

Did we reach a conclusion here? Any thoughts on moving in this
direction, and whether 18 is the right time to do it?

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

13 декабря 2024 г., 12:44:06

On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
> > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> > > 1.  The process global locale is always "C".  If you ever call
> > > uselocale(), it can only be for short stretches, and you have to
> > > restore it straight after; perhaps it is only ever used in
> > > replacement
> > > _l() functions for systems that lack them.  You need to use _l()
> > > functions for all non-"C" locales.  The current database default
> > > needs
> > > to be available as a variable (in future: thread-local variable, or
> > > reachable from one), so you can use it in _l() functions.  The "C"
> > > locale can be accessed implicitly with non-l() functions, or you
> > > could
> > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> > > for
> > > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
> > > just
> > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> > > singleton mechanism I showed in CF#5166.
> >
> > +1 to this approach. It makes things more consistent across platforms
> > and avoids surprising dependencies on the global setting.
> >
> > We'll have to be careful that each call site is either OK with C, or
> > that it gets changed to an _l() variant. We also have to be careful
> > about extensions.
>
> Did we reach a conclusion here? Any thoughts on moving in this
> direction, and whether 18 is the right time to do it?

I think this is the best way, and I haven't seen anyone supporting any
other idea.  (I'm working on those setlocale()-removal patches I
mentioned, more very soon...)

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

17 декабря 2024 г., 15:14:21

On 13.12.24 10:44, Thomas Munro wrote:
> On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
>> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
>>> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
>>>> 1.  The process global locale is always "C".  If you ever call
>>>> uselocale(), it can only be for short stretches, and you have to
>>>> restore it straight after; perhaps it is only ever used in
>>>> replacement
>>>> _l() functions for systems that lack them.  You need to use _l()
>>>> functions for all non-"C" locales.  The current database default
>>>> needs
>>>> to be available as a variable (in future: thread-local variable, or
>>>> reachable from one), so you can use it in _l() functions.  The "C"
>>>> locale can be accessed implicitly with non-l() functions, or you
>>>> could
>>>> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
>>>> for
>>>> "C".  Or a name like PG_C_LOCALE, which, in backend code could be
>>>> just
>>>> LC_GLOBAL_LOCALE, while in frontend/library code it could be the
>>>> singleton mechanism I showed in CF#5166.
>>>
>>> +1 to this approach. It makes things more consistent across platforms
>>> and avoids surprising dependencies on the global setting.
>>>
>>> We'll have to be careful that each call site is either OK with C, or
>>> that it gets changed to an _l() variant. We also have to be careful
>>> about extensions.
>>
>> Did we reach a conclusion here? Any thoughts on moving in this
>> direction, and whether 18 is the right time to do it?
> 
> I think this is the best way, and I haven't seen anyone supporting any
> other idea.  (I'm working on those setlocale()-removal patches I
> mentioned, more very soon...)

I also think this is the right direction, and we'll get closer with the 
remaining patches that Thomas has lined up.

I think at this point, we could already remove all locale settings 
related to LC_COLLATE.  Nothing uses that anymore.

I think we will need to keep the global LC_CTYPE setting set to 
something useful, for example so that system error messages come out in 
the right encoding.

But I'm concerned about the the Perl_setlocale() dance in plperl.c. 
Perl apparently does a setlocale(LC_ALL, "") during startup, and that 
code is a workaround to reset everything back afterwards.  We need to be 
careful not to break that.

(Perl has fixed that in 5.19, but the fix requires that you set another 
environment variable before launching Perl, which you can't do in a 
threaded system, so we'd probably need another fix eventually.  See 
<https://github.com/Perl/perl5/issues/8274>.)

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

17 декабря 2024 г., 21:10:50

On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
> I think we will need to keep the global LC_CTYPE setting set to
> something useful, for example so that system error messages come out
> in
> the right encoding.

Do we need to rely on the global LC_CTYPE setting? We already use
bind_textdomain_codeset().

> But I'm concerned about the the Perl_setlocale() dance in plperl.c.
> Perl apparently does a setlocale(LC_ALL, "") during startup, and that
> code is a workaround to reset everything back afterwards.  We need to
> be
> careful not to break that.
>
> (Perl has fixed that in 5.19, but the fix requires that you set
> another
> environment variable before launching Perl, which you can't do in a
> threaded system, so we'd probably need another fix eventually.  See
> <https://github.com/Perl/perl5/issues/8274>.)

I don't fully understand that issue, but I would think the direction we
are going (keeping the global LC_CTYPE more consistent and relying on
it less) would make the problem better.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

19 декабря 2024 г., 19:23:11

On 17.12.24 19:10, Jeff Davis wrote:
> On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
>> I think we will need to keep the global LC_CTYPE setting set to
>> something useful, for example so that system error messages come out
>> in
>> the right encoding.
> 
> Do we need to rely on the global LC_CTYPE setting? We already use
> bind_textdomain_codeset().

I don't think that would cover messages from the C library (strerror, 
dlerror, etc.).

>> But I'm concerned about the the Perl_setlocale() dance in plperl.c.
>> Perl apparently does a setlocale(LC_ALL, "") during startup, and that
>> code is a workaround to reset everything back afterwards.  We need to
>> be
>> careful not to break that.
>>
>> (Perl has fixed that in 5.19, but the fix requires that you set
>> another
>> environment variable before launching Perl, which you can't do in a
>> threaded system, so we'd probably need another fix eventually.  See
>> <https://github.com/Perl/perl5/issues/8274>.)
> 
> I don't fully understand that issue, but I would think the direction we
> are going (keeping the global LC_CTYPE more consistent and relying on
> it less) would make the problem better.

Yes, I think it's the right direction, but we need to figure this issue 
out eventually.

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

06 июня, 08:15:34

On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
> > > > +1 to this approach. It makes things more consistent across
> > > > platforms
> > > > and avoids surprising dependencies on the global setting.
> > > >
> >
> > I think this is the best way, and I haven't seen anyone supporting
> > any
> > other idea.  (I'm working on those setlocale()-removal patches I
> > mentioned, more very soon...)
>
> I also think this is the right direction, and we'll get closer with
> the
> remaining patches that Thomas has lined up.
>
> I think at this point, we could already remove all locale settings
> related to LC_COLLATE.  Nothing uses that anymore.
>
> I think we will need to keep the global LC_CTYPE setting set to
> something useful, for example so that system error messages come out
> in
> the right encoding.
>
> But I'm concerned about the the Perl_setlocale() dance in plperl.c.
> Perl apparently does a setlocale(LC_ALL, "") during startup, and that
> code is a workaround to reset everything back afterwards.  We need to
> be
> careful not to break that.
>
> (Perl has fixed that in 5.19, but the fix requires that you set
> another
> environment variable before launching Perl, which you can't do in a
> threaded system, so we'd probably need another fix eventually.  See
> <https://github.com/Perl/perl5/issues/8274>.)

To continue this thread, I did a symbol search in the meson build
directory like (patterns.txt attached):

  for f in `find . -name *.o`; do
    if ( nm --format=just-symbols $f | \
         grep -xE -f /tmp/patterns.txt > /dev/null ); then
      echo $f; fi; done

and it output:

./contrib/fuzzystrmatch/fuzzystrmatch.so.p/dmetaphone.c.o
./contrib/fuzzystrmatch/fuzzystrmatch.so.p/fuzzystrmatch.c.o
./contrib/isn/isn.so.p/isn.c.o
./contrib/spi/refint.so.p/refint.c.o
./contrib/ltree/ltree.so.p/crc32.c.o
./src/backend/postgres_lib.a.p/commands_copyfromparse.c.o
./src/backend/postgres_lib.a.p/utils_adt_pg_locale_libc.c.o
./src/backend/postgres_lib.a.p/tsearch_wparser_def.c.o
./src/backend/postgres_lib.a.p/parser_scansup.c.o
./src/backend/postgres_lib.a.p/utils_adt_inet_net_pton.c.o
./src/backend/postgres_lib.a.p/tsearch_ts_locale.c.o
./src/bin/psql/psql.p/meson-generated_.._tab-complete.c.o
./src/interfaces/ecpg/preproc/ecpg.p/meson-generated_.._preproc.c.o
./src/interfaces/ecpg/compatlib/libecpg_compat.so.3.18.p/informix.c.o
./src/interfaces/ecpg/compatlib/libecpg_compat.a.p/informix.c.o
./src/port/libpgport_srv.a.p/pgstrcasecmp.c.o
./src/port/libpgport_shlib.a.p/pgstrcasecmp.c.o
./src/port/libpgport.a.p/pgstrcasecmp.c.o

Not a short list, but not a long list, either, so seems tractable. Note
that this misses things like isdigit() which is inlined.

A few observations while spot-checking these files:

---------------------
pgstrcasecmp.c - has code like:

       else if (IS_HIGHBIT_SET(ch) && islower(ch))
        ch = toupper(ch);

and comments like "Note however that the whole thing is a bit bogus for
multibyte character sets."

Most of the callers are directly comparing with ascii literals, so I'm
not sure what the point is. There are probably some more interesting
callers hidden in there.
----------------------
pg_locale_libc.c -

char2wchar and wchar2char use mbstowcs and wcstombs when the input
locale is NULL. The main culprit seems to be full text search, which
has a bunch of /* TODO */ comments. Another caller is
get_iso_localename().

There are also a couple false positives where mbstowcs_l/wcstombs_l are
emulated with uselocale() and mbstowcs/wcstombs. In that case, it's not
actually sensitive to the global setting.
-----------------------
copyfromparse.c - the input is ASCII so it can use pg_ascii_tolower()
instead of tolower()
-----------------------

Regards,
    Jeff Davis

Вложения

patterns.txt

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

07 июня, 01:23:41

On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
> To continue this thread, I did a symbol search in the meson build
> directory like (patterns.txt attached):

Attached a rough patch series which does what everyone seemed to agree
on:

  * Change some trivial ASCII cases to use pg_ascii_* variants
  * Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
  * Introduce a new global_lc_ctype for callers that still need to use
operations that depend on datctype

There should be no behavior changes in this series.

Benefits:

  * a tiny step toward multithreading, because connections to different
databases no longer require different setlocale() settings
  * easier to identify dependence on datctype, because callers will
need to refer to global_lc_ctype.
  * harder to accidentally depend on datctype or datcollate

Ideally, when the database locale provider is not libc, the user
shouldn't need to even think about a valid LC_CTYPE locale at all. But
that requires more work, and potentially risk of breakage.

Regards,
    Jeff Davis

`

Вложения

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

10 июня, 18:32:01

On 07.06.25 00:23, Jeff Davis wrote:
> On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
>> To continue this thread, I did a symbol search in the meson build
>> directory like (patterns.txt attached):
> 
> Attached a rough patch series which does what everyone seemed to agree
> on:
> 
>    * Change some trivial ASCII cases to use pg_ascii_* variants
>    * Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
>    * Introduce a new global_lc_ctype for callers that still need to use
> operations that depend on datctype

v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch
v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch
v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch
v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patch

These look good to me.

v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch

This looks ok (but might depend on how patch 0006 turns out).

v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch

I think these need further individual analysis and explanation why these 
should use the global lc_ctype setting.  For example, you could argue 
that the SQL-callable soundex(text) function should use the collation 
object of its input value, not the global locale.  But furthermore, 
soundex_code() could actually just use pg_ascii_toupper() instead.  And 
in ts_locale.c, the isalnum_l() call should use mylocale that already 
exists in that function.  The problem to solve it getting a good value 
into mylocale.  Using the global setting confuses the issue a bit, I think.

v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch

Do we have any data what platforms we'd need these checks for?

Also, if you look into wparser_def.c what p_isxdigit is used for, it's 
used for parsing XML (presumably HTML) files, so we just need ASCII-only 
behavior and no locale dependency.

v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch

As I mentioned earlier in the thread, I don't think we can do this for 
LC_CTYPE, because otherwise system error messages would not come out in 
the right encoding.  For the LC_COLLATE settings, I think we could just 
do the setting in main(), where the other non-database-specific locale 
categories are set.

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

10 июня, 20:48:26

On Tue, 2025-06-10 at 17:32 +0200, Peter Eisentraut wrote:
> As I mentioned earlier in the thread, I don't think we can do this
> for
> LC_CTYPE, because otherwise system error messages would not come out
> in
> the right encoding.

Is there any way around that? If all we need is the right encoding, do
we need a proper locale?

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

10 июля, 01:52:34

On Mon, 2025-07-07 at 17:56 -0700, Jeff Davis wrote:
> I looked into this a bit, and if I understand correctly, the only
> problem is with strerror() and strerror_r(), which depend on
> LC_MESSAGES for the language but LC_CTYPE to find the right encoding.

...

> Windows would be a different story, though: strerror() doesn't seem
> to
> have a variant that accepts a _locale_t object, and even if it did, I
> don't see a way to create a _locale_t object with LC_MESSAGES and
> LC_CTYPE set to different values.

I think I have an answer to the second part here:

"For information about the format of the locale argument, see Locale
names, Languages, and Country/Region strings."

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale?view=msvc-170

and when I follow that link, I see:

  "You can specify multiple category types, separated by semicolons.
Category types that aren't specified use the current locale setting.
For example, this code snippet sets the current locale for all
categories to de-DE, and then sets the categories LC_MONETARY to en-GB
and LC_TIME to es-ES:

  _wsetlocale(LC_ALL, L"de-DE");
  _wsetlocale(LC_ALL, L"LC_MONETARY=en-GB;LC_TIME=es-ES");"

https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-170

So we just need to construct a string of the right form, and we can
have a _locale_t object representing the global locale for all
categories. I'm not sure exactly how we escape the individual locale
names, but it might be enough to just reject ';' in the locale name (at
least for windows).

The first problem -- how to affect the encoding of strings returned by
strerror() on windows -- may be solvable as well. It looks like
LC_MESSAGES is not supported at all on windows, so the only thing to be
concerned about is the encoding, which is affected by LC_CTYPE. But
windows doesn't offer uselocale() or strerror_l(). The only way seems
to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then
setlocale(LC_CTYPE, datctype) right before strerror(), and switch it
back to "C" right afterward. Comments welcome.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

10 июля, 02:53:02

On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com> wrote:
> The first problem -- how to affect the encoding of strings returned by
> strerror() on windows -- may be solvable as well. It looks like
> LC_MESSAGES is not supported at all on windows, so the only thing to be
> concerned about is the encoding, which is affected by LC_CTYPE. But
> windows doesn't offer uselocale() or strerror_l(). The only way seems
> to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then
> setlocale(LC_CTYPE, datctype) right before strerror(), and switch it
> back to "C" right afterward. Comments welcome.

FWIW there is an example of that in src/port/pg_localeconv_r.c.

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

10 июля, 03:01:12

On Tue, Jul 8, 2025 at 1:14 PM Jeff Davis <pgsql@j-davis.com> wrote:
> v4-0008 uses LC_C_LOCALE, and I'm not sure if that's portable, but if
> the buildfarm complains then I'll fix it or revert it.

(Catching up with this thread...)

LC_C_LOCALE is definitely not portable: I've only seen it on macOS and
NetBSD.  It would be a good thing to propose to POSIX, since no other
approach can be retrofitted quite so cromulently...

I tried to make a portable PG_C_LOCALE mechanism like that, but it was
reverted for reasons needing more investigation... see
8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
3c8e463b0d885e0d976f6a13a1fb78187b25c86f).

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

10 июля, 21:22:50

On Thu, 2025-07-10 at 11:53 +1200, Thomas Munro wrote:
> On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com>
> wrote:
> > The first problem -- how to affect the encoding of strings returned
> > by
> > strerror() on windows -- may be solvable as well. It looks like
> > LC_MESSAGES is not supported at all on windows, so the only thing
> > to be
> > concerned about is the encoding, which is affected by LC_CTYPE. But
> > windows doesn't offer uselocale() or strerror_l(). The only way
> > seems
> > to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and
> > then
> > setlocale(LC_CTYPE, datctype) right before strerror(), and switch
> > it
> > back to "C" right afterward. Comments welcome.
>
> FWIW there is an example of that in src/port/pg_localeconv_r.c.

OK, so it seems we have a path forward here:

1. Have a global_libc_locale that represents all of the categories, and
keep it up to date with GUC changes. On windows, it requires keeping
the textual locale names handy (e.g. copies of datcollate and
datctype), and building the special locale string and doing
_create_locale(LC_ALL, "LC_ABC=somelocale;LC_XYZ=otherlocale").

2. When there's no _l() variant of a function, like strerror_r(), wrap
with uselocale(). On windows, this means using the trick above with
_configthreadlocale(_ENABLE_PER_THREAD_LOCALE).

I don't have a great windows development environment, and it appears CI
and the buildfarm don't offer great coverage either. Can I ask for a
volunteer to do the windows side of this work?

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

10 июля, 21:33:41

On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
> I tried to make a portable PG_C_LOCALE mechanism like that, but it
> was
> reverted for reasons needing more investigation... see
> 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
> 3c8e463b0d885e0d976f6a13a1fb78187b25c86f).

The revert seems to be related to pgport_shlib. At least for my current
work, I'm focused on removing setlocale() dependencies in the backend,
and a PG_C_LOCALE should work fine there.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

11 июля, 02:47:25

On Fri, Jul 11, 2025 at 6:33 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
> > I tried to make a portable PG_C_LOCALE mechanism like that, but it
> > was
> > reverted for reasons needing more investigation... see
> > 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
> > 3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
>
> The revert seems to be related to pgport_shlib. At least for my current
> work, I'm focused on removing setlocale() dependencies in the backend,
> and a PG_C_LOCALE should work fine there.

OK, I'll figure out what happened with that and try to post a new
version over on that other thread soon.

(FWIW I learned a couple of encouraging things about that topic:
glibc's newlocale(LC_ALL, NULL, 0) seems to give you a static
singleton anyway, no allocation happens, so it can't fail in practice
and it's cheap, and FreeBSD also supports LC_C_LOCALE like NetBSD and
macOS, it just doesn't have a name, you pass NULL instead (which is
the value that LC_C_LOCALE has on those other systems).  I actually
rather like the macro, because how else are you supposed to test
whether this system can accept NULL there?)

Re: Remaining dependency on setlocale()

От

Thomas Munro

Дата:

11 июля, 02:48:50

On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
> I don't have a great windows development environment, and it appears CI
> and the buildfarm don't offer great coverage either. Can I ask for a
> volunteer to do the windows side of this work?

Me neither but I'm willing to help with that, and have done lots of
closely related things through trial-by-CI...

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

24 июля, 05:11:34

On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
> On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > I don't have a great windows development environment, and it
> > appears CI
> > and the buildfarm don't offer great coverage either. Can I ask for
> > a
> > volunteer to do the windows side of this work?
>
> Me neither but I'm willing to help with that, and have done lots of
> closely related things through trial-by-CI...

Attached a patch to separate the message translation (both gettext and
strerror translations) from setlocale(). That's a step towards thread
safety, and also a step toward setting LC_CTYPE=C permanently (more
work still required there).

The patch feels a bit over-engineered, but I'd like to know what you
think. It would be great if you could test/debug the windows NLS-
enabled paths.

I'm also not sure what to do about the NetBSD path. NetBSD has no
uselocale(), so I have to fall bad to temporary setlocale(), which is
not thread safe. And I'm getting a mysterious error in test_aio for
NetBSD, which I haven't investigated yet.

Regards,
    Jeff Davis

Вложения

v5-0001-Create-wrapper-for-managing-NLS-locale.patch

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

24 июля, 21:10:40

On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
> The patch feels a bit over-engineered, but I'd like to know what you
> think. It would be great if you could test/debug the windows NLS-
> enabled paths.

Let me explain how it ended up looking over-engineered, and perhaps
someone has a simpler solution.

For gettext, we already configure the encoding with
bind_textdomain_codeset(). All it needs is LC_MESSAGES set properly,
which can be done with uselocale(), as a semi-permanent setting until
the next GUC change, just like setlocale() today. There are a couple
minor problems for platforms without uselocale(). For windows, we could
just permanently do:

  _configthreadlocale(_ENABLE_PER_THREAD_LOCALE)

and then use _wsetlocale. For NetBSD, I don't have a solution, but
perhaps we can just reject new lc_messages settings after startup, or
just defer the problem until threading actually becomes a pressing
issue.

The main problem is with strerror_r(). To get the right LC_MESSAGES
setting, we need the separate path for windows (which has neither
uselocale() nor strerror_l()). Because we need to keep track of that
path anyway, I used it for gettext as well to have a cleaner separation
for the entire message translation locale. That means we can avoid
permanent locale settings, and reduce the chances that we accidentally
depend on the global locale.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

15 октября, 02:26:47

On Thu, 2025-07-24 at 11:10 -0700, Jeff Davis wrote:
> The main problem is with strerror_r()...

Postgres messages, like "division by zero" are translated just fine
without LC_CTYPE; gettext() only needs LC_MESSAGES and the server
encoding. So these are fine.

We use strerror_r() to translate the system errno into a readable
message, like "No such file or directory", i.e. the %m replacements.
That needs LC_CTYPE set (just for the encoding, not the
language/region) as well as LC_MESSAGES (for the language/region).

When using a locale provider other than libc, it's unfortunate to
require LC_CTYPE to be set for just this one single purpose. The locale
itself, e.g. the "en_US" part, is not used at all; only the encoding
part of the setting is relevant. And there is no value other than "C"
that works on all platforms. It's fairly confusing to explain why the
LC_CTYPE setting is required for the builtin or ICU providers at all.
Also, while it's far from the biggest challenge when it comes to
multithreading, it does cause thread-safety headaches on platforms
without uselocale().

Perhaps we could get the ASCII message and run it through gettext()?
That would be extra work for translators, but perhaps not a lot, given
that it's a small and static set of messages in practice. That would
also have the benefit that either NLS is enabled or not -- right now,
since the translation happens in two different ways you can end up with
partially-translated messages. It would also result in consistent
translations across platforms.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

29 октября, 03:19:50

On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
> On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
> > On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com>
> > wrote:
> > > I don't have a great windows development environment, and it
> > > appears CI
> > > and the buildfarm don't offer great coverage either. Can I ask
> > > for
> > > a
> > > volunteer to do the windows side of this work?
> >
> > Me neither but I'm willing to help with that, and have done lots of
> > closely related things through trial-by-CI...

Attached a new patch series, v6.

Rather than creating new global locale_t objects, this series (along
with a separate patch for NLS[1]) removes the dependency on the global
LC_CTYPE entirely. It's a bunch of small patches that replace direct
calls to tolower()/toupper() with calls into the provider.

An assumption of these patches is that, in the UTF-8 encoding, the
logic in pg_tolower()/pg_toupper() is equivalent to
pg_ascii_tolower()/pg_ascii_toupper().

Generally these preserve existing behavior, but there are a couple
differences:

  * If using the builtin C locale (not C.UTF-8) along with a datctype
that's a non-C locale with single-byte encoding, it could affect the
results of downcase_identifier(), ltree, and fuzzystrmatch on
characters > 127. For ICU, I went to a bit of extra effort to preserve
the existing behavior here, because it's more likely to be used for
single-byte encodings.

  * When using ICU or builtin C.UTF-8, along with a datctype of
"tr_TR.UTF-8", then it will affect ltree's and fuzzystrmatch's
treatment of i/I.

If these are a concern we can fix them with some hacks, but those
behaviors seem fairly obscure to me.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com

Вложения

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

30 октября, 20:17:03

On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote:
> Attached a new patch series, v6.

I'm eager to start committing this series so that we have plenty of
time to sort out any problems. I welcome feedback before or after
commit, and I can revert if necessary.

The goal here is to do a permanent:

setlocale(LC_CTYPE, "C")

in the postmaster, and instead use _l() variants where necessary.

Forcing the global LC_CTYPE to C will avoid platform-specific nuances
spread throughout the code, and prevent new code from accidentally
depending on platform-specific libc behavior. Instead, libc ctype
behavior will only happen through a pg_locale_t object.

It also takes us a step closer to thread safety.

LC_COLLATE was already permenently set to "C" (5e6e42e4), and most of
LC_CTYPE behavior already uses a pg_locale_t object. This series is
about removing the last few places that rely on raw calls to
tolower()/toupper() (sometimes through pg_tolower()). Where there isn't
a pg_locale_t immediately available it uses the database default locale
(which might or might not be libc).

There's another thread for what to do about strerror_r[1], which
depends on LC_CTYPE for the encoding:

https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com

pg_localeconv_r() does depend on the LC_CTYPE for the encoding, but it
already sets it from lc_monetary and lc_numeric, without using datctype
or the global setting. Then PGLC_localeconv() converts to the database
encoding, if necessary. So it's an exception to the rule that all ctype
behavior goes through a pg_locale_t, but it's not a problem. (Aside: we
could consider this approach as a narrower fix for strerror_r(), as
well.)

There may be a loose end around plperl, as well, but not sure if this
will make it any worse.

Some other LC_* settings still rely on setlocale(), which can be
considered separately unless there's some interaction that I missed.

Note that the datcollate and datctype fields are already mostly
irrelevant for non-libc providers. We could set those to NULL, but for
now I don't intend to do that.

Regards,
Jeff Davis

Re: Remaining dependency on setlocale()

От

"Daniel Verite"

Дата:

30 октября, 23:41:50

    Jeff Davis wrote:

> The goal here is to do a permanent:
>
>   setlocale(LC_CTYPE, "C")
>
> in the postmaster, and instead use _l() variants where necessary.

What about code in extensions? AFAIU a user can control the
locale in effect by setting the LC_CTYPE argument of
CREATE DATABASE, which ends up in the environment
of backends serving that database.
If it's forced to "C", how can an extension use locale-aware
libc functions?

In theory it's the same problem with LC_COLLATE, except
that functions like tolower()/toupper() are much more likely
to be used in extensions than strcoll().

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

30 октября, 23:53:43

On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
> What about code in extensions? AFAIU a user can control the
> locale in effect by setting the LC_CTYPE argument of
> CREATE DATABASE, which ends up in the environment
> of backends serving that database.
> If it's forced to "C", how can an extension use locale-aware
> libc functions?

Extensions often need to be updated for a new major version.

The extension should call pg_database_locale(), and pass that to a
function exposed in pg_locale.h. A recent commit exposed pg_iswalpha(),
etc., so there's a reasonable set of functions that should be suitable
for most purposes.

If it's not available in pg_locale.h, or the extension really needs to
use a different LC_CTYPE for some reason, it can use an _l() variant of
the function.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

31 октября, 12:40:41

On 30.10.25 18:17, Jeff Davis wrote:
> On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote:
>> Attached a new patch series, v6.
> 
> I'm eager to start committing this series so that we have plenty of
> time to sort out any problems. I welcome feedback before or after
> commit, and I can revert if necessary.

What is one supposed to do with this statement?  You post a series of 9 
patches and the next day you say you are eager to commit it?  Do you not 
want to give others the time to properly review this?  The patches say 
they are "v6", but AFAICT the previous patches "v5" and "v4" in this 
thread are substantially different from these.

> The goal here is to do a permanent:
> 
>     setlocale(LC_CTYPE, "C")
> 
> in the postmaster, and instead use _l() variants where necessary.
> 
> Forcing the global LC_CTYPE to C will avoid platform-specific nuances
> spread throughout the code, and prevent new code from accidentally
> depending on platform-specific libc behavior. Instead, libc ctype
> behavior will only happen through a pg_locale_t object.
> 
> It also takes us a step closer to thread safety.

At first glance, these patches seem reasonable steps into that direction.

But I'm not sure that we actually want to make that switch.  It would be 
good if our code is independent of the global locale settings, but that 
doesn't mean that there couldn't be code in extensions, other libraries, 
or other corners of the operating system that relies on this.  In 
general, and I haven't looked this up in the applicable standards, it 
seems like a good idea to accurately declare what encoding you operate in.

Re: Remaining dependency on setlocale()

От

"Daniel Verite"

Дата:

31 октября, 17:01:39

    Jeff Davis wrote:

> On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
> > What about code in extensions? AFAIU a user can control the
> > locale in effect by setting the LC_CTYPE argument of
> > CREATE DATABASE, which ends up in the environment
> > of backends serving that database.
> > If it's forced to "C", how can an extension use locale-aware
> > libc functions?
>
> Extensions often need to be updated for a new major version.

I think forcing the C locale is not comparable to API changes,
and the consequences are not even necessarily fixable for extensions.

For instance, consider the following function, when run in a database
with en_US.utf8 as locale.

CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
 use locale; return ($_[0] lt $_[1])?1:0;
$$ LANGUAGE plperlu;

select lt_test('a', 'B');

With PG 18 it returns true
With 19devel it returns false.

This is since commit 5e6e42e4 doing that:

+     * Collation is handled by pg_locale.c, and the behavior is dependent
on
+     * the provider. strcoll(), etc., should not be called directly.
+     */
+    init_locale("LC_COLLATE", LC_COLLATE, "C");
+
+    /*

Obviously libperl is not going to be updated to call Postgres
string comparisons functions instead of strcoll().
The same is probably true for other languages available as
extensions that expose POSIX locale-aware functions.

Extending this logic to LC_CTYPE will extend the breakage.

While I agree with the goal of not depending on setlocale()
in the core code for anything that should be locale-provider
dependent, making this goal leak into extensions seems
unnecessarily user-hostile. What it's saying to users is,
before v19 you could choose your locale, and starting
with v19 you'll have "C" whether you want it or not.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

31 октября, 18:25:46

On Fri, 2025-10-31 at 15:01 +0100, Daniel Verite wrote:
> > Extensions often need to be updated for a new major version.
>
> I think forcing the C locale is not comparable to API changes,
> and the consequences are not even necessarily fixable for extensions.

Are we in agreement that it's fine for C extensions?

> For instance, consider the following function, when run in a database
> with en_US.utf8 as locale.
>
> CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
>  use locale; return ($_[0] lt $_[1])?1:0;
> $$ LANGUAGE plperlu;
>
> select lt_test('a', 'B');

Are you aware of PL code that does things like that? If the database
locale is ICU, that would be at least a little bit confusing.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

31 октября, 22:59:20

On Fri, 2025-10-31 at 10:40 +0100, Peter Eisentraut wrote:

> But I'm not sure that we actually want to make that switch.  It would
> be
> good if our code is independent of the global locale settings, but
> that
> doesn't mean that there couldn't be code in extensions, other
> libraries,
> or other corners of the operating system that relies on this.

This question has been brewing for a while. How should we make this
decision?

>   In
> general, and I haven't looked this up in the applicable standards, it
> seems like a good idea to accurately declare what encoding you
> operate in.

One frustration (for me, at least) is that there is no way to set the
encoding without specifying the locale. LC_CTYPE=C.UTF-8 is close, but
the libc version is not available on all platforms and has some quirks.

That makes any changes to the initdb default logic difficult to sort
out. Some combinations which seem simple -- like ICU/UTF8 -- need to
handle the case when LC_CTYPE is not compatible with UTF-8, even though
the LC_CTYPE has no effect in that case.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

"Daniel Verite"

Дата:

03 ноября, 22:14:03

    Jeff Davis wrote:

> > > Extensions often need to be updated for a new major version.
> >
> > I think forcing the C locale is not comparable to API changes,
> > and the consequences are not even necessarily fixable for extensions.
>
> Are we in agreement that it's fine for C extensions?

No, I think we should put the database's lc_ctype
into LC_CTYPE and the database's lc_collate into
LC_COLLATE, independently of anything else,
like it was done until commit 5e6e42e.
I believe that's the purpose of these database
properties, whether the provider is libc or ICU or builtin.

Forcing "C" is a disruptive change, that IMO does
not seem compensated by substantial advantages
that would justify the disruption.

> > CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
> >  use locale; return ($_[0] lt $_[1])?1:0;
> > $$ LANGUAGE plperlu;
> >
> > select lt_test('a', 'B');
>
> Are you aware of PL code that does things like that? If the database
> locale is ICU, that would be at least a little bit confusing.

plperl users writing "use locale" should understand that
it's the libc locale, like when this code is run outside Postgres.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

03 ноября, 22:59:57

On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote:
> No, I think we should put the database's lc_ctype
> into LC_CTYPE and the database's lc_collate into
> LC_COLLATE, independently of anything else,
> like it was done until commit 5e6e42e.
> I believe that's the purpose of these database
> properties, whether the provider is libc or ICU or builtin.

Is there a clean way to document this behavior? I have tried to improve
the documentation in this area before, but it's not easy because the
behavior is so nuanced.

The CREATE DATABASE command needs LOCALE (or LC_CTYPE/LC_COLLATE). But
it's easy to get LOCALE and ICU_LOCALE or BUILTIN_LOCALE confused. And
similarly for initdb. We could clean those up a lot if
LC_CTYPE/LC_COLLATE didn't even need to be set for non-libc providers.

Users can end up in a situation where they need to define
LC_CTYPE/LC_COLLATE, even though it has almost (but not entirely) no
effect:

https://www.postgresql.org/message-id/f794e177b0b1ed8917e75258726ae315cf8fbbef.camel%40j-davis.com

Reverting commit 5e6e42e may be the right thing, but I'd like to hear
what others have to say on this point first. In particualr, I'd like to
know whether such a revert is based on principle, a practical problem,
or just an abundance of caution.

Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
providers, but leave it as in v17 for libc providers. If someone
explicitly wants libc behavior (by specifying something like "use
locale" in plperl), they probably want to be using the libc provider
for the database.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

04 ноября, 00:54:13

On Mon, 2025-11-03 at 11:59 -0800, Jeff Davis wrote:
> Reverting commit 5e6e42e may be the right thing, but I'd like to hear
> what others have to say on this point first. In particualr, I'd like
> to
> know whether such a revert is based on principle, a practical
> problem,
> or just an abundance of caution.
>
> Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
> providers, but leave it as in v17 for libc providers. If someone
> explicitly wants libc behavior (by specifying something like "use
> locale" in plperl), they probably want to be using the libc provider
> for the database.

Actually, there's yet another option: lc_ctype and lc_collate could be
GUCs again. If they don't affect any backend behavior, then why not?
The user would have complete control.

(Probably PGC_POSTMASTER, because one of the goals is to not rely on
setlocale() in the backends.)

Of course, we need to be sure that they are compatible with the
database encoding. We have code to do that, but not sure how reliable
it is across platforms.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

10 ноября, 23:39:28

On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote:
> No, I think we should put the database's lc_ctype
> into LC_CTYPE and the database's lc_collate into
> LC_COLLATE, independently of anything else,
> like it was done until commit 5e6e42e.
> I believe that's the purpose of these database
> properties, whether the provider is libc or ICU or builtin.

As phrased, that appears to be a promise that we will never support
thread-per-connection. setlocale() is not thread-safe, and uselocale()
is not available on NetBSD.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

12 ноября, 21:41:58

On 03.11.25 20:14, Daniel Verite wrote:
> No, I think we should put the database's lc_ctype
> into LC_CTYPE and the database's lc_collate into
> LC_COLLATE, independently of anything else,
> like it was done until commit 5e6e42e.
> I believe that's the purpose of these database
> properties, whether the provider is libc or ICU or builtin.
> 
> Forcing "C" is a disruptive change, that IMO does
> not seem compensated by substantial advantages
> that would justify the disruption.

 From my perspective, the difference between LC_COLLATE and LC_CTYPE is 
that LC_COLLATE has a quite limited impact area.  Either your code uses 
strcoll() (or strxfrm()) or it does not.  And if it does, you can find 
all the places and adjust them, and it probably won't be that many 
places.  The impact area of LC_CTYPE is much larger and more complicated 
and possibly interacts with other settings and third-party libraries in 
ways that we don't understand yet and might not be able to change. 
That's why I'm more hesitant about it.  But I don't see any reason to 
keep LC_COLLATE set going forward.

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

12 ноября, 21:59:33

On 29.10.25 01:19, Jeff Davis wrote:
> On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
>> On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
>>> On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com>
>>> wrote:
>>>> I don't have a great windows development environment, and it
>>>> appears CI
>>>> and the buildfarm don't offer great coverage either. Can I ask
>>>> for
>>>> a
>>>> volunteer to do the windows side of this work?
>>>
>>> Me neither but I'm willing to help with that, and have done lots of
>>> closely related things through trial-by-CI...
> 
> Attached a new patch series, v6.
> 
> Rather than creating new global locale_t objects, this series (along
> with a separate patch for NLS[1]) removes the dependency on the global
> LC_CTYPE entirely. It's a bunch of small patches that replace direct
> calls to tolower()/toupper() with calls into the provider.
> 
> An assumption of these patches is that, in the UTF-8 encoding, the
> logic in pg_tolower()/pg_toupper() is equivalent to
> pg_ascii_tolower()/pg_ascii_toupper().

I'm getting a bit confused by all these different variant function 
names.  Like we have now

tolower
TOLOWER
char_tolower
pg_tolower
pg_strlower
pg_ascii_tolower
downcase_identifier

and maybe more, and upper versions.

This patch set makes changes like

-           else if (IS_HIGHBIT_SET(ch2) && isupper(ch2))
-               ch2 = tolower(ch2);
+           else if (IS_HIGHBIT_SET(ch2))
+               ch2 = TOLOWER(ch2);

So there is apparently some semantic difference between tolower() and 
TOLOWER(), which is represented by the fact that the function name is 
all upper case?  Actually, it's a macro and could mean different things 
in different contexts.

And there is very little documentation accompanying all these different 
functions.  For example, struct collate_methods and struct ctype_methods 
contain barely any documentation at all.

Many of these issues are pre-existing, but I just figured it has reached 
a point where we need to do something about it.

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

12 ноября, 21:59:35

On Wed, 2025-11-12 at 19:41 +0100, Peter Eisentraut wrote:
> The impact area of LC_CTYPE is much larger and more complicated
> and possibly interacts with other settings and third-party libraries
> in
> ways that we don't understand yet and might not be able to change.
> That's why I'm more hesitant about it.

What do you think about making lc_ctype and/or lc_collate into GUCs
(like lc_messages), assuming we remove all known effects in the backend
first?

If we make the setting PGC_POSTMASTER, then it eliminates potential
problems with threading, because setlocale() would be called only once
before allowing connections. For platforms that support uselocale(), we
could allow it to be set more freely, for those who need it set to
different values in different backends.

It would also be easier to document, which would be nice. There could
be some confusion if various settings are inconsistent with each other,
but that's true currently. And we'd still enforce a ctype that's
consistent with the encoding, at least.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

15 ноября, 03:23:11

On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
> I'm getting a bit confused by all these different variant function
> names.

One way of looking at it is that the functions in this patch series
mostly affect how identifiers are treated, whereas earlier collation-
related work affects how text data is treated. Ideally, they should be
similar, but for historical reasons they're not.

There are a lot of subtle behaviors for identifiers, which individually
make some sense, but over time have just become edge cases and sources
of inconsistency:

downcase_identifier() is a server function to casefold unquoted
identifiers during parsing (used by other callers, too). For non-ascii
characters in a single-byte encoding, it uses tolower(); otherwise the
lowercasing is ascii-only. Note: if an application is reliant on the
casefolding of non-ascii identifiers, such as SELECT * FROM É finding
the table named "é", that application would not work in UTF-8 even with
a dump/restore.

pg_strcasecmp() and pg_tolower() are used from the server and the
client to do case-insensitive comparison of option names. They're
supposed to use the same casing semantics as downcase_identifier(), but
they don't differentiate between single-byte and multi-byte encodings;
they just call tolower() on any non-ascii byte. That difference
probably doesn't matter for UTF8, because tolower() on a single byte in
a multibyte sequence should be a no-op, but perhaps it can matter in
non-UTF-8 multibyte encodings.

It's hard to avoid some confusion unless we're able to simplify some of
these behaviors. Let me know if you think we can tolerate some
simplifications in these edge cases without breaking anything too
badly.

> Many of these issues are pre-existing, but I just figured it has
> reached
> a point where we need to do something about it.

Starting from first principles, individual character operations should
be mostly for parsing (e.g. tsearch) or pattern matching. Case folding
and caseless matching should be done with string operations. And
obviously all of this should be multibyte aware and work consistently
in different encodings (to the extent possible given the
representational constraints).

Our APIs in pg_locale.c do a good job of offering that, and do not
depend on the global LC_CTYPE. (There are a few things I'd like to add
or clean up, but it offers most of what we need.)

The problem, of course, is migrating the callers to use pg_locale.c
APIs without breaking things. This patch series is intended to make
everything locale-sensitive in the backend go through pg_locale_t
without any behavior changes. The benefit is that it would at least
remove the global LC_CTYPE dependency, but it ends up with hacky
compatibility methods like char_tolower(), which piles on to the
already-confusing set of tolower-like functions.

In an earlier approach:

https://www.postgresql.org/message-id/5f95b81af1e81b28b8a9ac5929f199b2f4091fdf.camel@j-davis.com

I added a strfold_ident() method. That's easier to understand for
downcase_identifier(), but didn't solve the problems for other
callsites that depend on tolower(), and so I'd need to add more methods
for those places, and started to look unpleasant.

And earlier in this thread, I had tried the approach of using a global
variable to hold a locale representing datctype. That felt a bit weird,
though, because it mostly only matters when datlocprovider='c', and in
that case, there's already a locale_t initialized along with the
default collation. So why not find a way to go through the default
collation?

I still favor the approach used in the current patch series to remove
the dependency on the global LC_CTYPE, but I'm open to suggestion.
Whatever we do will probably require some additional hacking later
anyway.

I tried to improve the comments in pgstrcasecmp.c, and I rebased.

Regards,
Jeff Davis

Вложения

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

21 ноября, 03:58:16

On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
> Many of these issues are pre-existing, but I just figured it has
> reached
> a point where we need to do something about it.

I tried to simplify things in this patch series, assuming that we have
some tolerance for small behavior changes.

0001: No behavior change here, same patch as before. Uncontroversial
simplification, so I plan to commit this soon.

0002: change fuzzystrmatch to use ASCII semantics. As far as I can
tell, this only affects the results of soundex(). Before the patch, in
en_US.iso885915, soundex('réd') was 'RÉ30', after the patch it's
'Ré30'. I'm not sure whether the current behavior is intentional or
not. Other functions (daitch_mokotoff, levenshtein, and metaphone) are
unaffected as far as I can tell.

0003+0005: change ltree to use case folding instead of tolower(). I
believe this is a bug fix, because the current code is inconsistent
between ltree_strncasecmp() and ltree_crc32_sz().

0006-0007: Remove char_tolower() API. This also removes the
optimization for single-byte encodings with the libc provider and a
non-C locale, but simplifies the code (the optimization is retained for
the C locale). It's possible to make the lazy-folding optimization work
for all locales without the char_tolower() API by doing something
simlar to what 0004 does for ltree. But to make this work efficiently
for Generic_Text_IC_like() would be a bit more complex: we'd need to
adjust MatchText() to be able to fold the arguments lazily, and perhaps
introduce some kind of casemapping iterator. That's already a pretty
complex function, so I'm hesitant to do that work unless the
optimization is important.

These patches don't get us quite to the point of eliminating the
LC_CTYPE dependency (there's still downcase_identifier() and
pg_strcasecmp() to worry about, and some assorted isxyz() calls to
examine), but they simplify things enough that the path forward will be
easier.

Regards,
    Jeff Davis

Вложения

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

25 ноября, 02:57:43

On Thu, 2025-11-20 at 16:58 -0800, Jeff Davis wrote:
> On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
> > Many of these issues are pre-existing, but I just figured it has
> > reached
> > a point where we need to do something about it.
>
> I tried to simplify things in this patch series, assuming that we
> have
> some tolerance for small behavior changes.
>
> 0001: No behavior change here, same patch as before. Uncontroversial
> simplification, so I plan to commit this soon.

Committed.

New series attached, which I tried to put in an order that would be
reasonable for commit.

0001-0004: Pure refactoring patches. I intend to commit a couple of
these soon.

0005: No behavioral change, and not much change at all. Computes the
"max_chr" for regexes (a performance optimization for low codepoints)
more consistently and simply based on the encoding.

0006: fixes longstanding ltree bug due to inconsistency between the
database locale and the global LC_CTYPE setting when using a non-libc
provider. The end result is also cleaner: use the database locale
consistently, like tsearch. I don't intend to backport this, unless
someone thinks it should be, but it should come with a release note to
reindex ltree indexes if using a non-libc provider.

0007: remove the char_tolower() API completely. We'd lose a pattern
matching optimization for single-byte encodings with libc and a non-C
locale, but it's a significant simplification. We could go even further
and change this to use casefolding rather than lower(), but that seems
like a separate change.

0008: Multibyte-aware extraction of pattern prefixes. The previous code
gave up on any byte that it didn't understand, which made prefixes
unnecessarily short. This patch is also cleaner.

0009: Changes fuzzystrmatch to use pg_ascii_toupper(). Most functions
in the extension are unaffected, but soundex() can be affected, and I'm
not sure what exactly it's supposed to do with non-ASCII.

0010: For downcase_identifier(), use a new provider-specific
pg_strfold_ident() method. The ICU version of this method is a work-in-
progress, because right now it depends on libc. I suppose it should
decode to UTF-32, then go through u_tolower(), then re-encode -- but
can the re-encoding fail? In any case, it would be a behavior change
for identifier casefolding with ICU and a single-byte encoding, which
is probably OK but the risk is non-zero.

0011: POC patch to introduce lc_collate GUC. It would only affect
extensions, PLs, libraries, or other non-core code that happens to call
strcoll() or strxfrm(). This would address Daniel's complaint, but it's
more flexible. And by being a GUC, it's clear that we shouldn't depend
on it for any stored data. We can do something similar for LC_CTYPE
after we eliminate dependencies in core code.

Regards,
    Jeff Davis

Вложения

Re: Remaining dependency on setlocale()

От

Chao Li

Дата:

25 ноября, 04:44:51

Hi Jeff,

I have reviewed 0001-0004 and got a few comments:

> On Nov 25, 2025, at 07:57, Jeff Davis <pgsql@j-davis.com> wrote:
>
>
> 0001-0004: Pure refactoring patches. I intend to commit a couple of
> these soon.
>

1 - 0001
```
+/*
+ * Fold a character to upper case, following C/POSIX locale rules.
+ */
+static inline unsigned char
+pg_ascii_toupper(unsigned char ch)
```

I was curious why “inline” is needed, then I figured out when I tried to build. Without “inline”, compile will raise
warningsof “unused function”. So I guess it’s better to explain why “inline” is used in the function comment, otherwise
otherreaders may get the same confusion. 

2 - 0002
```
+ * three output codepoints. See Unicode 5.18.2, "Change in Length".
```

With “change in length”, I confirmed “Unicode 5.18.2” means the Unicode Standard Section 5.18.2 “Complications for Case
Mapping”.Why don’t we just give the URL in the comment.
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-5/#G29675

3 - 0004
```
@@ -1282,10 +1327,18 @@ size_t
 pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
            pg_locale_t locale)
 {
-    if (locale->ctype->strfold)
-        return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
-    else
-        return locale->ctype->strlower(dst, dstsize, src, srclen, locale);
+    if (locale->ctype == NULL)
+    {
+        int            i;
+
+        srclen = (srclen >= 0) ? srclen : strlen(src);
+        for (i = 0; i < srclen && i < dstsize; i++)
+            dst[i] = pg_ascii_tolower(src[i]);
+        if (i < dstsize)
+            dst[i] = '\0';
+        return srclen;
+    }
+    return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
 }
```

I don’t get this change. In old code, depending on locale->ctype->strfold, it calls strfold or strlower. But in this
patch,it only calls strfold. Why? If that’s intentional, maybe better to add a comment to explain that. 

4 - 0004

In pg_strfold, the ctype==NULL fallback code is exactly the same as pg_strlower. I guess you intentionally to not call
pg_strlowerhere for performance consideration, is that true? 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

25 ноября, 20:20:01

On Tue, 2025-11-25 at 09:44 +0800, Chao Li wrote:
> I was curious why “inline” is needed, then I figured out when I tried
> to build. Without “inline”, compile will raise warnings of “unused
> function”. So I guess it’s better to explain why “inline” is used in
> the function comment, otherwise other readers may get the same
> confusion.

That's a typical pattern: to make it "inline", move it to a .h file and
declare it as "static inline". For common patterns like that, I don't
think we should explain them in comments, because it would mean we
would start adding comments in zillions of places.

> With “change in length”, I confirmed “Unicode 5.18.2” means the
> Unicode Standard Section 5.18.2 “Complications for Case Mapping”. Why
> don’t we just give the URL in the comment.
> https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-5/#G29675

Done, thank you. (Though since we haven't actually moved to 17 yet, I
linked to 16 instead.)

> I don’t get this change. In old code, depending on locale->ctype-
> >strfold, it calls strfold or strlower. But in this patch, it only
> calls strfold. Why? If that’s intentional, maybe better to add a
> comment to explain that.

I thought it would be slightly cleaner to just define the strfold
method in the libc provider as the same as strlower. I agree it's worth
a comment, so I added some in pg_locale_libc.c.

> In pg_strfold, the ctype==NULL fallback code is exactly the same as
> pg_strlower. I guess you intentionally to not call pg_strlower here
> for performance consideration, is that true?

I made some static functions to clean that up, and added some comments.
Thank you.

New series attached with only these changes and a rebase.

Regards,
    Jeff Davis

Вложения

Re: Remaining dependency on setlocale()

От

Chao Li

Дата:

26 ноября, 04:50:54

> On Nov 26, 2025, at 01:20, Jeff Davis <pgsql@j-davis.com> wrote:
>
>
> New series attached with only these changes and a rebase.
>
> Regards,
> Jeff Davis
>
>
<v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>

I continued reviewing 0005-0008.

0005 - no comment. The change looks correct to me. Before the patch, pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR
shouldbe the more common cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common cases, thus there
shouldbe not a behavior change. 

5 - 0006
```
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
 int
 ltree_strncasecmp(const char *a, const char *b, size_t s)
 {
-    char       *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
-    char       *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+    static pg_locale_t locale = NULL;
+    size_t        al_sz = s + 1;
+    char       *al = palloc(al_sz);
+    size_t        bl_sz = s + 1;
+    char       *bl = palloc(bl_sz);
+    size_t        needed;
     int            res;

+    if (!locale)
+        locale = pg_database_locale();
+
+    needed = pg_strfold(al, al_sz, a, s, locale);
+    if (needed + 1 > al_sz)
+    {
+        /* grow buffer if needed and retry */
+        al_sz = needed + 1;
+        al = repalloc(al, al_sz);
+        needed = pg_strfold(al, al_sz, a, s, locale);
+        Assert(needed + 1 <= al_sz);
+    }
+
+    needed = pg_strfold(bl, bl_sz, b, s, locale);
+    if (needed + 1 > bl_sz)
+    {
+        /* grow buffer if needed and retry */
+        bl_sz = needed + 1;
+        bl = repalloc(bl, bl_sz);
+        needed = pg_strfold(bl, bl_sz, b, s, locale);
+        Assert(needed + 1 <= bl_sz);
+    }
+
     res = strncmp(al, bl, s);

     pfree(al);
```

I do think the new implementation has some problem.

* The retry logic implies that a single-byte char may become multiple bytes after folding, otherwise retry is not
neededbecause you have allocated s+1 bytes for dest buffers. From this perspective, we should use two needed variables:
neededAand neededB, if neededA != neededB, then the two strings are different; if neededA == neededB, then we should be
performstrncmp, but here we should pass neededA (or neededB as they are identical) to strncmp(al, bl, neededA). 

* Based on the logic you implemented in 0004, first pg_strfold() has copied as many chars as possible to dest buffer,
sowhen retry, ideally we can should resume instead of start over. However, if single-byte->multi-byte folding happens,
wehave no information to decide from where to resume. From this perspective, in 0004, do we really need to take the
try-the-beststrategy for strlower_c()? If there are some other use cases that require data to be placed in dest buffer
evenif dest buffer doesn’t have enough space, then my patch [1] of changing strlower_libc_sb() should be considered. 

6 - 0007
```
     /*
-     * For efficiency reasons, in the single byte case we don't call lower()
-     * on the pattern and text, but instead call SB_lower_char on each
-     * character.  In the multi-byte case we don't have much choice :-(. Also,
-     * ICU does not support single-character case folding, so we go the long
-     * way.
+     * For efficiency reasons, in the C locale we don't call lower() on the
+     * pattern and text, but instead call SB_lower_char on each character.
      */
```

SB_lower_char should be changed to C_IMatchText.

7 - 0007
```
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
 ```

I think the comment should be updated accordingly, like “for ILIKE in the C locale”.

8 - 0008
```
+    /* for text types, use pg_wchar; for BYTEA, use char */
     if (typeid != BYTEAOID)
     {
-        patt = TextDatumGetCString(patt_const->constvalue);
-        pattlen = strlen(patt);
+        text       *val = DatumGetTextPP(patt_const->constvalue);
+        pg_wchar   *wpatt;
+        pg_wchar   *wmatch;
+        char       *match;
+
+        patt = VARDATA_ANY(val);
+        pattlen = VARSIZE_ANY_EXHDR(val);
+        wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+        wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+        pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+        match = palloc(pattlen + 1);
```

* match is allocated with pattlen+1 bytes, is it long enough to hold pattlen multiple-byte chars?

* match is not freed, but looks like it should be:

*prefix_const = string_to_const(match, typeid);
 -> string_to_datum(str, datatype);
 -> CStringGetTextDatum(str);
 -> cstring_to_text(s)
 -> cstring_to_text_with_len(s, strlen(s));
 -> *result = (text *) palloc(len + VARHDRSZ);

So, match has been copied, it should be freed.

9 - 0008
```
-    }
+            /* Backslash escapes the next character */
+            if (patt[pos] == '\\')
+            {
+                pos++;
+                if (pos >= pattlen)
+                    break;
+            }

-    match[match_pos] = '\0';
+            match[match_pos++] = pos;
+        }
```

Should “pos” be “part[pos]” assigning to match[match_pos++]?

I will review the rest 3 commits tomorrow.

[1] https://postgr.es/m/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Remaining dependency on setlocale()

От

Chao Li

Дата:

27 ноября, 04:08:16


> On Nov 26, 2025, at 09:50, Chao Li <li.evan.chao@gmail.com> wrote:
>
> I will review the rest 3 commits tomorrow.

10 - 0009
```
 {
     if (isalpha((unsigned char) c))
     {
-        c = toupper((unsigned char) c);
+        c = pg_ascii_toupper((unsigned char) c);
```

Just curious. As isaplha() and toupper() come from the same header file ctype.h, if we replace toupper with
pg_ascii_toupper,does isapha also need to be handled? 

11 - 0010
```
-    for (i = 0; i < len; i++)
-    {
-        unsigned char ch = (unsigned char) ident[i];
+    dstsize = len + 1;
+    result = palloc(dstsize);

-        if (ch >= 'A' && ch <= 'Z')
-            ch += 'a' - 'A';
-        else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
-            ch = tolower(ch);
-        result[i] = (char) ch;
-    }
-    result[i] = '\0';
+    needed = pg_strfold_ident(result, dstsize, ident, len);
+    Assert(needed + 1 == dstsize);
+    Assert(needed == len);
```

I think assert both dstsize and len are redundant, because dstsize=len+1, and no place to change their values.

12 - 0010
```
+/*
+ * Fold an identifier using the database default locale.
+ *
+ * For historical reasons, does not use ordinary locale behavior. Should only
+ * be used for identifier folding. XXX: can we make this equivalent to
+ * pg_strfold(..., default_locale)?
+ */
+size_t
+pg_strfold_ident(char *dest, size_t destsize, const char *src, ssize_t srclen)
+{
+    if (default_locale == NULL || default_locale->ctype == NULL)
+    {
+        int            i;
+
+        for (i = 0; i < srclen && i < destsize; i++)
+        {
+            unsigned char ch = (unsigned char) src[i];
+
+            if (ch >= 'A' && ch <= 'Z')
+                ch += 'a' - 'A';
+            dest[i] = (char) ch;
+        }
+
+        if (i < destsize)
+            dest[i] = '\0';
+
+        return srclen;
+    }
+    return default_locale->ctype->strfold_ident(dest, destsize, src, srclen,
+                                                default_locale);
+}
```

Given default_local can be NULL only at some specific moment, can we do something like

Local = default_local;
If (local == NULL || local->ctype == NULL)
  Local = libc or other fallback;
Return default_locale->ctype->strfold_ident(dest, destsize, src, srclen, local);

This way avoids the duplicate code.

13 - 0011
```
+{ name => 'lc_collate', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
+  short_desc => 'Sets the locale for text ordering in extensions.',
```

I just feel the GUC name is very misleading. Without carefully reading the doc, users may very easy to consider
lc_collatethe system’s locale. If it only affects extensions, then let’s name it accordingly, for example,
“extension_lc_collate”,or “legacy_lc_collate”. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

29 ноября, 23:50:56

On Wed, 2025-11-26 at 09:50 +0800, Chao Li wrote:
> * The retry logic implies that a single-byte char may become multiple
> bytes after folding, otherwise retry is not needed because you have
> allocated s+1 bytes for dest buffers. From this perspective, we
> should use two needed variables: neededA and neededB, if neededA !=
> neededB, then the two strings are different; if neededA == neededB,
> then we should be perform strncmp, but here we should pass neededA
> (or neededB as they are identical) to strncmp(al, bl, neededA).

Thank you.

It's actually worse than that: having a single 's' argument is just
completely wrong. Consider:

  a: U&'x\0394\0394\0394'
  b: U&'\0394\0394\0394'

There is no value for byte length 's' for which both 'a' and 'b' are
properly-encoded strings. So, the current code passes invalid byte
sequences to LOWER(), which is another pre-existing bug.

ltree_strncasecmp() is only used for checking equality of the first s
bytes of the query, so let's make it a safer API that just checks
prefix equality. Attached.

> * Based on the logic you implemented in 0004, first pg_strfold() has
> copied as many chars as possible to dest buffer, so when retry,
> ideally we can should resume instead of start over. However, if
> single-byte->multi-byte folding happens, we have no information to
> decide from where to resume.

Right.

That suggests that we might want some kind of lazy or iterator-based
API for string folding. We'd need to find the right way to do that with
ICU. If we find that it's a performance problem somewhere, we can look
into that. Do you think we need that now?

>  From this perspective, in 0004, do we really need to take the try-
> the-best strategy for strlower_c()? If there are some other use cases
> that require data to be placed in dest buffer even if dest buffer
> doesn’t have enough space, then my patch [1] of changing
> strlower_libc_sb() should be considered.

I will look into that.

> SB_lower_char should be changed to C_IMatchText.

Updated comment.

> I think the comment should be updated accordingly, like “for ILIKE in
> the C locale”.

Done, thank you.

> * match is allocated with pattlen+1 bytes, is it long enough to hold
> pattlen multiple-byte chars?
>
> * match is not freed, but looks like it should be:

...

> Should “pos” be “part[pos]” assigning to match[match_pos++]?

All fixed, thank you! (I apologize for posting a patch in that state to
begin with...)

I also reorganized slightly to separate out the pg_iswcased() API into
its own patch, and moved the like_support.c changes from the ctype_is_c
patch (already committed: 1476028225) into the pattern prefixes patch.

Regards,
    Jeff Davis

Вложения

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

30 ноября, 00:04:00

On Thu, 2025-11-27 at 09:08 +0800, Chao Li wrote:
> > On Nov 26, 2025, at 09:50, Chao Li <li.evan.chao@gmail.com> wrote:
> >
> > I will review the rest 3 commits tomorrow.
>
> 10 - 0009
>
> Just curious. As isaplha() and toupper() come from the same header
> file ctype.h, if we replace toupper with pg_ascii_toupper, does
> isapha also need to be handled?

OK.

What do you think about the change overall? Is fuzzystrmatch inherently
ASCII-based? Does it cause behavior changes aside from soundex()? Does
the behavior change in soundex() matter?

> 11 - 0010
>
> I think assert both dstsize and len are redundant, because
> dstsize=len+1, and no place to change their values.

OK.

What do you think of the change overall?

>
> If (local == NULL || local->ctype == NULL)
>   Local = libc or other fallback;
> Return default_locale->ctype->strfold_ident(dest, destsize, src,
> srclen, local);
>
> This way avoids the duplicate code.

OK. The fallback would still be ASCII though, right?

> I just feel the GUC name is very misleading. Without carefully
> reading the doc, users may very easy to consider lc_collate the
> system’s locale. If it only affects extensions, then let’s name it
> accordingly, for example, “extension_lc_collate”, or
> “legacy_lc_collate”.

It is the system locale, it's just that we won't be using the system
locale for most purposes, so it has very little effect: PLs,
extensions, and libraries used by extensions that happen to rely on the
system locale. That is a bit confusing, which is why I previously just
set LC_COLLATE=C. This patch addresses Daniel's concern that people
might still want lc_collate set to something other than C. I'm not sure
we want this patch, it's just a POC.

I didn't attach a new series here yet, but will after some of the
earlier patches get committed.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

05 декабря, 18:01:35

On 29.11.25 21:50, Jeff Davis wrote:
> All fixed, thank you! (I apologize for posting a patch in that state to
> begin with...)
> 
> I also reorganized slightly to separate out the pg_iswcased() API into
> its own patch, and moved the like_support.c changes from the ctype_is_c
> patch (already committed: 1476028225) into the pattern prefixes patch.

I reviewed the v11 patches.  But I wasn't able to apply them locally 
(couldn't find a starting commit where they applied cleanly), so I 
haven't tested them.

Patches 0001 through 0006 seem generally ok, with some small comments:

v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch

The function comment reads "Check if b has a prefix of a." -- Is that 
the same as "Check if a is a prefix of b."?  The latter might be clearer.

v11-0004-Remove-char_tolower-API.patch

The updated comment reads

+        * For efficiency reasons, in the C locale we don't call lower() 
on the
+        * pattern and text, but instead call SB_lower_char on each 
character.

but the patch removes SB_lower_char().

v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch

Might have a small typo in the commit message:

; and preserve and char-at-a-time logic for bytea.

For the remaining patches I have some more substantial questions.

v11-0007-fuzzystrmatch-use-pg_ascii_toupper.patch

dmetaphone.c has a comment

case '\xc7':        /* C with cedilla */

so the premise that "fuzzystrmatch is designed for ASCII" does not
appear to be correct.  Needs more analysis.

(But apparently it's not multibyte aware at all, so I don't know what to 
do about that.)

v11-0008-downcase_identifier-use-method-table-from-locale.patch

I'm confused here about the name of the function pg_strfold_ident().  In 
general, case "folding" results in an opaque string that is really only 
useful for comparing against other case-folded strings.  But for 
identifiers we are actually interested lower-casing.  I think this 
should be corrected in the API naming.

v11-0009-Control-LC_COLLATE-with-GUC.patch

I know there were some complaints about compatibility with extensions, 
but I don't think anything concrete was presented.  I would like to see 
more evidence that we need this.

Also, recall that we used to have a lc_collate GUC, and in the end 
people got confused that it didn't actually show a meaningful value when 
you used ICU.  So we removed that.  It seems adding this back in would 
create a similar kind of confusion.  So to avoid that, maybe this should 
be called fallback_lc_collate or something like that.

If we were to proceed with this patch, it should have some documentation 
and tests.

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

12 декабря, 23:11:40

On Fri, 2025-12-05 at 16:01 +0100, Peter Eisentraut wrote:
> v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
>
> The function comment reads "Check if b has a prefix of a." -- Is that
> the same as "Check if a is a prefix of b."?  The latter might be
> clearer.

Yes, fixed.

Note: I separated this into two patches. 0003 fixes the multibyte
mishandling issue, and 0004 consistently performs case folding. 0003 is
backpatchable, I believe.

> but the patch removes SB_lower_char().

Fixed and committed.

> v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
>
> Might have a small typo in the commit message:
>
> ; and preserve and char-at-a-time logic for bytea.

Fixed.

I also changed it into two functions: like_fixed_prefix(), which is
almost unchanged from the original; and like_fixed_prefix_ci(), which
is multibyte and locale-aware. It was too confusing to have single-byte
and multi-byte logic in the same function, and they didn't share much
code anyway.

> case '\xc7':        /* C with cedilla */
>
> so the premise that "fuzzystrmatch is designed for ASCII" does not
> appear to be correct.  Needs more analysis.
>
> (But apparently it's not multibyte aware at all, so I don't know what
> to
> do about that.)

I didn't notice that, thank you. Agreed, we need a bit more discussion
around this case as well as soundex().

> v11-0008-downcase_identifier-use-method-table-from-locale.patch
>
> I'm confused here about the name of the function pg_strfold_ident(). 
> In
> general, case "folding" results in an opaque string that is really
> only
> useful for comparing against other case-folded strings.  But for
> identifiers we are actually interested lower-casing.  I think this
> should be corrected in the API naming.

Agreed and fixed.

Also, I added 0006, which saves a locale_t object for ICU in this one
case where it's required. Surely that's not what we want in the long
term, but we don't have the infrastructure for decoding pg_wchar into
code points yet, and 0006 avoids the dependency on the global LC_CTYPE
setting.

> v11-0009-Control-LC_COLLATE-with-GUC.patch
>
> I know there were some complaints about compatibility with
> extensions,
> but I don't think anything concrete was presented.  I would like to
> see
> more evidence that we need this.
>
> Also, recall that we used to have a lc_collate GUC, and in the end
> people got confused that it didn't actually show a meaningful value
> when
> you used ICU.  So we removed that.  It seems adding this back in
> would
> create a similar kind of confusion.  So to avoid that, maybe this
> should
> be called fallback_lc_collate or something like that.

Yes, this is a POC patch and needs more discussion.

What are your thoughts about a similar lc_ctype GUC, though? That has
slightly different trade-offs.

I believe v12 0001-0005 are about ready for commit, and 0003 should be
backported.

Regards,
    Jeff Davis

Вложения

Re: Remaining dependency on setlocale()

От

Chao Li

Дата:

13 декабря, 12:48:13


> On Dec 13, 2025, at 04:11, Jeff Davis <pgsql@j-davis.com> wrote:
>
> I believe v12 0001-0005 are about ready for commit, and 0003 should be
> backported.

I quickly went through 0001-0005, and got a few nitpicks:

1 - 0001
```
+    int            match_mblen pg_attribute_unused();
```

Why this variable is marked unused? It’s actually used.

2 - 0002

I did a search and found one place that you missed at line 181 in pg_locale.h
```
extern bool char_is_cased(char ch, pg_locale_t locale);
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

15 декабря, 22:34:01

On Sat, 2025-12-13 at 17:48 +0800, Chao Li wrote:
> 1 - 0001
> ```
> +    int            match_mblen pg_attribute_unused();
> ```
>
> Why this variable is marked unused? It’s actually used.

Fixed and committed.

I originally marked it unused because I just had an:

   Assert(match[match_mblen] == '\0');

but I changed it to just set it:

   match[match_mblen] = '\0';

for clarity, even though I think the underlying API guarantees NUL-
termination.

> 2 - 0002
>
> I did a search and found one place that you missed at line 181 in
> pg_locale.h
> ```
> extern bool char_is_cased(char ch, pg_locale_t locale);
> ```

Thank you, fixed and committed.

I'll continue committing v12 0003-0005. Note that I'm planning to
backport 0003.

I'm also inclined to move forward with 0006. It's not a long-term
solution, but it solves an instance of the problem under discussion in
this thread (removes setlocale() dependency) and is a narrow fix.

After that, remaining loose ends:

* 0007: Can we just rip out the non-ASCII support? If it's gone all
this time without UTF8 support, and there's no suggestion about what
the non-ASCII behavior should be (in docs or source code), then it's
probably not terribly important.

* Use of pg_strncasecmp() in the backend, which uses tolower() to do
case-insensitive matching of command options, e.g. "if
(pg_strcasecmp(a, "plain") == 0)" to check for PLAIN storage in CREATE
TYPE.

* strerror(): either consider an lc_ctype GUC, or the approach over
here[1].

* Daniel suggests that we need some way to set LC_COLLATE for
extensions/dependencies.

* address calls to pg_tolower in datetime.c and tzparser.c -- can those
just be pg_ascii_tolower()?

* examine remaining isalpha(), etc., calls in the backend

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/flat/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com

Re: Remaining dependency on setlocale()

От

Chao Li

Дата:

16 декабря, 04:32:09


> On Dec 16, 2025, at 03:34, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sat, 2025-12-13 at 17:48 +0800, Chao Li wrote:
>> 1 - 0001
>> ```
>> + int match_mblen pg_attribute_unused();
>> ```
>>
>> Why this variable is marked unused? It’s actually used.
>
> Fixed and committed.
>
> I originally marked it unused because I just had an:
>
>   Assert(match[match_mblen] == '\0');
>
> but I changed it to just set it:
>
>   match[match_mblen] = '\0';
>
> for clarity, even though I think the underlying API guarantees NUL-
> termination.
>
>> 2 - 0002
>>
>> I did a search and found one place that you missed at line 181 in
>> pg_locale.h
>> ```
>> extern bool char_is_cased(char ch, pg_locale_t locale);
>> ```
>
> Thank you, fixed and committed.
>
>
> I'll continue committing v12 0003-0005. Note that I'm planning to
> backport 0003.

I have re-reviewed 0003-0005 last week, they all look good to me.

I have no comment on backport 0003.

>
> I'm also inclined to move forward with 0006. It's not a long-term
> solution, but it solves an instance of the problem under discussion in
> this thread (removes setlocale() dependency) and is a narrow fix.
>
>
> After that, remaining loose ends:
>
> * 0007: Can we just rip out the non-ASCII support? If it's gone all
> this time without UTF8 support, and there's no suggestion about what
> the non-ASCII behavior should be (in docs or source code), then it's
> probably not terribly important.
>
> * Use of pg_strncasecmp() in the backend, which uses tolower() to do
> case-insensitive matching of command options, e.g. "if
> (pg_strcasecmp(a, "plain") == 0)" to check for PLAIN storage in CREATE
> TYPE.
>
> * strerror(): either consider an lc_ctype GUC, or the approach over
> here[1].
>
> * Daniel suggests that we need some way to set LC_COLLATE for
> extensions/dependencies.
>
> * address calls to pg_tolower in datetime.c and tzparser.c -- can those
> just be pg_ascii_tolower()?
>
> * examine remaining isalpha(), etc., calls in the backend
>
> Regards,
> Jeff Davis
>
> [1]
> https://www.postgresql.org/message-id/flat/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
>

I just reviewed 0006-0007. Only got one comment on 0006:

```
@@ -306,6 +342,7 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
     result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
     result->icu.locale = MemoryContextStrdup(context, iculocstr);
     result->icu.ucol = collator;
+    result->icu.lt = loc;
```

The old code didn’t create a locale object and store in result, thus it didn’t have a logic to free the created locale.
Thispatch now dose that, but I don’t see where the created locale object is free-ed. I suppose newlocale() will
allocatememory from the OS, so I guess the memory should be free-ed somewhere. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

16 декабря, 23:04:54

On Tue, 2025-12-16 at 09:32 +0800, Chao Li wrote:
> I have re-reviewed 0003-0005 last week, they all look good to me.
>
> I have no comment on backport 0003.

Committed 0003 and backported to 14.

Committing 0004 also. For the archives, the bug in that case is:

  -- generate some randomly-cased non-ASCII data
  CREATE DATABASE i TEMPLATE template0 LOCALE 'C'
    LOCALE_PROVIDER 'icu' ICU_LOCALE 'en';
  \c i
  CREATE EXTENSION ltree;
  CREATE TABLE test(path ltree);
  CREATE FUNCTION gen() RETURNS TEXT LANGUAGE plpgsql AS $$
    declare
      s TEXT;
    begin
      s := '';
      for i in 1..5 loop
        s := s || case when random() > 0.5 then lower(U&'\00C1') else
U&'\00C1' end;
        s := s || case when random() > 0.5 then lower(U&'\00C9') else
U&'\00C9' end;
        s := s || case when random() > 0.5 then lower(U&'\00CD') else
U&'\00CD' end;
        s := s || case when random() > 0.5 then lower(U&'\00D3') else
U&'\00D3' end;
        s := s || case when random() > 0.5 then lower(U&'\00DA') else
U&'\00DA' end;
      end loop;
      return s;
    end;
  $$;
  INSERT INTO test select ('a.'||gen()||'.z')::ltree
    FROM generate_series(1,10000);
  CREATE INDEX test_idx ON test USING gist (path);
  -- returns 10000
  SET enable_seqscan = true;
  SET enable_indexscan = false;
  SET enable_bitmapscan = false;
  SELECT COUNT(*) FROM test
    WHERE path ~ U&'a.áéíóúáéíóúáéíóúáéíóúáéíóú@.z'::lquery;
  -- returns fewer tuples when using index scan
  SET enable_seqscan = false;
  SET enable_indexscan = true;
  SET enable_bitmapscan = true;
  SELECT COUNT(*) FROM test
    WHERE path ~ U&'a.áéíóúáéíóúáéíóúáéíóúáéíóú@.z'::lquery;

Probably a smaller case would do, but I think it requires page splits
to hit the bug. 0004 fixes the bug.

> The old code didn’t create a locale object and store in result, thus
> it didn’t have a logic to free the created locale. This patch now
> dose that, but I don’t see where the created locale object is free-
> ed. I suppose newlocale() will allocate memory from the OS, so I
> guess the memory should be free-ed somewhere.

The pg_locale_t objects are cached for the life of the backend, and
never freed. We may want to change that eventually, but in practice
it's not much of a problem.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Jeff Davis

Дата:

17 декабря, 00:19:12

On Tue, 2025-12-16 at 12:04 -0800, Jeff Davis wrote:
> Probably a smaller case would do, but I think it requires page splits
> to hit the bug. 0004 fixes the bug.

Because it's a clear bug, I elected to backport 0004 to v18, where the
casefolding APIs were introduced. It's a bug before that as well, but
backporting further would be more complex and/or invasive.

Regards,
    Jeff Davis

Re: Remaining dependency on setlocale()

От

Peter Eisentraut

Дата:

17 декабря, 13:39:05

On 12.12.25 21:11, Jeff Davis wrote:
>> case '\xc7':        /* C with cedilla */
>>
>> so the premise that "fuzzystrmatch is designed for ASCII" does not
>> appear to be correct.  Needs more analysis.
>>
>> (But apparently it's not multibyte aware at all, so I don't know what
>> to
>> do about that.)
> I didn't notice that, thank you. Agreed, we need a bit more discussion
> around this case as well as soundex().

Soundex is an ASCII-only algorithm, there is no expectation that the 
algorithm does anything useful with non-ASCII characters, and it doesn't 
do so now.  So I think using pg_ascii_toupper() is ok.  (Users could for 
example use unaccent to preprocess text.)

One might wonder if the presence of non-ASCII characters should be an 
error, but that doesn't have to be the subject of this thread.  I 
noticed that the Wikipedia page for Soundex even calls out PostgreSQL 
for doing things slightly different than everyone else, but I haven't 
studied the details.

For Metaphone, I found the reference implementation linked from its 
Wikipedia page, and it looks like our implementation is pretty closely 
aligned to that.  That reference implementation also contains the 
C-with-cedilla case explicitly.  The correct fix here would probably be 
to change the implementation to work on wide characters.  But I think 
for the moment you could try a shortcut like, use pg_ascii_toupper(), 
but if the encoding is LATIN1 (or LATIN9 or whichever other encodings 
also contain C-with-cedilla at that code point), then explicitly 
uppercase that one as well.  This would preserve the existing behavior.

Note that the documentation calls out: "At present, the soundex, 
metaphone, dmetaphone, and dmetaphone_alt functions do not work well 
with multibyte encodings (such as UTF-8)."

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Remaining dependency on setlocale()

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения