Обсуждение: Change initdb default to the builtin collation provider

Поиск
Список
Период
Сортировка

Change initdb default to the builtin collation provider

От
Jeff Davis
Дата:
-------
Summary
-------

The libc collation provider is a bad default[1]. The builtin collation
provider is a good default, so let's use that.

----------
Motivation
----------

The initdb default is what we choose for new postgres instances when
we have little information about the user's collation needs. The
default has no effect on upgrades, which always use the previous
instance's locale settings.

There are a number of factors to consider when choosing a default:

* Risk of primary key inconsistencies due to libc or ICU updates
* Performance
* Quality of query semantics for a variety of scripts, languages and
  regions (excluding the final result order)
* Final result ordering / display

In the absence of specific user requirements, these factors weigh
heavily in favor of the builtin collation provider, and heavily
against libc.

With the builtin provider, there's no risk of primary key or plain
index inconsistencies, the performance is great (ordering with
memcmp()), and the query semantics are based on Unicode.

-------------------
Why does it matter?
-------------------

Arguably, we could just not worry and let various service providers,
tools, scripts, packages, and wrappers make the choice independently.

But that just moves the problem -- someone still needs to make that
choice. Collectively, we have built up some knowledge about collation
here on -hackers, and it would be good to offer it as guidance. And an
initdb default is a good way to offer that guidance.

Unifying around one default also creates a more consistent, tested,
and documented Postgres experience that benefits hackers and users
alike.

--------------------------------
What's the catch? Display order.
--------------------------------

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but *much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

Other providers offer a better final display order, but it comes at a
heavy price: index inconsistencies and poor performance. Those
problems are hard to address in an existing system. Some users may be
willing to pay that price, but it should be opt-in.

Furthermore, in the default case, we don't even really know which
language and region to use. We infer it from the environment variable
LC_COLLATE at initdb time, but that's a weak signal: there's little
reason to think that the OS admin, DBA, and end user are all in the
same locale.

In general, there's little reason to think that a single locale for
display order is enough for a whole database. Often, databases are
used (directly or indirectly) by people from dozens of locales. When
per-locale display order becomes an issue, it will be necessary to add
COLLATE clauses or application logic to tailor to the end user
regardless, so the database default locale won't be useful.

For all of these reasons, display order is the wrong thing to optimize
for when the user doesn't specify anything. We should prioritize the
other factors, and for those other factors, the builtin provider is
the best.

------------
Why not ICU?
------------

ICU is better than libc in a lot of ways:

* Better performance
* Platform-independent
* Easier to manage it as a separate library

But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.

------------------
Other Alternatives
------------------

In a previous thread[1], I laid out some alternatives. If someone
disagrees with this proposal, please choose one of those or suggest a
new one.

The most interesting alternative, in my opinion, is #4, but that was
soundly rejected.

---------------------
Which builtin locale?
---------------------

All builtin locales use the exact same ordering: they sort by code
point. Code point order is stable, so primary keys and plain indexes
remain consistent across upgrades forever.

The difference is in case conversion and character classification
semantics:

1. C: Only basic ASCII semantics which never change.

2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this
   is similar to the libc "C.UTF-8" locale available on Linux. It's
   also similar to the default semantics of at least one big
   commercial database, making migrations easier.

3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more
   aligned with the SQL standard, which specifies in an example the
   uppercase of 'ß' is 'SS'.

For the latter two locales, expression and partial indexes depending
on these semantics may be subject to inconsistencies after a Unicode
update.

I propose changing the default to PG_C_UTF8 because it seems simple
and practical. However, I'm also fine with PG_UNICODE_FAST if those
affected by the "full" case mapping find it helpful. "C" is also a
possibility, but the query semantics suffer. All are better than libc.

-------
Details
-------

The mechanics of the default itself are being worked out here[2]. The
concrete proposal here is to commit those patches, and then
change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and
DEFAULT_BUILTIN_LOCALE to whatever we choose here.

Note: the builtin provider requires UTF-8, which can potentially
conflict with the LC_CTYPE. Fortunately, when the builtin provider is
being used, LC_CTPE has little effect. To further reduce the
consequences of LC_CTYPE when using the builtin provider, another
patch[3] fixes tsearch to parse based on the database default locale
rather than depending on LC_CTYPE.


Comments welcome.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
[2]
https://www.postgresql.org/message-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
[3]
https://www.postgresql.org/message-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com




Re: Change initdb default to the builtin collation provider

От
Peter Eisentraut
Дата:
On 11.10.25 02:48, Jeff Davis wrote:
> The builtin provider uses code point order, i.e. memcmp(), so the
> final result display order is less human-friendly. For instance, 'Z'
> comes before 'a'.
> 
> That problem is annoying, but*much* easier to fix than the other
> factors. The user might add a COLLATE clause to the final ORDER BY, or
> perform the sort in the application layer or presentation layer.

I remain violently opposed to this idea.  I don't understand how it 
could be acceptable to just not provide a good display order by default 
and have everyone rewrite their queries.

> ICU is better than libc in a lot of ways:
> 
> * Better performance
> * Platform-independent
> * Easier to manage it as a separate library
> 
> But fundamentally, I don't think it's a great default, because it
> favors final result display order at the risk of primary key
> inconsistencies.

I don't understand.  We have a versioning system for ICU collations? 
Does it not work?



Re: Change initdb default to the builtin collation provider

От
Jeff Davis
Дата:
On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:
> I remain violently opposed to this idea.  I don't understand how it
> could be acceptable to just not provide a good display order by
> default
> and have everyone rewrite their queries.

I assume that you favor alternative 3 listed here[1], which is to use
ICU "und" as the default. Is that correct? Or do you prefer to get the
locale from the environment at initdb time?

One thing you may not have considered is that if the provider is
builtin, a lot more users are likely to learn about and use ICU,
because they will see an unfriendly display order and try to figure out
why. Then they'll be more prepared for upgrades and more likely to see
and respond to a version mismatch.

> I don't understand.  We have a versioning system for ICU collations?
> Does it not work?

I have 27 versions of ICU installed by compiling them from source, and
I compile Postgres in my sleep, so it's fine for me.

But for the default user, who's never really considered collation until
after they are already in trouble, having inconsistent primary keys all
over the place is not a great experience. ICU is certainly better than
libc, but I still think people should approach it with non-zero
knowledge.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com




Re: Change initdb default to the builtin collation provider

От
Jeff Davis
Дата:
On Fri, 2025-10-17 at 15:02 -0700, Jeff Davis wrote:
> On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:
> > I remain violently opposed to this idea.  I don't understand how it
> > could be acceptable to just not provide a good display order by
> > default
> > and have everyone rewrite their queries.
>
> I assume that you favor alternative 3 listed here[1], which is to use
> ICU "und" as the default. Is that correct? Or do you prefer to get
> the
> locale from the environment at initdb time?

Right now we're still stuck with the worst possible default: libc. Can
you make a more concrete counter-proposal here that sorts through some
of the details?

* Should we base the ICU locale on the environment, or just default
everyone to the "und" locale?

* If ICU support is disabled, how does that affect the defaults?

* If using the environment, what happens if the locale is not supported
by ICU (in particular "C" or "C.UTF-8")?

* What would be the default encoding, or should that come from the
environment?

* The ICU provider has some weaknesses around non-UTF8 encodings
because of casts from wchar_t and the use of tolower() in
downcase_identifier(). Are those potential blockers, and if so, are
they fixable?

* Can we try harder to find an acceptable way to use memcmp() for the
indexes by default, at least primary keys, even if the database
collation is ICU? I know that I've argued for this in the past and it's
been soundly rejected[1], but some variation on this idea could be
worthy of consideration.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com



Re: Change initdb default to the builtin collation provider

От
Jeff Davis
Дата:
On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:
> -------
> Summary
> -------
>
> The libc collation provider is a bad default[1]. The builtin
> collation
> provider is a good default, so let's use that.

The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:

0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.

0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.

Motivation:

* UTF-8 seems safer than SQL_ASCII when the locale is compatible with
either.

* Whether the "C" locale uses the builtin provider or the libc provider
is mostly about the catalog representation, because the implementation
is the same. I don't have a strong motivation for this change, it just
clarifies that libc is not actually being used when the locale is "C".

* I think most users of the "C.UTF-8" locale would be better off with
the builtin provider, which benefits from important optimizations.

Note:

This would mean that "initdb --no-locale" would select UTF-8 and the
builtin provider with locale "C", whereas previously it would have
selected SQL_ASCII and the libc provider (though it didn't ever really
use libc internally). I'm not sure if others want this behavior or if
it would be surprising.

Regards,
    Jeff Davis


Вложения