Обсуждение: Re: Update Unicode data to Unicode 16.0.0

Поиск
Список
Период
Сортировка

Re: Update Unicode data to Unicode 16.0.0

От
Joe Conway
Дата:
On 11/11/24 01:27, Peter Eisentraut wrote:
> Here is the patch to update the Unicode data to version 16.0.0.
> 
> Normally, this would have been routine, but a few months ago there was
> some debate about how this should be handled. [0]  AFAICT, the consensus
> was to go ahead with it, but I just wanted to notify it here to be clear.
> 
> [0]:
> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

I ran a check and found that this patch causes changes in upper casing 
of some characters. Repro:

setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
  LOCALE_PROVIDER builtin
  BUILTIN_LOCALE 'C.UTF-8'
  TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------


8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------


8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------

So both UPPER and INITCAP produce different results unless I am missing 
something.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Update Unicode data to Unicode 16.0.0

От
Laurenz Albe
Дата:
On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:
> On 11/11/24 01:27, Peter Eisentraut wrote:
> > Here is the patch to update the Unicode data to version 16.0.0.
> >
> > Normally, this would have been routine, but a few months ago there was
> > some debate about how this should be handled. [0]  AFAICT, the consensus
> > was to go ahead with it, but I just wanted to notify it here to be clear.
> >
> > [0]:
> > https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
>
> I ran a check and found that this patch causes changes in upper casing
> of some characters.

I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much against it.

From my experiences in the field, I consider this need to rebuild indexes
one of the greatest current problems for the usability of PostgreSQL.
I dare say that most people would prefer living with an outdated Unicode version.

Yours,
Laurenz Albe



Re: Update Unicode data to Unicode 16.0.0

От
Peter Eisentraut
Дата:
On 12.11.24 10:40, Laurenz Albe wrote:
> On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:
>> On 11/11/24 01:27, Peter Eisentraut wrote:
>>> Here is the patch to update the Unicode data to version 16.0.0.
>>>
>>> Normally, this would have been routine, but a few months ago there was
>>> some debate about how this should be handled. [0]  AFAICT, the consensus
>>> was to go ahead with it, but I just wanted to notify it here to be clear.
>>>
>>> [0]:
>>> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
>>
>> I ran a check and found that this patch causes changes in upper casing
>> of some characters.
> 
> I want to reiterate what I said in the above thread:
> If that means that indexes on strings using the "builtin" collation
> provider need to be reindexed after an upgrade, I am very much against it.

The practice of regularly updating the Unicode files is older than the 
builtin collation provider.  It is similar to updating the time zone 
files, the encoding conversion files, the snowball files, etc.  We need 
to move all of these things forward to keep up with the aspects of the 
real world that this data reflects.  New features are required to live 
in that environment.  If a new feature were proposed that would then 
require us to stop updating any of these files, we would likely not 
accept that, or at least need a very deliberate discussion about that 
before the feature is introduced.  This was not done here at all.  If 
this new feature has this hidden requirement, then that feature is not 
complete yet, and work should probably continue to make that feature 
complete.  But that can't take progress in other areas hostage.




Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:
> I want to reiterate what I said in the above thread:
> If that means that indexes on strings using the "builtin" collation
> provider need to be reindexed after an upgrade, I am very much
> against it.

How would you feel if there was a better way to "lock down" the
behavior using an extension?

I have a patchset here:

https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com

that changes the implementation of collation and ctype to use method
tables rather than branching, and it also introduces some hooks that
can be used to replace the method tables with whatever you want.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

От
Laurenz Albe
Дата:
On Tue, 2024-11-19 at 13:42 -0800, Jeff Davis wrote:
> On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:
> > I want to reiterate what I said in the above thread:
> > If that means that indexes on strings using the "builtin" collation
> > provider need to be reindexed after an upgrade, I am very much
> > against it.
>
> How would you feel if there was a better way to "lock down" the
> behavior using an extension?

Better.

> I have a patchset here:
>
> https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com
>
> that changes the implementation of collation and ctype to use method
> tables rather than branching, and it also introduces some hooks that
> can be used to replace the method tables with whatever you want.

That looks like a nice idea, since it obviates the need to build
PostgreSQL yourself if you want to use a non-standard copy of - say -
the ICU library.  You still have to build your own ICU library, though.

I had hoped that the builtin provider would remove the need to REINDEX,
but I have given up that hope.  Peter's argument is sound from a
conceptual point of view, even though I doubt that the average user
will be able to appreciate it.

Yours,
Laurenz Albe



Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Wed, 2024-11-20 at 06:41 +0100, Laurenz Albe wrote:
> That looks like a nice idea, since it obviates the need to build
> PostgreSQL yourself if you want to use a non-standard copy of - say -
> the ICU library.  You still have to build your own ICU library,
> though.

It would work with the builtin provider, too, which would not require
ICU at all.

The idea is that you could build an extension that copies the same
logic for building the Unicode tables that we have in Postgres now,
except that it uses whatever version of the Unicode data files you
want.

If we want it to be targeted more specifically at the builtin provider,
we can make it even simpler by allowing you to just replace the unicode
tables with an extension (rather than the method tables). I'm not 100%
sure what people actually want here, so I'm open to suggestion.

> I had hoped that the builtin provider would remove the need to
> REINDEX,
> but I have given up that hope.  Peter's argument is sound from a
> conceptual point of view, even though I doubt that the average user
> will be able to appreciate it.

I'd like to provide options for all kinds of users and packagers.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:
> The practice of regularly updating the Unicode files is older than
> the
> builtin collation provider.  It is similar to updating the time zone
> files, the encoding conversion files, the snowball files, etc.  We
> need
> to move all of these things forward to keep up with the aspects of
> the
> real world that this data reflects.

Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?

That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

От
Jeremy Schneider
Дата:
On Mon, 20 Jan 2025 13:39:35 -0800
Jeff Davis <pgsql@j-davis.com> wrote:

> On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:
> > The practice of regularly updating the Unicode files is older than
> > the
> > builtin collation provider.  It is similar to updating the time
> > zone files, the encoding conversion files, the snowball files, etc.
> >  We need
> > to move all of these things forward to keep up with the aspects of
> > the
> > real world that this data reflects.
>
> Should we consider bundling multiple versions of the generated tables
> (header files) along with Postgres?
>
> That would enable a compile-time option to build with an older version
> of Unicode if you want, solving the packager concern that Noah raised.
> It would also make it easier for people to coordinate the Postgres
> version of Unicode and the ICU version of Unicode.

FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore. I think ICU is
the best solution for people who need the latest linguistic collation
rules.

On the user side, my main concerns are the same as they've always
been: 100% confidence that Postgres updates will not corrupt any data
or cause incorrect query results, and not being forced to rebuild
everything (or logically copy data to avoid pg_upgrade). I'm at a large
company with many internal devs using Postgres in ways I don't know
about, and many users storing lots of unicode data I don't know about.

I'm working a fair bit with Docker and Kubernetes and CloudNativePG
now, so our builds come through the debian PGDG repo. Bundling multiple
tables doesn't bother me, as long as it's not a precursor to removing
current tables from the debian PGDG builds we consume in the future.

Ironically it's not really an issue yet for us on docker because
support for pg_upgrade is pretty limited at the moment.  :)  But I
think pg_upgrade support will rapidly improve in docker, and will
become common on large databases.

If Postgres does go the path of multiple tables, does the community
want to accumulate a new set of tables every year? That could add up
quickly. Maybe we don't add new tables every year, but follow the
examples of Oracle and DB2 in accumulating them on a less frequent
basis?

-Jeremy



Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote:
> FWIW, after adding ICU support I personally don't think there's a
> pressing need to continue updating the tables anymore.

I agree that it's not a pressing concern.

> If Postgres does go the path of multiple tables, does the community
> want to accumulate a new set of tables every year? That could add up
> quickly. Maybe we don't add new tables every year, but follow the
> examples of Oracle and DB2 in accumulating them on a less frequent
> basis?

Yeah, it would probably be every-other-release or something. By the
time we built up enough versions for someone to worry about, hopefully
we'd have some better systems in place to track versions and migrate
forward.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote:
> On the user side, my main concerns are the same as they've always
> been: 100% confidence that Postgres updates will not corrupt any data
> or cause incorrect query results

I'll add that, while 100% may be a good goal, it hasn't been the
standard in the past. You're talking about a new standard of
immutability starting in 18, and as Peter pointed out, I don't think
Unicode updates are the only thing we need to consider.

My personal opinion is that both positions -- to upgrade Unicode or not
-- are a bit exaggerated. On the one hand, there's no urgency to
updating Unicode; but on the other hand, there's not a huge danger, at
least compared with our historical standards.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

От
Peter Eisentraut
Дата:
On 21.01.25 02:06, Jeremy Schneider wrote:
> FWIW, after adding ICU support I personally don't think there's a
> pressing need to continue updating the tables anymore.

That appears to ignore what these tables are actually used for.  They 
are used for Unicode normalization, which is used by SCRAM.  So in a 
slightly hyperbolic sense, keeping these tables updated is 
security-relevant.  They are also used by psql to determine character 
width and format output correctly.

Building a collation provider on this came much later.  It was possibly 
a mistake how that was done.




Re: Update Unicode data to Unicode 16.0.0

От
Peter Eisentraut
Дата:
On 20.01.25 22:39, Jeff Davis wrote:
> On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:
>> The practice of regularly updating the Unicode files is older than
>> the
>> builtin collation provider.  It is similar to updating the time zone
>> files, the encoding conversion files, the snowball files, etc.  We
>> need
>> to move all of these things forward to keep up with the aspects of
>> the
>> real world that this data reflects.
> 
> Should we consider bundling multiple versions of the generated tables
> (header files) along with Postgres?

I wouldn't have a problem with that.

> That would enable a compile-time option to build with an older version
> of Unicode if you want, solving the packager concern that Noah raised.
> It would also make it easier for people to coordinate the Postgres
> version of Unicode and the ICU version of Unicode.

But I don't think it would be a compile-time decision.  I think it would 
be a run-time selection, similar to the theorized multiple-ICU-versions 
feature.  (Those two features might even go together, since a given ICU 
version also sort of assumes a given Unicode version.)




Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Wed, 2025-01-22 at 19:08 +0100, Peter Eisentraut wrote:
> But I don't think it would be a compile-time decision.  I think it
> would
> be a run-time selection, similar to the theorized multiple-ICU-
> versions
> feature.  (Those two features might even go together, since a given
> ICU
> version also sort of assumes a given Unicode version.)

I am trying to get there, and the ctype methods patch is a step in that
direction, but I don't think we will have full the full multi-library-
versions work in v18.

A compile-time option does have a chance for v18, and if that satisfies
the immediate concerns of packagers, then we can still update Unicode
in the default build.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

От
Jeff Davis
Дата:
On Wed, 2025-01-22 at 19:03 +0100, Peter Eisentraut wrote:
> Building a collation provider on this came much later.  It was
> possibly
> a mistake how that was done.

It wasn't a mistake. "Stability within a PG major version" was called a
*benefit* near the top of the first email on the subject[1]. It was
considered a benefit because it offered a level of stability that
neither libc nor ICU could offer. As far as I know, it's still
considered to be a benefit today by more people than not (e.g. [2]).

The concerns about Unicode updates come from a misunderstanding of the
level of stability offered in the past:

* IMMUTABLE was initially a planner concept[3], which is why it didn't
care much about dependence on GUCs for instance.

* Expression / predicate indexes rely on immutability to mean something
more strict, and for that, dependence on GUCs creates a problem[4].
(Also, partitioning.)

* It's hard to make an immutable UDF without a SET search_path clause,
but until version 17, that was such a huge performance hit that it was
not usable in an expression index. There will be a lot of not-truly-
immutable UDFs used in expression indexes for a long time.

* Ordinary text indexes rely on the collation libraries to be stable,
which is hard to control because they could be updated by the OS. It's
barely possible recently to freeze the version of libc[5] without
freezing the whole OS version. And if you do manage to freeze both libc
and ICU, you are risking missed security fixes.

* pg_upgrade implicitly relies on IMMUTABLE to mean something even more
strict: stability across major versions. That's a problem for
expression indexes on functions like NORMALIZE(). And, if using the
optional built-in provider, also a problem for expression indexes on
LOWER(), etc.

At each moment we took steps that made sense at the time and in context
and I am not criticizing any of those steps. The biggest practical
problem was unforseen dramatic changes in glibc that broke a lot of
text indexes. The rest of the problems are a mix of design issues,
feature interactions, and implementation details that were not resolved
before the builtin provider existed and still not resolved today.

I do not accept the premise that there is a problem with the built-in
provider. I didn't throw caution to the wind and neither did the
reviewers: you, Daniel, Jeremy, and I did a ton of work to understand,
mitigate, and document the risks (along with a lot of help from
Thomas's earlier work). Users who opt-in to the built in provider opt-
in to occasional controlled changes according to the rather strict
Unicode stability policies[6]. These policies mitigate risks
dramatically, especially for those using only assigned code points,
which can be checked with the SQL function unicode_assigned().

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

[2]
https://www.postgresql.org/message-id/3729436.1721322211%40sss.pgh.pa.us

[3]
https://www.postgresql.org/message-id/3428810.1721160969%40sss.pgh.pa.us

[4]

   CREATE TABLE t(f float4);
   CREATE UNIQUE INDEX t_idx ON t((f::text));
   SET extra_float_digits = 0;
   INSERT INTO t VALUES (1.23456789);
   INSERT INTO t VALUES (1.23456789); -- error
   SET extra_float_digits = 1;
   INSERT INTO t VALUES (1.23456789); -- success

[5] https://github.com/awslabs/compat-collation-for-glibc

[6] https://www.unicode.org/policies/stability_policy.html