Обсуждение: Re: Update Unicode data to Unicode 16.0.0
On 11/11/24 01:27, Peter Eisentraut wrote: > Here is the patch to update the Unicode data to version 16.0.0. > > Normally, this would have been routine, but a few months ago there was > some debate about how this should be handled. [0] AFAICT, the consensus > was to go ahead with it, but I just wanted to notify it here to be clear. > > [0]: > https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com I ran a check and found that this patch causes changes in upper casing of some characters. Repro: setup 8<------------- wget https://joeconway.com/presentations/formated-unicode.txt initdb psql CREATE DATABASE builtincoll LOCALE_PROVIDER builtin BUILTIN_LOCALE 'C.UTF-8' TEMPLATE template0; \c builtincoll CREATE TABLE unsorted_table(strings text); \copy unsorted_table from formated-unicode.txt (format csv) VACUUM FREEZE ANALYZE unsorted_table; 8<------------- 8<------------- -- on master builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 7ec7f5c2d8729ec960942942bb82aedd (1 row) builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 97f83a4d1937aa65bcf8be134bf7b0c4 (1 row) builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 8cf65a43affc221f3a20645ef402085e (1 row) 8<------------- 8<------------- -- master+patch builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 7ec7f5c2d8729ec960942942bb82aedd (1 row) Time: 19858.981 ms (00:19.859) builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 3055b3d5dff76c8c1250ef500c6ec13f (1 row) Time: 19774.467 ms (00:19.774) builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 9985acddf7902ea603897cdaccd02114 (1 row) 8<------------- So both UPPER and INITCAP produce different results unless I am missing something. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote: > On 11/11/24 01:27, Peter Eisentraut wrote: > > Here is the patch to update the Unicode data to version 16.0.0. > > > > Normally, this would have been routine, but a few months ago there was > > some debate about how this should be handled. [0] AFAICT, the consensus > > was to go ahead with it, but I just wanted to notify it here to be clear. > > > > [0]: > > https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com > > I ran a check and found that this patch causes changes in upper casing > of some characters. I want to reiterate what I said in the above thread: If that means that indexes on strings using the "builtin" collation provider need to be reindexed after an upgrade, I am very much against it. From my experiences in the field, I consider this need to rebuild indexes one of the greatest current problems for the usability of PostgreSQL. I dare say that most people would prefer living with an outdated Unicode version. Yours, Laurenz Albe
On 12.11.24 10:40, Laurenz Albe wrote: > On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote: >> On 11/11/24 01:27, Peter Eisentraut wrote: >>> Here is the patch to update the Unicode data to version 16.0.0. >>> >>> Normally, this would have been routine, but a few months ago there was >>> some debate about how this should be handled. [0] AFAICT, the consensus >>> was to go ahead with it, but I just wanted to notify it here to be clear. >>> >>> [0]: >>> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com >> >> I ran a check and found that this patch causes changes in upper casing >> of some characters. > > I want to reiterate what I said in the above thread: > If that means that indexes on strings using the "builtin" collation > provider need to be reindexed after an upgrade, I am very much against it. The practice of regularly updating the Unicode files is older than the builtin collation provider. It is similar to updating the time zone files, the encoding conversion files, the snowball files, etc. We need to move all of these things forward to keep up with the aspects of the real world that this data reflects. New features are required to live in that environment. If a new feature were proposed that would then require us to stop updating any of these files, we would likely not accept that, or at least need a very deliberate discussion about that before the feature is introduced. This was not done here at all. If this new feature has this hidden requirement, then that feature is not complete yet, and work should probably continue to make that feature complete. But that can't take progress in other areas hostage.
On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote: > I want to reiterate what I said in the above thread: > If that means that indexes on strings using the "builtin" collation > provider need to be reindexed after an upgrade, I am very much > against it. How would you feel if there was a better way to "lock down" the behavior using an extension? I have a patchset here: https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com that changes the implementation of collation and ctype to use method tables rather than branching, and it also introduces some hooks that can be used to replace the method tables with whatever you want. Regards, Jeff Davis
On Tue, 2024-11-19 at 13:42 -0800, Jeff Davis wrote: > On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote: > > I want to reiterate what I said in the above thread: > > If that means that indexes on strings using the "builtin" collation > > provider need to be reindexed after an upgrade, I am very much > > against it. > > How would you feel if there was a better way to "lock down" the > behavior using an extension? Better. > I have a patchset here: > > https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com > > that changes the implementation of collation and ctype to use method > tables rather than branching, and it also introduces some hooks that > can be used to replace the method tables with whatever you want. That looks like a nice idea, since it obviates the need to build PostgreSQL yourself if you want to use a non-standard copy of - say - the ICU library. You still have to build your own ICU library, though. I had hoped that the builtin provider would remove the need to REINDEX, but I have given up that hope. Peter's argument is sound from a conceptual point of view, even though I doubt that the average user will be able to appreciate it. Yours, Laurenz Albe
On Wed, 2024-11-20 at 06:41 +0100, Laurenz Albe wrote: > That looks like a nice idea, since it obviates the need to build > PostgreSQL yourself if you want to use a non-standard copy of - say - > the ICU library. You still have to build your own ICU library, > though. It would work with the builtin provider, too, which would not require ICU at all. The idea is that you could build an extension that copies the same logic for building the Unicode tables that we have in Postgres now, except that it uses whatever version of the Unicode data files you want. If we want it to be targeted more specifically at the builtin provider, we can make it even simpler by allowing you to just replace the unicode tables with an extension (rather than the method tables). I'm not 100% sure what people actually want here, so I'm open to suggestion. > I had hoped that the builtin provider would remove the need to > REINDEX, > but I have given up that hope. Peter's argument is sound from a > conceptual point of view, even though I doubt that the average user > will be able to appreciate it. I'd like to provide options for all kinds of users and packagers. Regards, Jeff Davis
On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote: > The practice of regularly updating the Unicode files is older than > the > builtin collation provider. It is similar to updating the time zone > files, the encoding conversion files, the snowball files, etc. We > need > to move all of these things forward to keep up with the aspects of > the > real world that this data reflects. Should we consider bundling multiple versions of the generated tables (header files) along with Postgres? That would enable a compile-time option to build with an older version of Unicode if you want, solving the packager concern that Noah raised. It would also make it easier for people to coordinate the Postgres version of Unicode and the ICU version of Unicode. Regards, Jeff Davis
On Mon, 20 Jan 2025 13:39:35 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote: > > The practice of regularly updating the Unicode files is older than > > the > > builtin collation provider. It is similar to updating the time > > zone files, the encoding conversion files, the snowball files, etc. > > We need > > to move all of these things forward to keep up with the aspects of > > the > > real world that this data reflects. > > Should we consider bundling multiple versions of the generated tables > (header files) along with Postgres? > > That would enable a compile-time option to build with an older version > of Unicode if you want, solving the packager concern that Noah raised. > It would also make it easier for people to coordinate the Postgres > version of Unicode and the ICU version of Unicode. FWIW, after adding ICU support I personally don't think there's a pressing need to continue updating the tables anymore. I think ICU is the best solution for people who need the latest linguistic collation rules. On the user side, my main concerns are the same as they've always been: 100% confidence that Postgres updates will not corrupt any data or cause incorrect query results, and not being forced to rebuild everything (or logically copy data to avoid pg_upgrade). I'm at a large company with many internal devs using Postgres in ways I don't know about, and many users storing lots of unicode data I don't know about. I'm working a fair bit with Docker and Kubernetes and CloudNativePG now, so our builds come through the debian PGDG repo. Bundling multiple tables doesn't bother me, as long as it's not a precursor to removing current tables from the debian PGDG builds we consume in the future. Ironically it's not really an issue yet for us on docker because support for pg_upgrade is pretty limited at the moment. :) But I think pg_upgrade support will rapidly improve in docker, and will become common on large databases. If Postgres does go the path of multiple tables, does the community want to accumulate a new set of tables every year? That could add up quickly. Maybe we don't add new tables every year, but follow the examples of Oracle and DB2 in accumulating them on a less frequent basis? -Jeremy
On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote: > FWIW, after adding ICU support I personally don't think there's a > pressing need to continue updating the tables anymore. I agree that it's not a pressing concern. > If Postgres does go the path of multiple tables, does the community > want to accumulate a new set of tables every year? That could add up > quickly. Maybe we don't add new tables every year, but follow the > examples of Oracle and DB2 in accumulating them on a less frequent > basis? Yeah, it would probably be every-other-release or something. By the time we built up enough versions for someone to worry about, hopefully we'd have some better systems in place to track versions and migrate forward. Regards, Jeff Davis
On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote: > On the user side, my main concerns are the same as they've always > been: 100% confidence that Postgres updates will not corrupt any data > or cause incorrect query results I'll add that, while 100% may be a good goal, it hasn't been the standard in the past. You're talking about a new standard of immutability starting in 18, and as Peter pointed out, I don't think Unicode updates are the only thing we need to consider. My personal opinion is that both positions -- to upgrade Unicode or not -- are a bit exaggerated. On the one hand, there's no urgency to updating Unicode; but on the other hand, there's not a huge danger, at least compared with our historical standards. Regards, Jeff Davis
On 21.01.25 02:06, Jeremy Schneider wrote: > FWIW, after adding ICU support I personally don't think there's a > pressing need to continue updating the tables anymore. That appears to ignore what these tables are actually used for. They are used for Unicode normalization, which is used by SCRAM. So in a slightly hyperbolic sense, keeping these tables updated is security-relevant. They are also used by psql to determine character width and format output correctly. Building a collation provider on this came much later. It was possibly a mistake how that was done.
On 20.01.25 22:39, Jeff Davis wrote: > On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote: >> The practice of regularly updating the Unicode files is older than >> the >> builtin collation provider. It is similar to updating the time zone >> files, the encoding conversion files, the snowball files, etc. We >> need >> to move all of these things forward to keep up with the aspects of >> the >> real world that this data reflects. > > Should we consider bundling multiple versions of the generated tables > (header files) along with Postgres? I wouldn't have a problem with that. > That would enable a compile-time option to build with an older version > of Unicode if you want, solving the packager concern that Noah raised. > It would also make it easier for people to coordinate the Postgres > version of Unicode and the ICU version of Unicode. But I don't think it would be a compile-time decision. I think it would be a run-time selection, similar to the theorized multiple-ICU-versions feature. (Those two features might even go together, since a given ICU version also sort of assumes a given Unicode version.)
On Wed, 2025-01-22 at 19:08 +0100, Peter Eisentraut wrote: > But I don't think it would be a compile-time decision. I think it > would > be a run-time selection, similar to the theorized multiple-ICU- > versions > feature. (Those two features might even go together, since a given > ICU > version also sort of assumes a given Unicode version.) I am trying to get there, and the ctype methods patch is a step in that direction, but I don't think we will have full the full multi-library- versions work in v18. A compile-time option does have a chance for v18, and if that satisfies the immediate concerns of packagers, then we can still update Unicode in the default build. Regards, Jeff Davis
On Wed, 2025-01-22 at 19:03 +0100, Peter Eisentraut wrote: > Building a collation provider on this came much later. It was > possibly > a mistake how that was done. It wasn't a mistake. "Stability within a PG major version" was called a *benefit* near the top of the first email on the subject[1]. It was considered a benefit because it offered a level of stability that neither libc nor ICU could offer. As far as I know, it's still considered to be a benefit today by more people than not (e.g. [2]). The concerns about Unicode updates come from a misunderstanding of the level of stability offered in the past: * IMMUTABLE was initially a planner concept[3], which is why it didn't care much about dependence on GUCs for instance. * Expression / predicate indexes rely on immutability to mean something more strict, and for that, dependence on GUCs creates a problem[4]. (Also, partitioning.) * It's hard to make an immutable UDF without a SET search_path clause, but until version 17, that was such a huge performance hit that it was not usable in an expression index. There will be a lot of not-truly- immutable UDFs used in expression indexes for a long time. * Ordinary text indexes rely on the collation libraries to be stable, which is hard to control because they could be updated by the OS. It's barely possible recently to freeze the version of libc[5] without freezing the whole OS version. And if you do manage to freeze both libc and ICU, you are risking missed security fixes. * pg_upgrade implicitly relies on IMMUTABLE to mean something even more strict: stability across major versions. That's a problem for expression indexes on functions like NORMALIZE(). And, if using the optional built-in provider, also a problem for expression indexes on LOWER(), etc. At each moment we took steps that made sense at the time and in context and I am not criticizing any of those steps. The biggest practical problem was unforseen dramatic changes in glibc that broke a lot of text indexes. The rest of the problems are a mix of design issues, feature interactions, and implementation details that were not resolved before the builtin provider existed and still not resolved today. I do not accept the premise that there is a problem with the built-in provider. I didn't throw caution to the wind and neither did the reviewers: you, Daniel, Jeremy, and I did a ton of work to understand, mitigate, and document the risks (along with a lot of help from Thomas's earlier work). Users who opt-in to the built in provider opt- in to occasional controlled changes according to the rather strict Unicode stability policies[6]. These policies mitigate risks dramatically, especially for those using only assigned code points, which can be checked with the SQL function unicode_assigned(). Regards, Jeff Davis [1] https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com [2] https://www.postgresql.org/message-id/3729436.1721322211%40sss.pgh.pa.us [3] https://www.postgresql.org/message-id/3428810.1721160969%40sss.pgh.pa.us [4] CREATE TABLE t(f float4); CREATE UNIQUE INDEX t_idx ON t((f::text)); SET extra_float_digits = 0; INSERT INTO t VALUES (1.23456789); INSERT INTO t VALUES (1.23456789); -- error SET extra_float_digits = 1; INSERT INTO t VALUES (1.23456789); -- success [5] https://github.com/awslabs/compat-collation-for-glibc [6] https://www.unicode.org/policies/stability_policy.html