Обсуждение: pgsql: Add standard collation UNICODE
Add standard collation UNICODE This adds a new predefined collation named UNICODE, which sorts by the default Unicode collation algorithm specifications, per SQL standard. This only works if ICU support is built. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://www.postgresql.org/message-id/flat/1293e382-2093-a2bf-a397-c04e8f83d3c2@enterprisedb.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/0d21d4b9bc1f9da9dda29e5c4db0c6dd45408aaa Modified Files -------------- doc/src/sgml/charset.sgml | 31 +++++++++++++++++++++++--- src/bin/initdb/initdb.c | 10 ++++++--- src/include/catalog/catversion.h | 2 +- src/test/regress/expected/collate.icu.utf8.out | 9 ++++++++ src/test/regress/sql/collate.icu.utf8.sql | 1 + 5 files changed, 46 insertions(+), 7 deletions(-)
On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote: > > Add standard collation UNICODE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04 Looks like there's still a failure, even after subsequent fix 3e623ebc7a. I'm wondering how that diff happens, because I tested all versions of ICU between 50 and 72 (except 58 and 59, where I had build problems) and 'abc' is always less than 'ABC' in the root locale. Must be something simple I'm missing. Regards, Jeff Davis
> On Mar 10, 2023, at 11:44 AM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote: >>> Add standard collation UNICODE > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04 > > Looks like there's still a failure, even after subsequent fix > 3e623ebc7a. I'm wondering how that diff happens, because I tested all > versions of ICU between 50 and 72 (except 58 and 59, where I had build > problems) and 'abc' is always less than 'ABC' in the root locale. Do you have any customizations or other tests in place for adjusting the collation ordering that could be affecting the local build? The capitals should be first. I can try my own local test in a bit. Jonathan
On Fri, 2023-03-10 at 11:52 -0800, Jonathan S. Katz wrote: > > The capitals should be first. That is not true in a lot of natural language locales, whether libc or ICU. The following return true for me (collencoding UTF-8): select 'abc' collate "en_US" < 'ABC' collate "en_US"; select 'abc' collate "fr_FR" < 'ABC' collate "fr_FR"; select 'abc' collate "de_DE" < 'ABC' collate "de_DE"; select 'abc' collate "de_AT" < 'ABC' collate "de_AT"; select 'abc' collate "es_ES" < 'ABC' collate "es_ES"; select 'abc' collate "en-US-x-icu" < 'ABC' collate "en-US-x-icu"; select 'abc' collate "fr-CA-x-icu" < 'ABC' collate "fr-CA-x-icu"; select 'abc' collate "ja-JP-x-icu" < 'ABC' collate "ja-JP-x-icu"; select 'abc' collate "tr-TR-x-icu" < 'ABC' collate "tr-TR-x-icu"; There are some cases that return false, as well: select 'abc' collate "ja_JP" < 'ABC' collate "ja_JP"; select 'abc' collate "fr_CA" < 'ABC' collate "fr_CA"; select 'abc' collate "en-US-u-va-posix-x-icu" < 'ABC' collate "en-US-u-va-posix-x-icu"; The cases where it's false appear to be more common in libc locales, but most libc locales that I tested still sort lowercase first. Regards, Jeff Davis
On 2023-03-10 Fr 14:44, Jeff Davis wrote:
On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:Add standard collation UNICODEhttps://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04 Looks like there's still a failure, even after subsequent fix 3e623ebc7a. I'm wondering how that diff happens, because I tested all versions of ICU between 50 and 72 (except 58 and 59, where I had build problems) and 'abc' is always less than 'ABC' in the root locale. Must be something simple I'm missing.
It's fairly old:
$ rpm -q -a | grep icu
libicu-50.2-4.0.amzn1.x86_64
libicu-devel-50.2-4.0.amzn1.x86_64
I'll see about updating the machine (probably by recreating it with a modern AMI)
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote: > > > It's fairly old: > > $ rpm -q -a | grep icu > libicu-50.2-4.0.amzn1.x86_64 > libicu-devel-50.2-4.0.amzn1.x86_64 FWIW I tried ICU 50.2 (built from sources) and the root locale still sorts lowercase first. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote: >> It's fairly old: >> $ rpm -q -a | grep icu >> libicu-50.2-4.0.amzn1.x86_64 >> libicu-devel-50.2-4.0.amzn1.x86_64 > FWIW I tried ICU 50.2 (built from sources) and the root locale still > sorts lowercase first. snapper is showing this failure too [1], which makes me wonder if it's a locally-customizable option. In general, I see no good reason for our regression tests to be making assumptions about exactly how ICU's root locale behaves, so I'd suggest just lobotomizing this test case so it doesn't depend on upper/lower sort order. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2023-03-11%2016%3A50%3A13
Hi, On 2023-03-11 10:20:33 -0800, Jeff Davis wrote: > On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote: > > > > > It's fairly old: > > > > $ rpm -q -a | grep icu > > libicu-50.2-4.0.amzn1.x86_64 > > libicu-devel-50.2-4.0.amzn1.x86_64 > > FWIW I tried ICU 50.2 (built from sources) and the root locale still > sorts lowercase first. On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on centos-7. In case you want a way to reproduce the issue. You can merge that tree into a throway tree, none of the changes should affect your work. You can even log into the machine if it's on your repo ("re-run with terminal access"), if you want to figure out the details. https://cirrus-ci.com/task/6015098225950720 https://api.cirrus-ci.com/v1/artifact/task/6015098225950720/testrun/build/testrun/regress/regress/regression.diffs Centos 8, fedora rawhide, suse tumbleweed are all OK though. Greetings, Andres Freund [1] https://github.com/anarazel/postgres/tree/meson-ci-debug
> On Mar 11, 2023, at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > >> On 2023-03-11 10:20:33 -0800, Jeff Davis wrote: >> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote: >>>> >>> It's fairly old: >>> >>> $ rpm -q -a | grep icu >>> libicu-50.2-4.0.amzn1.x86_64 >>> libicu-devel-50.2-4.0.amzn1.x86_64 >> >> FWIW I tried ICU 50.2 (built from sources) and the root locale still >> sorts lowercase first. > > On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on > centos-7. In case you want a way to reproduce the issue. You can merge that > tree into a throway tree, none of the changes should affect your work. You can > even log into the machine if it's on your repo ("re-run with terminal > access"), if you want to figure out the details. > > https://cirrus-ci.com/task/6015098225950720 > https://api.cirrus-ci.com/v1/artifact/task/6015098225950720/testrun/build/testrun/regress/regress/regression.diffs > > Centos 8, fedora rawhide, suse tumbleweed are all OK though. I have a Centos 7 setup I can play with tomorrow. Cheers Andrew
On 2023-03-11 Sa 17:05, Andres Freund wrote:
Hi, On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:It's fairly old: $ rpm -q -a | grep icu libicu-50.2-4.0.amzn1.x86_64 libicu-devel-50.2-4.0.amzn1.x86_64FWIW I tried ICU 50.2 (built from sources) and the root locale still sorts lowercase first.On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on centos-7.
Yeah,
I have a Centos 7 VM sitting here which I fired up and reproduced it just now. I have the source package downloaded from the Centos archives, but given that it's also happening on Debian 7, perhaps it's not something packaging-specific.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2023-03-11 Sa 15:47, Tom Lane wrote:
Jeff Davis <pgsql@j-davis.com> writes:On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:It's fairly old: $ rpm -q -a | grep icu libicu-50.2-4.0.amzn1.x86_64 libicu-devel-50.2-4.0.amzn1.x86_64FWIW I tried ICU 50.2 (built from sources) and the root locale still sorts lowercase first.snapper is showing this failure too [1], which makes me wonder if it's a locally-customizable option. In general, I see no good reason for our regression tests to be making assumptions about exactly how ICU's root locale behaves, so I'd suggest just lobotomizing this test case so it doesn't depend on upper/lower sort order.
Yeah, I think we've found enough cases in old still not dead distros to make this reasonable.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2023-03-11 Sa 15:47, Tom Lane wrote: >> In general, I see no good reason for our regression tests to be making >> assumptions about exactly how ICU's root locale behaves, so I'd suggest >> just lobotomizing this test case so it doesn't depend on upper/lower >> sort order. > Yeah, I think we've found enough cases in old still not dead distros to > make this reasonable. I see topminnow is showing the symptom too, so it's now pretty clear that this was once a common behavior. I dug in ICU's release notes and couldn't find anything about it, but they must've changed something somewhen. regards, tom lane
On Mon, Mar 13, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > On 2023-03-11 Sa 15:47, Tom Lane wrote: > >> In general, I see no good reason for our regression tests to be making > >> assumptions about exactly how ICU's root locale behaves, so I'd suggest > >> just lobotomizing this test case so it doesn't depend on upper/lower > >> sort order. > > > Yeah, I think we've found enough cases in old still not dead distros to > > make this reasonable. > > I see topminnow is showing the symptom too, so it's now pretty clear > that this was once a common behavior. I dug in ICU's release notes > and couldn't find anything about it, but they must've changed something > somewhen. Most odd, and a bit worrying for the multi-lib concept we were discussing (now for 17), though not fatal if it's something like "there was one historical screwup back in 50.whatever, but now we are more careful", or "it was a packaging error that was fixed but old systems that aren't getting updated don't have the fix", so not even ICU's fault (which would fit with Jeff's observation that he can't repro the issue by building from ICU source at the 50.2 tag, as if what is in the package is not what was tagged?!). It'd be nice to understand what happened here... What do these disagreeing systems report for the various version meta-data (Unicode version, CLDR version, the opaque collator version, etc)?
On 2023-03-12 Su 18:47, Thomas Munro wrote:
On Mon, Mar 13, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:Andrew Dunstan <andrew@dunslane.net> writes:On 2023-03-11 Sa 15:47, Tom Lane wrote:In general, I see no good reason for our regression tests to be making assumptions about exactly how ICU's root locale behaves, so I'd suggest just lobotomizing this test case so it doesn't depend on upper/lower sort order.Yeah, I think we've found enough cases in old still not dead distros to make this reasonable.I see topminnow is showing the symptom too, so it's now pretty clear that this was once a common behavior. I dug in ICU's release notes and couldn't find anything about it, but they must've changed something somewhen.Most odd, and a bit worrying for the multi-lib concept we were discussing (now for 17), though not fatal if it's something like "there was one historical screwup back in 50.whatever, but now we are more careful", or "it was a packaging error that was fixed but old systems that aren't getting updated don't have the fix", so not even ICU's fault (which would fit with Jeff's observation that he can't repro the issue by building from ICU source at the 50.2 tag, as if what is in the package is not what was tagged?!). It'd be nice to understand what happened here... What do these disagreeing systems report for the various version meta-data (Unicode version, CLDR version, the opaque collator version, etc)?
How would I discover that? Or if it's quicker I can give you access to one or two machines which exhibit the issue.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > How would I discover that? Or if it's quicker I can give you access to > one or two machines which exhibit the issue. Looks like Peter already found the answer -- several of the unhappy animals have gone green since d72900bde. regards, tom lane
On Sat, 2023-03-11 at 15:47 -0500, Tom Lane wrote: > In general, I see no good reason for our regression tests to be > making > assumptions about exactly how ICU's root locale behaves, so I'd > suggest > just lobotomizing this test case so it doesn't depend on upper/lower > sort order. [ Looks like Peter already found the answer, but I had some additional commentary so I'll send this anyway. ] Mystery solved: "und" was not a valid spelling of "root" until version 55. So it fell back to the default locale, which is sensitive to the environment. With this in mind, I think we should take the opposite lesson and keep this test as-is. We don't have to promise the behavior in the docs, but it did catch a real issue. In fact, it caught a dangerous issue, because a change to lc_messages could cause a corrupt index. I'd like to see if there's a way to reliably detect that a fallback to the environment-sensitive default locale has happened, and throw an error. I don't think that happens in versions 55 and later, but it would be nice to have a test because earlier versions of ICU seem dangerous to me. Regards, Jeff Davis