Обсуждение: pgsql: Add standard collation UNICODE

Поиск
Список
Период
Сортировка

pgsql: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
Add standard collation UNICODE

This adds a new predefined collation named UNICODE, which sorts by the
default Unicode collation algorithm specifications, per SQL standard.

This only works if ICU support is built.

Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://www.postgresql.org/message-id/flat/1293e382-2093-a2bf-a397-c04e8f83d3c2@enterprisedb.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/0d21d4b9bc1f9da9dda29e5c4db0c6dd45408aaa

Modified Files
--------------
doc/src/sgml/charset.sgml                      | 31 +++++++++++++++++++++++---
src/bin/initdb/initdb.c                        | 10 ++++++---
src/include/catalog/catversion.h               |  2 +-
src/test/regress/expected/collate.icu.utf8.out |  9 ++++++++
src/test/regress/sql/collate.icu.utf8.sql      |  1 +
5 files changed, 46 insertions(+), 7 deletions(-)


Re: pgsql: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:
> > Add standard collation UNICODE

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04

Looks like there's still a failure, even after subsequent fix
3e623ebc7a. I'm wondering how that diff happens, because I tested all
versions of ICU between 50 and 72 (except 58 and 59, where I had build
problems) and 'abc' is always less than 'ABC' in the root locale.

Must be something simple I'm missing.

Regards,
    Jeff Davis




Re: pgsql: Add standard collation UNICODE

От
"Jonathan S. Katz"
Дата:
> On Mar 10, 2023, at 11:44 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:
>>> Add standard collation UNICODE
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04
>
> Looks like there's still a failure, even after subsequent fix
> 3e623ebc7a. I'm wondering how that diff happens, because I tested all
> versions of ICU between 50 and 72 (except 58 and 59, where I had build
> problems) and 'abc' is always less than 'ABC' in the root locale.

Do you have any customizations or other tests in
place for adjusting the collation ordering that
could be affecting the local build? The capitals
should be first. I can try my own local test in a bit.

Jonathan


Re: pgsql: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Fri, 2023-03-10 at 11:52 -0800, Jonathan S. Katz wrote:
>
> The capitals should be first.

That is not true in a lot of natural language locales, whether libc or
ICU. The following return true for me (collencoding UTF-8):

  select 'abc' collate "en_US" < 'ABC' collate "en_US";
  select 'abc' collate "fr_FR" < 'ABC' collate "fr_FR";
  select 'abc' collate "de_DE" < 'ABC' collate "de_DE";
  select 'abc' collate "de_AT" < 'ABC' collate "de_AT";
  select 'abc' collate "es_ES" < 'ABC' collate "es_ES";
  select 'abc' collate "en-US-x-icu" < 'ABC' collate "en-US-x-icu";
  select 'abc' collate "fr-CA-x-icu" < 'ABC' collate "fr-CA-x-icu";
  select 'abc' collate "ja-JP-x-icu" < 'ABC' collate "ja-JP-x-icu";
  select 'abc' collate "tr-TR-x-icu" < 'ABC' collate "tr-TR-x-icu";

There are some cases that return false, as well:

  select 'abc' collate "ja_JP" < 'ABC' collate "ja_JP";
  select 'abc' collate "fr_CA" < 'ABC' collate "fr_CA";
  select 'abc' collate "en-US-u-va-posix-x-icu" < 
    'ABC' collate "en-US-u-va-posix-x-icu";

The cases where it's false appear to be more common in libc locales,
but most libc locales that I tested still sort lowercase first.

Regards,
    Jeff Davis




Re: pgsql: Add standard collation UNICODE

От
Andrew Dunstan
Дата:


On 2023-03-10 Fr 14:44, Jeff Davis wrote:
On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:
Add standard collation UNICODE
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04

Looks like there's still a failure, even after subsequent fix
3e623ebc7a. I'm wondering how that diff happens, because I tested all
versions of ICU between 50 and 72 (except 58 and 59, where I had build
problems) and 'abc' is always less than 'ABC' in the root locale.

Must be something simple I'm missing.



It's fairly old:


$ rpm -q -a | grep icu
libicu-50.2-4.0.amzn1.x86_64
libicu-devel-50.2-4.0.amzn1.x86_64


I'll see about updating the machine (probably by recreating it with a modern AMI)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
> >
> It's fairly old:
>
> $ rpm -q -a | grep icu
>  libicu-50.2-4.0.amzn1.x86_64
>  libicu-devel-50.2-4.0.amzn1.x86_64

FWIW I tried ICU 50.2 (built from sources) and the root locale still
sorts lowercase first.

Regards,
    Jeff Davis





Re: pgsql: Add standard collation UNICODE

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
>> It's fairly old:
>> $ rpm -q -a | grep icu
>>  libicu-50.2-4.0.amzn1.x86_64
>>  libicu-devel-50.2-4.0.amzn1.x86_64

> FWIW I tried ICU 50.2 (built from sources) and the root locale still
> sorts lowercase first.

snapper is showing this failure too [1], which makes me wonder if it's
a locally-customizable option.

In general, I see no good reason for our regression tests to be making
assumptions about exactly how ICU's root locale behaves, so I'd suggest
just lobotomizing this test case so it doesn't depend on upper/lower
sort order.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2023-03-11%2016%3A50%3A13



Re: pgsql: Add standard collation UNICODE

От
Andres Freund
Дата:
Hi,

On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:
> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
> > > 
> > It's fairly old:
> > 
> > $ rpm -q -a | grep icu
> >  libicu-50.2-4.0.amzn1.x86_64
> >  libicu-devel-50.2-4.0.amzn1.x86_64
> 
> FWIW I tried ICU 50.2 (built from sources) and the root locale still
> sorts lowercase first.

On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on
centos-7. In case you want a way to reproduce the issue. You can merge that
tree into a throway tree, none of the changes should affect your work. You can
even log into the machine if it's on your repo ("re-run with terminal
access"), if you want to figure out the details.

https://cirrus-ci.com/task/6015098225950720
https://api.cirrus-ci.com/v1/artifact/task/6015098225950720/testrun/build/testrun/regress/regress/regression.diffs

Centos 8, fedora rawhide, suse tumbleweed are all OK though.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/meson-ci-debug



Re: pgsql: Add standard collation UNICODE

От
Andrew Dunstan
Дата:

> On Mar 11, 2023, at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>> On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:
>> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
>>>>
>>> It's fairly old:
>>>
>>> $ rpm -q -a | grep icu
>>>  libicu-50.2-4.0.amzn1.x86_64
>>>  libicu-devel-50.2-4.0.amzn1.x86_64
>>
>> FWIW I tried ICU 50.2 (built from sources) and the root locale still
>> sorts lowercase first.
>
> On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on
> centos-7. In case you want a way to reproduce the issue. You can merge that
> tree into a throway tree, none of the changes should affect your work. You can
> even log into the machine if it's on your repo ("re-run with terminal
> access"), if you want to figure out the details.
>
> https://cirrus-ci.com/task/6015098225950720
> https://api.cirrus-ci.com/v1/artifact/task/6015098225950720/testrun/build/testrun/regress/regress/regression.diffs
>
> Centos 8, fedora rawhide, suse tumbleweed are all OK though.


I have a Centos 7 setup I can play with tomorrow.

Cheers

Andrew



Re: pgsql: Add standard collation UNICODE

От
Andrew Dunstan
Дата:


On 2023-03-11 Sa 17:05, Andres Freund wrote:
Hi,

On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:
On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
It's fairly old:

$ rpm -q -a | grep icu
 libicu-50.2-4.0.amzn1.x86_64
 libicu-devel-50.2-4.0.amzn1.x86_64
FWIW I tried ICU 50.2 (built from sources) and the root locale still
sorts lowercase first.
On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on
centos-7. 


Yeah,


I have a Centos 7 VM sitting here which I fired up and reproduced it just now. I have the source package downloaded from the Centos archives, but given that it's also happening on Debian 7, perhaps it's not something packaging-specific.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От
Andrew Dunstan
Дата:


On 2023-03-11 Sa 15:47, Tom Lane wrote:
Jeff Davis <pgsql@j-davis.com> writes:
On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
It's fairly old:
$ rpm -q -a | grep icu
 libicu-50.2-4.0.amzn1.x86_64
 libicu-devel-50.2-4.0.amzn1.x86_64
FWIW I tried ICU 50.2 (built from sources) and the root locale still
sorts lowercase first.
snapper is showing this failure too [1], which makes me wonder if it's
a locally-customizable option.

In general, I see no good reason for our regression tests to be making
assumptions about exactly how ICU's root locale behaves, so I'd suggest
just lobotomizing this test case so it doesn't depend on upper/lower
sort order.
	


Yeah, I think we've found enough cases in old still not dead distros to make this reasonable.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-03-11 Sa 15:47, Tom Lane wrote:
>> In general, I see no good reason for our regression tests to be making
>> assumptions about exactly how ICU's root locale behaves, so I'd suggest
>> just lobotomizing this test case so it doesn't depend on upper/lower
>> sort order.

> Yeah, I think we've found enough cases in old still not dead distros to 
> make this reasonable.

I see topminnow is showing the symptom too, so it's now pretty clear
that this was once a common behavior.  I dug in ICU's release notes
and couldn't find anything about it, but they must've changed something
somewhen.

            regards, tom lane



Re: pgsql: Add standard collation UNICODE

От
Thomas Munro
Дата:
On Mon, Mar 13, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2023-03-11 Sa 15:47, Tom Lane wrote:
> >> In general, I see no good reason for our regression tests to be making
> >> assumptions about exactly how ICU's root locale behaves, so I'd suggest
> >> just lobotomizing this test case so it doesn't depend on upper/lower
> >> sort order.
>
> > Yeah, I think we've found enough cases in old still not dead distros to
> > make this reasonable.
>
> I see topminnow is showing the symptom too, so it's now pretty clear
> that this was once a common behavior.  I dug in ICU's release notes
> and couldn't find anything about it, but they must've changed something
> somewhen.

Most odd, and a bit worrying for the multi-lib concept we were
discussing (now for 17), though not fatal if it's something like
"there was one historical screwup back in 50.whatever, but now we are
more careful", or "it was a packaging error that was fixed but old
systems that aren't getting updated don't have the fix", so not even
ICU's fault (which would fit with Jeff's observation that he can't
repro the issue by building from ICU source at the 50.2 tag, as if
what is in the package is not what was tagged?!).  It'd be nice to
understand what happened here...  What do these disagreeing systems
report for the various version meta-data (Unicode version, CLDR
version, the opaque collator version, etc)?



Re: pgsql: Add standard collation UNICODE

От
Andrew Dunstan
Дата:


On 2023-03-12 Su 18:47, Thomas Munro wrote:
On Mon, Mar 13, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2023-03-11 Sa 15:47, Tom Lane wrote:
In general, I see no good reason for our regression tests to be making
assumptions about exactly how ICU's root locale behaves, so I'd suggest
just lobotomizing this test case so it doesn't depend on upper/lower
sort order.
Yeah, I think we've found enough cases in old still not dead distros to
make this reasonable.
I see topminnow is showing the symptom too, so it's now pretty clear
that this was once a common behavior.  I dug in ICU's release notes
and couldn't find anything about it, but they must've changed something
somewhen.
Most odd, and a bit worrying for the multi-lib concept we were
discussing (now for 17), though not fatal if it's something like
"there was one historical screwup back in 50.whatever, but now we are
more careful", or "it was a packaging error that was fixed but old
systems that aren't getting updated don't have the fix", so not even
ICU's fault (which would fit with Jeff's observation that he can't
repro the issue by building from ICU source at the 50.2 tag, as if
what is in the package is not what was tagged?!).  It'd be nice to
understand what happened here...  What do these disagreeing systems
report for the various version meta-data (Unicode version, CLDR
version, the opaque collator version, etc)?


How would I discover that? Or if it's quicker I can give you access to one or two machines which exhibit the issue.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> How would I discover that? Or if it's quicker I can give you access to 
> one or two machines which exhibit the issue.

Looks like Peter already found the answer -- several of the unhappy
animals have gone green since d72900bde.

            regards, tom lane



Re: pgsql: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Sat, 2023-03-11 at 15:47 -0500, Tom Lane wrote:
> In general, I see no good reason for our regression tests to be
> making
> assumptions about exactly how ICU's root locale behaves, so I'd
> suggest
> just lobotomizing this test case so it doesn't depend on upper/lower
> sort order.

[ Looks like Peter already found the answer, but I had some additional
commentary so I'll send this anyway. ]

Mystery solved: "und" was not a valid spelling of "root" until version
55. So it fell back to the default locale, which is sensitive to the
environment.

With this in mind, I think we should take the opposite lesson and keep
this test as-is. We don't have to promise the behavior in the docs, but
it did catch a real issue. In fact, it caught a dangerous issue,
because a change to lc_messages could cause a corrupt index.

I'd like to see if there's a way to reliably detect that a fallback to
the environment-sensitive default locale has happened, and throw an
error. I don't think that happens in versions 55 and later, but it
would be nice to have a test because earlier versions of ICU seem
dangerous to me.

Regards,
    Jeff Davis