Обсуждение: pgsql: Add standard collation UNICODE

Поиск

Список

Период

Сортировка

pgsql: Add standard collation UNICODE

От

Peter Eisentraut

Дата:

10 марта 2023 г., 15:42:35

Add standard collation UNICODE

This adds a new predefined collation named UNICODE, which sorts by the
default Unicode collation algorithm specifications, per SQL standard.

This only works if ICU support is built.

Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://www.postgresql.org/message-id/flat/1293e382-2093-a2bf-a397-c04e8f83d3c2@enterprisedb.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/0d21d4b9bc1f9da9dda29e5c4db0c6dd45408aaa

Modified Files
--------------
doc/src/sgml/charset.sgml                      | 31 +++++++++++++++++++++++---
src/bin/initdb/initdb.c                        | 10 ++++++---
src/include/catalog/catversion.h               |  2 +-
src/test/regress/expected/collate.icu.utf8.out |  9 ++++++++
src/test/regress/sql/collate.icu.utf8.sql      |  1 +
5 files changed, 46 insertions(+), 7 deletions(-)

Re: pgsql: Add standard collation UNICODE

От

Jeff Davis

Дата:

10 марта 2023 г., 22:44:08

On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:
> > Add standard collation UNICODE

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04

Looks like there's still a failure, even after subsequent fix
3e623ebc7a. I'm wondering how that diff happens, because I tested all
versions of ICU between 50 and 72 (except 58 and 59, where I had build
problems) and 'abc' is always less than 'ABC' in the root locale.

Must be something simple I'm missing.

Regards,
    Jeff Davis

Re: pgsql: Add standard collation UNICODE

От

"Jonathan S. Katz"

Дата:

10 марта 2023 г., 22:52:22

> On Mar 10, 2023, at 11:44 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:
>>> Add standard collation UNICODE
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04
>
> Looks like there's still a failure, even after subsequent fix
> 3e623ebc7a. I'm wondering how that diff happens, because I tested all
> versions of ICU between 50 and 72 (except 58 and 59, where I had build
> problems) and 'abc' is always less than 'ABC' in the root locale.

Do you have any customizations or other tests in
place for adjusting the collation ordering that
could be affecting the local build? The capitals
should be first. I can try my own local test in a bit.

Jonathan

Re: pgsql: Add standard collation UNICODE

От

Jeff Davis

Дата:

11 марта 2023 г., 00:24:27

On Fri, 2023-03-10 at 11:52 -0800, Jonathan S. Katz wrote:
>
> The capitals should be first.

That is not true in a lot of natural language locales, whether libc or
ICU. The following return true for me (collencoding UTF-8):

  select 'abc' collate "en_US" < 'ABC' collate "en_US";
  select 'abc' collate "fr_FR" < 'ABC' collate "fr_FR";
  select 'abc' collate "de_DE" < 'ABC' collate "de_DE";
  select 'abc' collate "de_AT" < 'ABC' collate "de_AT";
  select 'abc' collate "es_ES" < 'ABC' collate "es_ES";
  select 'abc' collate "en-US-x-icu" < 'ABC' collate "en-US-x-icu";
  select 'abc' collate "fr-CA-x-icu" < 'ABC' collate "fr-CA-x-icu";
  select 'abc' collate "ja-JP-x-icu" < 'ABC' collate "ja-JP-x-icu";
  select 'abc' collate "tr-TR-x-icu" < 'ABC' collate "tr-TR-x-icu";

There are some cases that return false, as well:

  select 'abc' collate "ja_JP" < 'ABC' collate "ja_JP";
  select 'abc' collate "fr_CA" < 'ABC' collate "fr_CA";
  select 'abc' collate "en-US-u-va-posix-x-icu" < 
    'ABC' collate "en-US-u-va-posix-x-icu";

The cases where it's false appear to be more common in libc locales,
but most libc locales that I tested still sort lowercase first.

Regards,
    Jeff Davis

Re: pgsql: Add standard collation UNICODE

От

Andrew Dunstan

Дата:

11 марта 2023 г., 16:31:00

On 2023-03-10 Fr 14:44, Jeff Davis wrote:

On Fri, 2023-03-10 at 12:42 +0000, Peter Eisentraut wrote:

Add standard collation UNICODE

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-03-10%2018%3A58%3A04

Looks like there's still a failure, even after subsequent fix
3e623ebc7a. I'm wondering how that diff happens, because I tested all
versions of ICU between 50 and 72 (except 58 and 59, where I had build
problems) and 'abc' is always less than 'ABC' in the root locale.

Must be something simple I'm missing.

It's fairly old:

$ rpm -q -a | grep icu
libicu-50.2-4.0.amzn1.x86_64
libicu-devel-50.2-4.0.amzn1.x86_64

I'll see about updating the machine (probably by recreating it with a modern AMI)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От

Jeff Davis

Дата:

11 марта 2023 г., 21:20:33

On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
> >
> It's fairly old:
>
> $ rpm -q -a | grep icu
>  libicu-50.2-4.0.amzn1.x86_64
>  libicu-devel-50.2-4.0.amzn1.x86_64

FWIW I tried ICU 50.2 (built from sources) and the root locale still
sorts lowercase first.

Regards,
    Jeff Davis

Re: pgsql: Add standard collation UNICODE

От

Tom Lane

Дата:

11 марта 2023 г., 23:47:41

Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
>> It's fairly old:
>> $ rpm -q -a | grep icu
>>  libicu-50.2-4.0.amzn1.x86_64
>>  libicu-devel-50.2-4.0.amzn1.x86_64

> FWIW I tried ICU 50.2 (built from sources) and the root locale still
> sorts lowercase first.

snapper is showing this failure too [1], which makes me wonder if it's
a locally-customizable option.

In general, I see no good reason for our regression tests to be making
assumptions about exactly how ICU's root locale behaves, so I'd suggest
just lobotomizing this test case so it doesn't depend on upper/lower
sort order.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2023-03-11%2016%3A50%3A13

Re: pgsql: Add standard collation UNICODE

От

Andres Freund

Дата:

12 марта 2023 г., 01:05:15

Hi,

On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:
> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
> > > 
> > It's fairly old:
> > 
> > $ rpm -q -a | grep icu
> >  libicu-50.2-4.0.amzn1.x86_64
> >  libicu-devel-50.2-4.0.amzn1.x86_64
> 
> FWIW I tried ICU 50.2 (built from sources) and the root locale still
> sorts lowercase first.

On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on
centos-7. In case you want a way to reproduce the issue. You can merge that
tree into a throway tree, none of the changes should affect your work. You can
even log into the machine if it's on your repo ("re-run with terminal
access"), if you want to figure out the details.

https://cirrus-ci.com/task/6015098225950720
https://api.cirrus-ci.com/v1/artifact/task/6015098225950720/testrun/build/testrun/regress/regress/regression.diffs

Centos 8, fedora rawhide, suse tumbleweed are all OK though.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/meson-ci-debug

Re: pgsql: Add standard collation UNICODE

От

Andrew Dunstan

Дата:

12 марта 2023 г., 01:42:59


> On Mar 11, 2023, at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>> On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:
>> On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:
>>>>
>>> It's fairly old:
>>>
>>> $ rpm -q -a | grep icu
>>>  libicu-50.2-4.0.amzn1.x86_64
>>>  libicu-devel-50.2-4.0.amzn1.x86_64
>>
>> FWIW I tried ICU 50.2 (built from sources) and the root locale still
>> sorts lowercase first.
>
> On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on
> centos-7. In case you want a way to reproduce the issue. You can merge that
> tree into a throway tree, none of the changes should affect your work. You can
> even log into the machine if it's on your repo ("re-run with terminal
> access"), if you want to figure out the details.
>
> https://cirrus-ci.com/task/6015098225950720
> https://api.cirrus-ci.com/v1/artifact/task/6015098225950720/testrun/build/testrun/regress/regress/regression.diffs
>
> Centos 8, fedora rawhide, suse tumbleweed are all OK though.


I have a Centos 7 setup I can play with tomorrow.

Cheers

Andrew

Re: pgsql: Add standard collation UNICODE

От

Andrew Dunstan

Дата:

12 марта 2023 г., 18:17:47

On 2023-03-11 Sa 17:05, Andres Freund wrote:

Hi,

On 2023-03-11 10:20:33 -0800, Jeff Davis wrote:

On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:

It's fairly old:

$ rpm -q -a | grep icu
 libicu-50.2-4.0.amzn1.x86_64
 libicu-devel-50.2-4.0.amzn1.x86_64

FWIW I tried ICU 50.2 (built from sources) and the root locale still
sorts lowercase first.

On one my hacking-on-meson-build trees ([1]) I saw the problem in CI, on
centos-7.

Yeah,

I have a Centos 7 VM sitting here which I fired up and reproduced it just now. I have the source package downloaded from the Centos archives, but given that it's also happening on Debian 7, perhaps it's not something packaging-specific.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От

Andrew Dunstan

Дата:

12 марта 2023 г., 23:04:55

On 2023-03-11 Sa 15:47, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

On Sat, 2023-03-11 at 08:31 -0500, Andrew Dunstan wrote:

It's fairly old:
$ rpm -q -a | grep icu
 libicu-50.2-4.0.amzn1.x86_64
 libicu-devel-50.2-4.0.amzn1.x86_64

FWIW I tried ICU 50.2 (built from sources) and the root locale still
sorts lowercase first.

snapper is showing this failure too [1], which makes me wonder if it's
a locally-customizable option.

In general, I see no good reason for our regression tests to be making
assumptions about exactly how ICU's root locale behaves, so I'd suggest
just lobotomizing this test case so it doesn't depend on upper/lower
sort order.

Yeah, I think we've found enough cases in old still not dead distros to make this reasonable.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От

Tom Lane

Дата:

12 марта 2023 г., 23:10:02

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-03-11 Sa 15:47, Tom Lane wrote:
>> In general, I see no good reason for our regression tests to be making
>> assumptions about exactly how ICU's root locale behaves, so I'd suggest
>> just lobotomizing this test case so it doesn't depend on upper/lower
>> sort order.

> Yeah, I think we've found enough cases in old still not dead distros to 
> make this reasonable.

I see topminnow is showing the symptom too, so it's now pretty clear
that this was once a common behavior.  I dug in ICU's release notes
and couldn't find anything about it, but they must've changed something
somewhen.

            regards, tom lane

Re: pgsql: Add standard collation UNICODE

От

Thomas Munro

Дата:

13 марта 2023 г., 01:47:32

On Mon, Mar 13, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2023-03-11 Sa 15:47, Tom Lane wrote:
> >> In general, I see no good reason for our regression tests to be making
> >> assumptions about exactly how ICU's root locale behaves, so I'd suggest
> >> just lobotomizing this test case so it doesn't depend on upper/lower
> >> sort order.
>
> > Yeah, I think we've found enough cases in old still not dead distros to
> > make this reasonable.
>
> I see topminnow is showing the symptom too, so it's now pretty clear
> that this was once a common behavior.  I dug in ICU's release notes
> and couldn't find anything about it, but they must've changed something
> somewhen.

Most odd, and a bit worrying for the multi-lib concept we were
discussing (now for 17), though not fatal if it's something like
"there was one historical screwup back in 50.whatever, but now we are
more careful", or "it was a packaging error that was fixed but old
systems that aren't getting updated don't have the fix", so not even
ICU's fault (which would fit with Jeff's observation that he can't
repro the issue by building from ICU source at the 50.2 tag, as if
what is in the package is not what was tagged?!).  It'd be nice to
understand what happened here...  What do these disagreeing systems
report for the various version meta-data (Unicode version, CLDR
version, the opaque collator version, etc)?

Re: pgsql: Add standard collation UNICODE

От

Andrew Dunstan

Дата:

13 марта 2023 г., 21:58:34

On 2023-03-12 Su 18:47, Thomas Munro wrote:

On Mon, Mar 13, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-03-11 Sa 15:47, Tom Lane wrote:

In general, I see no good reason for our regression tests to be making
assumptions about exactly how ICU's root locale behaves, so I'd suggest
just lobotomizing this test case so it doesn't depend on upper/lower
sort order.

Yeah, I think we've found enough cases in old still not dead distros to
make this reasonable.

I see topminnow is showing the symptom too, so it's now pretty clear
that this was once a common behavior.  I dug in ICU's release notes
and couldn't find anything about it, but they must've changed something
somewhen.

Most odd, and a bit worrying for the multi-lib concept we were
discussing (now for 17), though not fatal if it's something like
"there was one historical screwup back in 50.whatever, but now we are
more careful", or "it was a packaging error that was fixed but old
systems that aren't getting updated don't have the fix", so not even
ICU's fault (which would fit with Jeff's observation that he can't
repro the issue by building from ICU source at the 50.2 tag, as if
what is in the package is not what was tagged?!).  It'd be nice to
understand what happened here...  What do these disagreeing systems
report for the various version meta-data (Unicode version, CLDR
version, the opaque collator version, etc)?

How would I discover that? Or if it's quicker I can give you access to one or two machines which exhibit the issue.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Add standard collation UNICODE

От

Tom Lane

Дата:

13 марта 2023 г., 22:05:15

Andrew Dunstan <andrew@dunslane.net> writes:
> How would I discover that? Or if it's quicker I can give you access to 
> one or two machines which exhibit the issue.

Looks like Peter already found the answer -- several of the unhappy
animals have gone green since d72900bde.

            regards, tom lane

Re: pgsql: Add standard collation UNICODE

От

Jeff Davis

Дата:

13 марта 2023 г., 22:40:28

On Sat, 2023-03-11 at 15:47 -0500, Tom Lane wrote:
> In general, I see no good reason for our regression tests to be
> making
> assumptions about exactly how ICU's root locale behaves, so I'd
> suggest
> just lobotomizing this test case so it doesn't depend on upper/lower
> sort order.

[ Looks like Peter already found the answer, but I had some additional
commentary so I'll send this anyway. ]

Mystery solved: "und" was not a valid spelling of "root" until version
55. So it fell back to the default locale, which is sensitive to the
environment.

With this in mind, I think we should take the opposite lesson and keep
this test as-is. We don't have to promise the behavior in the docs, but
it did catch a real issue. In fact, it caught a dangerous issue,
because a change to lc_messages could cause a corrupt index.

I'd like to see if there's a way to reliably detect that a fallback to
the environment-sensitive default locale has happened, and throw an
error. I don't think that happens in versions 55 and later, but it
would be nice to have a test because earlier versions of ICU seem
dangerous to me.

Regards,
    Jeff Davis

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pgsql: Add standard collation UNICODE