Обсуждение: Add standard collation UNICODE

Поиск
Список
Период
Сортировка

Add standard collation UNICODE

От
Peter Eisentraut
Дата:
The SQL standard defines several standard collations.  Most of them are 
only of legacy interest (IMO), but two are currently relevant: UNICODE 
and UCS_BASIC.  UNICODE sorts by the default Unicode collation algorithm 
specifications and UCS_BASIC sorts by codepoint.

When collation support was added to PostgreSQL, we added UCS_BASIC, 
since that could easily be mapped to the C locale.  But there was no 
straightforward way to provide the UNICODE collation.  (Recall that 
collation support came several releases before ICU support.)

With ICU support, we can provide the UNICODE collation, since it's just 
the root locale.  I suppose one hesitation was that ICU was not a 
standard feature, so this would create variations in the default catalog 
contents, or something like that.  But I think now that we are drifting 
to make ICU more prominent, we can just add that anyway.  I think being 
able to say

     COLLATE UNICODE

instead of

     COLLATE "und-x-icu"

or whatever it is, is pretty useful.

So, attached is a small patch to add this.
Вложения

Re: Add standard collation UNICODE

От
Vik Fearing
Дата:
On 3/1/23 11:09, Peter Eisentraut wrote:
> The SQL standard defines several standard collations.  Most of them are 
> only of legacy interest (IMO), but two are currently relevant: UNICODE 
> and UCS_BASIC.  UNICODE sorts by the default Unicode collation algorithm 
> specifications and UCS_BASIC sorts by codepoint.
> 
> When collation support was added to PostgreSQL, we added UCS_BASIC, 
> since that could easily be mapped to the C locale.  But there was no 
> straightforward way to provide the UNICODE collation.  (Recall that 
> collation support came several releases before ICU support.)
> 
> With ICU support, we can provide the UNICODE collation, since it's just 
> the root locale.  I suppose one hesitation was that ICU was not a 
> standard feature, so this would create variations in the default catalog 
> contents, or something like that.  But I think now that we are drifting 
> to make ICU more prominent, we can just add that anyway.  I think being 
> able to say
> 
>      COLLATE UNICODE
> 
> instead of
> 
>      COLLATE "und-x-icu"
> 
> or whatever it is, is pretty useful.
> 
> So, attached is a small patch to add this.

I don't feel competent to review the patch (simple as it is), but +1 on 
the principle.
-- 
Vik Fearing




Re: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Wed, 2023-03-01 at 11:09 +0100, Peter Eisentraut wrote:

> When collation support was added to PostgreSQL, we added UCS_BASIC,
> since that could easily be mapped to the C locale.

Sorting by codepoint should be encoding-independent (i.e. decode to
codepoint first); but the C collation is just strcmp, which is
encoding-dependent. So is UCS_BASIC wrong today?

(Aside: I wonder whether we should differentiate between the libc
provider, which uses strcoll(), and the provider of non-localized
comparisons that just use strcmp(). That would be a better reflection
of what the code actually does.)

> With ICU support, we can provide the UNICODE collation, since it's
> just
> the root locale.

+1

>   I suppose one hesitation was that ICU was not a
> standard feature, so this would create variations in the default
> catalog
> contents, or something like that.

It looks like the way you've handled this is by inserting the collation
with collprovider=icu even if built without ICU support. I think that's
a new case, so we need to make sure it throws reasonable user-facing
errors.

I do like your approach though because, if someone is using a standard
collation, I think "not built with ICU" (feature not supported) is a
better error than "collation doesn't exist". It also effectively
reserves the name "unicode".


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Add standard collation UNICODE

От
Thomas Munro
Дата:
On Sun, Mar 5, 2023 at 7:30 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Sorting by codepoint should be encoding-independent (i.e. decode to
> codepoint first); but the C collation is just strcmp, which is
> encoding-dependent. So is UCS_BASIC wrong today?

It's created for UTF-8 only, and UTF-8 sorts the same way as the
encoded code points, when interpreted as a sequence of unsigned char
by memcmp(), strcmp() etc.  Seems right?



Re: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Sun, 2023-03-05 at 08:27 +1300, Thomas Munro wrote:
> It's created for UTF-8 only, and UTF-8 sorts the same way as the
> encoded code points, when interpreted as a sequence of unsigned char
> by memcmp(), strcmp() etc.  Seems right?

Right, makes sense.

Though in principle, shouldn't someone using another encoding also be
able to use ucs_basic? I'm not sure if that's a practical problem or
not; I'm just curious. Does ICU provide a locale for sorting by code
point?

Regards,
    Jeff Davis




Re: Add standard collation UNICODE

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2023-03-05 at 08:27 +1300, Thomas Munro wrote:
>> It's created for UTF-8 only, and UTF-8 sorts the same way as the
>> encoded code points, when interpreted as a sequence of unsigned char
>> by memcmp(), strcmp() etc.  Seems right?

> Right, makes sense.

> Though in principle, shouldn't someone using another encoding also be
> able to use ucs_basic? I'm not sure if that's a practical problem or
> not; I'm just curious. Does ICU provide a locale for sorting by code
> point?

ISTM we could trivially allow it in LATIN1 encoding as well;
strcmp would still have the effect of sorting by unicode code points.

Given the complete lack of field demand for making it work in
other encodings, I'm unexcited about spending more effort than that.

            regards, tom lane



Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 04.03.23 19:29, Jeff Davis wrote:
> It looks like the way you've handled this is by inserting the collation
> with collprovider=icu even if built without ICU support. I think that's
> a new case, so we need to make sure it throws reasonable user-facing
> errors.

It would look like this:

=> select * from t1 order by b collate unicode;
ERROR:  0A000: ICU is not supported in this build

> I do like your approach though because, if someone is using a standard
> collation, I think "not built with ICU" (feature not supported) is a
> better error than "collation doesn't exist". It also effectively
> reserves the name "unicode".

right




Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 04.03.23 19:29, Jeff Davis wrote:
> I do like your approach though because, if someone is using a standard
> collation, I think "not built with ICU" (feature not supported) is a
> better error than "collation doesn't exist". It also effectively
> reserves the name "unicode".

By the way, speaking of reserving names, I don't remember the reason for 
this bit in initdb.c:

/*
  * Add SQL-standard names.  We don't want to pin these, so they don't go
  * in pg_collation.h.  But add them before reading system collations, so
  * that they win if libc defines a locale with the same name.
  */

Why don't we want them pinned?

If we add them instead as entries into pg_collation.dat, it seems to 
work for me.

Another question: What is our current thinking on using BCP 47 names? 
The documentation says for example

"""
The first example selects the ICU locale using a “language tag” per BCP 
47. The second example uses the traditional ICU-specific locale syntax. 
The first style is preferred going forward, but it is not supported by 
older ICU versions.
"""

My patch uses 'und' [BCP 47 style], which appears to be in conflict with 
that statement.

But we have had some discussions on how correct that statement is, but I 
don't remember the outcome.




Re: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Wed, 2023-03-08 at 07:21 +0100, Peter Eisentraut wrote:
> On 04.03.23 19:29, Jeff Davis wrote:
> > It looks like the way you've handled this is by inserting the
> > collation
> > with collprovider=icu even if built without ICU support. I think
> > that's
> > a new case, so we need to make sure it throws reasonable user-
> > facing
> > errors.
>
> It would look like this:
>
> => select * from t1 order by b collate unicode;
> ERROR:  0A000: ICU is not supported in this build

Right, the error looks good. I'm just pointing out that before this
patch, having provider='i' in a build without ICU was a configuration
mistake; whereas afterward every database will have a collation with
provider='i' whether it has ICU support or not. I think that's fine,
I'm just double-checking.

Why is "unicode" only provided for the UTF-8 encoding? For "ucs_basic"
that makes some sense, because the implementation only works in UTF-8.
But here we are using ICU, and the "und" locale should work for any
ICU-supported encoding. I suggest that we use collencoding=-1 for
"unicode", and the docs can just add a note next to "ucs_basic" that it
only works for UTF-8, because that's the weird case.

For the docs, I suggest that you clarify that "ucs_basic" has the same
behavior as the C locale does *in the UTF-8 encoding*. Not all users
might pick up on the subtlety that the C locale has different behaviors
in different encodings.

Other than that, it looks good.

--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 08.03.23 19:25, Jeff Davis wrote:
> Why is "unicode" only provided for the UTF-8 encoding? For "ucs_basic"
> that makes some sense, because the implementation only works in UTF-8.
> But here we are using ICU, and the "und" locale should work for any
> ICU-supported encoding. I suggest that we use collencoding=-1 for
> "unicode", and the docs can just add a note next to "ucs_basic" that it
> only works for UTF-8, because that's the weird case.

make sense

> For the docs, I suggest that you clarify that "ucs_basic" has the same
> behavior as the C locale does *in the UTF-8 encoding*. Not all users
> might pick up on the subtlety that the C locale has different behaviors
> in different encodings.

Ok, word-smithed a bit more.

How about this patch version?


Вложения

Re: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Thu, 2023-03-09 at 11:21 +0100, Peter Eisentraut wrote:
> How about this patch version?

Looks good to me.

Regards,
    Jeff Davis





Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 09.03.23 20:23, Jeff Davis wrote:
> On Thu, 2023-03-09 at 11:21 +0100, Peter Eisentraut wrote:
>> How about this patch version?
> 
> Looks good to me.

Committed, after adding a test.




Re: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Thu, 2023-03-09 at 11:23 -0800, Jeff Davis wrote:
> Looks good to me.

Another thought: for ICU, do we want the default collation to be
UNICODE (root collation)? What we have now gets the default from the
environment, which is consistent with the libc provider.

But now that we have the UNICODE collation, it makes me wonder if we
should just default to that. The server's environment doesn't
necessarily say much about the locale of the data stored in it or the
locale of the applications accessing it.

I don't have a strong opinion here, but I thought I'd raise the issue.

By my count, >50% of locales are actually just the root locale. I'm not
sure if that should matter or not -- we don't want to weigh some
locales over others -- but I found it interesting.

Regards,
    Jeff Davis




Re: Add standard collation UNICODE

От
Laurenz Albe
Дата:
On Thu, 2023-03-23 at 13:16 -0700, Jeff Davis wrote:
> Another thought: for ICU, do we want the default collation to be
> UNICODE (root collation)? What we have now gets the default from the
> environment, which is consistent with the libc provider.
>
> But now that we have the UNICODE collation, it makes me wonder if we
> should just default to that. The server's environment doesn't
> necessarily say much about the locale of the data stored in it or the
> locale of the applications accessing it.
>
> I don't have a strong opinion here, but I thought I'd raise the issue.
>
> By my count, >50% of locales are actually just the root locale. I'm not
> sure if that should matter or not -- we don't want to weigh some
> locales over others -- but I found it interesting.

I second that.  Most people don't pay attention to that when creating a
cluster, so having a locale-agnostic collation is often better than
inheriting whatever default happened to be set in your shell.
For example, the Debian/Ubuntu binary packages create a cluster when
you install the server package, and most people just go on using that.

Yours,
Laurenz Albe



Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 23.03.23 21:16, Jeff Davis wrote:
> Another thought: for ICU, do we want the default collation to be
> UNICODE (root collation)? What we have now gets the default from the
> environment, which is consistent with the libc provider.
> 
> But now that we have the UNICODE collation, it makes me wonder if we
> should just default to that. The server's environment doesn't
> necessarily say much about the locale of the data stored in it or the
> locale of the applications accessing it.

As long as we still have to initialize the libc locale fields to some 
language, I think it would be less confusing to keep the ICU locale on 
the same language.

If we ever manage to get rid of that, then I would also support making 
the ICU locale the root collation by default.





Re: Add standard collation UNICODE

От
Joe Conway
Дата:
On 3/28/23 06:07, Peter Eisentraut wrote:
> On 23.03.23 21:16, Jeff Davis wrote:
>> Another thought: for ICU, do we want the default collation to be
>> UNICODE (root collation)? What we have now gets the default from the
>> environment, which is consistent with the libc provider.
>> 
>> But now that we have the UNICODE collation, it makes me wonder if we
>> should just default to that. The server's environment doesn't
>> necessarily say much about the locale of the data stored in it or the
>> locale of the applications accessing it.
> 
> As long as we still have to initialize the libc locale fields to some
> language, I think it would be less confusing to keep the ICU locale on
> the same language.

I definitely agree with that.

> If we ever manage to get rid of that, then I would also support making
> the ICU locale the root collation by default.

+1

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add standard collation UNICODE

От
Jeff Davis
Дата:
On Tue, 2023-03-28 at 08:46 -0400, Joe Conway wrote:
> > As long as we still have to initialize the libc locale fields to
> > some
> > language, I think it would be less confusing to keep the ICU locale
> > on
> > the same language.
>
> I definitely agree with that.

Sounds good -- no changes then.

Regards,
    Jeff Davis

>



Re: Add standard collation UNICODE

От
"Daniel Verite"
Дата:
    Peter Eisentraut wrote:

>      COLLATE UNICODE
>
> instead of
>
>      COLLATE "und-x-icu"
>
> or whatever it is, is pretty useful.
>
> So, attached is a small patch to add this.

This collation has an empty pg_collation.collversion column, instead
of being set to the same value as "und-x-icu" to track its version.


postgres=# select * from pg_collation  where collname='unicode' \gx
-[ RECORD 1 ]-------+--------
oid            | 963
collname        | unicode
collnamespace        | 11
collowner        | 10
collprovider        | i
collisdeterministic | t
collencoding        | -1
collcollate        |
collctype        |
colliculocale        | und
collicurules        |
collversion        |

The original patch implements this as an INSERT in which it would be easy to
fix I guess, but in current HEAD it comes as an entry in
include/catalog/pg_collation.dat:

{ oid => '963',
  descr => 'sorts using the Unicode Collation Algorithm with default
settings',
  collname => 'unicode', collprovider => 'i', collencoding => '-1',
  colliculocale => 'und' },

Should it be converted back into an INSERT or better left
in this file and collversion being updated afterwards?


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 27.04.23 13:44, Daniel Verite wrote:
> This collation has an empty pg_collation.collversion column, instead
> of being set to the same value as "und-x-icu" to track its version.

> The original patch implements this as an INSERT in which it would be easy to
> fix I guess, but in current HEAD it comes as an entry in
> include/catalog/pg_collation.dat:
> 
> { oid => '963',
>    descr => 'sorts using the Unicode Collation Algorithm with default
> settings',
>    collname => 'unicode', collprovider => 'i', collencoding => '-1',
>    colliculocale => 'und' },
> 
> Should it be converted back into an INSERT or better left
> in this file and collversion being updated afterwards?

How about we do it with an UPDATE command.  We already do this for 
pg_database in a similar way.  See attached patch.

Вложения

Re: Add standard collation UNICODE

От
Peter Eisentraut
Дата:
On 08.05.23 17:48, Peter Eisentraut wrote:
> On 27.04.23 13:44, Daniel Verite wrote:
>> This collation has an empty pg_collation.collversion column, instead
>> of being set to the same value as "und-x-icu" to track its version.
> 
>> The original patch implements this as an INSERT in which it would be 
>> easy to
>> fix I guess, but in current HEAD it comes as an entry in
>> include/catalog/pg_collation.dat:
>>
>> { oid => '963',
>>    descr => 'sorts using the Unicode Collation Algorithm with default
>> settings',
>>    collname => 'unicode', collprovider => 'i', collencoding => '-1',
>>    colliculocale => 'und' },
>>
>> Should it be converted back into an INSERT or better left
>> in this file and collversion being updated afterwards?
> 
> How about we do it with an UPDATE command.  We already do this for 
> pg_database in a similar way.  See attached patch.

This has been committed.