Обсуждение: [HACKERS] Patch to address concerns about ICU collcollate stability in v10(Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)

Поиск
Список
Период
Сортировка
On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> After reviewing this thread and testing around a bit, I think the code
> is mostly fine as it is, but the documentation is lacking.  Hence,
> attached is a patch to expand the documentation quite a bit, especially
> to document in more detail what ICU locale strings are accepted.

Attached is my counterproposal. This patch has us always make sure
that collcollate is in BCP 47 format, even on ICU versions prior to
ICU 54.

Other improvements/bug fixes:

* Sanitizes locale names within CREATE COLLATION. This has been tested
on ICU 42 (earliest supported version) and ICU 55.

* Enforces a NAMEDATALEN restriction for collcollate during CREATE
COLLATION, forbidding collcollate truncation. This is useful because
truncating can allow subtly wrong answers later on.

* Adds DEBUG1 message with ICU display name, so we can satisfy
ourselves that we're getting the expected behavior, to some degree.

I used this to confirm that we get consistent behavior between ICU 42
and 55 for CREATE COLLATION. On ICU 42, keyword collation attributes
(e.g., the emoji keyword, numeric ordering for natural sort order)
still don't work, just as before, but the locale string is still
considered valid. (This is because ucol_setAttribute() is supposed to
be used there).

* Documents the aforementioned keyword collation attribute restriction
on ICU versions before ICU 54. This was needed anyway. We only claim
for Postgres collations what the ICU docs claim for ICU collators,
even though there is reason to believe that some ICU versions before
ICU 54 actually can do better.

* When using ICU 4.2, the examples in the docs (variant collations
like German Phonebook order) now actually work.

The examples are completely broken right now IMV, because the user has
to know that they are on ICU 4.2, which only accepts the old style
locale strings as things stand. And, they'd have no obvious indication
that things were broken without this patch, because there would have
been no sanitization or other feedback.

* Creates root collation as root-x-icu (collcollate "root"), not
und-x-icu. "und" means undefined language.

* Moves all knowledge of ICU version issues to within a few
pg_locale.c routines, leaving the code reasonably well encapsulated.

* Does an encoding conversion when getting a display name for the
initdb collation comment. This needs to be ascii-safe due to the
initdb/template0 DB encoding restriction, but I suspect that the way
we do it now is subtly wrong. This does imply that SQL_ASCII databases
will never get ICU pg_collation entries that come with comments added
by initdb, but such databases were already unable to use the
collations anyway, so this is no loss at all. (SQL_ASCII is mentioned
here as an example of a database collation that ICU doesn't support,
all of which are affected in this way.)

I decided to implement CREATE COLLATION sanitization based on whether
or not a display string can be generated (if not, or if it's empty, we
reject). This seems like about the right level of strictness to me,
because we're still very forgiving. I admit that that's a little bit
arbitrary, but it does seem to be a good match for Postgres; it is
forgiving of things that could plausibly make sense on another ICU
version to some user at some time, but rejects most things that are
inherently wrong, IMHO. You can still ask for Japanese as spoken in
Madagascar, or even specify a language that ICU has never heard of,
and there is no error. It catches syntax errors only. See the slightly
expanded tests for details. I'm very open to negotiating the exact
details of how we sanitize, but any level of sanitization will be at
least a little bit arbitrary (including what we have right now, which
is no sanitization).

Aside from the specific problems for Postgres that I've noted that the
patch prevents or fixes, there is another reason to do this. The old
style locale name format is officially deprecated by ICU, which makes
it seem like we should never expose it to users in the first place.
Per ICU docs:

"Starting with ICU 54, the following naming scheme and its API
functions are deprecated. Use ucol_open() with language tag collation
keywords instead (see Collation API Details)" [1]

[1] http://userguide.icu-project.org/collation/concepts#TOC-Collator-naming-scheme
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения
On 9/25/17 00:24, Peter Geoghegan wrote:
> * Creates root collation as root-x-icu (collcollate "root"), not
> und-x-icu. "und" means undefined language.

I'm curious about this point.  "und" is defined in BCP 47.  I don't see
"root" defined anywhere.  ICU converts the root collation to "und",
AFAIK, so it seems to agree with the current naming.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Sep 25, 2017 at 11:42 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 9/25/17 00:24, Peter Geoghegan wrote:
>> * Creates root collation as root-x-icu (collcollate "root"), not
>> und-x-icu. "und" means undefined language.
>
> I'm curious about this point.  "und" is defined in BCP 47.  I don't see
> "root" defined anywhere.  ICU converts the root collation to "und",
> AFAIK, so it seems to agree with the current naming.

In my patch, "root" is a string that is passed to get a language tag.
That's technically in the old format.

I think that this is another ICU vs. UCA/CLDR thing (this causes much
confusion). Note that "root" is mentioned in the ICU locale explorer,
for example: https://ssl.icu-project.org/icu-bin/locexp

Note also that ucol_open() comments/docs say this:
* @param loc The locale containing the required collation rules.*            Special values for locales can be passed
in-*            if NULL is passed for the locale, the default locale*            collation rules will be used. If empty
string("") or*            "root" are passed, UCA rules will be used.
 

I went with "root" because that produces a meaningful/useful display
name for pg_collation, and seems to be widely used elsewhere.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Sun, Sep 24, 2017 at 9:24 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> * Documents the aforementioned keyword collation attribute restriction
> on ICU versions before ICU 54. This was needed anyway. We only claim
> for Postgres collations what the ICU docs claim for ICU collators,
> even though there is reason to believe that some ICU versions before
> ICU 54 actually can do better.

Following some digging, it looks like the restriction actually only
needs to apply on versions prior to ICU 4.6. The internal function
_canonicalize() contains the functionality need for Postgres to avoid
converting to BCP 47 itself on the fly (this was previously believed
to be the case only in ICU 54):

https://ssl.icu-project.org/trac/browser/tags/release-50-1-2/icu4c/source/common/uloc.cpp#L1614

(This is uloc.c on earlier ICU versions.)

(The call stack here is that from ucol_open(), ucol_open_internal() is
called, which calls uloc_getBaseName(), which calls _canonicalize().)

ICU gained that _hasBCP47Extension() branch in ICU 46. The
_canonicalize function has hardly changed since, despite the rewrite
from C to C++ (I checked up until ICU 55, the reasonably current
version I've done most testing on). This is totally consistent with
what both Peter and I have observed, even though the ucol_open() docs
suggest that that stuff only became available within ICU 54. I think
that ucol_open() was updated to mention BCP 47 at the same time as it
mentioned the lifting of the restrictions on extended keywords (when
you no longer had to use ucol_setAttribute()), confusing two basically
unrelated issues.

I suggest that as a compromise, my proposal can be changed in either
of the following ways:

* Do the same thing as I propose, except only introduce the mapping
from/to language tags when ICU version < 46 is in use. This would mean
that we'd be doing that far less in practice.

Or:

* Never do *any* mapping/language tag conversion when opening a
collator, but still consistently canonicalize as BCP 47 on all ICU
versions. This can be done by only supporting versions that have the
required _hasBCP47Extension() branch (ICU >= 46). We'd desupport the
old ICU 42 and ICU 44 versions.

Support for ICU 42 and ICU 44 came fairly late, in commit eccead9 on
August 5th. Both Tom and I expressed reservations about that at the
time. It doesn't seem too late to desupport those 2 versions, leaving
us with a fix that is even less invasive. We actually wouldn't
technically be changing anything about canonicalization at all, this
way. We'd be changing our understanding of when a restriction applies
(not < 54, < 46), at which point the "collcollate =
U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing that I take
particular issue with becomes "collcollate = U_ICU_VERSION_MAJOR_NUM
>= 46 ? langtag : name".

This is equivalent to "collcollate = langtag" (which is what it would
actually look like), since we're desupporting earlier versions.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers