Обсуждение: [HACKERS] Patch to address concerns about ICU collcollate stability in v10(Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)
On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > After reviewing this thread and testing around a bit, I think the code > is mostly fine as it is, but the documentation is lacking. Hence, > attached is a patch to expand the documentation quite a bit, especially > to document in more detail what ICU locale strings are accepted. Attached is my counterproposal. This patch has us always make sure that collcollate is in BCP 47 format, even on ICU versions prior to ICU 54. Other improvements/bug fixes: * Sanitizes locale names within CREATE COLLATION. This has been tested on ICU 42 (earliest supported version) and ICU 55. * Enforces a NAMEDATALEN restriction for collcollate during CREATE COLLATION, forbidding collcollate truncation. This is useful because truncating can allow subtly wrong answers later on. * Adds DEBUG1 message with ICU display name, so we can satisfy ourselves that we're getting the expected behavior, to some degree. I used this to confirm that we get consistent behavior between ICU 42 and 55 for CREATE COLLATION. On ICU 42, keyword collation attributes (e.g., the emoji keyword, numeric ordering for natural sort order) still don't work, just as before, but the locale string is still considered valid. (This is because ucol_setAttribute() is supposed to be used there). * Documents the aforementioned keyword collation attribute restriction on ICU versions before ICU 54. This was needed anyway. We only claim for Postgres collations what the ICU docs claim for ICU collators, even though there is reason to believe that some ICU versions before ICU 54 actually can do better. * When using ICU 4.2, the examples in the docs (variant collations like German Phonebook order) now actually work. The examples are completely broken right now IMV, because the user has to know that they are on ICU 4.2, which only accepts the old style locale strings as things stand. And, they'd have no obvious indication that things were broken without this patch, because there would have been no sanitization or other feedback. * Creates root collation as root-x-icu (collcollate "root"), not und-x-icu. "und" means undefined language. * Moves all knowledge of ICU version issues to within a few pg_locale.c routines, leaving the code reasonably well encapsulated. * Does an encoding conversion when getting a display name for the initdb collation comment. This needs to be ascii-safe due to the initdb/template0 DB encoding restriction, but I suspect that the way we do it now is subtly wrong. This does imply that SQL_ASCII databases will never get ICU pg_collation entries that come with comments added by initdb, but such databases were already unable to use the collations anyway, so this is no loss at all. (SQL_ASCII is mentioned here as an example of a database collation that ICU doesn't support, all of which are affected in this way.) I decided to implement CREATE COLLATION sanitization based on whether or not a display string can be generated (if not, or if it's empty, we reject). This seems like about the right level of strictness to me, because we're still very forgiving. I admit that that's a little bit arbitrary, but it does seem to be a good match for Postgres; it is forgiving of things that could plausibly make sense on another ICU version to some user at some time, but rejects most things that are inherently wrong, IMHO. You can still ask for Japanese as spoken in Madagascar, or even specify a language that ICU has never heard of, and there is no error. It catches syntax errors only. See the slightly expanded tests for details. I'm very open to negotiating the exact details of how we sanitize, but any level of sanitization will be at least a little bit arbitrary (including what we have right now, which is no sanitization). Aside from the specific problems for Postgres that I've noted that the patch prevents or fixes, there is another reason to do this. The old style locale name format is officially deprecated by ICU, which makes it seem like we should never expose it to users in the first place. Per ICU docs: "Starting with ICU 54, the following naming scheme and its API functions are deprecated. Use ucol_open() with language tag collation keywords instead (see Collation API Details)" [1] [1] http://userguide.icu-project.org/collation/concepts#TOC-Collator-naming-scheme -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On 9/25/17 00:24, Peter Geoghegan wrote: > * Creates root collation as root-x-icu (collcollate "root"), not > und-x-icu. "und" means undefined language. I'm curious about this point. "und" is defined in BCP 47. I don't see "root" defined anywhere. ICU converts the root collation to "und", AFAIK, so it seems to agree with the current naming. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 25, 2017 at 11:42 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 9/25/17 00:24, Peter Geoghegan wrote: >> * Creates root collation as root-x-icu (collcollate "root"), not >> und-x-icu. "und" means undefined language. > > I'm curious about this point. "und" is defined in BCP 47. I don't see > "root" defined anywhere. ICU converts the root collation to "und", > AFAIK, so it seems to agree with the current naming. In my patch, "root" is a string that is passed to get a language tag. That's technically in the old format. I think that this is another ICU vs. UCA/CLDR thing (this causes much confusion). Note that "root" is mentioned in the ICU locale explorer, for example: https://ssl.icu-project.org/icu-bin/locexp Note also that ucol_open() comments/docs say this: * @param loc The locale containing the required collation rules.* Special values for locales can be passed in-* if NULL is passed for the locale, the default locale* collation rules will be used. If empty string("") or* "root" are passed, UCA rules will be used. I went with "root" because that produces a meaningful/useful display name for pg_collation, and seems to be widely used elsewhere. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 24, 2017 at 9:24 PM, Peter Geoghegan <pg@bowt.ie> wrote: > * Documents the aforementioned keyword collation attribute restriction > on ICU versions before ICU 54. This was needed anyway. We only claim > for Postgres collations what the ICU docs claim for ICU collators, > even though there is reason to believe that some ICU versions before > ICU 54 actually can do better. Following some digging, it looks like the restriction actually only needs to apply on versions prior to ICU 4.6. The internal function _canonicalize() contains the functionality need for Postgres to avoid converting to BCP 47 itself on the fly (this was previously believed to be the case only in ICU 54): https://ssl.icu-project.org/trac/browser/tags/release-50-1-2/icu4c/source/common/uloc.cpp#L1614 (This is uloc.c on earlier ICU versions.) (The call stack here is that from ucol_open(), ucol_open_internal() is called, which calls uloc_getBaseName(), which calls _canonicalize().) ICU gained that _hasBCP47Extension() branch in ICU 46. The _canonicalize function has hardly changed since, despite the rewrite from C to C++ (I checked up until ICU 55, the reasonably current version I've done most testing on). This is totally consistent with what both Peter and I have observed, even though the ucol_open() docs suggest that that stuff only became available within ICU 54. I think that ucol_open() was updated to mention BCP 47 at the same time as it mentioned the lifting of the restrictions on extended keywords (when you no longer had to use ucol_setAttribute()), confusing two basically unrelated issues. I suggest that as a compromise, my proposal can be changed in either of the following ways: * Do the same thing as I propose, except only introduce the mapping from/to language tags when ICU version < 46 is in use. This would mean that we'd be doing that far less in practice. Or: * Never do *any* mapping/language tag conversion when opening a collator, but still consistently canonicalize as BCP 47 on all ICU versions. This can be done by only supporting versions that have the required _hasBCP47Extension() branch (ICU >= 46). We'd desupport the old ICU 42 and ICU 44 versions. Support for ICU 42 and ICU 44 came fairly late, in commit eccead9 on August 5th. Both Tom and I expressed reservations about that at the time. It doesn't seem too late to desupport those 2 versions, leaving us with a fix that is even less invasive. We actually wouldn't technically be changing anything about canonicalization at all, this way. We'd be changing our understanding of when a restriction applies (not < 54, < 46), at which point the "collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing that I take particular issue with becomes "collcollate = U_ICU_VERSION_MAJOR_NUM >= 46 ? langtag : name". This is equivalent to "collcollate = langtag" (which is what it would actually look like), since we're desupporting earlier versions. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers