Re: ICU locale validation / canonicalization
От | Jeff Davis |
---|---|
Тема | Re: ICU locale validation / canonicalization |
Дата | |
Msg-id | 060cb1b5d32c8693587b41f8f534ef79d3caecb1.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: ICU locale validation / canonicalization (Jeff Davis <pgsql@j-davis.com>) |
Список | pgsql-hackers |
On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote: > It feels like BCP 47 is the right catalog representation. We are > already using it for the import of initial collations, and it's a > standard, and there seems to be good support in ICU. Patch attached. We should have been canonicalizing all along -- either with uloc_toLanguageTag(), as this patch does, or at least with uloc_canonicalize() -- before passing to ucol_open(). ucol_open() is documented[1] to work on either language tags or ICU format locale IDs. Anything else is invalid and ends up going through some fallback logic, probably after being mis-parsed. For instance, in ICU 72, "fr_CA.UTF-8" is not a valid ICU format locale ID or a valid language tag, and is resolved by ucol_open() to the actual locale "root"; but if you canonicalize it first (to the ICU format locale ID "fr_CA" or the language tag "fr-CA"), it correctly resolves to the actual locale "fr_CA". The correct thing to do is canonicalize first and then pass to ucol_open(). But because we didn't canonicalize in the past, there could be raw locale strings stored in the catalog that resolve to the wrong actual collator, and there could be indexes depending on the wrong collator, so we have to be careful during pg_upgrade. Say someone created two ICU collations, one with locale "en_US.UTF-8" and one with locale "fr_CA.UTF-8" in PG15. When they upgrade to PG16, this patch will check the language tag "en-US" and see that it resolves to the same locale as "en_US.UTF-8", and change to the language tag during upgrade (so "en-US" will be in the new catalog). But when it checks the language tag "fr-CA", it will notice that it resolves to a different locale than "fr_CA.UTF-8", and keep the latter string even though it's wrong, because some indexes might be dependent on that wrong collator. [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6 -- Jeff Davis PostgreSQL Contributor Team - AWS
Вложения
В списке pgsql-hackers по дате отправления: