Re: ICU locale validation / canonicalization
От | Jeff Davis |
---|---|
Тема | Re: ICU locale validation / canonicalization |
Дата | |
Msg-id | 8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: ICU locale validation / canonicalization (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Ответы |
Re: ICU locale validation / canonicalization
|
Список | pgsql-hackers |
New patch attached. The new patch also includes a GUC that (when enabled) validates that the collator is actually found. On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote: > a) BCP47 tags are preferred, and Agreed. > b) They don't work with ICU versions before 54. I tried in versions 50 through 53, and the language tags are supported, but I think I know why we don't use them: Prior to version 54, ICU would not set the collator attributes based on the locale name. That is the same for either language tags or ICU format locale IDs. However, for ICU format locale IDs, we added special code to parse the locale string and set the attributes ourselves. We didn't bother to add the same parsing logic for language tags, so if a language tag is found in the catalog, the parts of it that specify collation strength (for example) would be ignored. I don't know if that's an actual problem when importing the system collations, because I don't think we use any collator attributes, but it makes sense that we'd not favor language tags in ICU prior to v54. > I would support transitioning this forward somehow, but we would need > to > know exactly what the impact would be. I've done quite a bit of investigation, which I've described upthread. We need to transition somehow, because the prior behavior is incorrect for locales like "fr_CA.UTF-8". Our tests suggest that's an acceptable thing to do, but if we pass that straight to ucol_open(), then it gets misinterpreted as plain "fr" because it doesn't understand the "." as a valid separator. We must turn it into a language tag (or at least canonicalize it) before passing the string to ucol_open(). This misbehavior only affects a small number of locales, which resolve to a different actual collator than they should. The most problematic case is during pg_upgrade, where a slight behavior change would result in corrupt indexes. So during binary upgrade, my patch falls back to the original raw string (not the language tag) when it resolves to a different actual collator. If we want to be more paranoid, we could also provide a compatibility GUC to preserve the old misbehavior for newly-created collations, too, but I don't think that's necessary. There is also some interaction with pg_upgrade's ability to check whether the old and new cluster are compatible. If the catalog representation of the locale changes, then it could falsely believe the icu locales aren't compatible, because it's doing a simple string comparison. But as we are discussing in the other thread[1], the whole idea of checking for compatibility of the initialized cluster is strange: pg_upgrade should be in charge of making a compatible cluster to upgrade into (assuming the binaries are at least compatible). I don't see this as a major problem; we'll sort out the other thread first to allow ICU as the default, and then adapt this patch if necessary. [1] https://www.postgresql.org/message-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de -- Jeff Davis PostgreSQL Contributor Team - AWS
Вложения
В списке pgsql-hackers по дате отправления: