Re: Character expansion with ICU collations
От | Finnerty, Jim |
---|---|
Тема | Re: Character expansion with ICU collations |
Дата | |
Msg-id | 10F78B0E-3C4B-4BF8-9EF0-BEE684F4C8CC@amazon.com обсуждение исходный текст |
Ответ на | Re: Character expansion with ICU collations ("Finnerty, Jim" <jfinnert@amazon.com>) |
Список | pgsql-hackers |
I have a proposal for how to support tailoring rules in ICU collations: The ucol_openRules() function is an alternative tothe ucol_open() function that PostgreSQL calls today, but it takes the collation strength as one if its parameters so thelocale string would need to be parsed before creating the collator. After the collator is created using either ucol_openRulesor ucol_open, the ucol_setAttribute() function may be used to set individual attributes from keyword=valuepairs in the locale string as it does now, except that the strength probably can't be changed after openingthe collator with ucol_openRules. So the logic in pg_locale.c would need to be reorganized a little bit, but thatsounds straightforward. One simple solution would be to have the tailoring rules be specified as a new keyword=value pair, such as colTailoringRules=<rulestring>. Since the <rulestring> may contain single quote characters or PostgreSQL escape characters,any single quote characters or escapes would need to be escaped using PostgreSQL escape rules. If colTailoringRulesis present, colStrength would also be known prior to opening the collator, or would default to tertiary,and we would keep a local flag indicating that we should not process the colStrength keyword again, if specified. Representing the TailoringRules as just another keyword=value in the locale string means that we don't need any change tothe catalog to store it. It's just part of the locale specification. I think we wouldn't even need to bump the catversion. Are there any tailoring rules, such as expansions and contractions, that we should disallow? I realize that we don't handlenondeterministic collations in LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a databasewith a UTF-8 encoding and arbitrary tailoring rules that include expansions and contractions, is it still guaranteedthat expr must sort BETWEEN 'a' AND ('a' || E'/uFFFF') ?
В списке pgsql-hackers по дате отправления: