Обсуждение: Built-in case-insensitive collation pg_unicode_ci
-------- Proposal -------- New builtin case-insensitive collation PG_UNICODE_CI, where the ordering semantics are just: strcmp(CASEFOLD(arg1), CASEFOLD(arg2)) and the character semantics are the same as PG_UNICODE_FAST. This does not perform the Unicode Collation algorithm (UCA), and it is not a natural language collation. It does not normalize the inputs, either. It's just comparing codepoint values after folding. One interesting case is something like: SELECT 'straße' = 'STRASSE' COLLATE pg_unicode_ci; which is true, because 'ß' folds to 'ss' when using Unicode Default Case Folding. (Note that ICU only makes this equivalence at "level1"; at "level2", the strings would compare as unequal in ICU.) ---------- Motivation ---------- Non-deterministic collations cannot be used by SIMILAR TO, and may cause problems for ILIKE and regexes. The reason is that pattern matching often depends on the character-by-character semantics, but ICU collations aren't constrained enough for these semantics to work. See: "But the definition there is pretty much impossible to implement for nondeterministic collations: It basically says, the predicate is true if the string to be matched is equal, using the applicable collation, to any of the strings in the set of strings described by the regular expression." https://www.postgresql.org/message-id/899e7b5f-b54a-4e1b-9218-bb23534fc2c4%40eisentraut.org However, PG_UNICODE_CI collation does have character-by-character semantics which are well-defined for pattern matching. That takes us a step closer to allowing the database default collation to be case-insensitive. Some also might prefer the simplicity of PG_UNICODE_CI compared with ICU. ICU is still much more flexible for users who know what they want: normalization, various levels of sensitivity, tailored behavior, etc. ------- Details ------- Patch attached. Implementation-wise, it folds one codepoint at a time to avoid unnecessary table lookups. The code must be careful about the case where the result of CASEFOLD() is a different size for each input, and perhaps still leading to a match. Does not implement pattern matching. Would need some discussion to see how it should be integrated with Peter's work. It's currently slower than ICU, but I don't think there's any inherent reason. ------------- Normalization ------------- The argument could be made that we should both normalize and casefold before comparing. ICU does that, or something like that, and we have the infrastructure to do it. But getting that to be both correct and fast is non-trivial, so I didn't want to add the complexity. It's also not clear that we want to do that, or at least not all the time. One option would be to introduce a normalizing collation PG_UNICODE_NCI later, if needed. And at the same time we could also introduce NCASEFOLD() which would have corresponding semantics. ---------- Versioning ---------- Unlike other built-in collations, the order does depend on the version of Unicode, so the collation is given a version equal to the version of Unicode. (Other builtin collations have a version of "1".) That means that indexes, including primary keys, can become inconsistent after a major version upgrade if the version of Unicode has changed. The conditions where this can happen are much narrower than with libc or ICU collations: (a) The database in the prior version must contain code points unassigned as of that version; and (b) Some of those previously-unassigned code points must be assigned to a Cased character in the newer version. It's a smaller problem than a libc or ICU upgrade, which can cause differences in sort order for the same reason (unassigned codepoints later being assigned) as well as many other reasons. Regards, Jeff Davis
Вложения
On Fri, 2025-09-19 at 17:21 -0700, Jeff Davis wrote:
> ----------
> Versioning
> ----------
>
> Unlike other built-in collations, the order does depend on the
> version
> of Unicode...
> That means that indexes, including primary keys, can become
> inconsistent after a major version upgrade...
There's another option here: we can have the PG_UNICODE_CI collation
throw an error when the comparison involves unassigned code points.
That would give us assurance that primary keys remain consistent across
upgrades.
While not every user would want that for their entire database, I think
it's a good idea in the case of PG_UNICODE_CI:
* It would ensure that all primary keys using any builtin collation
are stable across upgrades.
* If the data is somewhere else, like an unindexed column or an index
with a different collation, then unassigned code points would still be
just fine.
* The cases where you'd want to use the PG_UNICODE_CI collation are
also the cases where it's not so important to permit very-recently-
assigned code points.
* Applications already need to expect errors when inserting into a
primary key or unique index, so it wouldn't require rewriting
applications to handle such errors.
Regards,
Jeff Davis
On Fri, 2025-09-19 at 17:21 -0700, Jeff Davis wrote: > -------- > Proposal > -------- > > New builtin case-insensitive collation PG_UNICODE_CI, where the > ordering semantics are just: > > strcmp(CASEFOLD(arg1), CASEFOLD(arg2)) > > and the character semantics are the same as PG_UNICODE_FAST. I think that this is interesting. > ---------- > Motivation > ---------- > > Non-deterministic collations cannot be used by SIMILAR TO, and may > cause problems for ILIKE and regexes. The reason is that pattern > matching often depends on the character-by-character semantics, but ICU > collations aren't constrained enough for these semantics to work. See: > > However, PG_UNICODE_CI collation does have character-by-character > semantics which are well-defined for pattern matching. > > That takes us a step closer to allowing the database default collation > to be case-insensitive. What is still missing for that? Pattern matching? > ---------- > Versioning > ---------- > > Unlike other built-in collations, the order does depend on the version > of Unicode, so the collation is given a version equal to the version of > Unicode. (Other builtin collations have a version of "1".) > > That means that indexes, including primary keys, can become > inconsistent after a major version upgrade if the version of Unicode > has changed. The conditions where this can happen are much narrower > than with libc or ICU collations: > > (a) The database in the prior version must contain code points > unassigned as of that version; and > (b) Some of those previously-unassigned code points must be assigned > to a Cased character in the newer version. That's an improvement for people who are ready to perform a test upgrade and check if any indexes are corrupted - they will likely see that none are, so no index needs to be rebuilt. I tried your patch. It works as advertised, and I didn't manage to break it. Yours, Laurenz Albe
On 20.09.25 02:21, Jeff Davis wrote: > New builtin case-insensitive collation PG_UNICODE_CI, where the > ordering semantics are just: > > strcmp(CASEFOLD(arg1), CASEFOLD(arg2)) > > and the character semantics are the same as PG_UNICODE_FAST. If it's a variant of PG_UNICODE_FAST, then it ought to be called PG_UNICODE_FAST_CI or similar. Otherwise, one would expect it to be a variant of PG_UNICODE (if that existed, but there is also UNICODE). But that name is also dubious since you later write that it's not actually fast. > Non-deterministic collations cannot be used by SIMILAR TO, and may > cause problems for ILIKE and regexes. The reason is that pattern > matching often depends on the character-by-character semantics, but ICU > collations aren't constrained enough for these semantics to work. This reasoning is a bit narrow. SIMILAR TO is kind of deprecated, and ILIKE is kind of stupid, and regexes have their own way to control case-sensitivity. Nevertheless, I think there would be some value to provide CI (and maybe accent-insensitive?) collations that operate separately from the "nondeterministic" mechanism. But then I would like to see a comprehensive approach that covers a variety of providers and locales. For example, I would expect there to be something like a "sv_SE_CI" locale, either available by default or easily created.
On Thu, 2025-10-16 at 15:46 +0200, Peter Eisentraut wrote:
> If it's a variant of PG_UNICODE_FAST, then it ought to be called
> PG_UNICODE_FAST_CI or similar. Otherwise, one would expect it to be
> a
> variant of PG_UNICODE (if that existed, but there is also UNICODE).
>
> But that name is also dubious since you later write that it's not
> actually fast.
My reasoning for the naming was that "PG" means it's our locale,
"UNICODE" describes the ctype behavior and "FAST" describes the
collation behavior (that is, fast but not human-friendly).
For this new case-insensitive collation, "UNICODE" still describes the
ctype behavior, but "CI" is a better description of the collation
behavior -- the main purpose is to be case-insensitive, not to be fast.
Other naming suggestions are welcome.
>
> This reasoning is a bit narrow. SIMILAR TO is kind of deprecated,
I didn't know that. Deprecated in the standard, or in Postgres? Should
we document that?
> and
> ILIKE is kind of stupid,
Should we be discouraging its use in the docs?
> Nevertheless, I think there would be some value to provide CI (and
> maybe
> accent-insensitive?) collations that operate separately from the
> "nondeterministic" mechanism. But then I would like to see a
> comprehensive approach that covers a variety of providers and
> locales.
> For example, I would expect there to be something like a "sv_SE_CI"
> locale, either available by default or easily created.
I don't think that it's possible in another provider to get CI pattern
matching semantics that are consistent with collation semantics. libc
doesn't offer case-insensitive collations at all, and ICU doesn't give
us enough information about the nature of a collation to use it for
pattern matching.
For instance:
'e' SIMILAR TO 'e_' -- locale=en-u-ka-shifted
should be true, because the right side could expand to 'e ', which
matches the left side (because the locale ignores the space). But as
mentioned in the other thread, it's not practical to guess at all the
possible expansions that might lead to a match in that locale.
Even with a basic CI locale:
'é' SIMILAR TO 'e_' -- locale=en-u-ks-level2
should also return true, because the right hand side can expand to
U&'e\0301', and the en-u-ks-level2 locale does basic normalization and
will consider that a match to U&'\00E9'.
Given that we don't have a lot of visibility into what the ICU locale
is doing, I don't see a safe way to decide whether an ICU locale will
match our expectations about pattern matching. In fact, I don't think
any ICU locales work because of the normalization issue in the second
pattern, unless we redefine SIMILAR TO to be normalization-aware (which
I'm not suggesting).
The way I defined PG_UNICODE_CI (or whatever we want to name it), it
avoids these problems: it does codepoint-at-a-time folding with no
attempt to normalize, and that's all. Both SIMILAR TO expressions above
will return false, because the right hand side always expands to a
longer string than the left, and can never match.
That being said, PG_UNICODE_CI is a collation, not a complete solution
for SIMILAR TO or regexes.
Is there interest in the collation independently as just a simple case-
insensitive collation? It would also be nice for testing/documentation,
and it's the only other collation that would be in-scope for the
builtin provider (because it doesn't require human-friendly ordering,
which is better handled by ICU).
Regards,
Jeff Davis