On 05/02/2019 05:36 PM, Daniel Verite wrote:
> rihad wrote:
>
>> Thanks for the reply. Do you know what would a "decent" ICU collation be
>> to bind to a field's schema definition so it would mimic a UTF-8
>> encoding for a multilingual column? Maybe und-x-icu? We aren't as much
>> concerned about their sortability in most cases, we just want indexes to
>> better handle future PG/ICU upgrades. But what does und(efined) even
>> mean with respect to collations?
> "undefined" in this context means unspecified language and
> unspecified country or region. It implies that no language-specific
> nor regional rule will be applied to compare strings.
>
> Using C.UTF-8 as the collation for text fields to index may be the
> best trade-off in your case. It should be immune to libc and ICU
> upgrades.
>
> With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
> and accents will also sort differently than with a linguistic-aware
> collation.
Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups? And can the existing
en_US.UTF-8 fields' definition be altered in place, without a
dump+restore? en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?
> If your applications care about that, it can be fixed by simply
> adding COLLATE "default" to the ORDER BY clause of the queries that
> are meant to present data to users.
> COLLATE "default" means the collation of the database, which
> presumably would be something like "language_REGION.UTF-8" in your
> case. If you never specified it explicitly, it came from initdb which
> itself got it from the environment of the server.
>
>
> Best regards,