Re: ICU locale validation / canonicalization

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: ICU locale validation / canonicalization
Дата	10 февраля 2023 г. 17:53:58
Msg-id	33acfc3a772224d668042bd2cbef88e91704ce25.camel@j-davis.com обсуждение исходный текст
Ответ на	Re: ICU locale validation / canonicalization (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: ICU locale validation / canonicalization Re: ICU locale validation / canonicalization
Список	pgsql-hackers

Дерево обсуждения

On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > I do like the ICU format locale IDs from a readability standpoint.
> > "en_US@colstrength=primary" is more meaningful to me than "en-US-u-
> > ks-
> > level1" (the equivalent language tag).
>
> Sadly, neither of those means a whole lot to me? :-(
>
> How did you find out that those are equivalent?

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

Here's the API reference, which tells you that you can set the strength
of a collator (using the API, not the keyword):
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#acc801048729e684bcabed328be85f77a

The more precise definitions of the strengths are here:
https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

Regarding the equivalence of the two forms, uloc_toLanguageTag() and
uloc_toLanguageTag() are inverses. As far as I can tell (a lower degree
of assurance than you are looking for), if one succeeds, then the other
will also succeed and produce the original result.

There are another couple documents here (TR35):
http://www.unicode.org/reports/tr35/
https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
that seems to cover the "ks-level1" and how it maps to the collation
strength.

My examination of these standards is very superficial -- I'm basically
just checking that they seem to be there. If I search for a string like
"en-US-u-ks-level1", I only find Postgres-related results, so you could
also question whether these standards are actually used.

Using BCP 47 tags for icu locale strings, and moving to ICU (as
discussed in the other thread) is basically a leap of faith in ICU. The
docs aren't perfect, the source is hard to read, and we've found bugs.
But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

> > And the format is specified[1],
> > even though it's not an independent standard. But I think the
> > benefits
> > of better validation, an independent standard, and the fact that
> > we're
> > already favoring BCP47 outweigh my subjective opinion.
>
> See, I'm confused, because that link says "If a keyword list is
> present it must be preceded by an at-sign" which makes it sound like
> it is talking about stuff like en_US@colstrength=primary rather than
> stuff like en-US-u-ks-level1. The examples are all that way too, like
> it gives examples like en_IE@currency=IEP and
> fr@collation=phonebook;calendar=islamic-civil.

My paragraph was unclear, let me restate the point:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: ICU locale validation / canonicalization