Re: encoding affects ICU regex character classification

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: encoding affects ICU regex character classification
Дата	12 декабря 2023 г. 21:39:55
Msg-id	03959b5f8b37b6126d0b9c6ac16c960a94fcd3bb.camel@j-davis.com обсуждение исходный текст
Ответ на	Re: encoding affects ICU regex character classification (Thomas Munro <thomas.munro@gmail.com>)
Ответы	Re: encoding affects ICU regex character classification
Список	pgsql-hackers

Дерево обсуждения

On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:

>
> How would you specify what you want?

One proposal would be to have a builtin collation provider:

https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com

I don't think there are very many ctype options, but they could be
specified as part of the locale, or perhaps even as some provider-
specific options specified at CREATE COLLATION time.

> As with collating, I like the
> idea of keeping support for libc even if it is terrible (some libcs
> more than others) and eventually not the default, because I think
> optional agreement with other software on the same host is a feature.

Of course we should keep the libc support around. I'm not sure how
relevant such a feature is, but I don't think we actually have to
remove it.

> Unless you also
> implement built-in case mapping, you'd still have to call libc or ICU
> for that, right?

We can do built-in case mapping, see:

https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

>   It seems a bit strange to use different systems for
> classification and mapping.  If you do implement mapping too, you
> have
> to decide if you believe it is language-dependent or not, I think?

A complete solution would need to do the language-dependent case
mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
and only a handful of mapping changes, so we can handle that with the
builtin provider as well.

> Hmm, let's see what we're doing now... for ICU the regex code is
> using
> "simple" case mapping functions like u_toupper(c) that don't take a
> locale, so no Turkish i/İ conversion for you, unlike our SQL
> upper()/lower(), which this is supposed to agree with according to
> the
> comments at the top.  I see why: POSIX can only do one-by-one
> character mappings (which cannot handle Greek's context-sensitive
> Σ->σ/ς or German's multi-character ß->SS)

Regexes are inherently character-by-character, so transformations like
ß->SS are not going to work for case-insensitive regex matching
regardless of the provider.

Σ->σ/ς does make sense, and what we have seems to be just broken:

  select 'ς' ~* 'Σ'; -- false in both libc and ICU
  select 'Σ' ~* 'ς'; -- true in both libc and ICU

Similarly for titlecase variants:

  select 'ǅ' ~* 'ǆ'; -- false in libc and ICU
  select 'ǆ' ~* 'ǅ'; -- true in libc and ICU

If we do the case mapping ourselves, we can make those work. We'd just
have to modify the APIs a bit so that allcases() can actually get all
of the case variants, rather than relying on just towupper/towlower.

Regards,
    Jeff Davis

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: encoding affects ICU regex character classification