Re: Built-in CTYPE provider

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: Built-in CTYPE provider
Дата	29 декабря 2023 г. 02:57:16
Msg-id	804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com обсуждение исходный текст
Ответ на	Re: Built-in CTYPE provider (Jeff Davis <pgsql@j-davis.com>)
Ответы	Re: Built-in CTYPE provider Re: Built-in CTYPE provider Re: Built-in CTYPE provider
Список	pgsql-hackers

Дерево обсуждения

On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached is an implementation of a built-in provider for the "C.UTF-
> 8"

Attached a more complete version that fixes a few bugs, stabilizes the
tests, and improves the documentation. I optimized the performance, too
-- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
collation and case mapping (numbers below).

It's really nice to finally be able to have platform-independent tests
that work on any UTF-8 database.

Simple character classification:

  SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8;

Case mapping is more interesting (note that accented characters are
being properly mapped, and it's using the titlecase variant "ǅ"):

  SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx' COLLATE C_UTF8);
           initcap
  --------------------------
   Axxe Áxxé ǅxxǆ ǅxxx ǅxxx

Even more interesting -- test that non-latin characters can still be a
member of a case-insensitive range:

  -- capital delta is member of lowercase range gamma to lambda
  SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8;
  -- small delta is member of uppercase range gamma to lambda
  SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8;

Moreover, a lot of this behavior is locked in by strong Unicode
guarantees like [1] and [2]. Behavior that can change probably won't
change very often, and in any case will be tied to a PG major version.

All of these behaviors are very close to what glibc "C.utf8" does on my
machine. The case transformations are identical (except titlecasing
because libc doesn't support it). The character classifications have
some differences, which might be worth discussing, but I didn't see
anything terribly concerning (I am following the unicode
recommendations[3] on this topic).

Performance:

  Sotring 10M strings:
    libc    "C"               14s
    builtin  C_UTF8           14s
    libc    "C.utf8"          20s
    ICU     "en-US-x-icu"     31s

  Running UPPER() on 10M strings:
    libc    "C"               03s
    builtin  C_UTF8           07s
    libc    "C.utf8"          08s
    ICU     "en-US-x-icu"     15s

I didn't investigate or optimize regexes / pattern matching yet, but I
can do similar optimizations if there's any gap.

Note that I implemented the "simple" case mapping (which is what glibc
does) and the "posix compatible"[3] flavor of character classification
(which is closer to what glibc does than the "standard" flavor"). I
opted to use title case mapping for initcap(), which is a difference
from libc and I may go back to just upper/lower. These seem like
reasonable choices if we're going to name the locale after C.UTF-8.

Regards,
    Jeff Davis

[1] https://www.unicode.org/policies/stability_policy.html#Case_Pair
[2] https://www.unicode.org/policies/stability_policy.html#Identity
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Built-in CTYPE provider

Вложения