Re: Built-in CTYPE provider

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: Built-in CTYPE provider
Дата
Msg-id 804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Built-in CTYPE provider  (Jeremy Schneider <schneider@ardentperf.com>)
Re: Built-in CTYPE provider  (Jeremy Schneider <schneider@ardentperf.com>)
Re: Built-in CTYPE provider  ("Daniel Verite" <daniel@manitou-mail.org>)
Список pgsql-hackers
On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached is an implementation of a built-in provider for the "C.UTF-
> 8"

Attached a more complete version that fixes a few bugs, stabilizes the
tests, and improves the documentation. I optimized the performance, too
-- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
collation and case mapping (numbers below).

It's really nice to finally be able to have platform-independent tests
that work on any UTF-8 database.

Simple character classification:

  SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8;

Case mapping is more interesting (note that accented characters are
being properly mapped, and it's using the titlecase variant "Dž"):

  SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8);
           initcap
  --------------------------
   Axxe Áxxé Džxxdž Džxxx Džxxx

Even more interesting -- test that non-latin characters can still be a
member of a case-insensitive range:

  -- capital delta is member of lowercase range gamma to lambda
  SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8;
  -- small delta is member of uppercase range gamma to lambda
  SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8;

Moreover, a lot of this behavior is locked in by strong Unicode
guarantees like [1] and [2]. Behavior that can change probably won't
change very often, and in any case will be tied to a PG major version.

All of these behaviors are very close to what glibc "C.utf8" does on my
machine. The case transformations are identical (except titlecasing
because libc doesn't support it). The character classifications have
some differences, which might be worth discussing, but I didn't see
anything terribly concerning (I am following the unicode
recommendations[3] on this topic).

Performance:

  Sotring 10M strings:
    libc    "C"               14s
    builtin  C_UTF8           14s
    libc    "C.utf8"          20s
    ICU     "en-US-x-icu"     31s

  Running UPPER() on 10M strings:
    libc    "C"               03s
    builtin  C_UTF8           07s
    libc    "C.utf8"          08s
    ICU     "en-US-x-icu"     15s

I didn't investigate or optimize regexes / pattern matching yet, but I
can do similar optimizations if there's any gap.

Note that I implemented the "simple" case mapping (which is what glibc
does) and the "posix compatible"[3] flavor of character classification
(which is closer to what glibc does than the "standard" flavor"). I
opted to use title case mapping for initcap(), which is a difference
from libc and I may go back to just upper/lower. These seem like
reasonable choices if we're going to name the locale after C.UTF-8.

Regards,
    Jeff Davis

[1] https://www.unicode.org/policies/stability_policy.html#Case_Pair
[2] https://www.unicode.org/policies/stability_policy.html#Identity
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: Synchronizing slots from primary to standby
Следующее
От: Andrei Lepikhov
Дата:
Сообщение: Re: POC: GROUP BY optimization