Re: Built-in CTYPE provider

Поиск
Список
Период
Сортировка
От Daniel Verite
Тема Re: Built-in CTYPE provider
Дата
Msg-id 610d7f1b-c68c-4eb8-a03d-1515da304c58@manitou-mail.org
обсуждение исходный текст
Ответ на Re: Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
    Jeff Davis wrote:

> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
>
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character before mapping to titlecase, and digits are not Cased. ICU
> has a configurable adjustment, and defaults in a way that produces
> '123abc'.

Even aside from ICU, there's a different behavior between glibc
and pg_c_utf8 glibc for codepoints in the decimal digit category
outside of the US-ASCII range '0'..'9',

select initcap(concat(chr(0xff11), 'a') collate "C.utf8");   -- glibc 2.35
 initcap
---------
 1a

select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8");
 initcap
---------
 1A

Both collations consider that chr(0xff11) is not a digit
(isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8
says it's neither digit nor alpha.

AFAIU this is why in the above initcap() call, pg_c_utf8 considers
that 'a' is the first alphanumeric, whereas C.utf8 considers that '1'
is the first alphanumeric, leading to different capitalizations.

Comparing the 3 providers:

WITH v(provider,type,result) AS (values
 ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"),
 ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"),
 ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"),
 ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"),
 ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"),
 ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8")
 )
select * from v
\crosstabview


 provider | isalpha | isdigit
----------+---------+---------
 ICU      | f        | t
 glibc      | t        | f
 builtin  | f        | f


Are we fine with pg_c_utf8 differing from both ICU's point of view
(U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not
digit, but it's alpha)?

Aside from initcap(), this is going to be significant for regular
expressions.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Regina Obe"
Дата:
Сообщение: [MASSMAIL]Can't compile PG 17 (master) from git under Msys2 autoconf
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Flushing large data immediately in pqcomm