Re: Built-in CTYPE provider
От | Jeff Davis |
---|---|
Тема | Re: Built-in CTYPE provider |
Дата | |
Msg-id | 804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: Built-in CTYPE provider (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: Built-in CTYPE provider
Re: Built-in CTYPE provider Re: Built-in CTYPE provider |
Список | pgsql-hackers |
On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote: > Attached is an implementation of a built-in provider for the "C.UTF- > 8" Attached a more complete version that fixes a few bugs, stabilizes the tests, and improves the documentation. I optimized the performance, too -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both collation and case mapping (numbers below). It's really nice to finally be able to have platform-independent tests that work on any UTF-8 database. Simple character classification: SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8; Case mapping is more interesting (note that accented characters are being properly mapped, and it's using the titlecase variant "Dž"): SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8); initcap -------------------------- Axxe Áxxé Džxxdž Džxxx Džxxx Even more interesting -- test that non-latin characters can still be a member of a case-insensitive range: -- capital delta is member of lowercase range gamma to lambda SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8; -- small delta is member of uppercase range gamma to lambda SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8; Moreover, a lot of this behavior is locked in by strong Unicode guarantees like [1] and [2]. Behavior that can change probably won't change very often, and in any case will be tied to a PG major version. All of these behaviors are very close to what glibc "C.utf8" does on my machine. The case transformations are identical (except titlecasing because libc doesn't support it). The character classifications have some differences, which might be worth discussing, but I didn't see anything terribly concerning (I am following the unicode recommendations[3] on this topic). Performance: Sotring 10M strings: libc "C" 14s builtin C_UTF8 14s libc "C.utf8" 20s ICU "en-US-x-icu" 31s Running UPPER() on 10M strings: libc "C" 03s builtin C_UTF8 07s libc "C.utf8" 08s ICU "en-US-x-icu" 15s I didn't investigate or optimize regexes / pattern matching yet, but I can do similar optimizations if there's any gap. Note that I implemented the "simple" case mapping (which is what glibc does) and the "posix compatible"[3] flavor of character classification (which is closer to what glibc does than the "standard" flavor"). I opted to use title case mapping for initcap(), which is a difference from libc and I may go back to just upper/lower. These seem like reasonable choices if we're going to name the locale after C.UTF-8. Regards, Jeff Davis [1] https://www.unicode.org/policies/stability_policy.html#Case_Pair [2] https://www.unicode.org/policies/stability_policy.html#Identity [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
Вложения
В списке pgsql-hackers по дате отправления: