Re: Pre-proposal: unicode normalized text
От | Jeff Davis |
---|---|
Тема | Re: Pre-proposal: unicode normalized text |
Дата | |
Msg-id | c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: Pre-proposal: unicode normalized text (Peter Eisentraut <peter@eisentraut.org>) |
Ответы |
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text |
Список | pgsql-hackers |
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote: > We need to be careful about precise terminology. "Valid" has a > defined > meaning for Unicode. A byte sequence can be valid or not as UTF-8. > But > a string containing unassigned code points is not not-"valid" as > Unicode. New patch attached, function name is "unicode_assigned". I believe the patch has utility as-is, but I've been brainstorming a few more ideas that could build on it: * Add a per-database option to enforce only storing assigned unicode code points. * (More radical) Add a per-database option to normalize all text in NFC. * Do character classification in Unicode rather than relying on glibc/ICU. This would affect regex character classes, etc., but not affect upper/lower/initcap nor collation. I did some experiments and the General Category doesn't change a lot: a total of 197 characters changed their General Category since Unicode 6.0.0, and only 5 since ICU 11.0.0. I'm not quite sure how to expose this, but it seems like a nicer way to handle it than tying it into the collation provider. Regards, Jeff Davis
Вложения
В списке pgsql-hackers по дате отправления: