Re: Pre-proposal: unicode normalized text
От | Jeff Davis |
---|---|
Тема | Re: Pre-proposal: unicode normalized text |
Дата | |
Msg-id | 2bab90239c5264fa9a87372c16bbf8759c8f9e64.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: Pre-proposal: unicode normalized text (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Pre-proposal: unicode normalized text
|
Список | pgsql-hackers |
On Tue, 2023-10-10 at 10:02 -0400, Robert Haas wrote: > On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut > <peter@eisentraut.org> wrote: > > Can you restate what this is supposed to be for? This thread > > appears to > > have morphed from "let's normalize everything" to "let's check for > > unassigned code points", but I'm not sure what we are aiming for > > now. It was a "pre-proposal", so yes, the goalposts have moved a bit. Right now I'm aiming to get some primitives in place that will be useful by themselves, but also that we can potentially build on. Attached is a new version of the patch which introduces some SQL functions as well: * unicode_is_valid(text): returns true if all codepoints are assigned, false otherwise * unicode_version(): version of unicode Postgres is built with * icu_unicode_version(): version of Unicode ICU is built with I'm not 100% clear on the consequences of differences between the PG unicode version and the ICU unicode version, but because normalization uses the Postgres version of Unicode, I believe the Postgres version of Unicode should also be available to determine whether a code point is assigned or not. We may also find it interesting to use the PG Unicode tables for regex character classification. This is just an idea and we can discuss whether that makes sense or not, but having the primitives in place seems like a good idea regardless. > Jeff can say what he wants it for, but one obvious application would > be to have the ability to add a CHECK constraint that forbids > inserting unassigned code points into your database, which would be > useful if you're worried about forward-compatibility with collation > definitions that might be extended to cover those code points in the > future. Another application would be to find data already in your > database that has this potential problem. Exactly. Avoiding unassigned code points also allows you to be forward- compatible with normalization. Regards, Jeff Davis
Вложения
В списке pgsql-hackers по дате отправления: