Re: Pre-proposal: unicode normalized text
От | Jeff Davis |
---|---|
Тема | Re: Pre-proposal: unicode normalized text |
Дата | |
Msg-id | a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: Pre-proposal: unicode normalized text (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text |
Список | pgsql-hackers |
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful. Attached is an implementation of a per-database option STRICT_UNICODE which enforces the use of assigned code points only. Not everyone would want to use it. There are lots of applications that accept free-form text, and that may include recently-assigned code points not yet recognized by Postgres. But it would offer protection/stability for some databases. It makes it possible to have a hard guarantee that Unicode normalization is stable[1]. And it may also mitigate the risk of collation changes -- using unassigned code points carries a high risk that the collation order changes as soon as the collation provider recognizes the assignment. (Though assigned code points can change, too, so limiting yourself to assigned code points is only a mitigation.) I worry slightly that users will think at first that they want only assigned code points, and then later figure out that the application has increased in scope and now takes all kinds of free-form text. In that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and follow up with some "CHECK (unicode_assigned(...))" constraints on the particular fields that they'd like to protect. There's some weirdness that the set of assigned code points as Postgres sees it may not match what a collation provider sees due to differing Unicode versions. That's not great -- perhaps we could check that code points are considered assigned by *both* Postgres and ICU. I don't know if there's a way to tell if libc considers a code point to be assigned. Regards, Jeff Davis [1] https://www.unicode.org/policies/stability_policy.html#Normalization
Вложения
В списке pgsql-hackers по дате отправления: