Re: Pre-proposal: unicode normalized text

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: Pre-proposal: unicode normalized text
Дата	1 марта 2024 г. 01:02:51
Msg-id	a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com обсуждение исходный текст
Ответ на	Re: Pre-proposal: unicode normalized text (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text
Список	pgsql-hackers

Дерево обсуждения

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.

Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.

But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)

I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.

There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.

Regards,
    Jeff Davis

[1]
https://www.unicode.org/policies/stability_policy.html#Normalization

Вложения

v1-0001-CREATE-DATABASE-.-STRICT_UNICODE.patch

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Pre-proposal: unicode normalized text

Вложения