Re: Pre-proposal: unicode normalized text
От | Chapman Flack |
---|---|
Тема | Re: Pre-proposal: unicode normalized text |
Дата | |
Msg-id | 02d05bc98ca5b2d8ab38fec5fe5b7625@anastigmatix.net обсуждение исходный текст |
Ответ на | Re: Pre-proposal: unicode normalized text (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text |
Список | pgsql-hackers |
On 2023-10-04 16:38, Jeff Davis wrote: > On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: >> The SQL standard would have me able to: >> >> CREATE TABLE foo ( >> a CHARACTER VARYING CHARACTER SET UTF8, >> b CHARACTER VARYING CHARACTER SET LATIN1 >> ) >> >> and so on > > Is there a use case for that? UTF-8 is able to encode any unicode code > point, it's relatively compact, and it's backwards-compatible with 7- > bit ASCII. If you have a variety of text data in your system (and in > many cases even if not), then UTF-8 seems like the right solution. Well, for what reason does anybody run PG now with the encoding set to anything besides UTF-8? I don't really have my finger on that pulse. Could it be that it bloats common strings in their local script, and with enough of those to store, it could matter to use the local encoding that stores them more economically? Also, while any Unicode transfer format can encode any Unicode code point, I'm unsure whether it's yet the case that {any Unicode code point} is a superset of every character repertoire associated with every non-Unicode encoding. The cheap glaring counterexample is SQL_ASCII. Half those code points are *nobody knows what Unicode character* (or even *whether*). I'm not insisting that's a good thing, but it is a thing. It might be a very tidy future to say all text is Unicode and all server encodings are UTF-8, but I'm not sure it wouldn't still be a good step on the way to be able to store some things in their own encodings. We have JSON and XML now, two data types that are *formally defined* to accept any Unicode content, and we hedge and mumble and say (well, as long as it goes in the server encoding) and that makes me sad. Things like that should be easy to handle even without declaring UTF-8 as a server-wide encoding ... they already are their own distinct data types, and could conceivably know their own encodings. But there again, it's possible that going with unconditional UTF-8 for JSON or XML documents could, in some regions, bloat them. Regards, -Chap
В списке pgsql-hackers по дате отправления: