Re: ICU for global collation

Поиск

Список

Период

Сортировка

От	Marina Polyakova
Тема	Re: ICU for global collation
Дата	16 сентября 2022 г. 07:31:42
Msg-id	1989d430b926be3c08735f97fffc6294@postgrespro.ru обсуждение исходный текст
Ответ на	Re: ICU for global collation (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы	Re: ICU for global collation
Список	pgsql-hackers

Дерево обсуждения

On 2022-09-16 07:55, Kyotaro Horiguchi wrote:
> At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova
> <m.polyakova@postgrespro.ru> wrote in
>> P.S. While working on the patch, I discovered that UTF8 encoding is
>> always used for the ICU provider in initdb unless it is explicitly
>> specified by the user:
>> 
>> if (!encoding && locale_provider == COLLPROVIDER_ICU)
>>     encodingid = PG_UTF8;
>> 
>> IMO this creates additional errors for locales with other encodings:
>> 
>> $ initdb --locale de_DE.iso885915@euro --locale-provider icu
>> --icu-locale de-DE
>> ...
>> initdb: error: encoding mismatch
>> initdb: detail: The encoding you selected (UTF8) and the encoding that
>> the selected locale uses (LATIN9) do not match. This would lead to
>> misbehavior in various character string processing functions.
>> initdb: hint: Rerun initdb and either do not specify an encoding
>> explicitly, or choose a matching combination.
>> 
>> And ICU supports many encodings, see the contents of pg_enc2icu_tbl in
>> encnames.c...
> 
> It seems to me the best default that fits almost all cases using icu
> locales.
> 
> So, we need to specify encoding explicitly in that case.
> 
> $ initdb --encoding iso-8859-15 --locale de_DE.iso885915@euro
> --locale-provider icu --icu-locale de-DE
> 
> However, I think it is hardly understantable from the documentation.
> 
> (I checked this using euc-jp [1] so it might be wrong..)
> 
> [1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider
> icu --icu-locale ja-x-icu
> 
> regards.

Thank you!

IMO it is hardly understantable from the program output either - it 
looks like I manually chose the encoding UTF8. Maybe first inform about 
selected encoding?..

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 
6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 
100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2310,7 +2310,11 @@ setup_locale_encoding(void)
      }

      if (!encoding && locale_provider == COLLPROVIDER_ICU)
+    {
          encodingid = PG_UTF8;
+        printf(_("The default database encoding has been set to \"%s\" for a 
better experience with the ICU provider.\n"),
+               pg_encoding_to_char(encodingid));
+    }
      else if (!encoding)
      {
          int            ctype_enc;

ISTM that such choices (e.g. UTF8 for Windows in some cases) are 
described in the documentation [1] as

By default, initdb uses the locale provider libc, takes the locale 
settings from the environment, and determines the encoding from the 
locale settings. This is almost always sufficient, unless there are 
special requirements.

[1] https://www.postgresql.org/docs/devel/app-initdb.html

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: ICU for global collation