Обсуждение: Re: fixing tsearch locale support

Поиск

Список

Период

Сортировка

Re: fixing tsearch locale support

От

Peter Eisentraut

Дата:

18 августа 2025 г., 13:23:03

On 09.12.24 11:11, Peter Eisentraut wrote:
> lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as 
> str_tolower(), except that the former don't use the common locale 
> provider framework but instead use the global libc locale settings.
> 
> This patch replaces uses of lowerstr*() with str_tolower(..., 
> DEFAULT_COLLATION_OID).  For instances that use a libc locale globally, 
> this will result in exactly the same behavior.  For instances that use 
> other locale providers, you now get consistent behavior and are no 
> longer dependent on the libc locale settings.
> 
> Most uses of these functions are for processing dictionary and 
> configuration files.  In those cases, using the default collation seems 
> appropriate.  At least we don't have a more specific collation 
> available.  But the code in contrib/pg_trgm should really depend on the 
> collation of the columns being processed.  This is not done here, this 
> can be done in a separate patch.
> 
> (You can probably construct some edge cases where this change would 
> create some locale-related upgrade incompatibility, for example if 
> before you used a combination of ICU and a differently-behaving libc 
> locale.  We can document this in the release notes, but I don't think 
> there is anything more we can do about this.)

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved.  It 
now observes the locale provider configured for the database.  It was 
previously hardcoded to use the configured libc LC_CTYPE setting.  In 
database clusters that use a locale provider other than libc and where 
the locale configured through that locale provider behaves differently 
from the LC_CTYPE setting configured for the database, this could cause 
changes in behavior of some functions related to full-text search as 
well as the pg_trgm extension.  When upgrading such database clusters 
using pg_upgrade, it is recommended to reindex all indexes related to 
full-text search and pg_trgm after the upgrade.
"""

The commit reference is fb1a18810f0.

Thoughts?

Re: fixing tsearch locale support

От

"Daniel Verite"

Дата:

18 августа 2025 г., 18:56:01

    Peter Eisentraut wrote:

> There is a PG18 open item to document this possible upgrade incompatibility.
>
> I think the following text could be added to the release notes:
>
> """
> The locale implementation underlying full-text search was improved.  It
> now observes the locale provider configured for the database.  It was
> previously hardcoded to use the configured libc LC_CTYPE setting
> [...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
 alias |   token   |   lexemes
-------+-----------+-------------
 word  | ABCD—EFGH | {abcd—efgh}


with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
   alias   | token | lexemes
-----------+-------+---------
 asciiword | ABCD  | {abcd}
 blank       | —       |
 asciiword | EFGH  | {efgh}


OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
 alias | token | lexemes
-------+-------+---------
 word  | ÉTÉ   | {ÉtÉ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
 alias | token | lexemes
-------+-------+---------
 word  | ÉTÉ   | {été}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Re: fixing tsearch locale support

От

Heikki Linnakangas

Дата:

26 августа 2025 г., 19:52:11

On 18/08/2025 18:56, Daniel Verite wrote:
>> There is a PG18 open item to document this possible upgrade incompatibility.
>>
>> I think the following text could be added to the release notes:
>>
>> """
>> The locale implementation underlying full-text search was improved.  It
>> now observes the locale provider configured for the database.  It was
>> previously hardcoded to use the configured libc LC_CTYPE setting
>> [...]
> 
> That sounds misleading because LC_CTYPE is still used in 18.
> 
> To illustrate in an ICU database, the parser will classify "Em Dash"
> as a separator or not depending on LC_CTYPE.
> 
> with LC_CTYPE=C
> 
> => select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
>   alias |   token   |   lexemes    
> -------+-----------+-------------
>   word  | ABCD—EFGH | {abcd—efgh}
> 
> 
> with LC_CTYPE=en_US.utf8 (glibc 2.35):
> 
> => select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
>     alias   | token | lexemes
> -----------+-------+---------
>   asciiword | ABCD  | {abcd}
>   blank       | —       |
>   asciiword | EFGH  | {efgh}
> 
> 
> OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
> to better lexemes.
> 
> pg17, ICU locale, LC_TYPE=C
> 
> => select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
>   alias | token | lexemes
> -------+-------+---------
>   word  | ÉTÉ   | {ÉtÉ}
> 
> pg18, ICU locale, LC_TYPE=C
> 
> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
>   alias | token | lexemes
> -------+-------+---------
>   word  | ÉTÉ   | {été}
> 
> So maybe the release notes should say
> "now observes the locale provider configured for the database to
> convert strings to lower case".

Is it only used for converting to lower case, or is there any other 
operations that need to be mentioned? Converting to upper case too I 
presume. (I haven't been following this thread)

We only support two collation providers, libc and ICU right? That makes 
Peter's phrasing "In database clusters that use a locale provider other 
than libc ..." an unnecessarily complicated way of saying ICU.

Putting those two changes together:

"""
The locale implementation underlying full-text search was improved.  It 
now observes the collation provider configured for the database for 
converting strings to upper and lower case.  It was previously hardcoded 
to use libc.  In databases that use the ICU collation provider and where 
the configured ICU locale behaves differently from the LC_CTYPE setting 
configured for the database, this could cause changes in behavior of 
some functions related to full-text search as well as the pg_trgm 
extension.  When upgrading such database clusters using pg_upgrade, it 
is recommended to reindex all indexes related to full-text search and 
pg_trgm after the upgrade.
"""

I wonder if it's clear enough that this applies to full-text search, not 
upper/lower case conversions in general. (Is that true?)

It's pretty urgent to get the release notes in shape, people are testing 
upgrades with the betas already...

- Heikki

Re: fixing tsearch locale support

От

Peter Eisentraut

Дата:

29 августа 2025 г., 11:34:17

On 26.08.25 18:52, Heikki Linnakangas wrote:
>> So maybe the release notes should say
>> "now observes the locale provider configured for the database to
>> convert strings to lower case".
> 
> Is it only used for converting to lower case, or is there any other 
> operations that need to be mentioned? Converting to upper case too I 
> presume. (I haven't been following this thread)

It's actually only lower case.  (It should really be casefold, but that 
might be a separate project for another day.)  But after reading this a 
few times, just writing "for converting to lower case" led me to ask 
"but what about upper case", so I reworded it to "case conversion".

> We only support two collation providers, libc and ICU right? That makes 
> Peter's phrasing "In database clusters that use a locale provider other 
> than libc ..." an unnecessarily complicated way of saying ICU.

There is the "builtin" provider, and it is affected by this as well.

> It's pretty urgent to get the release notes in shape, people are testing 
> upgrades with the betas already...

I have committed this release note item with some adjustment now.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: fixing tsearch locale support

Re: fixing tsearch locale support

Re: fixing tsearch locale support

Re: fixing tsearch locale support

Re: fixing tsearch locale support