Обсуждение: make tsearch use the database default locale

Поиск
Список
Период
Сортировка

make tsearch use the database default locale

От
Jeff Davis
Дата:
tsvector and tsquery are not collatable types, but they do need locale
information to parse the original text. It would not do any good to
make it a collatable type, because a COLLATE clause would typically be
applied after the parsing is done.

Previously, tsearch used the database CTYPE for parsing, but that's not
good because it creates an unnecessary dependency on libc even when the
user has requested another provider.

This patch series allows tsearch to use the database default locale for
parsing. If the database collation is libc, there's no change.


Motivation:

  (a) it reduces the dependence on setlocale(), which is not thread-
safe;
  (b) if a user is using the builtin or ICU providers, understanding
the effects of LC_CTYPE can be very confusing;
  (c) it would allow us to test more of the tsearch parsing behavior.


Notes:

* Should have the the exact same behavior as before if the database
locale provider is libc. If the database locale provider is builtin or
ICU, then there will be some differences in tsearch parsing behavior.

* Most of the patches are straightforward, but v1-0005 might need extra
attention. There are quite a few cases there with subtle distinctions,
and I might have missed something. For example, in the "C" locale,
tsearch treats non-ascii characters as alpha, even though the libc
functions do not do so (I preserved this behavior).

* This introduces redundancy between the character isxyz() functions in
recg_pg_locale.c and similar functions in pg_locale.c. It would be easy
enough to refactor to eliminate the redundancy, but that might have
performance implications, so I didn't do it yet.

Regards,
    Jeff Davis


Вложения

Re: make tsearch use the database default locale

От
Jeff Davis
Дата:
On Tue, 2025-10-07 at 15:49 -0700, Jeff Davis wrote:
> This patch series allows tsearch to use the database default locale
> for
> parsing. If the database collation is libc, there's no change.

I committed a couple of the refactoring patches and rebased. v3
attached.

v3-0003 which eliminates the "wstr" logic and uses only the "pgwstr". I
was a bit confused why both were needed, as the purpose of pg_wchar is
to abstract away the problems with wchar_t. Perhaps it's historical, or
perhaps I missed something.

Regarding the risk of behavior changes: this affects parsing the
values, but not the interpretation of values after parsing, so the risk
of index inconsistencies seems low. There's risk that a document parsed
in the old version would be parsed differently in the new version,
though. Overall, it seems comparable to the risk of fb1a18810f.

Regards,
    Jeff Davis


Вложения

Re: make tsearch use the database default locale

От
Peter Eisentraut
Дата:
On 08.10.25 00:49, Jeff Davis wrote:
> Previously, tsearch used the database CTYPE for parsing, but that's not
> good because it creates an unnecessary dependency on libc even when the
> user has requested another provider.
> 
> This patch series allows tsearch to use the database default locale for
> parsing. If the database collation is libc, there's no change.

This looks good to me overall.

> * Most of the patches are straightforward, but v1-0005 might need extra
> attention. There are quite a few cases there with subtle distinctions,
> and I might have missed something. For example, in the "C" locale,
> tsearch treats non-ascii characters as alpha, even though the libc
> functions do not do so (I preserved this behavior).

This is indeed a bit mysterious.  AFAICT, the behavior you describe is 
conditional on if (prs->usewide), so it apparently depends also on the 
encoding?  I'm not sure if the new code covers this.

After this patch set, char2wchar() can become a local function in 
pg_locale_libc.c.  (But we still need wchar2char() externally, so maybe 
it's not worth changing this (yes).)




Re: make tsearch use the database default locale

От
Jeff Davis
Дата:
On Fri, 2025-10-17 at 18:15 +0200, Peter Eisentraut wrote:
>
> This is indeed a bit mysterious.  AFAICT, the behavior you describe
> is
> conditional on if (prs->usewide), so it apparently depends also on
> the
> encoding?  I'm not sure if the new code covers this.

I believe the new code does cover this case:

Previously, the code was effectively:
   if (prs->usewide && prs->pgwstr != NULL && c > 0x7f)
      retirm nonascii

and the new code is:
   if (prs->charmaxlen > 1 && locale->ctype_is_c && wc > 0x7f)
      return nonascii;

unless I missed something, those are equivalent.

> After this patch set, char2wchar() can become a local function in
> pg_locale_libc.c.  (But we still need wchar2char() externally, so
> maybe
> it's not worth changing this (yes).)

Done.

The rest of the patches are rebased with no other changes. I plan to
commit soon.

Regards,
    Jeff Davis


Вложения

Re: make tsearch use the database default locale

От
Peter Eisentraut
Дата:
On 19.10.25 02:29, Jeff Davis wrote:
> On Fri, 2025-10-17 at 18:15 +0200, Peter Eisentraut wrote:
>>
>> This is indeed a bit mysterious.  AFAICT, the behavior you describe
>> is
>> conditional on if (prs->usewide), so it apparently depends also on
>> the
>> encoding?  I'm not sure if the new code covers this.
> 
> I believe the new code does cover this case:
> 
> Previously, the code was effectively:
>     if (prs->usewide && prs->pgwstr != NULL && c > 0x7f)
>        retirm nonascii
> 
> and the new code is:
>     if (prs->charmaxlen > 1 && locale->ctype_is_c && wc > 0x7f)
>        return nonascii;
> 
> unless I missed something, those are equivalent.

Yes, this looks ok.