Bug concerning regular expressions and UTF-8

Поиск

Список

Период

Сортировка

От	Helmar Spangenberg
Тема	Bug concerning regular expressions and UTF-8
Дата	21 января 2006 г. 21:56:22
Msg-id	200601201803.17706.hspangenberg@frey.de обсуждение исходный текст
Список	pgsql-bugs

Дерево обсуждения

Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by=20
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produ=
ce=20
a faulty result is

select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';=20

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see als=
o=20
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei=
=20
Regular Expressions"). It seems that this problem does not exist in Windows=
=20
based installations.

It seems to me that this bug is originated in the file=20
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and=
=20
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and=
=20
tolower(unsigned char) which definitely are the wrong choice for UTF8=20
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and=20
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",=20
which lead to the correct results of=20
select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';

Since I don't have any idea concerning the side effects of this change, ple=
ase=20
let me know as soon as an "official" patch is available - I definitely do=
=20
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: hspangenberg@frey.de

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Bug concerning regular expressions and UTF-8