Bug concerning regular expressions and UTF-8
От | Helmar Spangenberg |
---|---|
Тема | Bug concerning regular expressions and UTF-8 |
Дата | |
Msg-id | 200601201803.17706.hspangenberg@frey.de обсуждение исходный текст |
Список | pgsql-bugs |
Hello folks, my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by=20 myself, NLS enabled). LOCALE is set to de_DE.UTF-8. The bug shows up using the operator '~*' with umlauts. An easy way to produ= ce=20 a faulty result is select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';=20 The result should be "TRUE", however Postgres thinks, it's "FALSE" (see als= o=20 discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei= =20 Regular Expressions"). It seems that this problem does not exist in Windows= =20 based installations. It seems to me that this bug is originated in the file=20 src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and= =20 pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and= =20 tolower(unsigned char) which definitely are the wrong choice for UTF8=20 characters beyond the ASCII coding. To check my estimation, I replaced the bodies of pg_wc_tolower and=20 pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",=20 which lead to the correct results of=20 select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*'; Since I don't have any idea concerning the side effects of this change, ple= ase=20 let me know as soon as an "official" patch is available - I definitely do= =20 need regular expressions handling UTF8 correctly... Thanks, Helmar Spangenberg e-mail: hspangenberg@frey.de
В списке pgsql-bugs по дате отправления: