Re: BUG #15548: Unaccent does not remove combining diacritical characters
От | Tom Lane |
---|---|
Тема | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Дата | |
Msg-id | 11345.1545114237@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
|
Список | pgsql-bugs |
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote: >> tl;dr: I think we should convert unaccent.sql and unaccent.out >> to UTF8 encoding. Then, adding more test cases for this patch >> will be easy. > Do you think that we could also remove the non-ASCII characters from the > tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in > input, and show the output with bytea. I'm not really for that, because it would make the test cases harder to verify by eyeball. With the current setup --- other than the uncommon-outside-Russia encoding choice --- you don't really need to read or speak Russian to see that this: SELECT unaccent('ёлка'); unaccent ---------- елка (1 row) probably represents unaccent doing what it ought to. If everything is in hex then it's a lot harder. Ten years ago I might've agreed with your point, but today it's hard to believe that anyone who takes any interest at all in unaccent's functionality would not have a UTF8-capable terminal. > That's harder to read, still we > discussed about not using UTF-8 in the python script to allow folks with > simple terminals to touch the code the last time this was touched > (5e8d670) and the characters used could be documented as comments in the > tests. Maybe I'm misremembering, but I thought that discussion was about the code files. I am still mistrustful of non-ASCII in our code files. But for data and test files, we've been accepting UTF8 ever since the text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8. regards, tom lane
В списке pgsql-bugs по дате отправления: