Re: BUG #15548: Unaccent does not remove combining diacritical characters
От | Tom Lane |
---|---|
Тема | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Дата | |
Msg-id | 23237.1544899488@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: BUG #15548: Unaccent does not remove combining diacritical characters (Hugh Ranalli <hugh@whtc.ca>) |
Ответы |
Re: BUG #15548: Unaccent does not remove combining diacritical characters
Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Список | pgsql-bugs |
Hugh Ranalli <hugh@whtc.ca> writes: > On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Me too -- seems like that bears looking into. Perhaps the script's >> results are platform dependent -- what were you testing on? > I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think > that's it. The program's decisions come from the two data files, the > Unicode data set and the Latin-ASCII transliteration file. The script uses > categories ( > ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) > to identify letters (and now combining marks) and if they are in range, > performs a substitution. It then uses the transliteration file to find > rules for particular character substitutions (for example, that file seems > to handle the copyright symbol substitution). I don't see anything platform > dependent in there. Hm. Something funny is going on here. When I fetch the two reference files from the URLs cited in the script, and do python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules I get something that's bit-for-bit the same as what's in unaccent.rules. So there's clearly a platform difference between here and there. I'm using Python 2.6.6, which is what ships with RHEL6; have not tried it on anything newer. regards, tom lane
В списке pgsql-bugs по дате отправления: