BUG #15548: Unaccent does not remove combining diacritical characters
От | PG Bug reporting form |
---|---|
Тема | BUG #15548: Unaccent does not remove combining diacritical characters |
Дата | |
Msg-id | 15548-cef1b3f8de190d4f@postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 15548 Logged by: Hugh Ranalli Email address: hugh@whtc.ca PostgreSQL version: 11.1 Operating system: Ubuntu 18.04 Description: Apparently Unicode has two ways of accenting a character: as a separate code point, which represents the base character and the accent, or as a "combining diacritical mark" (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks), in which case the mark applies itself to the preceding character. For example, A followed by U+0300 displays À. However, unaccent is not removing these accents. SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in 'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL repositories. I've read bug report #13440, and have tried with both the installed unaccent.rules as well as a new set generated by the generate_unaccent_rules.py distributed with the 11.1 source code: wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt wget https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules I see there have been some updates to generate_unaccent_rules.py to handle Greek and Vietnamese characters, but neither of them seem to address this issue. I'm happy to contribute a patch to handle these cases, but of course wanted to make sure this is desired behaviour, or if I am misunderstanding something somewhere. Thank you, Hugh Ranalli
В списке pgsql-bugs по дате отправления: