Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters
От | Daniel Verite |
---|---|
Тема | Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters |
Дата | |
Msg-id | 769f5b7c-42c6-435a-a062-a728891b7d81@manitou-mail.org обсуждение исходный текст |
Ответ на | BUG #15548: Unaccent does not remove combining diacritical characters (PG Bug reporting form <noreply@postgresql.org>) |
Ответы |
Re: BUG #15548: Unaccent does not remove combining diacritical characters
|
Список | pgsql-bugs |
PG Bug reporting form wrote: > Apparently Unicode has two ways of accenting a character: as a separate code > point, which represents the base character and the accent, or as a > "combining diacritical mark" > (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks) Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence In general, PostgreSQL leaves it to applications to normalize Unicode strings so that they are all in the same canonical form, either composed or decomposed. > the mark applies itself to the preceding character. For example, A > followed by U+0300 displays À. However, unaccent is not removing > these accents. Short of having the input normalized by the application, ISTM that the best solution would be to provide functions to do it in Postgres, so you'd just write for example: unaccent(unicode_NFC(string)) Otherwise unaccent.rules can be customized. You may add replacements for letter+diacritical sequences that are missing for the languages you have to deal with. But doing it in general for all diacriticals multiplied by all base characters seems unrealistic. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
В списке pgsql-bugs по дате отправления: