Re: [HACKERS] Extra Vietnamese unaccent rules
От | Thomas Munro |
---|---|
Тема | Re: [HACKERS] Extra Vietnamese unaccent rules |
Дата | |
Msg-id | CAEepm=39zN5tkbWPVUMifK9uk+rVkyEaXDs-y+DO2R+CtUUEBA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] Extra Vietnamese unaccent rules (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: [HACKERS] Extra Vietnamese unaccent rules
|
Список | pgsql-hackers |
On Sat, May 27, 2017 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Nguyen Le Hoang Kha <nlhkha@gmail.com> writes: >>> Most of the time in Vietnamese language, there are up to 2 accents in a >>> character. These unaccent rules are added to handle such cases (which are >>> very common). > >> I can't see any reason not to add these --- any objections out there? > > Oh, wait a minute. Patching unaccent.rules directly isn't the way > to do this; that file is supposed to be generated by > generate_unaccent_rules.py. Can you see how to modify that script > to produce these rules? Looking at one example from this patch: UTF8: <E1><BA><A5> Codepoint: 1EA5 Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE In UnicodData.txt it's this line: 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 0301;;;;N;;;1EA4;;1EA4 The problem is that generate_unaccent_rules.py assumes that the composing data is a plain letter followed by some number of diacritical modifiers. That's true for the characters with a single accent, but in this multi-accent case it's *composed* character 00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301 (COMBINING ACCENT ACUTE). So we need to teach it to be recursive. -- Thomas Munro http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: