Re: [HACKERS] Extra Vietnamese unaccent rules
От | Kha Nguyen |
---|---|
Тема | Re: [HACKERS] Extra Vietnamese unaccent rules |
Дата | |
Msg-id | B7A3AD71-931B-4559-96FF-2E9D2B179651@gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] Extra Vietnamese unaccent rules (Thomas Munro <thomas.munro@enterprisedb.com>) |
Список | pgsql-hackers |
Does this mean that the python script has to be updated to be recursive too? > On 27 May 2017, at 0.48, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlhkha@gmail.com> wrote: >> Could you explain to me what this line means: >> “ >> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 >> 0301;;;;N;;;1EA4;;1EA4 >> “ >> >> If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with thisunaccent format generation yet. > > So contrib/unaccent/generate_unaccent_rules.py is a Python script that > takes UnicodeData.txt, a list of information about all Unicode > codepoints available at a URL that is shown in a comment, and > generates unaccent.rules. The idea was to avoid having to change it > manually every time someone finds characters that should be in there > (as you have just done!) by doing it systematically. > > Unicode has two ways to represent characters with accents: either with > composed codepoints like "é" or decomposed codepoints where you say > "e" and then "´". The field "00E2 0301" is the decomposed form of > that character above. Our job here is to identify the basic letter > that each composed character contains, by analysing the decomposed > field that you see in that line. I failed to realise that characters > with TWO accents are described as a composed character with ONE accent > plus another accent. > > You don't have to worry about decoding that line, it's all done in > that Python script. The problem is just in the function > is_letter_with_marks(). Instead of just checking if combining_ids[0] > is a plain letter, it looks like it should also check if > combining_ids[0] itself is a letter with marks. Also get_plain_letter > would need to be able to recurse to extract the "a". > > I hope that helps! > > -- > Thomas Munro > http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: