Re: [HACKERS] Extra Vietnamese unaccent rules
От | Michael Paquier |
---|---|
Тема | Re: [HACKERS] Extra Vietnamese unaccent rules |
Дата | |
Msg-id | CAB7nPqTVYiaCqWgviJU1mM4LazGbuZ2sxp7qBAPNCHzSZAF4Dw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] Extra Vietnamese unaccent rules (Thomas Munro <thomas.munro@enterprisedb.com>) |
Ответы |
Re: [HACKERS] Extra Vietnamese unaccent rules
|
Список | pgsql-hackers |
On Fri, May 26, 2017 at 5:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Unicode has two ways to represent characters with accents: either with > composed codepoints like "é" or decomposed codepoints where you say > "e" and then "´". The field "00E2 0301" is the decomposed form of > that character above. Our job here is to identify the basic letter > that each composed character contains, by analysing the decomposed > field that you see in that line. I failed to realise that characters > with TWO accents are described as a composed character with ONE accent > plus another accent. Doesn't that depend on the NF operation you are working on? With a canonical decomposition it seems to me that a character with two accents can as well be decomposed with one character and two composing character accents (NFKC does a canonical decomposition in one of its steps). > You don't have to worry about decoding that line, it's all done in > that Python script. The problem is just in the function > is_letter_with_marks(). Instead of just checking if combining_ids[0] > is a plain letter, it looks like it should also check if > combining_ids[0] itself is a letter with marks. Also get_plain_letter > would need to be able to recurse to extract the "a". Actually, with the recent work that has been done with unicode_norm_table.h which has been to transpose UnicodeData.txt into user-friendly tables, shouldn't the python script of unaccent/ be replaced by something that works on this table? This does a canonical decomposition but just keeps the first characters with a class ordering of 0. So we have basic APIs able to look at UnicodeData.txt and let caller do decision making with the result returned. -- Michael
В списке pgsql-hackers по дате отправления: