Re: [PATCH] Completed unaccent dictionary with many missing characters
От | Michael Paquier |
---|---|
Тема | Re: [PATCH] Completed unaccent dictionary with many missing characters |
Дата | |
Msg-id | Yq/SMQqOAK9w0nJZ@paquier.xyz обсуждение исходный текст |
Ответ на | Re: [PATCH] Completed unaccent dictionary with many missing characters (Przemysław Sztoch <przemyslaw@sztoch.pl>) |
Ответы |
Re: [PATCH] Completed unaccent dictionary with many missing characters
|
Список | pgsql-hackers |
On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote: > Two fixes (bad comment and fixed Latin-ASCII.xml). if codepoint.general_category.startswith('L') and \ - len(codepoint.combining_ids) > 1: + len(codepoint.combining_ids) > 0: So, this one checks for the case where a codepoint is within the letter category. As far as I can see this indeed adds a couple of characters, with a combination of Greek and Latin letters. So that looks fine. + elif codepoint.general_category.startswith('N') and \ + len(codepoint.combining_ids) > 0 and \ + args.noLigaturesExpansion is False and is_ligature(codepoint, table): + charactersSet.add((codepoint.id, + "".join(chr(combining_codepoint.id) + for combining_codepoint + in get_plain_letters(codepoint, table)))) And this one is for the numerical part of the change. Do you actually need to apply is_ligature() here? I would have thought that this only applies to letters. - assert(False) + assert False, 'Codepoint U+%0.2X' % codepoint.id [...] - assert(is_ligature(codepoint, table)) + assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id These two are a good idea for debugging. - return all(is_letter(table[i], table) for i in codepoint.combining_ids) + return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids) It looks like this makes the code weaker, as we would silently skip characters that are not part of the table rather than checking for them all the time? While recreating unaccent.rules with your patch, I have noticed what looks like an error. An extra rule mapping U+210C (black-letter capital h) to "x" gets added on top of te existing one for "H", but the correct answer is the existing rule, not the one added by the patch. -- Michael
Вложения
В списке pgsql-hackers по дате отправления: