Re: [PATCH] Completed unaccent dictionary with many missing characters
От | Przemysław Sztoch |
---|---|
Тема | Re: [PATCH] Completed unaccent dictionary with many missing characters |
Дата | |
Msg-id | 4c9326a1-6554-262f-1f22-e636933086ed@sztoch.pl обсуждение исходный текст |
Ответ на | Re: [PATCH] Completed unaccent dictionary with many missing characters (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: [PATCH] Completed unaccent dictionary with many missing characters
Re: [PATCH] Completed unaccent dictionary with many missing characters |
Список | pgsql-hackers |
Michael Paquier wrote on 7/5/2022 9:22 AM:
- (0x0410, 0x044f), # Cyrillic capital and small letters
+ (0x0402, 0x0402), # Cyrillic capital and small letters
+ (0x0404, 0x0406), #
+ (0x0408, 0x040b), #
+ (0x040f, 0x0418), #
+ (0x041a, 0x0438), #
+ (0x043a, 0x044f), #
+ (0x0452, 0x0452), #
+ (0x0454, 0x0456), #
I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...
If you decide what to do with point 3, I will correct it and send new patches.
1. It's good that you noticed it. I missed it. But it doesn't affect the generated rule list.On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote:Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a dictionnary when manipulating the set of codepoints, but that's me being too picky. Just to say that I am fine with what you are proposing here.So, I have been looking at the change for cyrillic letters, and are you sure that the range of codepoints [U+0410,U+044f] is right when it comes to consider all those letters as plain letters? There are a couple of characters that itch me a bit with this range: - What of the letter CAPITAL SHORT I (U+0419) and SMALL SHORT I (U+0439)? Shouldn't U+0439 be translated to U+0438 and U+0419 translated to U+0418? That's what I get while looking at UnicodeData.txt, and it would mean that the range of plain letters should not include both of them.
2. I added a few more letters that are used in languages other than Russian: Byelorussian or Ukrainian.- It seems like we are missing a couple of letters after U+044F, like U+0454, U+0456 or U+0455 just to name three of them?
- (0x0410, 0x044f), # Cyrillic capital and small letters
+ (0x0402, 0x0402), # Cyrillic capital and small letters
+ (0x0404, 0x0406), #
+ (0x0408, 0x040b), #
+ (0x040f, 0x0418), #
+ (0x041a, 0x0438), #
+ (0x043a, 0x044f), #
+ (0x0452, 0x0452), #
+ (0x0454, 0x0456), #
I do not add more, because they probably concern older languages.
An alternative might be to rely entirely on Unicode decomposition ...
However, after the change, only one additional Ukrainian letter with an accent was added to the rule file.
3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition),I have extracted from 0001 and applied the parts about the regression tests for degree signs, while adding two more for SOUND RECORDING COPYRIGHT (U+2117) and Black-Letter Capital H (U+210C) translated to 'x', while it should be probably 'H'.
then "U + 33D7" changes not to pH but to PH.
In the end, I left it like it was before ...
If you decide what to do with point 3, I will correct it and send new patches.
--
Przemysław Sztoch | Mobile +48 509 99 00 66
Przemysław Sztoch | Mobile +48 509 99 00 66
В списке pgsql-hackers по дате отправления: