Re: BUG #13440: unaccent does not remove all diacritics
От | Thomas Munro |
---|---|
Тема | Re: BUG #13440: unaccent does not remove all diacritics |
Дата | |
Msg-id | CAEepm=0YVseDdN3Odjg2AZ2QvEPshqwJf=4zZbea5cwMQEP1Bw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #13440: unaccent does not remove all diacritics (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: BUG #13440: unaccent does not remove all diacritics
Re: BUG #13440: unaccent does not remove all diacritics |
Список | pgsql-bugs |
On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> My terminal shows these characters to be different. One is >> http://graphemica.com/%C8%9B >> latin small letter t with comma below (U+021B) > >> The other is >> http://graphemica.com/%C5%A3 >> latin small letter t with cedilla (U+0163) > > Ah-hah -- I did not look closely enough. So the immediate answer for > Michael is to add another entry to his unaccent.rules file. > > Should we add the missing character to the standard unaccent.rules file? It looks like Romanian also has s with comma. Perhaps we should have all these characters: $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l 702 That's quite a lot more than the 187 we currently have. Of those, I think only the following ligature characters don't fit the above pattern: =C3=86, =C3=A6, =C4=B2, =C4=B3, =C5=92, =C5=93, =C3=9F. Incidenta= lly, I don't believe that the way we "unaccent" ligatures is correct anyway. Maybe they should be expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i, O, o, S as we have it, but I guess it depends what the purpose of unaccent is... --=20 Thomas Munro http://www.enterprisedb.com
В списке pgsql-bugs по дате отправления: