Re: BUG #15347: Unaccent for greek characters does not work
От | Thomas Munro |
---|---|
Тема | Re: BUG #15347: Unaccent for greek characters does not work |
Дата | |
Msg-id | CAEepm=3bFBfv9CBC1r+n7_TsrsWY_JxFqAsUKUozKDvcbstdhw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #15347: Unaccent for greek characters does not work (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: BUG #15347: Unaccent for greek characters does not work
|
Список | pgsql-bugs |
On Fri, Aug 24, 2018 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> On Fri, Aug 24, 2018 at 12:12 PM, Michael Paquier <michael@paquier.xyz> wrote: >>> Perhaps it would be better to avoid non-ASCII characters in this script? > >> You mean in the Python script? Why? At the top it has a PEP-263 >> encoding declaration: >> # -*- coding: utf-8 -*- > > What happens if someone tries to view this in a non-UTF8 encoding? > > As a comparison point, we generally avoid using non-ASCII characters > directly in the SGML docs; we write out the appropriate SGML entity > instead. I think we should try to do the equivalent thing here --- > I assume python has some way to write "U+nnnn" or some such. Ok, 2 against 1. Done. I'll wait for other opinions on what to do about lower case sigma before committing. I'm not keen on adding that special case because: 1. It's a new kind of thing: previously we did only accent and ligature removal, but this is removal of variants that exist in only one case. It's admittedly a bit like the German ß, which lacks an upper case version according to some German speakers and undergoes a lossy conversion to double-S, but that was already handled without a special case by ligature expansion, so it's not the same thing. 2. We are down to only 5 hardcoded special cases: two Cyrillic characters which I suspect will go away if we allow Cyrillic to be processed via the general mechanism as we are doing here with Greek, and 3 oddballs that we inherited from the old hand-maintained unaccent.rules files: DEGREE CELSIUS, DEGREE FAHRENHEIT, and SOUND RECORDING COPYRIGHT. I think the degrees signs can be done automatically with just a bit more Unicode smarts, and I might try reporting SOUND RECORDING COPYRIGHT as missing from <character-fallback> to the CLDR project whose data we're using. 3. The problem seems to go away by itself if you convert to upper case. -- Thomas Munro http://www.enterprisedb.com
Вложения
В списке pgsql-bugs по дате отправления: