Re: BUG #15548: Unaccent does not remove combining diacritical characters
От | Thomas Munro |
---|---|
Тема | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Дата | |
Msg-id | CAEepm=0qb_nx-f8cACS1=1NdmCj-3D9zXFU+RJHsFbZEztcqjg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #15548: Unaccent does not remove combining diacritical characters (Hugh Ranalli <hugh@whtc.ca>) |
Ответы |
Re: BUG #15548: Unaccent does not remove combining diacritical characters
Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Список | pgsql-bugs |
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote: > On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hugh Ranalli <hugh@whtc.ca> writes: >> > I've attached two patches, one to update generate_unaccent_rules.py, and >> > another that updates unaccent.rules from the v34 transliteration file. >> >> I think you forgot the patches? > > > Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached andwill add to CF. Let me know if you see anything amiss. +ʹ ' +ʺ " +ʻ ' +ʼ ' +ʽ ' +˂ < +˃ > +˄ ^ +ˆ ^ +ˈ ' +ˋ ` +ː : +˖ + +˗ - +˜ ~ I don't think this is quite right. Those don't seem to be the combining codepoints[1], and in any case they are being replaced with ASCII characters, whereas I thought we wanted to replace them with nothing at all. Here is my attempt to come up with a test case using combining characters: select unaccent('un café crème s''il vous plaît'); It's not stripping the accents. I've attached that in a file for reference so you can run it with psql -f x.sql, and you can see that it's using combining code points (code points 0301, 0300, 0302 which come out as cc81, cc80, cc82 in UTF-8) like so: $ xxd x.sql 00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent( 00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre.. 00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl 00000030: 6169 cc82 7427 293b 0a0a ai..t');.. (To come up with that I used the trick of typing ":%!xxd" and then when finished ":%!xxd -r", to turn vim into a hex editor.) [1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks -- Thomas Munro http://www.enterprisedb.com
Вложения
В списке pgsql-bugs по дате отправления: