Re: BUG #13440: unaccent does not remove all diacritics
От | Emre Hasegeli |
---|---|
Тема | Re: BUG #13440: unaccent does not remove all diacritics |
Дата | |
Msg-id | CAE2gYzxRa6wWWL1NS2e8+sjzdNKRu5tMs-AGMdo2wcmq6RfTDg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #13440: unaccent does not remove all diacritics (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Список | pgsql-bugs |
> To me, conceptually what unaccent does is turn whatever junk you have > into a very basic common alphabet (ascii); then it's very easy to do > full text searches without having to worry about what accents the people > did or did not use in their searches. If we say "okay, but that funny > char is not an accent so let's leave it alone" then the charter doesn't > sound so useful to me. It is the same for me. It is unfortunate that this module is named as "unaccent". There are many characters on the rule file that has nothing do with accents. They are normal letters on some alphabets which are not in ASCII. "replace-with-ascii" would be a better name for it. > The cases I care about are okay anyway, because all the funny chars in > spanish are already covered; and maybe German people always enter their > queries using the funny ss thing I can't even write, and then this is > not a problem for them. I am learning German only for a few months, and even I can confirm that replacing "=C3=9F" with "s", or "=C3=BC" with "u" is wrong. On the ot= her hand if they would be correctly replaced with "ss" and "ou", I would be really unhappy because it is just too common in Turkish to press "u" instead of "=C3=BC". I think it is better for this module to replace those characters with a single ASCII character that sounds similar. With this point of view I think is fine to replace "=C3=9F" with "s" even if it is obviously wrong. This module will never be useful for German without breaking other usages, anyway. We can try to cover as many characters as possible keeping this in mind. It would also be nice support other rules for real "unaccent", and correct replacement for German. Maybe we can add different rule files to this module. > Regarding back-patching unaccent.rules changes as discussed downthread, > I think it's okay to simply document that any indexes using the module > should be reindexed immediately after upgrading to that minor version. > The consequence of not doing so is not *that* serious anyway. But then, > since I'm not actually affected in any way, I'm not strongly holding > this position either. I think it would cause more trouble than help, if we ever backpack changes on this rules.
В списке pgsql-bugs по дате отправления: