Re: Remaining dependency on setlocale()
| От | Peter Eisentraut |
|---|---|
| Тема | Re: Remaining dependency on setlocale() |
| Дата | |
| Msg-id | dd0cdd1f-e786-426e-b336-1ffa9b2f1fc6@eisentraut.org обсуждение исходный текст |
| Ответ на | Re: Remaining dependency on setlocale() (Jeff Davis <pgsql@j-davis.com>) |
| Список | pgsql-hackers |
On 12.12.25 21:11, Jeff Davis wrote: >> case '\xc7': /* C with cedilla */ >> >> so the premise that "fuzzystrmatch is designed for ASCII" does not >> appear to be correct. Needs more analysis. >> >> (But apparently it's not multibyte aware at all, so I don't know what >> to >> do about that.) > I didn't notice that, thank you. Agreed, we need a bit more discussion > around this case as well as soundex(). Soundex is an ASCII-only algorithm, there is no expectation that the algorithm does anything useful with non-ASCII characters, and it doesn't do so now. So I think using pg_ascii_toupper() is ok. (Users could for example use unaccent to preprocess text.) One might wonder if the presence of non-ASCII characters should be an error, but that doesn't have to be the subject of this thread. I noticed that the Wikipedia page for Soundex even calls out PostgreSQL for doing things slightly different than everyone else, but I haven't studied the details. For Metaphone, I found the reference implementation linked from its Wikipedia page, and it looks like our implementation is pretty closely aligned to that. That reference implementation also contains the C-with-cedilla case explicitly. The correct fix here would probably be to change the implementation to work on wide characters. But I think for the moment you could try a shortcut like, use pg_ascii_toupper(), but if the encoding is LATIN1 (or LATIN9 or whichever other encodings also contain C-with-cedilla at that code point), then explicitly uppercase that one as well. This would preserve the existing behavior. Note that the documentation calls out: "At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8)."
В списке pgsql-hackers по дате отправления: