Re: proposal: UTF8 to_ascii function
От | Jan Urbański |
---|---|
Тема | Re: proposal: UTF8 to_ascii function |
Дата | |
Msg-id | 48A048FB.5030805@students.mimuw.edu.pl обсуждение исходный текст |
Ответ на | Re: proposal: UTF8 to_ascii function (Andrew Dunstan <andrew@dunslane.net>) |
Список | pgsql-hackers |
Andrew Dunstan wrote: > > > Jan Urbański wrote: >> Andrew Dunstan wrote: >>> >>> >>> Pavel Stehule wrote: >>> What you have not said is how you propose to convert UTF8 to ASCII. >>> >>> Currently to_ascii() converts a small number of single byte charsets >>> to ASCII by folding the chars with high bits set, so what we get is a >>> pure ASCII result which is safe in any server encoding, as they are >>> all ASCII supersets. >>> >>> But what conversion rule will you use for the gazillions of Unicode >>> characters? >>> >>> I honestly do not understand the use case for this at all. >> >> I do. Often clients want their searches to be >> accented-or-language-specific letters insensitive. So searching for >> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of >> such facility made me consider not upgrading particular client to >> 8.3...). >> Or maybe there's a better way to do it? > > Well, my first question would be "Why aren't you using a database > encoding that supports to_ascii()?" Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode characters like (U+2026) HORIZONTAL ELLIPSIS. I'd like to have to_ascii(text, [error_handling]) returns text So no bytea, to_ascii would accept text that's legal in my current database encoding and return text in that encoding. And error_handling would be something like: - 'error' (the default, throw an error if a character is untranslable to ASCII) - 'ignore' (omit untranslable characters) - 'transliterate' (do your best to transliterate the character, or leave it as it is if impossible). Examples would include (assuming UTF-8 database) to_ascii('łódź') -> 'lodz' to_ascii('china is written 中國') -> ERROR to_ascii('china is written 中國', 'ignore') -> 'china is written ' to_ascii('china is written 中國', 'transliterate') -> 'china is written zhong guo' (in an ideal world) to_ascii('china is written 中國', 'transliterate') -> 'china is written 中國' (in reality)\ These would have the property, that: to_ascii(X, 'ignore') is always pure ASCII data and never throws an error to_ascii(X, 'transliterate') is sometimes non-ASCII data and never throws an error to_ascii(X) is sometimes non-ASCII data and sometimes throws an error It's something like PHP's iconv that can have //TRANSLIT or somesuch (forgive me for giving PHP as an example...). Now I'd love to hear people punch holes in my daydreaming design ;) Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
В списке pgsql-hackers по дате отправления: