Re: Initial ugly reverse-translator
От | Craig Ringer |
---|---|
Тема | Re: Initial ugly reverse-translator |
Дата | |
Msg-id | 480A36F3.5050907@postnewspapers.com.au обсуждение исходный текст |
Ответ на | Re: Initial ugly reverse-translator (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-general |
Tom Lane wrote: >> True. It's not so much the speed as the fragility when faced with small >> changes to formatting. In addition to whitespace, some clients mangle >> punctuation with features like automatic "curly"-quoting. > > Yeah. I was wondering whether encoding differences wouldn't be a huge > problem in practice, as well. I'm not *too* worried about text encoding issues. In general it's very obvious when text has been mangled due to bad encoding handling, and it's extremely rare to see anything subtle like an app that transforms accented chars to their base variants. Demangling strings damaged by bad encoding handling is way out of scope, and sometimes not possible anyway. I guess that UTF-8's delightful support for various composed and decomposed forms of same glyph might be a problem. It's something I may face in some other works I'm doing too, so I might have to see how hard it'd be to put together a DB function that normalizes a UTF-8 string to its fully composed variant. I don't think the decomposed forms see much use in the wild though; they mostly come up as a security issue for path/URL matching and the like. http://unicode.org/reports/tr15/ http://msdn2.microsoft.com/en-us/library/ms776393(VS.85).aspx http://earthlingsoft.net/ssp/blog/2006/07/unicode_normalisation I don't know much about the CJK text representations, though, either in Unicode or in other encodings like Big5 . I *hope* the Unicode normalization rules will be enough there but I'm not sure. All strings must be converted from their original encoding to utf-8 for queries of course. That might be troublesome when using something like a web form where it might be hard to know the encoding of the input text (and where browser bugs are the rule rather than the exception) but it's thankfully not necessary to cater to every weird and broken browser. So in this case I don't think encodings will be *too* much trouble unless alternate unicode normalization forms turn out to be more common than I think they are. -- Craig Ringer
В списке pgsql-general по дате отправления: