A thought about regex versus multibyte character sets
От | Tom Lane |
---|---|
Тема | A thought about regex versus multibyte character sets |
Дата | |
Msg-id | 7341.1259604906@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: A thought about regex versus multibyte character sets
Re: A thought about regex versus multibyte character sets |
Список | pgsql-hackers |
We've had many complaints about the fact that the regex functions are not bright about locale-dependent operations in multibyte character sets, especially case-insensitive matching. The reason for this, as was discussed in this thread http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php is that we'd need to use the <wctype.h> functions, but those expect the platform's wchar_t representation, whereas the regex stuff works on pg_wchar_t which might have a different character set mapping. I just spent a bit of time considering what we might do to fix this. The idea mentioned in the above thread was to switch over to using wchar_t in the regex code, but that seems to have a number of problems. One showstopper is that on some platforms wchar_t is only 16 bits and can't represent the full range of Unicode characters. I don't want to fix case-folding only to break regexes for other uses. However, it strikes me that we might be overstating the size of the mismatch between wchar_t and pg_wchar_t representations. In particular, for Unicode-based locales it seems virtually certain that every platform would use Unicode code points for the wchar_t representation, and that is also our representation in pg_wchar_t. I therefore propose the following idea: if the database encoding is UTF8, allow the regc_locale.c functions to call the <wctype.h> functions, assuming that wchar_t and pg_wchar_t share the same representation. On platforms where wchar_t is only 16 bits, we can do this up to U+FFFF and be stupid about code points above that. I think this will solve at least 99% of the problem for a fairly small amount of work. It does not do anything for non-UTF8 multibyte encodings, but so far as I can see the only such encodings are Far Eastern ones, in which the present ASCII-only behavior is probably good enough --- concepts like case don't apply to their non-ASCII characters anyhow. (Well, there's also MULE_INTERNAL, but I don't believe anyone runs their DB in that.) However, not being a native user of any non-ASCII character set, I might be missing something big here. Comments? regards, tom lane
В списке pgsql-hackers по дате отправления: