Re: UTF-8 encoding problem w/ libpq
От | Heikki Linnakangas |
---|---|
Тема | Re: UTF-8 encoding problem w/ libpq |
Дата | |
Msg-id | 51ACE361.2050006@vmware.com обсуждение исходный текст |
Ответ на | Re: UTF-8 encoding problem w/ libpq (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: UTF-8 encoding problem w/ libpq
|
Список | pgsql-hackers |
On 03.06.2013 21:28, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the >> backend is supposed to leave bytes with the high-bit set alone, ie. in >> UTF-8 encoding, it's supposed to leave ä and ß alone. > > Well, actually, downcase_truncate_identifier() is doing this: > > unsigned char ch = (unsigned char) ident[i]; > > if (ch>= 'A'&& ch<= 'Z') > ch += 'a' - 'A'; > else if (IS_HIGHBIT_SET(ch)&& isupper(ch)) > ch = tolower(ch); > > There's basically no way that that second case can give pleasant results > in a multibyte encoding, other than by not doing anything. Hmph, I see. > I suspect > that Windows' libc has fewer defenses than other implementations and > performs some transformation that we don't get elsewhere. This may also > explain the gripe yesterday in -general about funny results in OS X. Can't really blame Windows on that. On Windows, we don't require that the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies WIN1252 encoding. We allow that and it generally works on Windows because in varstr_cmp, we use MultiByteToWideChar() followed by wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. But for isupper(), it matters. > We talked about this before and went off into the weeds about whether > it was sensible to try to use towlower() and whether that wouldn't > create undesirably platform-sensitive results. I wonder though if we > couldn't just fix this code to not do anything to high-bit-set bytes > in multibyte encodings. Yeah, we should do that. It makes no sense to call isupper or tolower on bytes belonging to multi-byte characters. - Heikki
В списке pgsql-hackers по дате отправления: