Re: Mac OS: invalid byte sequence for encoding "UTF8"
От | Larry Rosenman |
---|---|
Тема | Re: Mac OS: invalid byte sequence for encoding "UTF8" |
Дата | |
Msg-id | d94fdeb7997353bf0ba6906679a89d0c@thebighonker.lerctr.org обсуждение исходный текст |
Ответ на | Re: Mac OS: invalid byte sequence for encoding "UTF8" (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Mac OS: invalid byte sequence for encoding "UTF8"
|
Список | pgsql-hackers |
On 2016-02-10 16:19, Tom Lane wrote: > I wrote: >> Artur Zakirov <a.zakirov@postgrespro.ru> writes: >>> I think this is not a bug. It is a normal behavior. In Mac OS >>> sscanf() >>> with the %s format reads the string one character at a time. The size >>> of >>> letter 'х' is 2. And sscanf() separate it into two wrong characters. > >> That argument might be convincing if OSX behaved that way for all >> multibyte characters, but it doesn't seem to be doing that. Why is >> only 'х' affected? > > I looked into the OS X sources, and found that indeed you are right: > *scanf processes the input a byte at a time, and applies isspace() to > each byte separately, even when the locale is such that that's a > clearly > insane thing to do. Since this code was derived from FreeBSD, FreeBSD > has or once had the same issue. (A look at the freebsd project on > github > says it still does, assuming that's the authoritative repo.) Not sure > about other BSDen. > > I also verified that in UTF8-based locales, isspace() thinks that 0x85 > and > 0xA0, and no other high-bit-set values, are spaces. Not sure exactly > why > it thinks that, but that explains why 'х' fails when adjacent code > points > don't. > > So apparently the coding rule we have to adopt is "don't use *scanf() > on data that might contain multibyte characters". (There might be > corner > cases where it'd work all right for conversion specifiers other than > %s, > but probably you might as well just use strtol and friends in such > cases.) > Ugh. > > regards, tom lane Definitive FreeBSD Sources: https://svnweb.freebsd.org/base/ -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
В списке pgsql-hackers по дате отправления: