Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
От | Tom Lane |
---|---|
Тема | Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence |
Дата | |
Msg-id | 28944.1282256681@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8
sequence
Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence |
Список | pgsql-hackers |
Steven Schlansker <steven@trumpet.io> writes: > On Aug 19, 2010, at 2:35 PM, Tom Lane wrote: >> I was able to reproduce this on my own Mac. Some tracing shows that the >> problem is that isspace(0x85) returns true when in locale en_US.utf-8. >> This causes array_in to drop the final byte of the array element string, >> thinking that it's insignificant whitespace. > The 0x85 seems to be the second byte of a multibyte UTF-8 > sequence. Check. > I'm not at all experienced with character encodings so I could > be totally off base, but isn't it wrong to ever call isspace(0x85), > whatever the result may be, given that the actual character is 0xCF85? > (U+03C5, GREEK SMALL LETTER UPSILON) We generally assume that in server-safe encodings, the ctype.h functions will behave sanely on any single-byte value. You can argue the wisdom of that, but deciding to change that policy would be a rather massive code change; I'm not excited about going that direction. >> I believe that you must >> not have produced the data file data.copy on a Mac, or at least not in >> that locale setting, because array_out should have double-quoted the >> array element given that behavior of isspace(). > Correct, it was produced on a Linux machine. That said, the charset > there was also UTF-8. Right ... but you had an isspace function that meets our expectations. > I actually can't reproduce that behavior here: You need a setlocale() call, else the program acts as though it's in C locale regardless of environment. My test case looks like this: $ cat isspace.c #include <stdio.h> #include <ctype.h> #include <locale.h> int main() { int c; setlocale(LC_ALL, ""); for (c = 1; c < 256; c++) { if (isspace(c)) printf("%3o is space\n", c); } return 0; } $ gcc -O -Wall isspace.c $ LANG=C ./a.out11 is space12 is space13 is space14 is space15 is space40 is space $ LANG=en_US.utf-8 ./a.out11 is space12 is space13 is space14 is space15 is space40 is space 205 is space 240 is space $ regards, tom lane
В списке pgsql-hackers по дате отправления: