Re: Differences in UTF8 between 8.0 and 8.1
От | Paul Lindner |
---|---|
Тема | Re: Differences in UTF8 between 8.0 and 8.1 |
Дата | |
Msg-id | 20051027005951.GA27655@inuus.com обсуждение исходный текст |
Ответ на | Re: Differences in UTF8 between 8.0 and 8.1 (Andrew - Supernews <andrew+nonews@supernews.com>) |
Ответы |
Re: Differences in UTF8 between 8.0 and 8.1
Re: Differences in UTF8 between 8.0 and 8.1 |
Список | pgsql-hackers |
On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote: > > I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 > was never actually a valid utf-8 string, and that the d2 b9 is only valid > by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0 > utf-8 check was broken, but I didn't realize it was quite so bad. Looking at the data it appears that it is a sequence of latin1 characters. They all have the eighth bit set and all seem to pass the check. In a million rows I found 2 examples of this. However I'm running into another problem now. The command: iconv -c -f UTF8 -t UTF8 does strip out the invalid characters. However, iconv reads the entire file into memory before it writes out any data. This is not so good for multi-gigabyte dump files and doesn't allow for it to be used in a pipe between pg_dump and psql. Anyone have any other recommendations? GNU recode might do it, but I'm a bit stymied by the syntax. A quick perl script using Text::Iconv didn't work either. I'm off to look at some other perl modules and will try to create a script so I can strip out the invalid characters. -- Paul Lindner ||||| | | | | | | | | | lindner@inuus.com
В списке pgsql-hackers по дате отправления: