Re: BUG #5532: Valid UTF8 sequence errors as invalid
От | Tom Lane |
---|---|
Тема | Re: BUG #5532: Valid UTF8 sequence errors as invalid |
Дата | |
Msg-id | 14170.1277922093@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: BUG #5532: Valid UTF8 sequence errors as invalid (Mike Lewis <mikelikespie@gmail.com>) |
Ответы |
Re: BUG #5532: Valid UTF8 sequence errors as invalid
|
Список | pgsql-bugs |
Mike Lewis <mikelikespie@gmail.com> writes: > I've run into a fair amount of unicode errors when trying to copy in log > files. Would you recommend using bytea or another data type instead of text > or varchar... or at least copying to a staging table with bytea's and > filtering out invalid rows when moving it to the main table? My guess is that you're working with data that was originally represented in UTF16, and you've used a tool that doesn't really know what it's doing to convert to UTF8. A correct conversion has to reunite surrogate pairs into wider-than-16-bit Unicode characters and then encode those as single UTF8 sequences. Dunno if you can easily identify the culprit, but fixing that conversion is the long-term solution. (BTW, I should think that iconv or some related tool would have a solution for fixing this miscoding; it's not an uncommon problem.) regards, tom lane
В списке pgsql-bugs по дате отправления: