Re: error while trying to change the database encoding on a database
От | Adrian Klaver |
---|---|
Тема | Re: error while trying to change the database encoding on a database |
Дата | |
Msg-id | 4D3DE4CA.3080706@gmail.com обсуждение исходный текст |
Ответ на | Re: error while trying to change the database encoding on a database (Geoffrey Myers <lists@serioustechnology.com>) |
Список | pgsql-general |
On 01/24/2011 10:57 AM, Geoffrey Myers wrote: > Adrian Klaver wrote: >> On 01/24/2011 09:16 AM, Geoffrey Myers wrote: >> >>> >>> We hope to identify the characters and fix them in the existing >>> database, then convert. It appears to be very limited, but it would help >>> if there was some way to identify these characters outside of simply >>> doing the reload of the data and finding the errors. >>> >>> Hence the reason I asked about a resource that might identify the >>> characters. >> >> The problem is that from the standpoint of the SQL_ASCII database >> there is nothing wrong with the characters per se. AFAIK there is no >> built in function to validate characters. The reason is that valid is >> determined by the encoding and if you know the encoding then you >> really don't need to determine validity. If you want to see one way >> others have tackled this, search on iconv in the mailing list archive. >> This requires working on an external copy of the data and knowing >> something about the encodings involved. The nearest I could ever find >> to an encoding detector is: >> >> http://chardet.feedparser.org/ >> >> It is a Python program and the encodings it detects are limited but it >> might work for you. >> >> Given all the above, when I was faced with the problem you are facing >> I found it easiest to make an educated guess as to the original >> encoding and then do test restores with client_encoding set to my guess. > > Understood. We had figured the problem to be small, and it appears it is > and thus felt we could address it a character at a time. Then get this > error: > > pg_restore: [archiver (db)] Error from TOC entry 5258; 0 17549 TABLE > DATA fax postgres > pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence > for encoding "UTF8": 0xe28053 > > That hex value doesn't translate to a single character. I've dumped the > data to a file as you suggested, but reviewing the identified line > brings no joy. > The only thing I can think of is to use iconv like: iconv -c -t utf8 -f utf8 -o converted_txt.txt 'original.txt' where original.txt is your plain text data dump. The -c switch causes iconv not to convert any illegal characters. You could then run a diff against converted_txt.txt and 'original.txt' to see what characters in the original text are causing the problem. -- Adrian Klaver adrian.klaver@gmail.com
В списке pgsql-general по дате отправления: