Re: finding bogus UTF-8
От | Geoffrey Myers |
---|---|
Тема | Re: finding bogus UTF-8 |
Дата | |
Msg-id | 4D5AA54E.9080104@serioustechnology.com обсуждение исходный текст |
Ответ на | Re: finding bogus UTF-8 (Glenn Maynard <glenn@zewt.org>) |
Ответы |
Re: finding bogus UTF-8
|
Список | pgsql-general |
Glenn Maynard wrote: > On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com > <mailto:scott_ribe@elevated-dev.com>> wrote: > > I know that I have at least one instance of a varchar that is not > valid UTF-8, imported from a source with errors (AMA CPT files, > actually) before PG's checking was as stringent as it is today. Can > anybody suggest a query to find such values? > > > I hit this problem too, if I remember correctly when trying to upgrade a > database from 8.3 to 8.4. I ended up aborting the upgrade, since the > upgrade documentation made no mention of this and I didn't have time to > dig into it at the time. A tool to find all instances of this would be > very helpful. I'm about to pipe the ascii output of a database dump through a perl script that removes any unwanted characters. To help define what 'unwanted characters' are, compare the ord() values to decimal values at http://www.asciitable.com/ while (<>) { $_ =~ s/(.)/((ord($1) >= 0) && (ord($1) <= 8)) || ((ord($1) >= 11) && (ord($1) <= 31)) || ((ord($1) >= 127)) ?"": $1/egs; print; } comments would be appreciated. -- Until later, Geoffrey "I predict future happiness for America if they can prevent the government from wasting the labors of the people under the pretense of taking care of them." - Thomas Jefferson
В списке pgsql-general по дате отправления: