Обсуждение: [Fwd: Re: Question about pgfsck]
Is there a FAQ or a HOWTO on recovering from database corruption? If not, then I am willing to write it. I will need help getting the document together because as much as I google, I come up empty handed. What I need is something that tells you: A> You get errors in your server log like: PANIC: open of /psql/pg_clog/0C95 failed: No such file or directory - or - PANIC: ZeroFill failed to write /psql/pg_xlog/xlogtemp.27517: No space left on device :) - or - LOG: server process (pid 48146) was terminated by signal 6 - or - LOG: ReadRecord: record with zero length at 39D/A53F7800 - or - LOG: ReadRecord: unexpected pageaddr 39A/78D4000 in log file 922, segment 15, offset 9256960 - or - ERROR: Invalid page header in block 2879878 of channeldata etc. Which are serious? What needs to be done? B> How to find all corrupt tuples, whether they are in indexes or the tables themselves. C> How to eliminate the corrupt tuples. Delete them of course. But how when the server errors on deleting them?? pgfsck looks like an awesome tool. It allows you to search areas of your database that are rarely used. So you don't get a 1AM phone call from the sales guy needing some report for tomorrow at 7AM and he can't get it. I want to have something that can be easily digested and doesn't require advanced knowledge of postgres. What does the beginning PostgreSQL admin do? A database disaster guide. Thanks Dan Hrabarchuk -----Forwarded Message----- > From: Martijn van Oosterhout <kleptog@svana.org> > To: Dan Hrabarchuk <dan@kwasar.biz> > Subject: Re: Question about pgfsck > Date: Thu, 07 Oct 2004 10:38:33 +0200 > > The problem is obviously with the structure of the tuples. So you need > to identify it by something then does not require parsing the tuples. > Hence the suggestion of ctid or oid or something else fixed. > > Good luck, > > On Wed, Oct 06, 2004 at 02:57:31PM -0700, Dan Hrabarchuk wrote: > > > > > > 3. Finding the file with the error. > > > > > > And the tablename, ergo, filename > > > > > > > 4. Finally fixing the error. > > > > > > Delete the record is the usual way. > > > > Okay the following query: > > SELECT * FROM channeldata where cd_id=6268 and tstamp<???2004-09-20??? and > > tstamp>???2004-09-15???; > > > > Will cause a signal 6 and have the database backend restart. > > > > If I change it to: DELETE FROM channeldata where cd_id=6268 and > > tstamp<???2004-09-20??? and tstamp>???2004-09-15???; > > > > I get an error and the transaction does not commit. How do you delete > > the records? Or is this not the usual way. > > > > I can recover the data after it's deleted. I just need to delete it. > > > > Thanks again. > > > > Dan > >
Dan Hrabarchuk <dan@kwasar.biz> writes: > Is there a FAQ or a HOWTO on recovering from database corruption? If > not, then I am willing to write it. I will need help getting the > document together because as much as I google, I come up empty handed. > ... > I want to have something that can be easily digested and doesn't require > advanced knowledge of postgres. I think this is an impossible goal. The facts of the matter are that it generally *isn't* easy to get out of a corrupt-data scenario, and it certainly isn't something that can be reduced to a cookbook recipe that a novice can follow. If we had bugs with effects as predictable as that, we would have found them and fixed them long ago. I think every corruption situation is unique and has to be approached as a fresh problem. Part of that feeling stems from the fact that you usually want to try to identify *why* the corruption occurred, not just fix its immediate effects, and that almost always requires examining the entrails at a pretty low level of detail. If you don't find out why it happened then you have no confidence that it won't happen again. Don't get me wrong --- I think a HOWTO on this is a fine idea. I'm just telling you that it's going to have to be a fairly detailed document that is going to teach the reader quite a bit about postgres internals. You can collect a lot of material for it by trawling the PG mailing list archives for old threads in which people were having corruption problems. Aside from the obvious of "corruption", try looking for threads mentioning REINDEX, pg_resetxlog, pg_filedump, locating tuples by ctid, the zero_damaged_pages parameter, memtest86, and badblocks, as those are the common tools for this sort of activity. You can search the archives either directly at http://archives.postgresql.org/ or via Oleg and Teodor's indexer at http://www.pgsql.ru/db/pgsearch/. I tend to use both as they seem to have different searching behavior. Google has a good indexer but a rather incomplete set of PG list archives, so that's usually my last choice... regards, tom lane