Обсуждение: Trying to recover a corrupted database

Поиск
Список
Период
Сортировка

Trying to recover a corrupted database

От
John Scalia
Дата:
Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought
thesystem down to single user and ran an fsck which did  
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to
multi-userstatus and ran a successful pg_basebackup on the  
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing
thesame type of errors in its log: invalid page header in  
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the
pg_settingstable, we got: 

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                   name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
  zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages
again.He swears that that it was on when the first VACUUM  
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as
we'restill getting log errors like before, I don't really know  
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay



Re: Trying to recover a corrupted database

От
Scott Whitney
Дата:
a) How many databases do you have?
b) Have you tried to pg_dump the database(s) to see if that succeeds?

If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.


Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran an fsck which did
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful pg_basebackup on the
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log: invalid page header in
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                   name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
  zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on when the first VACUUM
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like before, I don't really know
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay



--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Trying to recover a corrupted database

От
John Scalia
Дата:
We're only working with a single database on this system, and yes, I attempted a pg_dump earlier today on just one single table - it failed (I know pg_dump and pg_basebackup use different mechanisms.) Mind you it's a large table with 154,000 rows in it. The overall database is somewhere around 43GB.

On 7/17/2014 3:16 PM, Scott Whitney wrote:
a) How many databases do you have?
b) Have you tried to pg_dump the database(s) to see if that succeeds?

If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.


Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran an fsck which did
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful pg_basebackup on the
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log: invalid page header in
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                   name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
  zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on when the first VACUUM
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like before, I don't really know
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay



--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: Trying to recover a corrupted database

От
John Scalia
Дата:
Well, we already have the schema in a canned file. So, we wouldn't need to do a pg_dump for the schema, and I can get into the database using psql obviously. That's how I produced the setting from pg_settings. But, I really don't want to have to go through 150K+ records to find the multiple broken ones, if I don't have to do so. I'm just hoping for a faster way.

On 7/17/2014 3:30 PM, Scott Whitney wrote:
Well, at this point, let me let the big brains answer your main question on the thread and tell you what _I_ did when I got into a similar situation.

a) pg_dump --schema-only <database name>   >  db.structure
b) From here it was a lot of select * from table until I found the broken one(s) and I worked around the actual bad data by offset/limit on the selects on the busted tables.

It took me several hours, but it did work (for me).

That's assuming you can even psql to the db in the first place when the postmaster is running.


We're only working with a single database on this system, and yes, I attempted a pg_dump earlier today on just one single table - it failed (I know pg_dump and pg_basebackup use different mechanisms.) Mind you it's a large table with 154,000 rows in it. The overall database is somewhere around 43GB.

On 7/17/2014 3:16 PM, Scott Whitney wrote:
a) How many databases do you have?
b) Have you tried to pg_dump the database(s) to see if that succeeds?

If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.


Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran an fsck which did
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful pg_basebackup on the
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log: invalid page header in
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                   name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
  zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on when the first VACUUM
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like before, I don't really know
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay



--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin




Re: Trying to recover a corrupted database

От
John Scalia
Дата:
I was attempting to do that with one table that I know is broken, but I can't get anything from it in a select. All I get no matter what is:

ERROR: could not access status of transaction 3706060803
DETAIL: Could not open file "pg_clog/OCDE": No such file or directory

I think the pg_clog files are transaction journal files that tell the Db why a record is or is not in the table, but I'm not certain. I just wish there was a pg_resetclog executable like there is for the pg_xlog segments.

On 7/17/2014 3:58 PM, Scott Whitney wrote:
Yeah. I know the feeling, man. What I ended up doing was "select * from" until I found the broken one then jumped down into groups of thousands of records and basically "Newtoned" it down until I found the affected data.

It sucked. A lot.


Well, we already have the schema in a canned file. So, we wouldn't need to do a pg_dump for the schema, and I can get into the database using psql obviously. That's how I produced the setting from pg_settings. But, I really don't want to have to go through 150K+ records to find the multiple broken ones, if I don't have to do so. I'm just hoping for a faster way.

On 7/17/2014 3:30 PM, Scott Whitney wrote:
Well, at this point, let me let the big brains answer your main question on the thread and tell you what _I_ did when I got into a similar situation.

a) pg_dump --schema-only <database name>   >  db.structure
b) From here it was a lot of select * from table until I found the broken one(s) and I worked around the actual bad data by offset/limit on the selects on the busted tables.

It took me several hours, but it did work (for me).

That's assuming you can even psql to the db in the first place when the postmaster is running.


We're only working with a single database on this system, and yes, I attempted a pg_dump earlier today on just one single table - it failed (I know pg_dump and pg_basebackup use different mechanisms.) Mind you it's a large table with 154,000 rows in it. The overall database is somewhere around 43GB.

On 7/17/2014 3:16 PM, Scott Whitney wrote:
a) How many databases do you have?
b) Have you tried to pg_dump the database(s) to see if that succeeds?

If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.


Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran an fsck which did
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful pg_basebackup on the
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log: invalid page header in
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                   name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
  zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on when the first VACUUM
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like before, I don't really know
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay



--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin





Re: Trying to recover a corrupted database

От
John Scalia
Дата:
Well, nothing to lose, so I did a touch on pg_clog/0CDE and at least the error changed. Now it says:

DETAIL: Could not read from file "pg_clog/0CDE" at offset 98304: Success.

But I don't understand why it says Success?

On 7/17/2014 4:09 PM, Scott Whitney wrote:
OOhhh...IIRC I think that I created the clog file as a zero byte file to get around that. It was _absolutely_ a step of last resort. I would be certain that you have a backup that can get you back to this point at least before proceeding.



I was attempting to do that with one table that I know is broken, but I can't get anything from it in a select. All I get no matter what is:

ERROR: could not access status of transaction 3706060803
DETAIL: Could not open file "pg_clog/OCDE": No such file or directory

I think the pg_clog files are transaction journal files that tell the Db why a record is or is not in the table, but I'm not certain. I just wish there was a pg_resetclog executable like there is for the pg_xlog segments.

On 7/17/2014 3:58 PM, Scott Whitney wrote:
Yeah. I know the feeling, man. What I ended up doing was "select * from" until I found the broken one then jumped down into groups of thousands of records and basically "Newtoned" it down until I found the affected data.

It sucked. A lot.


Well, we already have the schema in a canned file. So, we wouldn't need to do a pg_dump for the schema, and I can get into the database using psql obviously. That's how I produced the setting from pg_settings. But, I really don't want to have to go through 150K+ records to find the multiple broken ones, if I don't have to do so. I'm just hoping for a faster way.

On 7/17/2014 3:30 PM, Scott Whitney wrote:
Well, at this point, let me let the big brains answer your main question on the thread and tell you what _I_ did when I got into a similar situation.

a) pg_dump --schema-only <database name>   >  db.structure
b) From here it was a lot of select * from table until I found the broken one(s) and I worked around the actual bad data by offset/limit on the selects on the busted tables.

It took me several hours, but it did work (for me).

That's assuming you can even psql to the db in the first place when the postmaster is running.


We're only working with a single database on this system, and yes, I attempted a pg_dump earlier today on just one single table - it failed (I know pg_dump and pg_basebackup use different mechanisms.) Mind you it's a large table with 154,000 rows in it. The overall database is somewhere around 43GB.

On 7/17/2014 3:16 PM, Scott Whitney wrote:
a) How many databases do you have?
b) Have you tried to pg_dump the database(s) to see if that succeeds?

If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.


Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran an fsck which did
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful pg_basebackup on the
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log: invalid page header in
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                   name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
  zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on when the first VACUUM
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like before, I don't really know
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay



--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin