Обсуждение: Data corruption
Hello,
--
we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.
There are at least two tables occasionally emitting errors like this:
ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397
The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.
To make things worse, vacuum ends with:
ERROR: could not access status of transaction 2828785257
DETAIL: Could not open file "pg_clog/0A89": No such file or directory.
DETAIL: Could not open file "pg_clog/0A89": No such file or directory.
That file is nowhere to be found.
When I do this:
set zero_damaged_pages = on;
select count(*) from foo
It shows 3 errors like this:
WARNING: invalid page header in block 78550 of relation base/12926/31642; zeroing out page
WARNING: invalid page header in block 78551 of relation base/12926/31642; zeroing out page
WARNING: invalid page header in block 78552 of relation base/12926/31642; zeroing out page
WARNING: invalid page header in block 78551 of relation base/12926/31642; zeroing out page
WARNING: invalid page header in block 78552 of relation base/12926/31642; zeroing out page
One idea I found is to generate a zero file for the missing clog, set zero_damaged_pages=on, then vacuum or recreate the table. Looks like it might work, but I'm trying to understand the implications.
1. When I generate a zero clog file like this, what impact will it have on everything? Can it break anything else in the database, introduce some kind of inconsistency with other (correctly committed) table data?
2. How can I gauge how much actual data (e.g. how many rows) will be lost by doing this (zero clog file + vacuum/recreate table)?
3. Is there some place I can look to reliably tell how many rows the table should have, so I can compare to how many I can read after setting zero_damaged_pages?
4. Is there any way at all to recover/repair the corrupted data at this point?
Konrad Garus
On Wed, 21 Jul 2021 at 21:44, Konrad Garus <konrad.garus@gmail.com> wrote:
Hello,we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.There are at least two tables occasionally emitting errors like this:ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.To make things worse, vacuum ends with:ERROR: could not access status of transaction 2828785257
DETAIL: Could not open file "pg_clog/0A89": No such file or directory.
I have not tried this (not sure to simulate corruption on a laptop) , but ,maybe the below extension and demo, help to see if how much of it has got corrupted ?
Also, does pg_dump on those relations complete? or does it throw errors and abort?
Thanks,
Vijay
Mumbai, India
This is very nasty error scenario .
You can try by faking the clog blocks . Anyways your data is corrupted and if you don’t have any back up so it’s lost .
Thanks
Soumitra
Sent from my iPhone
On 21-Jul-2021, at 10:02 PM, Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:
On Wed, 21 Jul 2021 at 21:44, Konrad Garus <konrad.garus@gmail.com> wrote:--Hello,we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.There are at least two tables occasionally emitting errors like this:ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.To make things worse, vacuum ends with:ERROR: could not access status of transaction 2828785257
DETAIL: Could not open file "pg_clog/0A89": No such file or directory.I have not tried this (not sure to simulate corruption on a laptop) , but ,maybe the below extension and demo, help to see if how much of it has got corrupted ?Also, does pg_dump on those relations complete? or does it throw errors and abort?Thanks,VijayMumbai, India
Do you have any clue as to how this happened? If you have a hardware problem, disk, controller or cable that is corrupting data written, then any efforts you make at recoveryon the same hardware could just result in more & more corruption. You don't mention, I think, whether you've copied the data to different storage (and preferably different server), but ifyou haven't already done so, STOP RIGHT NOW and do nothing else with this db until you've done that.
I've zeroes out Pg 8.4 pc_clog files using dd (instructions are out there on the web) without any visible negative effect.
Naturally, YMMV.
Naturally, YMMV.
On 7/21/21 11:14 AM, Konrad Garus wrote:
Hello,we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.There are at least two tables occasionally emitting errors like this:ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.To make things worse, vacuum ends with:ERROR: could not access status of transaction 2828785257
DETAIL: Could not open file "pg_clog/0A89": No such file or directory.That file is nowhere to be found.When I do this:set zero_damaged_pages = on;select count(*) from fooIt shows 3 errors like this:WARNING: invalid page header in block 78550 of relation base/12926/31642; zeroing out page
WARNING: invalid page header in block 78551 of relation base/12926/31642; zeroing out page
WARNING: invalid page header in block 78552 of relation base/12926/31642; zeroing out pageOne idea I found is to generate a zero file for the missing clog, set zero_damaged_pages=on, then vacuum or recreate the table. Looks like it might work, but I'm trying to understand the implications.1. When I generate a zero clog file like this, what impact will it have on everything? Can it break anything else in the database, introduce some kind of inconsistency with other (correctly committed) table data?2. How can I gauge how much actual data (e.g. how many rows) will be lost by doing this (zero clog file + vacuum/recreate table)?3. Is there some place I can look to reliably tell how many rows the table should have, so I can compare to how many I can read after setting zero_damaged_pages?4. Is there any way at all to recover/repair the corrupted data at this point?--Konrad Garus
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
On Wed, 2021-07-21 at 18:14 +0200, Konrad Garus wrote: > we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup. > > There are at least two tables occasionally emitting errors like this: > > ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert intofoo (a, b, c) values (?, ?, ?) > | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relationbase/12926/32397 > > The files in question (e.g. 32397) correspond to the table itself, not an index. > The larger of these tables has two files (32397 & 32397.1), almost 2G total. > > To make things worse, vacuum ends with: > > ERROR: could not access status of transaction 2828785257 > DETAIL: Could not open file "pg_clog/0A89": No such file or directory. > > That file is nowhere to be found. > > When I do this: > > set zero_damaged_pages = on; > select count(*) from foo > > It shows 3 errors like this: > > WARNING: invalid page header in block 78550 of relation base/12926/31642; zeroing out page > WARNING: invalid page header in block 78551 of relation base/12926/31642; zeroing out page > WARNING: invalid page header in block 78552 of relation base/12926/31642; zeroing out page > > One idea I found is to generate a zero file for the missing clog, set zero_damaged_pages=on, > then vacuum or recreate the table. Looks like it might work, but I'm trying to understand the implications. > > 1. When I generate a zero clog file like this, what impact will it have on everything? > Can it break anything else in the database, introduce some kind of inconsistency with other (correctly committed) tabledata? Certainly. I would create a fake clog file with only 0x55 in it, that corresponds to all transactions being committed, so you will "see" more of the damaged data. But of yourse there is no guarantee that the data are consistent. For example, foreign keys could be violated. You will have to manually fix these data before you can create the constraints. > 2. How can I gauge how much actual data (e.g. how many rows) will be lost by doing this (zero clog file + vacuum/recreatetable)? You'd have to look at the binary data of the affected blocks to get an idea. The "pageinspect" contrib can extract a block from a data file. > 3. Is there some place I can look to reliably tell how many rows the table should have, > so I can compare to how many I can read after setting zero_damaged_pages? No. > 4. Is there any way at all to recover/repair the corrupted data at this point? With the techiques you describe, you can salvage some data. Carefully reading the damaged blocks, you may glean some more. But typically, at that point you are happy if you lose only some data. You may have to delete some more to make the data consistent. Then cut your losses and load the dump into a new cluster on good hardware. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com