Re: Invalid headers and xlog flush failures
От | Bricklen Anderson |
---|---|
Тема | Re: Invalid headers and xlog flush failures |
Дата | |
Msg-id | 42039146.7020808@PresiNET.com обсуждение исходный текст |
Ответ на | Re: Invalid headers and xlog flush failures (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-general |
Tom Lane wrote: > Bricklen Anderson <BAnderson@PresiNET.com> writes: > >>>Tom Lane wrote: >>> >>>>But anyway, the evidence seems pretty clear that in fact end of WAL is >>>>in the 73 range, and so those page LSNs with 972 and 973 have to be >>>>bogus. I'm back to thinking about dropped bits in RAM or on disk. > > >>memtest86+ ran for over 15 hours with no errors reported. >>e2fsck -c completed with no errors reported. > > > Hmm ... that's not proof your hardware is ok, but it at least puts the > ball back in play. > > >>Any ideas on what I should try next? Considering that this db is not >>in production yet, I _do_ have the liberty to rebuild the database if >>necessary. Do you have any further recommendations? > > > If the database isn't too large, I'd suggest saving aside a physical > copy (eg, cp or tar dump taken with postmaster stopped) for forensic > purposes, and then rebuilding so you can get on with your own work. > > One bit of investigation that might be worth doing is to look at every > single 8K page in the database files and collect information about the > LSN fields, which are the first 8 bytes of each page. Do you mean this line from pg_filedump's results: LSN: logid 56 recoff 0x3f4be440 Special 8176 (0x1ff0) If so, I've set up a shell script that looped all of the files and emitted that line. It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line. I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I believe is the hex of 74?). In case anyone wants the script that I ran to get the LSN: #!/bin/sh for FILE in /var/postgres/data/base/17235/*; do i=0 echo $FILE >> test_file; while [ 1==1 ]; do str=`pg_filedump -R $i $FILE | grep LSN`; if [ "$?" -eq "1" ]; then break fi echo "$FILE: $str" >> LSN_out; i=$((i+1)); done done > In a non-broken database all of these should be less than or equal to the current ending > WAL offset (which you can get with pg_controldata if the postmaster is > stopped). We know there are at least two bad pages, but are there more? > Is there any pattern to the bad LSN values? Also it would be useful to > look at each bad page in some detail to see if there's any evidence of > corruption extending beyond the LSN value. > > regards, tom lane NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig into it further. Thanks again for you help, Tom. Cheers, Bricklen
В списке pgsql-general по дате отправления: