Re: Online verification of checksums
От | Stephen Frost |
---|---|
Тема | Re: Online verification of checksums |
Дата | |
Msg-id | 20180918154536.GE4184@tamriel.snowman.net обсуждение исходный текст |
Ответ на | Re: Online verification of checksums (Michael Banck <michael.banck@credativ.de>) |
Ответы |
Re: Online verification of checksums
|
Список | pgsql-hackers |
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > please find attached version 2 of the patch. > > Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck: > > I've now forward-ported this change to pg_verify_checksums, in order to > > make this application useful for online clusters, see attached patch. > > > > I've tested this in a tight loop (while true; do pg_verify_checksums -D > > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do > > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench; > > done", which I already used to develop the original code in the fork and > > which brought up a few bugs. > > > > I got one checksums verification failure this way, all others were > > caught by the recheck (I've introduced a 500ms delay for the first ten > > failures) like this: > > > > > pg_verify_checksums: checksum verification failed on first attempt in > > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but > > > expected 5063 > > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850" > > > verified ok on recheck > > I have now changed this from the pg_sleep() to a check against the > checkpoint LSN as discussed upthread. Ok. > > However, I am also seeing sporadic (maybe 0.5 times per pgbench run) > > failures like this: > > > > > pg_verify_checksums: short read of block 2644 in file > > > "data1/base/16637/16650", got only 4096 bytes > > > > This is not strictly a verification failure, should we do anything about > > this? In my fork, I am also rechecking on this[3] (and I am happy to > > extend the patch that way), but that makes the code and the patch more > > complicated and I wanted to check the general opinion on this case > > first. > > I have added a retry for this as well now, without a pg_sleep() as well. > This catches around 80% of the half-reads, but a few slip through. At > that point we bail out with exit(1), and the user can try again, which I > think is fine? No, this is perfectly normal behavior, as is having completely blank pages, now that I think about it. If we get a short read then I'd say we simply check that we got an EOF and, in that case, we just move on. > Alternatively, we could just skip to the next file then and don't make > it count as a checksum failure. No, I wouldn't count it as a checksum failure. We could possibly count it towards the skipped pages, though I'm even on the fence about that. Thanks! Stephen
Вложения
В списке pgsql-hackers по дате отправления: