Re: Online verification of checksums

Поиск

Список

Период

Сортировка

От	Stephen Frost
Тема	Re: Online verification of checksums
Дата	18 сентября 2018 г. 18:45:36
Msg-id	20180918154536.GE4184@tamriel.snowman.net обсуждение исходный текст
Ответ на	Re: Online verification of checksums (Michael Banck <michael.banck@credativ.de>)
Ответы	Re: Online verification of checksums
Список	pgsql-hackers

Дерево обсуждения

Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> please find attached version 2 of the patch.
>
> Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
> >
> > I've tested this in a tight loop (while true; do pg_verify_checksums -D
> > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> > done", which I already used to develop the original code in the fork and
> > which brought up a few bugs.
> >
> > I got one checksums verification failure this way, all others were
> > caught by the recheck (I've introduced a 500ms delay for the first ten
> > failures) like this:
> >
> > > pg_verify_checksums: checksum verification failed on first attempt in
> > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > > expected 5063
> > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > > verified ok on recheck
>
> I have now changed this from the pg_sleep() to a check against the
> checkpoint LSN as discussed upthread.

Ok.

> > However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> > failures like this:
> >
> > > pg_verify_checksums: short read of block 2644 in file
> > > "data1/base/16637/16650", got only 4096 bytes
> >
> > This is not strictly a verification failure, should we do anything about
> > this? In my fork, I am also rechecking on this[3] (and I am happy to
> > extend the patch that way), but that makes the code and the patch more
> > complicated and I wanted to check the general opinion on this case
> > first.
>
> I have added a retry for this as well now, without a pg_sleep() as well.

> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine? 

No, this is perfectly normal behavior, as is having completely blank
pages, now that I think about it.  If we get a short read then I'd say
we simply check that we got an EOF and, in that case, we just move on.

> Alternatively, we could just skip to the next file then and don't make
> it count as a checksum failure.

No, I wouldn't count it as a checksum failure.  We could possibly count
it towards the skipped pages, though I'm even on the fence about that.

Thanks!

Stephen

Вложения

signature.asc

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Online verification of checksums

Вложения