Re: Online verification of checksums
От | Anastasia Lubennikova |
---|---|
Тема | Re: Online verification of checksums |
Дата | |
Msg-id | 97cbb4a6-ee0a-4cad-6b65-84e06d14dfe9@postgrespro.ru обсуждение исходный текст |
Ответ на | Re: Online verification of checksums (Stephen Frost <sfrost@snowman.net>) |
Ответы |
Re: Online verification of checksums
|
Список | pgsql-hackers |
On 23.11.2020 18:35, Stephen Frost wrote:
TBH, I think it is highly unlikely that the page that was just updated will be evicted.Greetings, * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:On 21.11.2020 04:30, Michael Paquier wrote:The only method I can think as being really reliable is based on two facts: - Do a check only on pd_checksums, as that validates the full contents of the page. - When doing a retry, make sure that there is no concurrent I/O activity in the shared buffers. This requires an API we don't have yet.It seems reasonable to me to rely on checksums only. As for retry, I think that API for concurrent I/O will be complicated. Instead, we can introduce a function to read the page directly from shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof solution to me. Do you see any possible problems with it?We might end up reading pages back in that have been evicted, for one thing, which doesn't seem great,
Have I missed something? Now pg_basebackup has only one process + one child process for streaming. Anyway, I totally agree with your argument. The need to maintain connection(s) to PG is the most unpleasant part of the proposed approach.and this also seems likely to be awkward for cases which aren't using the replication protocol, unless every process maintains a connection to PG the entire time, which also doesn't seem great.
Well... Reading a page from shared buffers is a reliable way to get a correct page from postgres under any concurrent load. So it just seems natural to me.Also- what is the point of reading the page from shared buffers anyway..?
Yes and this is a tricky part. Until you have explained it in your latest message, I wasn't sure how we can distinct concurrent update from a page header corruption. Now I agree that if page LSN updated and increased between rereads, it is safe enough to conclude that we have some concurrent load.All we need to do is prove that the page will be rewritten during WAL replay.
Good point. I was thinking that we can recalculate checksum. Or even save a page without it, as we have checked LSN and know for sure that it will be rewritten by WAL replay.If we can prove that, we don't actually care what the contents of the page are. We certainly can't calculate the checksum on a page we plucked out of shared buffers since we only calculate the checksum when we go to write the page out.
To sum up, I agree with your proposal to reread the page and rely on ascending LSNs. Can you submit a patch?
You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch from this message https://www.postgresql.org/message-id/20201030023028.GC1693@paquier.xyz
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
В списке pgsql-hackers по дате отправления: