On 2018-04-07 01:27:13 +0200, Daniel Gustafsson wrote:
> > On 07 Apr 2018, at 01:13, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote:
> >>> I'm fairly certain that the bug here is a simple race condition in the
> >>> test (not the main code!):
> >>
> >> I wonder if it may perhaps be a case of both?
> >
> > See my other message about the atomic fallback bit.
>
> Yep, my MUA pulled it down just as I had sent this. Thanks for confirming my
> suspicion.
But note it fails because the code using it is WRONG. There's a reason
there's "unlocked" in the name. But even leaving that aside, it probably
*still* be wrong if it were locked.
It seems *extremely* dubious that we'll allow to re-enable the checksums
while a worker is still doing stuff for the old cycle in the
background. Consider what happens if the checksum helper is currently
doing RequestCheckpoint() (something that can certainly take a *LONG*)
while. Another process disables checksums. Pages get written out
without checksums. Yet another process re-enables checksums. Helper
process does SetDataChecksumsOn(). Which succeeds because
if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION)
{
LWLockRelease(ControlFileLock);
elog(ERROR, "Checksums not in inprogress mode");
}
succeeds. Boom. Cluster with partially set checksums but marked as
valid.
Greetings,
Andres Freund