Обсуждение: Incorrect logic in XLogNeedsFlush()
Hi, If you call XLogNeedsFlush() immediately after calling XLogFlush() in FlushBuffer(), it can return true. With this diff: diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 350cc0402aa..91c3fe99d6e 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4342,7 +4342,11 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, * skip the flush if the buffer isn't permanent. */ if (buf_state & BM_PERMANENT) + { XLogFlush(recptr); + if(XLogNeedsFlush(recptr)) + elog(ERROR, "should be flushed and isn't"); + } recovery/015_promotion_pages errors out. During an end-of-recovery checkpoint, XLogNeedsFlush() should compare the provided LSN to the flush pointer and not the min recovery point. This only errors out during an end-of-recovery checkpoint after a crash when the ControlFile->minRecoveryPoint has not been correctly updated. In this case, LocalMinRecoveryPoint is initialized to an incorrect value potentially causing XLogNeedsFlush() to incorrectly return true. This trivial diff makes the test pass: @@ -3115,7 +3125,7 @@ XLogNeedsFlush(XLogRecPtr record) * instead. So "needs flush" is taken to mean whether minRecoveryPoint * would need to be updated. */ - if (RecoveryInProgress()) + if (RecoveryInProgress() && !XLogInsertAllowed()) { However, since all of this code is new to me, there are some remaining things I don't understand: Why is it okay for other processes than the startup process to initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint during crash recovery? All of the comments and the code seem to indicate that this is allowed. Couldn't it just as likely be invalid for them (e.g. checkpointer here) as it could for the startup process? Like if we crashed before updating it? Besides this question, I find the use of the global variable InRecovery as a proxy for RecoveryInProgress() + "I'm the startup process" to be confusing. It seems like UpdateMinRecoveryPoint() and XLogNeedsFlush() could just test MyBackendType == B_STARTUP and make it explicit. - Melanie
On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Hi, > > If you call XLogNeedsFlush() immediately after calling XLogFlush() in > FlushBuffer(), it can return true. > > With this diff: > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 350cc0402aa..91c3fe99d6e 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -4342,7 +4342,11 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, > IOObject io_object, > * skip the flush if the buffer isn't permanent. > */ > if (buf_state & BM_PERMANENT) > + { > XLogFlush(recptr); > + if(XLogNeedsFlush(recptr)) > + elog(ERROR, "should be flushed and isn't"); > + } > > recovery/015_promotion_pages errors out. > > During an end-of-recovery checkpoint, XLogNeedsFlush() should compare > the provided LSN to the flush pointer and not the min recovery point. That right, "end-of-recovery checkpoint" actually performs a flush instead of updating the minRecoveryPoint so we should compare it with flush pointer. > This only errors out during an end-of-recovery checkpoint after a > crash when the ControlFile->minRecoveryPoint has not been correctly > updated. In this case, LocalMinRecoveryPoint is initialized to an > incorrect value potentially causing XLogNeedsFlush() to incorrectly > return true. > > This trivial diff makes the test pass: > > @@ -3115,7 +3125,7 @@ XLogNeedsFlush(XLogRecPtr record) > * instead. So "needs flush" is taken to mean whether minRecoveryPoint > * would need to be updated. > */ > - if (RecoveryInProgress()) > + if (RecoveryInProgress() && !XLogInsertAllowed()) > { > > However, since all of this code is new to me, there are some remaining > things I don't understand: > > Why is it okay for other processes than the startup process to > initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint > during crash recovery? During crash recovery we don't yet know the minRecoveryPoint until we replay all the WALs and once it is done it will be updated in the ControlFile at PerformRecoveryXLogAction()->CreateEndOfRecoveryRecord(). After this point it will be valid to use by any process including the StartupProcess. So I don't think until ControlFile->minRecoveryPoint is initialized it is safe to use for any process. In fact, before any connections are allowed this has to be set either by CreateEndOfRecoveryRecord() if recovering a primary from crash or by ReachedEndOfBackup() if you are starting a standby from backup. > Besides this question, I find the use of the global variable > InRecovery as a proxy for RecoveryInProgress() + "I'm the startup > process" to be confusing. It seems like UpdateMinRecoveryPoint() and > XLogNeedsFlush() could just test MyBackendType == B_STARTUP and make > it explicit. IIUC InRecovery is slightly different than "RecoveryInProgress() + "I'm the startup", in StartupXlog, we first set InRecovery to false and then update the minRecoveryPoint in CreateEndOfRecoveryRecord() and after that we clear RecoveryInProgress (i.e. set XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;). -- Regards, Dilip Kumar Google
Thanks for taking time to reply! On Wed, Sep 3, 2025 at 11:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > Why is it okay for other processes than the startup process to > > initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint > > during crash recovery? > > During crash recovery we don't yet know the minRecoveryPoint until we > replay all the WALs and once it is done it will be updated in the > ControlFile at PerformRecoveryXLogAction()->CreateEndOfRecoveryRecord(). > After this point it will be valid to use by any process including the > StartupProcess. So I don't think until ControlFile->minRecoveryPoint > is initialized it is safe to use for any process. In fact, before any > connections are allowed this has to be set either by > CreateEndOfRecoveryRecord() if recovering a primary from crash or by > ReachedEndOfBackup() if you are starting a standby from backup. So, looking at the code, it seems like most places we examine ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery". Is this something we want to do in XLogNeedsFlush() to avoid reading from ControlFile->minRecoveryPoint()? Though, it seems like LocalMinRecoveryPoint must be getting incorrectly set elsewhere, otherwise this would have guarded us from examining the control file: if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) updateMinRecoveryPoint = false; /* Quick exit if already known to be updated or cannot be updated */ if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint) return false; > > Besides this question, I find the use of the global variable > > InRecovery as a proxy for RecoveryInProgress() + "I'm the startup > > process" to be confusing. It seems like UpdateMinRecoveryPoint() and > > XLogNeedsFlush() could just test MyBackendType == B_STARTUP and make > > it explicit. > > IIUC InRecovery is slightly different than "RecoveryInProgress() + > "I'm the startup", in StartupXlog, we first set InRecovery to false > and then update the minRecoveryPoint in CreateEndOfRecoveryRecord() > and after that we clear RecoveryInProgress (i.e. set > XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;). Hmm. So, RecoveryInProgress() can be true in the startup process and InRecovery is false. I see. - Melanie
On Tue, 2025-09-09 at 16:29 -0400, Melanie Plageman wrote: > On Wed, Sep 3, 2025 at 11:38 PM Dilip Kumar <dilipbalaut@gmail.com> > wrote: > > > > On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > Why is it okay for other processes than the startup process to > > > initialize LocalMinRecoveryPoint from ControlFile- > > > >minRecoveryPoint > > > during crash recovery? > > > > During crash recovery we don't yet know the minRecoveryPoint until > > we > > replay all the WALs and once it is done it will be updated in the > > ControlFile at PerformRecoveryXLogAction()- > > >CreateEndOfRecoveryRecord(). ... > > Though, it seems like LocalMinRecoveryPoint must be getting > incorrectly set elsewhere, otherwise this would have guarded us from > examining the control file: I am confused about whether we are discussing "incorrect" or "invalid". Regards, Jeff Davis > >
On Wed, Sep 10, 2025 at 1:59 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Thanks for taking time to reply! > > On Wed, Sep 3, 2025 at 11:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > Why is it okay for other processes than the startup process to > > > initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint > > > during crash recovery? > > > > During crash recovery we don't yet know the minRecoveryPoint until we > > replay all the WALs and once it is done it will be updated in the > > ControlFile at PerformRecoveryXLogAction()->CreateEndOfRecoveryRecord(). > > After this point it will be valid to use by any process including the > > StartupProcess. So I don't think until ControlFile->minRecoveryPoint > > is initialized it is safe to use for any process. In fact, before any > > connections are allowed this has to be set either by > > CreateEndOfRecoveryRecord() if recovering a primary from crash or by > > ReachedEndOfBackup() if you are starting a standby from backup. > > So, looking at the code, it seems like most places we examine > ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery". > Is this something we want to do in XLogNeedsFlush() to avoid reading > from ControlFile->minRecoveryPoint()? > > Though, it seems like LocalMinRecoveryPoint must be getting > incorrectly set elsewhere, otherwise this would have guarded us from > examining the control file: > > if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) > updateMinRecoveryPoint = false; > > /* Quick exit if already known to be updated or cannot be updated */ > if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint) > return false; That's not quite right. Before the end-of-recovery checkpoint, the InRecovery flag is already set to false. This means that even if LocalMinRecoveryPoint is invalid, it won't matter, and updateMinRecoveryPoint will not be set to false. Since LocalMinRecoveryPoint is 0, the condition if (record <= LocalMinRecoveryPoint) will also fail, causing the process to continue and read from the ControlFile. -- Regards, Dilip Kumar Google
On Wed, Sep 10, 2025 at 12:48:21PM +0530, Dilip Kumar wrote: >> So, looking at the code, it seems like most places we examine >> ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery". >> Is this something we want to do in XLogNeedsFlush() to avoid reading >> from ControlFile->minRecoveryPoint()? How would that help when the checkpointer does XLogNeedsFlush() calls, which is where you are reporting the disturbance is during the end-of-recovery checkpoint? InArchiveRecovery is only true in the startup process, set when we are doing archive recovery. This can be switched from the beginning of recovery or later, once all the local WAL has been replayed. >> Though, it seems like LocalMinRecoveryPoint must be getting >> incorrectly set elsewhere, otherwise this would have guarded us from >> examining the control file: >> >> if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) >> updateMinRecoveryPoint = false; >> >> /* Quick exit if already known to be updated or cannot be updated */ >> if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint) >> return false; > > That's not quite right. Before the end-of-recovery checkpoint, the > InRecovery flag is already set to false. This means that even if > LocalMinRecoveryPoint is invalid, it won't matter, and > updateMinRecoveryPoint will not be set to false. Since > LocalMinRecoveryPoint is 0, the condition if (record <= > LocalMinRecoveryPoint) will also fail, causing the process to continue > and read from the ControlFile. Similarly, InRecovery can only be true in the startup process. In the case of 015_promotion_pages, I can see that when the check you are adding fails, the LSN is already flushed by XLogFLush(). So, while I don't see any active bug here, I tend to agree with your position that XLogNeedsFlush() could be improved in the context of the end-of-recovery checkpoint or just when a process is allowed WAL inserts while we are still in recovery. You have a good point here, especially knowing that for CHECKPOINT_END_OF_RECOVERY case in CreateCheckPoint(), called by the checkpointer, we allow WAL inserts with LocalSetXLogInsertAllowed(). So, relying on RecoveryInProgress() in this context leads to a wrong result with XLogNeedsFlush(): WAL inserts are allowed, the minimum recovery point is not relevant for the evaluation. Or are you worried about the case of GetVictimBuffer() where a second call of XLogNeedsFlush() lives? -- Michael
Вложения
On Wed, Sep 10, 2025 at 3:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 10, 2025 at 1:59 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > Though, it seems like LocalMinRecoveryPoint must be getting > > incorrectly set elsewhere, otherwise this would have guarded us from > > examining the control file: > > > > if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) > > updateMinRecoveryPoint = false; > > > > /* Quick exit if already known to be updated or cannot be updated */ > > if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint) > > return false; > > That's not quite right. Before the end-of-recovery checkpoint, the > InRecovery flag is already set to false. This means that even if > LocalMinRecoveryPoint is invalid, it won't matter, and > updateMinRecoveryPoint will not be set to false. Since > LocalMinRecoveryPoint is 0, the condition if (record <= > LocalMinRecoveryPoint) will also fail, causing the process to continue > and read from the ControlFile. Ah, right, I got turned around. My original investigation showed me that the checkpointer incorrectly read from the ControlFile when I added the XLogNeedsFlush() precisely because InRecovery is false outside of the startup process. What I want is for it to be safe and accurate to call XLogNeedsFlush() in any backend (one that might flush WAL, that is). - Melanie
On Tue, Sep 9, 2025 at 9:50 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2025-09-09 at 16:29 -0400, Melanie Plageman wrote: > > > Though, it seems like LocalMinRecoveryPoint must be getting > > incorrectly set elsewhere, otherwise this would have guarded us from > > examining the control file: > > I am confused about whether we are discussing "incorrect" or "invalid". I meant incorrect as in when it was not supposed to be -- not InvalidXLogRecPtr. But, I've realized I misspoke -- because InRecovery is only true in the startup process, so if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) is false for checkpointer. - Melanie
On Wed, Sep 10, 2025 at 3:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Sep 10, 2025 at 12:48:21PM +0530, Dilip Kumar wrote: > >> So, looking at the code, it seems like most places we examine > >> ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery". > >> Is this something we want to do in XLogNeedsFlush() to avoid reading > >> from ControlFile->minRecoveryPoint()? > > How would that help when the checkpointer does XLogNeedsFlush() calls, > which is where you are reporting the disturbance is during the > end-of-recovery checkpoint? InArchiveRecovery is only true in the > startup process, set when we are doing archive recovery. This can be > switched from the beginning of recovery or later, once all the local > WAL has been replayed. Ah right, I made the same mistake with the InRecovery global variable initially too. This is my first time reading this code. I thought that there needed to be something guarding processes other than the startup process from reading the minimum recovery point from the ControlFile during crash recovery until it is valid. But perhaps not -- more on that below. > In the case of 015_promotion_pages, I can see that when the check you > are adding fails, the LSN is already flushed by XLogFLush(). So, > while I don't see any active bug here, I tend to agree with your > position that XLogNeedsFlush() could be improved in the context of the > end-of-recovery checkpoint or just when a process is allowed WAL > inserts while we are still in recovery. > > You have a good point here, especially knowing that for > CHECKPOINT_END_OF_RECOVERY case in CreateCheckPoint(), called by the > checkpointer, we allow WAL inserts with LocalSetXLogInsertAllowed(). > So, relying on RecoveryInProgress() in this context leads to a wrong > result with XLogNeedsFlush(): WAL inserts are allowed, the minimum > recovery point is not relevant for the evaluation. So, would you consider the defining characteristic of whether or not we should use the flush pointer instead of min recovery point in XLogNeedsFlush() to be whether or not WAL inserts are allowed? e.g. @@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record) * instead. So "needs flush" is taken to mean whether minRecoveryPoint * would need to be updated. */ - if (RecoveryInProgress()) + if (RecoveryInProgress() && !XLogInsertAllowed()) If I look at the difference between XLogFlush() and XLogNeedsFlush(), it checks if XLogInsertAllowed() and, if not, does not try to flush WAL -- which makes sense. There shouldn't be WAL to flush if we aren't allowed to insert any. In XLogFlush() -> UpdateMinRecoveryPoint() (called after checking if WAL inserts are allowed), it has similar logic to XLogNeedsFlush(): if (!updateMinRecoveryPoint || (!force && lsn <= LocalMinRecoveryPoint)) return; if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) updateMinRecoveryPoint = false; The order of these two if statements is reversed in XLogNeedsFlush() So, what I don't understand is: why it would be okay for processes other than the startup process to read the minimum recovery point from the control file during crash recovery before it is valid. It seems in crash recovery, a backend other than the startup process will fail this test: if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) and then potentially read from the control file LocalMinRecoveryPoint = ControlFile->minRecoveryPoint; LocalMinRecoveryPointTLI = ControlFile->minRecoveryPointTLI; There is a comment right after that says /* * Check minRecoveryPoint for any other process than the startup * process doing crash recovery, which should not update the control * file value if crash recovery is still running. */ Which seems to say that it is okay for other processes than the startup process to read the minimum recovery point from the Control File during crash recovery. But I don't quite understand why. > Or are you worried about the case of GetVictimBuffer() where a second > call of XLogNeedsFlush() lives? I want it to be fixed because of another patch I am writing to do batched writes with smgrwritev() [1]. In this case, I've split up the code in FlushBuffer() and flush the WAL to the required LSN in a different place in code than where I call smgrwritev(). So, I want to add an assert-only verification that checks that none of the buffer's need WAL flushed. This would be invoked by any process calling FlushBuffer(), so it would have to work for all types of backends. - Melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_Yurj0sabXZB5EVqO3fKgwN1vELe1H4e0s1zXV3MKbZjQ%40mail.gmail.com#c8caa5ff77d89e75b6f56f16619a77b7
On Wed, 2025-09-10 at 11:12 -0400, Melanie Plageman wrote: > So, would you consider the defining characteristic of whether or not > we should use the flush pointer instead of min recovery point in > XLogNeedsFlush() to be whether or not WAL inserts are allowed? That was my question here: https://www.postgresql.org/message-id/b4ad535a72fc02ea43076cf525e4dbaa72b00d5b.camel@j-davis.com It seems like XLogFlush() and XLogNeedsFlush() should use the same test, otherwise you could always get some confusing inconsistency. Right? Regards, Jeff Davis
On Wed, Sep 10, 2025 at 11:12:37AM -0400, Melanie Plageman wrote: > On Wed, Sep 10, 2025 at 3:58 AM Michael Paquier <michael@paquier.xyz> wrote: >> You have a good point here, especially knowing that for >> CHECKPOINT_END_OF_RECOVERY case in CreateCheckPoint(), called by the >> checkpointer, we allow WAL inserts with LocalSetXLogInsertAllowed(). >> So, relying on RecoveryInProgress() in this context leads to a wrong >> result with XLogNeedsFlush(): WAL inserts are allowed, the minimum >> recovery point is not relevant for the evaluation. > > So, would you consider the defining characteristic of whether or not > we should use the flush pointer instead of min recovery point in > XLogNeedsFlush() to be whether or not WAL inserts are allowed? Yeah, using the flush pointer makes sense if WAL inserts are allowed, even if that's just temporary. We're telling the process where the inserts are allowed that we are not officially out of recovery yet, but close to it. If your suggestion is to update XLogNeedsFlush() so as it uses !XLogInsertAllowed() rather than RecoveryInProgress(), that makes sense to me. It looks like we've never really had a case up to now where XLogNeedsFlush() would matter for the processes where LocalSetXLogInsertAllowed() has been called to temporarily allow WAL inserts. LocalSetXLogInsertAllowed() is not a one-way trip as far as I recall, with the checkpointer being a special case. > In XLogFlush() -> UpdateMinRecoveryPoint() (called after checking if > WAL inserts are allowed), it has similar logic to XLogNeedsFlush(): > > if (!updateMinRecoveryPoint || (!force && lsn <= LocalMinRecoveryPoint)) > return; > if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) > updateMinRecoveryPoint = false; > > The order of these two if statements is reversed in XLogNeedsFlush() You mean in XLogNeedsFlush()->RecoveryInProgress() for the first set of checks, and the top of UpdateMinRecoveryPoint() for the second set of similar checks. Alexander Kukushkin may have some test cases fancier than the in-core tests to stress this code path, still +1 for more consistency. > So, what I don't understand is: why it would be okay for processes > other than the startup process to read the minimum recovery point from > the control file during crash recovery before it is valid. InvalidXLogRecPtr stored in ControlFile->minRecoveryPoint can be used as a way to check if we are running crash recovery. I don't see why we could not reverse the order of the checks to make both code paths more consistent. Cannot think of a reason on top of my mind. > It seems in crash recovery, a backend other than the startup process > will fail this test: > > if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) Yes. InRecovery can only ever be true in the startup process. Anything else will fail this test. FYI, looking a bit around with the history of XLogNeedsFlush(), the last changes were a few years ago, with c186ba135eca and 8d68ee6f31ca (cough): https://www.postgresql.org/message-id/20180831.145206.05203037.horiguchi.kyotaro@lab.ntt.co.jp The only process calling UpdateMinRecoveryPoint() during crash recovery should be the startup process, and I don't recall that we've changed anything in this area lately to change this rule. > There is a comment right after that says > > /* > * Check minRecoveryPoint for any other process than the startup > * process doing crash recovery, which should not update the control > * file value if crash recovery is still running. > */ > > Which seems to say that it is okay for other processes than the > startup process to read the minimum recovery point from the Control > File during crash recovery. But I don't quite understand why. minRecoveryPoint as an invalid LSN can allow another process to know if we are in crash recovery. These days one could just use GetRecoveryState() if they'd wish to know the state of recovery we are in based on how much the startup process has replayed. I suspect that this code could be simplified to rely more on GetRecoveryState(), instead. This API has been introduced later than the recent changes of XLogNeedsFlush(), as in 4e87c4836ab9 vs c186ba135eca. You are right that there are some simplications that can be done here. That looks like an independent list of patches, at least three: 1) The order of the sanity checks. 2) Potential simplifications in XLogNeedsFlush() with GetRecoveryState(). 3) Switch XLogNeedsFlush() to use !XLogInsertAllowed() rather than RecoveryInProgress() that discard the local-wal-insert case, with the addition of an extra assertion. bufmgr.c would be one location, I am wondering if we should not plant that elsewhere, like in XLogFlush() itself. There should be no risk of an infinite recursion, if my reading of the code is right this morning. Anyway, yes, it really comes down to the point that we've never bothered using XLogNeedsFlush() to support your case. 3) would be enough for your case, the other two seem like nice improvements to make this part of the recovery code leaner. >> Or are you worried about the case of GetVictimBuffer() where a second >> call of XLogNeedsFlush() lives? > > I want it to be fixed because of another patch I am writing to do > batched writes with smgrwritev() [1]. In this case, I've split up the > code in FlushBuffer() and flush the WAL to the required LSN in a > different place in code than where I call smgrwritev(). So, I want to > add an assert-only verification that checks that none of the buffer's > need WAL flushed. This would be invoked by any process calling > FlushBuffer(), so it would have to work for all types of backends. Adding an assertion based on needs-flush sounds like a good idea in the context of what you are doing in the other patch, yes. Recovery test 015 would fail as well in your case, right? -- Michael
Вложения
On Wed, Sep 10, 2025 at 09:58:08AM -0700, Jeff Davis wrote: > It seems like XLogFlush() and XLogNeedsFlush() should use the same > test, otherwise you could always get some confusing inconsistency. > Right? Even if the checks are duplicated (dependency could be documented as well), it would make sense to me to plant a check of XLogNeedsFlush() inside XLogFlush(), I think. I have not tried if some parts of the tests blow up when trying to do that even after switching XLogNeedsFlush() to check if WAL inserts are allowed rather than if we are in recovery. -- Michael
Вложения
On Thu, Sep 11, 2025 at 4:51 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Sep 10, 2025 at 09:58:08AM -0700, Jeff Davis wrote: > > It seems like XLogFlush() and XLogNeedsFlush() should use the same > > test, otherwise you could always get some confusing inconsistency. > > Right? > > Even if the checks are duplicated (dependency could be documented as > well), it would make sense to me to plant a check of XLogNeedsFlush() > inside XLogFlush(), I think. I have not tried if some parts of the > tests blow up when trying to do that even after switching > XLogNeedsFlush() to check if WAL inserts are allowed rather than if we > are in recovery. +1, it really makes XLogFlush() to directly check using XLogNeedsFlush() after adding the "WAL inserts are allowed" check in XLogNeedsFlush(), this is the best way to avoid any inconsistencies in future as well. -- Regards, Dilip Kumar Google
On Fri, Sep 12, 2025 at 8:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Sep 11, 2025 at 4:51 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Wed, Sep 10, 2025 at 09:58:08AM -0700, Jeff Davis wrote: > > > It seems like XLogFlush() and XLogNeedsFlush() should use the same > > > test, otherwise you could always get some confusing inconsistency. > > > Right? > > > > Even if the checks are duplicated (dependency could be documented as > > well), it would make sense to me to plant a check of XLogNeedsFlush() > > inside XLogFlush(), I think. I have not tried if some parts of the > > tests blow up when trying to do that even after switching > > XLogNeedsFlush() to check if WAL inserts are allowed rather than if we > > are in recovery. > > +1, it really makes XLogFlush() to directly check using > XLogNeedsFlush() after adding the "WAL inserts are allowed" check in > XLogNeedsFlush(), this is the best way to avoid any inconsistencies in > future as well. I tried with the attached patch, at least check-world reports no issue. -- Regards, Dilip Kumar Google
Вложения
On Fri, Sep 12, 2025 at 08:45:36AM +0530, Dilip Kumar wrote: > On Fri, Sep 12, 2025 at 8:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> +1, it really makes XLogFlush() to directly check using >> XLogNeedsFlush() after adding the "WAL inserts are allowed" check in >> XLogNeedsFlush(), this is the best way to avoid any inconsistencies in >> future as well. > > I tried with the attached patch, at least check-world reports no issue. @@ -2797,7 +2797,7 @@ XLogFlush(XLogRecPtr record) } /* Quick exit if already known flushed */ - if (record <= LogwrtResult.Flush) + if (!XLogNeedsFlush(record)) return; Hmm, no. You are making more expensive a check that is written to be cheap. I was more thinking about an assertion at the bottom of XLogFlush() once a flush is completed. Input and ideas from others are of course welcome on the matter. -- Michael
Вложения
On Fri, Sep 12, 2025 at 9:47 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Sep 12, 2025 at 08:45:36AM +0530, Dilip Kumar wrote: > > On Fri, Sep 12, 2025 at 8:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> +1, it really makes XLogFlush() to directly check using > >> XLogNeedsFlush() after adding the "WAL inserts are allowed" check in > >> XLogNeedsFlush(), this is the best way to avoid any inconsistencies in > >> future as well. > > > > I tried with the attached patch, at least check-world reports no issue. > > @@ -2797,7 +2797,7 @@ XLogFlush(XLogRecPtr record) > } > > /* Quick exit if already known flushed */ > - if (record <= LogwrtResult.Flush) > + if (!XLogNeedsFlush(record)) > return; > > Hmm, no. You are making more expensive a check that is written to be > cheap. I was more thinking about an assertion at the bottom of > XLogFlush() once a flush is completed. Input and ideas from others > are of course welcome on the matter. Yeah, asserting it at the end makes sense, as we can ensure that XLogFlush() and XLogNeedsFlush() agree on the same thing without adding additional overhead. Here is the revised one. -- Regards, Dilip Kumar Google
Вложения
On Fri, Sep 12, 2025 at 10:21:27AM +0530, Dilip Kumar wrote: > Yeah, asserting it at the end makes sense, as we can ensure that > XLogFlush() and XLogNeedsFlush() agree on the same thing without > adding additional overhead. Here is the revised one. Melanie and Jeff, what do you think? -- Michael
Вложения
On Fri, 2025-09-12 at 14:04 +0900, Michael Paquier wrote: > On Fri, Sep 12, 2025 at 10:21:27AM +0530, Dilip Kumar wrote: > > Yeah, asserting it at the end makes sense, as we can ensure that > > XLogFlush() and XLogNeedsFlush() agree on the same thing without > > adding additional overhead. Here is the revised one. > > Melanie and Jeff, what do you think? IIUC: during the end-of-recovery checkpoint, a hypothetical call to XLogNeedsFlush() would have incorrectly used the LocalMinRecoveryPoint; and with the change, it would correctly use the flush pointer. That wasn't a live bug because XLogNeedsFlush() was never actually called during the end-of-recovery checkpoint. CreateCheckPoint() called XLogFlush() directly, which correctly checked the flush pointer. But, hypothetically, if that code had logic based around XLogNeedsFlush() before calling XLogFlush(), that would have been a bug. So this fix has no behavior change, but it would make the code more resilient against changes to CreateCheckPoint(), more consistent, and less confusing. Is that right? Assuming I'm right so far, then I agree with changing the test in XLogNeedsFlush() to !XLogInsertAllowed(), as the patch does. The comment in the patch is describing what's happening, but lost the "why". Perhaps something like: "During recovery, we don't flush WAL but update minRecoveryPoint instead. So "needs flush" is taken to mean whether minRecoveryPoint would need to be updated. The end-of-recovery checkpoint still must flush WAL like normal, though, so check !XLogInsertAllowed(). This check must be consistent with XLogFlush()." The commit message is vague about whether there's a live bug or not; I suggest clarifying that the inconsistency between the two functions could have been a bug but wasn't. Also, I generally use past tense for the stuff that's being fixed and present tense for what the patch does. One loose end: there was some discussion about the times when LocalMinRecoveryPoint is valid/correct. I'm not sure I entirely understood -- is that no longer a concern? Regards, Jeff Davis
On Fri, Sep 12, 2025 at 10:02 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2025-09-12 at 14:04 +0900, Michael Paquier wrote: > > On Fri, Sep 12, 2025 at 10:21:27AM +0530, Dilip Kumar wrote: > > > Yeah, asserting it at the end makes sense, as we can ensure that > > > XLogFlush() and XLogNeedsFlush() agree on the same thing without > > > adding additional overhead. Here is the revised one. > > > > Melanie and Jeff, what do you think? > > IIUC: during the end-of-recovery checkpoint, a hypothetical call to > XLogNeedsFlush() would have incorrectly used the LocalMinRecoveryPoint; > and with the change, it would correctly use the flush pointer. > > That wasn't a live bug because XLogNeedsFlush() was never actually > called during the end-of-recovery checkpoint. CreateCheckPoint() > called XLogFlush() directly, which correctly checked the flush pointer. > But, hypothetically, if that code had logic based around > XLogNeedsFlush() before calling XLogFlush(), that would have been a > bug. > > So this fix has no behavior change, but it would make the code more > resilient against changes to CreateCheckPoint(), more consistent, and > less confusing. Is that right? That's right. > Assuming I'm right so far, then I agree with changing the test in > XLogNeedsFlush() to !XLogInsertAllowed(), as the patch does. > > The comment in the patch is describing what's happening, but lost the > "why". Perhaps something like: > > "During recovery, we don't flush WAL but update minRecoveryPoint > instead. So "needs flush" is taken to mean whether minRecoveryPoint > would need to be updated. The end-of-recovery checkpoint still must > flush WAL like normal, though, so check !XLogInsertAllowed(). This > check must be consistent with XLogFlush()." Updated as per suggestion > The commit message is vague about whether there's a live bug or not; I > suggest clarifying that the inconsistency between the two functions > could have been a bug but wasn't. Also, I generally use past tense for > the stuff that's being fixed and present tense for what the patch does. I tried to improve it in v2 > One loose end: there was some discussion about the times when > LocalMinRecoveryPoint is valid/correct. I'm not sure I entirely > understood -- is that no longer a concern? IMHO during crash recovery LocalMinRecoveryPoint and ControlFile->minRecoveryPoint can not initialized until we replay all WAL so those should never get accessed. However if XLogNeedsFlush() were called during the end of the recovery checkpoint it was accessing this which was wrong although it was not a live bug and this patch is making that more resilient. OTOH if we are creating a restartpoint during standby replay or archive recovery then we do access LocalMinRecoveryPoint and ControlFile->minRecoveryPoint because by that time we have already done the minimum recovery require for reaching to the consistent state and we would have initialized ControlFile->minRecoveryPoint. IMHO during crash recovery, the variables LocalMinRecoveryPoint and ControlFile->minRecoveryPoint remain uninitialized until all required WAL has been replayed. Consequently, they must not be accessed during this phase. Previously, calling XLogNeedsFlush() during the "end of recovery checkpoint" incorrectly accessed these uninitialized values. While this was not a live bug, this patch hardens the code against this type of race condition. OTOH, when creating a RestartPoint during standby replay or doing archive recovery, access to both LocalMinRecoveryPoint and ControlFile->minRecoveryPoint is correct. By this stage, it has completed the minimum recovery required to achieve a consistent state, and ControlFile->minRecoveryPoint would have been properly initialized. -- Regards, Dilip Kumar Google
Вложения
On Sun, 2025-09-14 at 13:39 +0530, Dilip Kumar wrote: > I tried to improve it in v2 Thank you. Looks good to me. > > IMHO during crash recovery LocalMinRecoveryPoint and > ControlFile->minRecoveryPoint can not initialized until we replay all > WAL so those should never get accessed. However if XLogNeedsFlush() > were called during the end of the recovery checkpoint it was > accessing > this which was wrong although it was not a live bug and this patch is > making that more resilient. OK, so this will no longer be a concern after the patch (and not a live bug before, anyway). Regards, Jeff Davis
On Sun, Sep 14, 2025 at 11:23 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Sun, 2025-09-14 at 13:39 +0530, Dilip Kumar wrote: > > I tried to improve it in v2 > > Thank you. Looks good to me. Thanks > > > > IMHO during crash recovery LocalMinRecoveryPoint and > > ControlFile->minRecoveryPoint can not initialized until we replay all > > WAL so those should never get accessed. However if XLogNeedsFlush() > > were called during the end of the recovery checkpoint it was > > accessing > > this which was wrong although it was not a live bug and this patch is > > making that more resilient. > > OK, so this will no longer be a concern after the patch (and not a live > bug before, anyway). That's my take, but I'd like to hear Melanie's thoughts on this point as she raised this concern. -- Regards, Dilip Kumar Google
On Sun, Sep 14, 2025 at 4:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Sep 12, 2025 at 10:02 PM Jeff Davis <pgsql@j-davis.com> wrote: > > > Assuming I'm right so far, then I agree with changing the test in > > XLogNeedsFlush() to !XLogInsertAllowed(), as the patch does. > > > > The comment in the patch is describing what's happening, but lost the > > "why". Perhaps something like: > > > > "During recovery, we don't flush WAL but update minRecoveryPoint > > instead. So "needs flush" is taken to mean whether minRecoveryPoint > > would need to be updated. The end-of-recovery checkpoint still must > > flush WAL like normal, though, so check !XLogInsertAllowed(). This > > check must be consistent with XLogFlush()." > > Updated as per suggestion I like the idea of adding an assert to keep the two in sync. However, two things 1) Since XLogNeedsFlush() refreshes LogwrtResult, is it possible for this assert to fail because the flush pointer was updated? I realize that that is the case with my original patch as well. If this is true, perhaps there is not a way to use XLogNeedsFlush() after XLogFlush() -- only before as a check to see if we must call XLogFlush(). 2) You've added the assert at the bottom which will only cover the flush pointer case and not the min recovery point case. Maybe that's good enough though... > > One loose end: there was some discussion about the times when > > LocalMinRecoveryPoint is valid/correct. I'm not sure I entirely > > understood -- is that no longer a concern? > > IMHO during crash recovery LocalMinRecoveryPoint and > ControlFile->minRecoveryPoint can not initialized until we replay all > WAL so those should never get accessed. However if XLogNeedsFlush() > were called during the end of the recovery checkpoint it was accessing > this which was wrong although it was not a live bug and this patch is > making that more resilient. OTOH if we are creating a restartpoint > during standby replay or archive recovery then we do access > LocalMinRecoveryPoint and ControlFile->minRecoveryPoint because by > that time we have already done the minimum recovery require for > reaching to the consistent state and we would have initialized > ControlFile->minRecoveryPoint. > > IMHO during crash recovery, the variables LocalMinRecoveryPoint and > ControlFile->minRecoveryPoint remain uninitialized until all required > WAL has been replayed. Consequently, they must not be accessed during > this phase. Previously, calling XLogNeedsFlush() during the "end of > recovery checkpoint" incorrectly accessed these uninitialized values. > While this was not a live bug, this patch hardens the code against > this type of race condition. > > OTOH, when creating a RestartPoint during standby replay or doing > archive recovery, access to both LocalMinRecoveryPoint and > ControlFile->minRecoveryPoint is correct. By this stage, it has > completed the minimum recovery required to achieve a consistent state, > and ControlFile->minRecoveryPoint would have been properly > initialized. My confusion was two things: how the startup process is getting a valid minimum recovery point during crash recovery on a standby and how bgwriter and checkpointer avoid reading the control file during crash recovery on the standby. IIUC, the minimum recovery point on the primary will always be InvalidXLogRecPtr. On the standby, during "normal" recovery, all backends (regular backends running read-only queries, checkpointer, etc) update the minimum recovery point so it is ready in case of a crash. During recovery from a crash on the standby, no regular backends will be running read-only queries because those are forbidden until crash recovery finishes. Crash recovery on the standby is finished when WAL has been replayed through the minimum recovery point -- or, at least I thought that was the point of the minimum recovery point. But, the startup process needs to replay WAL through the minimum recovery point. So, where does it get this value from? The control file? And if the startup process gets the minimum recovery point from the ControlFile, then how are bgwriter and checkpointer protected from reading it since they should only see InvalidXLogRecPtr during crash recovery? Looking at this logic in XLogFlush() -> UpdateMinRecoveryPoint() if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) { updateMinRecoveryPoint = false; return; } Only the startup process will pass this test. Then other processes will continue to read the control file. This only works if ControlFile->minRecoveryPoint is InvalidXLogRecPtr during crash recovery. As for your patch, I think we should also update the comment at the top of XLogNeedsFlush(). Something like this when combined with your comment update and code. /* - * Test whether XLOG data has been flushed up to (at least) the given position. + * Test whether XLOG data has been flushed up to (at least) the given position + * or whether the minimum recovery point is updated past the given position. * - * Returns true if a flush is still needed. (It may be that someone else - * is already in process of flushing that far, however.) + * Returns true if a flush is still needed or if the minimum recovery point + * must be updated. + * + * It is possible that someone else is already in the process of flushing that + * far or updating the minimum recovery point that far. */ bool XLogNeedsFlush(XLogRecPtr record) { /* - * During recovery, we don't flush WAL but update minRecoveryPoint - * instead. So "needs flush" is taken to mean whether minRecoveryPoint - * would need to be updated. + * During recovery, when WAL inserts are forbidden, "needs flush" is + * really "minimum recovery point needs updating". The end-of-recovery + * checkpoint still must flush WAL like normal, though, so check + * !XLogInsertAllowed(). This check must be consistent with XLogFlush(). */ - if (RecoveryInProgress()) + if (!XLogInsertAllowed()) { + Assert(RecoveryInProgress()); And I don't know if WAL inserts could ever be forbidden outside of recovery (maybe if a read-only database mode is added) -- but maybe it is worth adding an assertion that recovery is in progress? - Melanie
On Mon, Sep 15, 2025 at 10:58 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sun, Sep 14, 2025 at 4:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Updated as per suggestion > > I like the idea of adding an assert to keep the two in sync. However, two things > > 1) Since XLogNeedsFlush() refreshes LogwrtResult, is it possible for > this assert to fail because the flush pointer was updated? I realize > that that is the case with my original patch as well. If this is true, > perhaps there is not a way to use XLogNeedsFlush() after XLogFlush() > -- only before as a check to see if we must call XLogFlush(). Oops, ignore this -- obviously the flush pointer can only move forwards :) So, this is not an issue. > 2) You've added the assert at the bottom which will only cover the > flush pointer case and not the min recovery point case. Maybe that's > good enough though... I actually think this is fine also. The assert will ensure that XLogNeedsFlush() and XLogFlush() stay in sync WRT which path is taken (either comparing to the min recovery point or comparing to the flush pointer). It doesn't need to recheck what XLogFlush() just checked. As such, we should clarify the comment above the assert in your patch to make it clear that the point of the assert is not to check the logic in XLogFlush() but to protect against drift between XLogNeedsFlush() and XLogFlush() WRT the comparison logic used. - Melanie
On Mon, Sep 15, 2025 at 03:02:36PM -0400, Melanie Plageman wrote: > As such, we should clarify the comment above the assert in your patch > to make it clear that the point of the assert is not to check the > logic in XLogFlush() but to protect against drift between > XLogNeedsFlush() and XLogFlush() WRT the comparison logic used. Agreed. While looking at this patch, I have planted the following assertion in XLogNeedsFlush(): @@ -3142,6 +3150,13 @@ XLogNeedsFlush(XLogRecPtr record) LocalMinRecoveryPointTLI = ControlFile->minRecoveryPointTLI; LWLockRelease(ControlFileLock); + /* + * An invalid minRecoveryPoint can only be accessed in the startup + * process or while in single-user mode. + */ + Assert(AmStartupProcess() || !IsUnderPostmaster || + !XLogRecPtrIsInvalid(LocalMinRecoveryPoint)); Then wondered if we should try to push for more efforts about making XLogNeedsFlush() callable in non-startup non-postmaster processes while we perform crash recovery (aka mininum recovery point is still invalid, because XLogFlush() can be called by the checkpointer since 7ff23c6d277d while in crash recovery). However, I cannot get much excited about this without an actual use case, also knowing that UpdateMinRecoveryPoint() is coded to be defensive enough to discard this case. As a whole, the patch looks like a good balance, able to satisfy the new case you want to handle, Melanie. I am guessing that you'd want to tweak it and apply it yourself, so please feel free. -- Michael
Вложения
On Mon, Sep 15, 2025 at 8:28 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sun, Sep 14, 2025 at 4:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > OTOH, when creating a RestartPoint during standby replay or doing > > archive recovery, access to both LocalMinRecoveryPoint and > > ControlFile->minRecoveryPoint is correct. By this stage, it has > > completed the minimum recovery required to achieve a consistent state, > > and ControlFile->minRecoveryPoint would have been properly > > initialized. > > My confusion was two things: how the startup process is getting a > valid minimum recovery point during crash recovery on a standby and > how bgwriter and checkpointer avoid reading the control file during > crash recovery on the standby. > > IIUC, the minimum recovery point on the primary will always be > InvalidXLogRecPtr. Right > On the standby, during "normal" recovery, all backends (regular > backends running read-only queries, checkpointer, etc) update the > minimum recovery point so it is ready in case of a crash. Right > During recovery from a crash on the standby, no regular backends will > be running read-only queries because those are forbidden until crash > recovery finishes. Crash recovery on the standby is finished when WAL > has been replayed through the minimum recovery point -- or, at least I > thought that was the point of the minimum recovery point. > > But, the startup process needs to replay WAL through the minimum > recovery point. So, where does it get this value from? The control > file? In a recovery scenario, the ControlFile->minRecoveryPoint on a standby server is continuously updated. This ensures that even in the event of a crash, a valid recovery point is available. However, if a crashed standalone server is started as a standby, the ControlFile->minRecoveryPoint is initially unset, and this will be set in ReadRecord()->SwitchIntoArchiveRecovery() once all the WAL from pg_wal directory are applied, this happens right before start streaming from primary. > And if the startup process gets the minimum recovery point from the > ControlFile, then how are bgwriter and checkpointer protected from > reading it since they should only see InvalidXLogRecPtr during crash > recovery? > > Looking at this logic in XLogFlush() -> UpdateMinRecoveryPoint() > > if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) > { > updateMinRecoveryPoint = false; > return; > } > > Only the startup process will pass this test. Then other processes > will continue to read the control file. This only works if > ControlFile->minRecoveryPoint is InvalidXLogRecPtr during crash > recovery. Until the startup process hasn't completed the crash recovery i.g. not applied the WAL unto 'ControlFile->minRecoveryPoint' logically 'bgwriter' can not write any buffer whose WAL location is beyond the ControlFile->minRecoveryPoint because we haven't yet applied those WALs and also there should not any RestartPoint()/Checkpoint record before reaching the ControlFile->minRecoveryPoint otherwise we would have started crash recovery from that point. During the crash recovery process, before the WAL has been fully applied up to the ControlFile->minRecoveryPoint, the bgwriter cannot write any buffers that have a page LSN beyond this min recovery point. This is because those WALs which are beyond min recovery point have not yet been applied, so any buffer can not have page_lsn beyond this point. Therefore, I agree that when UpdateMinRecoveryPoint() is invoked by the bgwriter during standby crash recovery, it will indeed retrieve a valid value from ControlFile->minRecoveryPoint. However, IMHO, this cannot be called with LSN greater than ControlFile->minRecoveryPoint itself. Consequently, the conditional check else if (force || LocalMinRecoveryPoint < lsn) will not evaluate to true, and no action will be taken. And as far as the checkpointer is concerned, on standby systems, we only execute RestartPoint when a Checkpoint WAL is detected. And logically, no Checkpoint WAL should exist between the recovery start point and minRecoveryPoint, as crash recovery initiates from the most recent checkpoint redo. -- Regards, Dilip Kumar Google
On Wed, Sep 17, 2025 at 07:05:32PM +0530, Dilip Kumar wrote: > In a recovery scenario, the ControlFile->minRecoveryPoint on a standby > server is continuously updated. This ensures that even in the event of > a crash, a valid recovery point is available. However, if a crashed > standalone server is started as a standby, the > ControlFile->minRecoveryPoint is initially unset, and this will be set > in ReadRecord()->SwitchIntoArchiveRecovery() once all the WAL from > pg_wal directory are applied, this happens right before start > streaming from primary. This last sentence is incomplete. This happens right before streaming or before reading WAL from the archives, where the source of WAL is switched when the startup process waits for some WAL to become available (when fetching a new page). If streaming and archives are set, the order is: 1) local pg_wal 2) archives 3) streaming That's also documented. The startup process can switch between 2 and 3 depending on the availability of the streaming source. > Until the startup process hasn't completed the crash recovery i.g. not > applied the WAL unto 'ControlFile->minRecoveryPoint' logically > 'bgwriter' can not write any buffer whose WAL location is beyond the > ControlFile->minRecoveryPoint because we haven't yet applied those > WALs and also there should not any RestartPoint()/Checkpoint record > before reaching the ControlFile->minRecoveryPoint otherwise we would > have started crash recovery from that point. > > During the crash recovery process, before the WAL has been fully > applied up to the ControlFile->minRecoveryPoint, the bgwriter cannot > write any buffers that have a page LSN beyond this min recovery point. > This is because those WALs which are beyond min recovery point have > not yet been applied, so any buffer can not have page_lsn beyond this > point. Yep. Doing so is pointless, a newer page version is already on disk, and we expect crash recovery to replay up to a point where it can recover. > And as far as the checkpointer is concerned, on standby systems, we > only execute RestartPoint when a Checkpoint WAL is detected. Not exactly. Restart points can be triggered on-the-fly as well depending on the server activity. It's OK to run a manual CHECKPOINT on a standby, for example. > And logically, no Checkpoint WAL should exist between the recovery start > point and minRecoveryPoint, as crash recovery initiates from the most > recent checkpoint redo. We allow restart points even during crash recovery to make restarts more responsive. If the crash recovery phase it very long, this matters because we don't to replay a bunch of WAL if the standby is stopped before crash recovery finishes, still it has enough room to begin and finish a restart point. See 7ff23c6d277d. -- Michael
Вложения
On Tue, Sep 16, 2025 at 09:40:50AM +0900, Michael Paquier wrote: > As a whole, the patch looks like a good balance, able to satisfy the > new case you want to handle, Melanie. I am guessing that you'd want > to tweak it and apply it yourself, so please feel free. Hearing nothing, I'd like to move ahead with this improvement. I have tweaked a bit the comments, as suggested. If one switches the check of XLogNeedsFlush() from XLogInsertAllowed() to RecoveryInProgress(), the recovery test 015 blows up as expected. Any opinions or more word-smithing required? -- Michael
Вложения
On Sep 18, 2025, at 07:20, Michael Paquier <michael@paquier.xyz> wrote:On Tue, Sep 16, 2025 at 09:40:50AM +0900, Michael Paquier wrote:As a whole, the patch looks like a good balance, able to satisfy the
new case you want to handle, Melanie. I am guessing that you'd want
to tweak it and apply it yourself, so please feel free.
Hearing nothing, I'd like to move ahead with this improvement. I have
tweaked a bit the comments, as suggested. If one switches the check
of XLogNeedsFlush() from XLogInsertAllowed() to RecoveryInProgress(),
the recovery test 015 blows up as expected.
Any opinions or more word-smithing required?
--
Michael
<v3-0001-Make-XLogFlush-and-XLogNeedsFlush-decision-more-c.patch>
My 5 cents:
```
+ * XLogInsertAllowed() is used as an end-of-recovery checkpoint is
+ * launched while recovery is still in progress, RecoveryInProgress()
“XLogInsertAllowed() is used as an end-of-recovery checkpoint is launched” sounds like XLogInsertAllowed() is being “used as a checkpoint,” how about rephrase as:
XLogInsertAllowed() is needed because an end-of-recovery checkpoint can be launched while recovery is still in progress,
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, Sep 17, 2025 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Sep 16, 2025 at 09:40:50AM +0900, Michael Paquier wrote: > > As a whole, the patch looks like a good balance, able to satisfy the > > new case you want to handle, Melanie. I am guessing that you'd want > > to tweak it and apply it yourself, so please feel free. > > Hearing nothing, I'd like to move ahead with this improvement. I have > tweaked a bit the comments, as suggested. If one switches the check > of XLogNeedsFlush() from XLogInsertAllowed() to RecoveryInProgress(), > the recovery test 015 blows up as expected. > > Any opinions or more word-smithing required? Commit message looks good. In terms of comments, I think it is best to update the comment above XLogNeedsFlush(). Something like : /* - * Test whether XLOG data has been flushed up to (at least) the given position. + * Test whether XLOG data has been flushed up to (at least) the given position + * or whether the minimum recovery point is updated past the given position. * - * Returns true if a flush is still needed. (It may be that someone else - * is already in process of flushing that far, however.) + * Returns true if a flush is still needed or if the minimum recovery point + * must be updated. + * + * It is possible that someone else is already in the process of flushing that + * far or updating the minimum recovery point that far. Though maybe that's too literal... > + /* > + * Cross-check XLogNeedsFlush(). Some of the checks of XLogFlush() and > + * XLogNeedsFlush() are duplicated, and this assertion ensures that these > + * remain consistent. > + */ > + Assert(!XLogNeedsFlush(record)); > } > > /* > @@ -3114,8 +3121,13 @@ XLogNeedsFlush(XLogRecPtr record) > * During recovery, we don't flush WAL but update minRecoveryPoint > * instead. So "needs flush" is taken to mean whether minRecoveryPoint > * would need to be updated. > + * > + * XLogInsertAllowed() is used as an end-of-recovery checkpoint is > + * launched while recovery is still in progress, RecoveryInProgress() > + * returning true in this case. This check should be consistent with the > + * one in XLogFlush(). > */ I can't quite parse the wording of the above comment. - Melanie
On Wed, Sep 17, 2025 at 09:23:05PM -0400, Melanie Plageman wrote: > In terms of comments, I think it is best to update the comment above > XLogNeedsFlush(). Something like : > > /* > - * Test whether XLOG data has been flushed up to (at least) the given position. > + * Test whether XLOG data has been flushed up to (at least) the given position > + * or whether the minimum recovery point is updated past the given position. Okay. > - * Returns true if a flush is still needed. (It may be that someone else > - * is already in process of flushing that far, however.) > + * Returns true if a flush is still needed or if the minimum recovery point > + * must be updated. Okay with that as well. Why not if you find that confusing.. > + * It is possible that someone else is already in the process of flushing that > + * far or updating the minimum recovery point that far. I am not sure about the value this part brings, considering the addition with the two other paragraphs. >> /* >> @@ -3114,8 +3121,13 @@ XLogNeedsFlush(XLogRecPtr record) >> * During recovery, we don't flush WAL but update minRecoveryPoint >> * instead. So "needs flush" is taken to mean whether minRecoveryPoint >> * would need to be updated. >> + * >> + * XLogInsertAllowed() is used as an end-of-recovery checkpoint is >> + * launched while recovery is still in progress, RecoveryInProgress() >> + * returning true in this case. This check should be consistent with the >> + * one in XLogFlush(). >> */ > > I can't quite parse the wording of the above comment. Yep. I am missing an "if" here for "is used as, if an end-of-recovery", but I am coming with simpler, like: Using XLogInsertAllowed() rather than RecoveryInProgress() matters for the case of an end-of-recovery checkpoint, where WAL data is flushed. This check should be consistent with the one in XLogFlush(). -- Michael
Вложения
On Thu, Sep 18, 2025 at 9:24 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Sep 17, 2025 at 09:23:05PM -0400, Melanie Plageman wrote: > > In terms of comments, I think it is best to update the comment above > > XLogNeedsFlush(). Something like : > > > > /* > > - * Test whether XLOG data has been flushed up to (at least) the given position. > > + * Test whether XLOG data has been flushed up to (at least) the given position > > + * or whether the minimum recovery point is updated past the given position. > > Okay. > > > - * Returns true if a flush is still needed. (It may be that someone else > > - * is already in process of flushing that far, however.) > > + * Returns true if a flush is still needed or if the minimum recovery point > > + * must be updated. > > Okay with that as well. Why not if you find that confusing.. > > > + * It is possible that someone else is already in the process of flushing that > > + * far or updating the minimum recovery point that far. > > I am not sure about the value this part brings, considering the > addition with the two other paragraphs. I think this comment is a side note which is stating that it is possible that while XLogNeedFlush() is deciding that based on the current flush position or min recovery point parallely someone might flush beyond that point. And it was existing comment which has been improved by adding min recovery points, so I think it makes sense. > >> /* > >> @@ -3114,8 +3121,13 @@ XLogNeedsFlush(XLogRecPtr record) > >> * During recovery, we don't flush WAL but update minRecoveryPoint > >> * instead. So "needs flush" is taken to mean whether minRecoveryPoint > >> * would need to be updated. > >> + * > >> + * XLogInsertAllowed() is used as an end-of-recovery checkpoint is > >> + * launched while recovery is still in progress, RecoveryInProgress() > >> + * returning true in this case. This check should be consistent with the > >> + * one in XLogFlush(). > >> */ > > > > I can't quite parse the wording of the above comment. > > Yep. I am missing an "if" here for "is used as, if an > end-of-recovery", but I am coming with simpler, like: > Using XLogInsertAllowed() rather than RecoveryInProgress() matters for > the case of an end-of-recovery checkpoint, where WAL data is flushed. > This check should be consistent with the one in XLogFlush(). I tried improving this comment as well. Feel free to disregard it if you think it's not improving it. -- Regards, Dilip Kumar Google
Вложения
On Thu, Sep 18, 2025 at 05:07:00PM +0530, Dilip Kumar wrote: > I think this comment is a side note which is stating that it is > possible that while XLogNeedFlush() is deciding that based on the > current flush position or min recovery point parallely someone might > flush beyond that point. And it was existing comment which has been > improved by adding min recovery points, so I think it makes sense. Indeed. I have kept this one after drinking more caffeine, rewording it slightly. > I tried improving this comment as well. Feel free to disregard it if > you think it's not improving it. The new additions in XLogNeedsFlush() felt overweight, though, so I have kept a shorter and reworded version. Then, applied the result. Do we want to make the order of the checks to be more consistent in both routines? These would require a separate set of double-checks and review, but while we're looking at this area of the code we may as tweak it more.. -- Michael
Вложения
On Fri, Sep 19, 2025 at 1:28 AM Michael Paquier <michael@paquier.xyz> wrote: > > Do we want to make the order of the checks to be more consistent in > both routines? These would require a separate set of double-checks > and review, but while we're looking at this area of the code we may as > tweak it more.. That makes sense to me to avoid future questions such as mine. - Melanie
On Mon, Sep 22, 2025 at 08:09:18AM -0400, Melanie Plageman wrote: > On Fri, Sep 19, 2025 at 1:28 AM Michael Paquier <michael@paquier.xyz> wrote: >> Do we want to make the order of the checks to be more consistent in >> both routines? These would require a separate set of double-checks >> and review, but while we're looking at this area of the code we may as >> tweak it more.. > > That makes sense to me to avoid future questions such as mine. Would somebody like to propose a patch? Doing so on this thread would be OK for me, but perhaps it would be better to discuss that on a new one as we would deal with a separate problem. -- Michael
Вложения
On Fri, Sep 19, 2025 at 10:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 18, 2025 at 05:07:00PM +0530, Dilip Kumar wrote: > > I think this comment is a side note which is stating that it is > > possible that while XLogNeedFlush() is deciding that based on the > > current flush position or min recovery point parallely someone might > > flush beyond that point. And it was existing comment which has been > > improved by adding min recovery points, so I think it makes sense. > > Indeed. I have kept this one after drinking more caffeine, rewording > it slightly. > > > I tried improving this comment as well. Feel free to disregard it if > > you think it's not improving it. > > The new additions in XLogNeedsFlush() felt overweight, though, so I > have kept a shorter and reworded version. Then, applied the result. Thanks. > Do we want to make the order of the checks to be more consistent in > both routines? These would require a separate set of double-checks > and review, but while we're looking at this area of the code we may as > tweak it more.. I see both routines have the same order i.e. first check if (!XLogInsertAllowed()) and then if (record <= LogwrtResult.Flush), what am I missing? -- Regards, Dilip Kumar Google
On Wed, Sep 24, 2025 at 7:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Sep 19, 2025 at 10:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > Do we want to make the order of the checks to be more consistent in > > both routines? These would require a separate set of double-checks > > and review, but while we're looking at this area of the code we may as > > tweak it more.. > > I see both routines have the same order i.e. first check if > (!XLogInsertAllowed()) and then if (record <= LogwrtResult.Flush), > what am I missing? I think he's referring to the order of checks inside UpdateMinRecoveryPoint(). Something like the attached patch. And, if I'm not mistaken, there was another idea mentioned in [1] which was to move to using GetRecoveryState() in XLogNeedsFlush() and UpdateMinRecoveryPoint instead of relying on the variable InRecovery + XLogRecPtrIsInvalid(LocalMinRecoveryPoint) to distinguish crash recovery. Is this roughly what you were thinking, Michael? diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3929b2ff224..73d4e62e4eb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2714,7 +2714,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) * available is replayed in this case. This also saves from extra locks * taken on the control file from the startup process. */ - if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) + if (GetRecoveryState() == RECOVERY_STATE_CRASH) { updateMinRecoveryPoint = false; return; @@ -3146,7 +3146,7 @@ XLogNeedsFlush(XLogRecPtr record) * which cannot update its local copy of minRecoveryPoint as long as * it has not replayed all WAL available when doing crash recovery. */ - if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery) + if (GetRecoveryState() == RECOVERY_STATE_CRASH) { updateMinRecoveryPoint = false; return false; Because we reordered the checks, we will set updateMinRecoveryPoint to false the first time and avoid the spin lock acquisition in GetRecoveryState() happening more than once. These should probably be in the same commit. I think that if what I have is correct, it is a clarity improvement. - Melanie [1] https://www.postgresql.org/message-id/aMIHNRTP6Wj6vw1s%40paquier.xyz
Вложения
On Wed, Sep 24, 2025 at 05:33:08PM -0400, Melanie Plageman wrote: > I think he's referring to the order of checks inside > UpdateMinRecoveryPoint(). Something like the attached patch. Yep, thanks. What you are doing with XLogNeedsFlush() looks correct. > And, if I'm not mistaken, there was another idea mentioned in [1] > which was to move to using GetRecoveryState() in XLogNeedsFlush() and > UpdateMinRecoveryPoint instead of relying on the variable InRecovery + > XLogRecPtrIsInvalid(LocalMinRecoveryPoint) to distinguish crash > recovery. The removal of InRecovery means that we would disable a bit more aggressively updateMinRecoveryPoint for other processes than the startup process as InRecovery is only set in the startup process, which should be OK. Some comments of XLogNeedsFlush(), like the comment block on top of the new GetRecoveryState() call, would need a refresh. XLogNeedsFlush() has a bit further down your change a path where updateMinRecoveryPoint is set to false for other processes than the startup process. A shortcut based on GetRecoveryState() should make its removal possible, simplifying a bit more the code. > These should probably be in the same commit. I think that if what I > have is correct, it is a clarity improvement. I'd rather tackle each thing separately. We're changing slightly different things here: depending on GetRecoveryState() is one thing a bit more invasive than the second thing, which is to switch the order of the checks of updateMinRecoveryPoint. Perhaps I'm the only one picky here as the lines are thin, though :o) -- Michael
Вложения
On Thu, Sep 25, 2025 at 04:53:29PM +0900, Michael Paquier wrote: > I'd rather tackle each thing separately. I have just applied the patch to reorder the checks as of bb68cde4136b. If you have patches for the rest, feel free to submit them. -- Michael