Обсуждение: VM corruption on standby
Hi hackers! I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw somethingsimilar too. So I toyed around and accidentally wrote a test that reproduces $subj. I think the corruption happens as follows: 0. we create a table with one frozen tuple 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet 2. VM buffer is flushed on disk with checkpointer or bgwriter 3. primary is killed with -9 now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record 5. pg_visibility detects corruption Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging dependson PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case: /* Clear only the all-frozen bit on visibility map if needed */ if (PageIsAllVisible(page) && visibilitymap_clear(relation, block, vmbuffer, VISIBILITYMAP_ALL_FROZEN)) cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby. The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placedat modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hackyversion. And, probably, more generic. Thanks! Best regards, Andrey Borodin.
Вложения
Hi Andrey, > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw somethingsimilar too. > So I toyed around and accidentally wrote a test that reproduces $subj. > > I think the corruption happens as follows: > 0. we create a table with one frozen tuple > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet > 2. VM buffer is flushed on disk with checkpointer or bgwriter > 3. primary is killed with -9 > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record > 5. pg_visibility detects corruption > > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging dependson PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case: > > /* Clear only the all-frozen bit on visibility map if needed */ > if (PageIsAllVisible(page) && > visibilitymap_clear(relation, block, vmbuffer, > VISIBILITYMAP_ALL_FROZEN)) > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash > > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby. Great find. I executed your test on a pretty much regular Linux x64 machine and indeed it failed: ``` not ok 1 - pg_check_frozen() observes corruption not ok 2 - pg_check_visible() observes corruption not ok 3 - deleted data returned by select 1..3 # test failed ----------------------------------- stderr ----------------------------------- # Failed test 'pg_check_frozen() observes corruption' # at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl line 110. # got: '(0,2) # (0,3) # (0,4)' # expected: '' # Failed test 'pg_check_visible() observes corruption' # at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl line 111. # got: '(0,2) # (0,4)' # expected: '' # Failed test 'deleted data returned by select' # at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl line 112. # got: '2' # expected: '' # Looks like you failed 3 tests of 3. ``` This is a tricky bug. Do you also have a proposal of a particular fix? > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test isplaced at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a lesshacky version. And, probably, more generic. IMO - yes, we do need this regression test.
Hi, > This is a tricky bug. Do you also have a proposal of a particular fix? If my understanding is correct, we should make a WAL record with the XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within the same critical section (in order to avoid race conditions within the same backend). A draft patch is attached. It makes the test pass and doesn't seem to break any other tests. Thoughts?
Вложения
> > This is a tricky bug. Do you also have a proposal of a particular fix? > > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section (in order to avoid race conditions within > the same backend). I meant instance, not backend. Sorry for confusion. > A draft patch is attached. It makes the test pass and doesn't seem to > break any other tests.
Hi again, > I meant instance, not backend. Sorry for confusion. It looks like I completely misunderstood what START_CRIT_SECTION() / END_CRIT_SECTION() are for here. Simply ignore this part :) Apologies for the noise.
Hi, > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section [...] > > A draft patch is attached. It makes the test pass and doesn't seem to > break any other tests. > > Thoughts? In order not to forget - assuming I'm not wrong about the cause of the issue, we might want to recheck the order of visibilitymap_* and XLog* calls in the following functions too: - heap_multi_insert - heap_delete - heap_update - heap_lock_tuple - heap_lock_updated_tuple_rec By a quick look all named functions modify the VM before making a corresponding WAL record. This can cause a similar issue: 1. VM modified 2. evicted asynchronously before logging 3. kill 9 4. different state of VM on primary and standby
> On 7 Aug 2025, at 17:09, Aleksander Alekseev <aleksander@tigerdata.com> wrote: > > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section (in order to avoid race conditions within > the same backend). Well, the test passes because you moved injection point to a very safe position. I can't comment anything on other aspectsof moving visibilitymap_clear() around. The approach seems viable to me, but I'd like to have understanding why PD_ALL_VISIBLE in a heap page header did not savethe day before fixing anything. Best regards, Andrey Borodin.
> On 7 Aug 2025, at 18:54, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > moved injection point to a very safe position. BTW, your fix also fixes ALL_FROZEN stuff, just because WAL for heap insert is already emitted by the time of -9. I want to emphasize that it seems to me that position of injection point is not a hint, but rather coincidental. I concur that all other users of visibilitymap_clear() likely will need to be fixed. But only when we have a good picturewhat exactly is broken. Best regards, Andrey Borodin.
Hi Andrey, > the test passes because you moved injection point to a very safe position > [...] > I want to emphasize that it seems to me that position of injection point is not a hint, but rather coincidental. Well, I wouldn't say that the test passes merely because the location of the injection point was moved. For sure it was moved, because the visibilitymap_clear() call was moved. Maybe I misunderstood the intent of the test. Wasn't it to call the injection point right after updating the VM? I tried to place it between updating the WAL and updating the VM and the effect was the same - the test still passes. In any case we can place it anywhere we want to if we agree to include the test into the final version of the patch. > I concur that all other users of visibilitymap_clear() likely will need to be fixed. Right, I realized there are a few places besides heapam.c that might need a change. > The approach seems viable to me, but I'd like to have understanding why PD_ALL_VISIBLE in a heap page header did not savethe day before fixing anything > ... But only when we have a good picture what exactly is broken. Agree. I especially would like to know the opinion of somebody who's been hacking Postgres longer than I did. Perhaps there was a good reason to update the VM *before* creating WAL records I'm unaware of.
> On 7 Aug 2025, at 19:36, Aleksander Alekseev <aleksander@tigerdata.com> wrote: > > Maybe I misunderstood the intent of the test. You understood precisely my intent of writing the test. But it fail not due to a bug I anticipated! So far I noticed that if I move injection point before PageClearAllVisible(BufferGetPage(buffer)); or after writing WAL - test passes. Also I investigated that in a moment of kill -9 checkpointer flushes heap page to disk despite content lock. I haven't foundwho released content lock though. Best regards, Andrey Borodin.
> On 9 Aug 2025, at 18:28, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Also I investigated that in a moment of kill -9 checkpointer flushes heap page to disk despite content lock. I haven'tfound who released content lock though. I've written this message and understood: its LWLockReleaseAll(). 0. checkpointer is going to flush a heap buffer but waits on content lock 1. client is resetting PD_ALL_VISIBLE from page 2. postmaster is killed and command client to go down 3. client calls LWLockReleaseAll() at ProcKill() (?) 4. checkpointer flushes buffer with reset PG_ALL_VISIBLE that is not WAL-logged to standby 5. subsequent deletes do not log resetting this bit 6. deleted data is observable on standby with IndexOnlyScan Any idea how to fix this? Best regards, Andrey Borodin.
On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev <aleksander@tigerdata.com> wrote: > Perhaps there was a good > reason to update the VM *before* creating WAL records I'm unaware of. Looks like 503c730 intentionally does it this way; however, I have not yet fully understood the reasoning behind it. -- Best regards, Kirill Reshke
Hi Andrey, > 0. checkpointer is going to flush a heap buffer but waits on content lock > 1. client is resetting PD_ALL_VISIBLE from page > 2. postmaster is killed and command client to go down > 3. client calls LWLockReleaseAll() at ProcKill() (?) > 4. checkpointer flushes buffer with reset PG_ALL_VISIBLE that is not WAL-logged to standby > 5. subsequent deletes do not log resetting this bit > 6. deleted data is observable on standby with IndexOnlyScan Thanks for investigating this in more detail. If this is indeed what happens it is a violation of the "log before changing" approach. For this reason we have PageHeaderData.pd_lsn for instance - to make sure pages are evicted only *after* the record that changed it is written to disk (because WAL records can't be applied to pages from the future). I guess the intent here could be to do an optimization of some sort but the facts that 1. the instance can be killed at any time and 2. there might be replicas - were not considered. > Any idea how to fix this? IMHO: logging the changes first, then allowing to evict the page.
On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev <aleksander@tigerdata.com> wrote: > For this reason we have PageHeaderData.pd_lsn for instance - to make sure > pages are evicted only *after* the record that changed it is written > to disk (because WAL records can't be applied to pages from the > future). We don't bump the LSN of the heap page when setting the visibility map bit. > I guess the intent here could be to do an optimization of some sort > but the facts that 1. the instance can be killed at any time and 2. > there might be replicas - were not considered. > IMHO: logging the changes first, then allowing to evict the page. Clearing the vm before the logging changes was intentional [0]. So I assume we should not change the approach, but rather just tweak things a bit to make the whole thing work. [0] https://www.postgresql.org/message-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA%40mail.gmail.com -- Best regards, Kirill Reshke
> On 9 Aug 2025, at 23:54, Aleksander Alekseev <aleksander@tigerdata.com> wrote: > > IMHO: logging the changes first, then allowing to evict the page. VM and BufferManager code does not allow flush of a buffer until changes are logged. The problem is that our crash-exiting system destroys locks that protect buffer from being flushed. Best regards, Andrey Borodin.
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw somethingsimilar too. > So I toyed around and accidentally wrote a test that reproduces $subj. > > I think the corruption happens as follows: > 0. we create a table with one frozen tuple > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet > 2. VM buffer is flushed on disk with checkpointer or bgwriter > 3. primary is killed with -9 > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record > 5. pg_visibility detects corruption > > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging dependson PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case: > > /* Clear only the all-frozen bit on visibility map if needed */ > if (PageIsAllVisible(page) && > visibilitymap_clear(relation, block, vmbuffer, > VISIBILITYMAP_ALL_FROZEN)) > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash > > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby. > > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test isplaced at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a lesshacky version. And, probably, more generic. > > Thanks! > > > Best regards, Andrey Borodin. > > > Attached reproduces the same but without any standby node. CHECKPOINT somehow manages to flush the heap page when instance kill-9-ed. As a result, we have inconsistency between heap and VM pages: ``` reshke=# select * from pg_visibility('x'); blkno | all_visible | all_frozen | pd_all_visible -------+-------------+------------+---------------- 0 | t | t | f (1 row) ``` Notice I moved INJECTION point one line above visibilitymap_clear. Without this change, such behaviour also reproduced, but with much less frequency. -- Best regards, Kirill Reshke
Вложения
On Tue, 12 Aug 2025 at 10:38, I wrote: > CHECKPOINT > somehow manages to flush the heap page when instance kill-9-ed. This corruption does not reproduce without CHECKPOINT call, however I do not see any suspicious syscal that CHECKPOINT's process does. It does not write anything to disk here, isn’t ? PFA strace. -- Best regards, Kirill Reshke
Вложения
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw somethingsimilar too. While this aims to find existing VM corruption (i mean, in PG <= 17), this reproducer does not seem to work on pg17. At least, I did not manage to reproduce this scenario on pg17. This makes me think this exact corruption may be pg18-only. Is it possible that AIO is somehow involved here? -- Best regards, Kirill Reshke
On Tue, 12 Aug 2025 at 13:00, I wrote: > While this aims to find existing VM corruption (i mean, in PG <= 17), > this reproducer does not seem to work on pg17. At least, I did not > manage to reproduce this scenario on pg17. > > This makes me think this exact corruption may be pg18-only. Is it > possible that AIO is somehow involved here? First of all, the "corruption" is reproducible with io_method = sync, so AIO is not under suspicion. Then, I did a gdb session many times and I ended up with the conclusion that this test is NOT a valid corruption reproducer. So, the thing is, when you involve injection point logic, due how inj points are implemented, you allow postgres to enter WaitLatch function, which has its own logic for postmaster death handling[0]. So, when we add injection point here, we allow this sequence of events to happen: 1) INSERT enters `heap_insert`, modifies HEAP page, goes to inj point and hangs. 2) CHECKPOINT process tries to FLUSH this page and wiats 3) kill -9 to postmaster 4) INSERT wakes up on postmaster death, goes to [0] and releases all locks. 5) CHECKPOINT-er flushes the HEAP page to disk, causing corruption. The thing is, this execution will NOT happen without inj points. So, overall, injection points are not suitable for this critical section testing (i think). == Off-list Andrey send me this patch: ``` diff --git a/src/backend/storage/ipc/waiteventset.c b/src/backend/storage/ipc/waiteventset.c index 7c0e66900f9..e89e1d115cb 100644 --- a/src/backend/storage/ipc/waiteventset.c +++ b/src/backend/storage/ipc/waiteventset.c @@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, long cur_timeout = -1; Assert(nevents > 0); + Assert(CritSectionCount == 0); /* * Initialize timeout if requested. We must record the current time so ``` The objective is to confirm our assumption that the WaitEventSetWait call ought not to occur during critical sections. This patch causes `make check` to fail, indicating that this assumption is incorrect. Assertion breaks due to AdvanceXLInsertBuffer call (which uses condvar logic) inside XlogInsertRecord. I did not find any doc or other piece of information indicating whether WaitEventSetWait and critical sections are allowed. But I do thing this is bad, because we do not process interruptions during critical sections, so it is unclear to me why we should handle postmaster death any differently. [0] https://github.com/postgres/postgres/blob/393e0d2314050576c9c039853fdabe7f685a4f47/src/backend/storage/ipc/waiteventset.c#L1260-L1261 -- Best regards, Kirill Reshke
On Wed, 13 Aug 2025 at 16:15, I wrote: > I did not find any doc or other piece of information indicating > whether WaitEventSetWait and critical sections are allowed. But I do > thing this is bad, because we do not process interruptions during > critical sections, so it is unclear to me why we should handle > postmaster death any differently. Maybe I'm very wrong about this, but I'm currently suspecting there is corruption involving CHECKPOINT, process in CRIT section and kill -9. The scenario I am trying to reproduce is following: 1) Some process p1 locks some buffer (name it buf1), enters CRIT section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar in (GetXLogBuffer -> AdvanceXLInsertBuffer). 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1 3) Postmaster kill-9-ed 4) signal of postmaster death delivered to p1, it wakes up in WaitLatch/WaitEventSetWaitBlock functions, checks postmaster aliveness, and exits releasing all locks. 5) p2 acquires locks on buf1 and flushes it to disk. 6) signal of postmaster death delivered to p2, p2 exits. And we now have a case when the buffer is flushed to disk, while the xlog record that describes this change never makes it to disk. This is very bad. To be clear, I am trying to avoid use of inj points to reproduce corruption. I am not yet successful in this though. -- Best regards, Kirill Reshke
On Thu, 14 Aug 2025 at 10:41, Kirill Reshke <reshkekirill@gmail.com> wrote: > o I am trying to reproduce is following: > > 1) Some process p1 locks some buffer (name it buf1), enters CRIT > section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar > in (GetXLogBuffer -> AdvanceXLInsertBuffer). > 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1 > 3) Postmaster kill-9-ed > 4) signal of postmaster death delivered to p1, it wakes up in > WaitLatch/WaitEventSetWaitBlock functions, checks postmaster > aliveness, and exits releasing all locks. > 5) p2 acquires locks on buf1 and flushes it to disk. > 6) signal of postmaster death delivered to p2, p2 exits. Andrey told me to create CF entry and attach fix, so doing it [0] https://commitfest.postgresql.org/patch/5964/ -- Best regards, Kirill Reshke
Вложения
Kirill Reshke <reshkekirill@gmail.com> writes: > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ] I do not like this patch one bit: it will replace one set of problems with another set, namely systems that fail to shut down. I think the actual bug here is the use of proc_exit(1) after observing postmaster death. That is what creates the hazard, because it releases the locks that are preventing other processes from observing the inconsistent state in shared memory. Compare this to what we do, for example, on receipt of SIGQUIT: /* * We DO NOT want to run proc_exit() or atexit() callbacks -- we're here * because shared memory may be corrupted, so we don't want to try to * clean up our transaction. Just nail the windows shut and get out of * town. The callbacks wouldn't be safe to run from a signal handler, * anyway. * * Note we do _exit(2) not _exit(0). This is to force the postmaster into * a system reset cycle if someone sends a manual SIGQUIT to a random * backend. This is necessary precisely because we don't clean up our * shared memory state. (The "dead man switch" mechanism in pmsignal.c * should ensure the postmaster sees this as a crash, too, but no harm in * being doubly sure.) */ _exit(2); So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the places that are responding to postmaster death. There might be more than just WaitEventSetWaitBlock; I didn't look. regards, tom lane
> On 17 Aug 2025, at 17:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the > places that are responding to postmaster death. +1. But should we _exit(2) only in critical section or always in case of postmaster death? Another question that was bothering Kirill is do we want a test that naturally reproduces WaitEventSetWaitBlock() under criticalsection so that we can observe a corruption? Or is it kind of obvious from code that such things might happen? Existing test is adding WaitEventSetWaitBlock() via injection point to a place where it was not present before. Though withexisting test at hand we can check that fix is curing WaitEventSetWaitBlock() against critical section. Best regards, Andrey Borodin.
Hi! Thank you for putting attention to this. On Sun, 17 Aug 2025 at 19:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Kirill Reshke <reshkekirill@gmail.com> writes: > > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ] > > I do not like this patch one bit: it will replace one set of problems > with another set, namely systems that fail to shut down. I did not observe this during my by-hand testing. I am under the impression that CRIT sections are something that backend (or other) postgres processes try to pass quickly. So, what this patch is doing, is that it defers the process reaction to postmaster death until the end of the CRIT section. So, typical scenario here (as I understand) is this: (1) Process doing its goods, enters CRIT section. (2) Postmaster dies. (3a) Signal of postmaster death (SIGPWR on my VM) delivered to process (3b) Process exists CRIT sections, and then does CFI logic, observes postmaster death and quits. This is why I did my patch the way I did it. I mean, is it always possible for race conditions to occur, which will result in late signal delivery, so why bother and all? > I think the actual bug here is the use of proc_exit(1) after > observing postmaster death. Agreed. > So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the > places that are responding to postmaster death. There might be > more than just WaitEventSetWaitBlock; I didn't look. Well, I see that patching this way will be a much safer way to fix the issue. I can see that doing more conservative changes can be more beneficial (more future-proof and less bug-prone). I will take a detailed look and try to send a patch soon. -- Best regards, Kirill Reshke
On Mon, 18 Aug 2025 at 13:15, I wrote: > > I do not like this patch one bit: it will replace one set of problems > > with another set, namely systems that fail to shut down. > > I did not observe this during my by-hand testing. I am sorry: I was wrong. This is exactly what happens in this test (modified 001_multixact.pl). To be precise, after the INSERT process exits, CHECKPOINT process waits indefinitely in LWLockAcquire. It looks like the reason why proc_exit(1) releases all holded lwlocks is because we use it to notify all lwlock contenders through shared memory about state change, which will not be notified otherwise, since we do not check for signals inside LWLockAcquire. Looks like we need to do something like ConditionVariableBroadcast(), but without lwlock release, to notify all lwlock contenders and then exit(2). === As for the fix, I am now trying to make attached work. The idea to "escalate" proc_exit to immediately exit via syscall comes to my mind from how elog(ERROR) behaves in CRIT sections (every elog(ERROR) efficiently becomes elog(PANIC)). -- Best regards, Kirill Reshke
Вложения
Kirill Reshke <reshkekirill@gmail.com> writes: > On Sun, 17 Aug 2025 at 19:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do not like this patch one bit: it will replace one set of problems >> with another set, namely systems that fail to shut down. > I did not observe this during my by-hand testing. I am under the > impression that CRIT sections are something that backend (or other) > postgres processes try to pass quickly. So, what this patch is doing, > is that it defers the process reaction to postmaster death until the > end of the CRIT section. Well, if you're inside WaitEventSetWaitBlock, you have little control over how long you're going to sit. There is a separate discussion to be had over whether we should prohibit calling that function inside a critical section. But I'm of the opinion that proc_exit is the wrong thing to use after seeing postmaster death, critical section or no. We should assume that system integrity is already compromised, and get out as fast as we can with as few side-effects as possible. It'll be up to the next generation of postmaster to try to clean up. regards, tom lane
On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > But I'm of the opinion that proc_exit > is the wrong thing to use after seeing postmaster death, critical > section or no. We should assume that system integrity is already > compromised, and get out as fast as we can with as few side-effects > as possible. It'll be up to the next generation of postmaster to > try to clean up. Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after someone who holds x calls _exit()? I don't know if there are other ways that LWLockReleaseAll() can lead to persistent corruption that won't be corrected by crash recovery, but this one is probably new since the following commit, explaining the failure to reproduce on v17: commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 Author: Alexander Korotkov <akorotkov@postgresql.org> Date: Wed Apr 2 12:44:24 2025 +0300 Get rid of WALBufMappingLock Any idea involving deferring the handling of PM death from here doesn't seem right: you'd keep waiting for the CV, but the backend that would wake you might have exited. Hmm, I wonder if there could be a solution in between where we don't release the locks on PM exit, but we still wake the waiters so they can observe a new dead state in the lock word (or perhaps a shared postmaster_is_dead flag), and exit themselves. Nice detective work Andrey and others! That's a complicated and rare interaction.
Hi! Thank you for putting attention to this. On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > But I'm of the opinion that proc_exit > > is the wrong thing to use after seeing postmaster death, critical > > section or no. We should assume that system integrity is already > > compromised, and get out as fast as we can with as few side-effects > > as possible. It'll be up to the next generation of postmaster to > > try to clean up. > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > someone who holds x calls _exit()? > > I don't know if there are other ways that LWLockReleaseAll() can lead > to persistent corruption that won't be corrected by crash recovery, > but this one is probably new since the following commit, explaining > the failure to reproduce on v17: > > commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 > Author: Alexander Korotkov <akorotkov@postgresql.org> > Date: Wed Apr 2 12:44:24 2025 +0300 > > Get rid of WALBufMappingLock > > Any idea involving deferring the handling of PM death from here > doesn't seem right: you'd keep waiting for the CV, but the backend > that would wake you might have exited. OK. > Hmm, I wonder if there could be a solution in between where we don't > release the locks on PM exit, but we still wake the waiters so they > can observe a new dead state in the lock word (or perhaps a shared > postmaster_is_dead flag), and exit themselves. Since yesterday I was thinking about adding a new state bit for LWLockWaitState. Something like LW_WS_OWNER_DEAD, which will be set by lwlock owner after observing PM death and then checked by containers in LWLockAcquire. so something like: *lock holder in proc_exit(1)* ``` for all my lwlock do: waiter->lwWaiting = LW_WS_OWNER_DEAD; PGSemaphoreUnlock(waiter->sem); ``` *lock contender in LWLockAttemptLock* ``` old_state = pg_atomic_read_u32(&lock->state); /* loop until we've determined whether we could acquire the lock or not */ while (true) { if (old_state & (1<< LW_WS_OWNER_DEAD)) _exit(2) /* or maybe proc_exit(1)*/ .... if (pg_atomic_compare_exchange_u32(&lock->state, &old_state, desired_state)) ... /*rerty*/ } ``` I am not sure this idea is workable though. -- Best regards, Kirill Reshke
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But I'm of the opinion that proc_exit >> is the wrong thing to use after seeing postmaster death, critical >> section or no. We should assume that system integrity is already >> compromised, and get out as fast as we can with as few side-effects >> as possible. It'll be up to the next generation of postmaster to >> try to clean up. > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > someone who holds x calls _exit()? If someone who holds x is killed by (say) the OOM killer, how do we get out of that? regards, tom lane
On Tue, 19 Aug 2025 at 11:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> But I'm of the opinion that proc_exit > >> is the wrong thing to use after seeing postmaster death, critical > >> section or no. We should assume that system integrity is already > >> compromised, and get out as fast as we can with as few side-effects > >> as possible. It'll be up to the next generation of postmaster to > >> try to clean up. > > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > > someone who holds x calls _exit()? > > If someone who holds x is killed by (say) the OOM killer, how do > we get out of that? +1, if we kill-9 PM and then immediately kill-9 lwlock holder, there is no way for system to shutdown (both HEAD and back branches). So we can ignore this case. -- Best regards, Kirill Reshke
On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote: > > I don't know if there are other ways that LWLockReleaseAll() can lead > to persistent corruption that won't be corrected by crash recovery, > but this one is probably new since the following commit, explaining > the failure to reproduce on v17: > > commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 > Author: Alexander Korotkov <akorotkov@postgresql.org> > Date: Wed Apr 2 12:44:24 2025 +0300 > > Get rid of WALBufMappingLock > > Any idea involving deferring the handling of PM death from here > doesn't seem right: you'd keep waiting for the CV, but the backend > that would wake you might have exited. I revert this commit (these were conflicts but i resolved them) and added assert for crit sections in WaitEventSetWait. make check passes (without v2-0001 it fails) -- Best regards, Kirill Reshke
Вложения
This thread is a candidate for [0]
Best regards,
Kirill Reshke
On 8/19/25 11:14, Kirill Reshke wrote: > This thread is a candidate for [0] > > > [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items <https:// > wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items> > Added, with Alexander as an owner (assuming it really is caused by commit bc22dc0e0d. regards -- Tomas Vondra
10.08.2025 08:45, Kirill Reshke пишет: > On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev > <aleksander@tigerdata.com> wrote: >> For this reason we have PageHeaderData.pd_lsn for instance - to make sure >> pages are evicted only *after* the record that changed it is written >> to disk (because WAL records can't be applied to pages from the >> future). > > We don't bump the LSN of the heap page when setting the visibility > map bit. And that is very bad!!! In fact, I believe "incremental backup" is not safe because of that. >> I guess the intent here could be to do an optimization of some sort >> but the facts that 1. the instance can be killed at any time and 2. >> there might be replicas - were not considered. > >> IMHO: logging the changes first, then allowing to evict the page. > > Clearing the vm before the logging changes was intentional [0]. > So I assume we should not change the approach, but rather just tweak > things a bit to make the whole thing work. > > [0] https://www.postgresql.org/message-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA%40mail.gmail.com Lets cite that message: On Fri, May 6, 2011 at 5:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote: > On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote: >>> Another question: >>> To address the problem in >>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php >>> , should we just clear the vm before the log of insert/update/delete? >>> This may reduce the performance, is there another solution? >> >> Yeah, that's a straightforward way to fix it. I don't think the performance >> hit will be too bad. But we need to be careful not to hold locks while doing >> I/O, which might require some rearrangement of the code. We might want to do >> a similar dance that we do in vacuum, and call visibilitymap_pin first, then >> lock and update the heap page, and then set the VM bit while holding the >> lock on the heap page. > > Here's an attempt at implementing the necessary gymnastics. So we see: Heikki told "lock and update the heap page, and then set the VM bit while holding lock on the heap page". He didn't say "set the VM bit before log heap update". It is not clear why Robert put visibilitymap_clear before logging of update, so your point "was intentional" is not valid. Call of visibilitymap_clear after logging but before exit from critical section (as Aleksander Alekseev did in suggested patch) is correct way to do the thing. -- regards Yura Sokolov aka funny-falcon
09.08.2025 22:54, Kirill Reshke пишет: > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev > <aleksander@tigerdata.com> wrote: > >> Perhaps there was a good >> reason to update the VM *before* creating WAL records I'm unaware of. > > Looks like 503c730 intentionally does it this way; however, I have not > yet fully understood the reasoning behind it. I repeat: there was no intention. Neither in commit message, nor in discussion about. There was intention to move visibilitymap_clear under heap page lock and into critical section, but there were no any word about logging. I believe, it was just an unfortunate oversight that the change is made before logging. -- regards Yura Sokolov aka funny-falcon
Hi, On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote: > 09.08.2025 22:54, Kirill Reshke пишет: > > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev > > <aleksander@tigerdata.com> wrote: > > > >> Perhaps there was a good > >> reason to update the VM *before* creating WAL records I'm unaware of. > > > > Looks like 503c730 intentionally does it this way; however, I have not > > yet fully understood the reasoning behind it. > > I repeat: there was no intention. Neither in commit message, nor in > discussion about. > > There was intention to move visibilitymap_clear under heap page lock and > into critical section, but there were no any word about logging. > > I believe, it was just an unfortunate oversight that the change is made > before logging. The normal pattern *is* to modify the buffer while holding an exclusive lock, in a critical section, before WAL logging. Look at src/backend/access/transam/README: > The general schema for executing a WAL-logged action is > > 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s) > to be modified. > > 2. START_CRIT_SECTION() (Any error during the next three steps must cause a > PANIC because the shared buffers will contain unlogged changes, which we > have to ensure don't get to disk. Obviously, you should check conditions > such as whether there's enough free space on the page before you start the > critical section.) > > 3. Apply the required changes to the shared buffer(s). > > 4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must > happen before the WAL record is inserted; see notes in SyncOneBuffer().) > Note that marking a buffer dirty with MarkBufferDirty() should only > happen iff you write a WAL record; see Writing Hints below. > > 5. If the relation requires WAL-logging, build a WAL record using > XLogBeginInsert and XLogRegister* functions, and insert it. (See > "Constructing a WAL record" below). Then update the page's LSN using the > returned XLOG location. For instance, > > XLogBeginInsert(); > XLogRegisterBuffer(...) > XLogRegisterData(...) > recptr = XLogInsert(rmgr_id, info); > > PageSetLSN(dp, recptr); > > 6. END_CRIT_SECTION() > > 7. Unlock and unpin the buffer(s). Greetings, Andres Freund
On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote: > > This thread is a candidate for [0] > > > [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items > Let me summarize this thread for ease of understanding of what's going on: Timeline: 1) Andrey Borodin sends a patch (on 6 Aug) claiming there is corruption in VM bits. 2) We investigate problem in not with how PostgreSQL modified buffers or logs changes, but with LWLockReleaseALl in proc_exit(1) after kill-9 PM 3) We have reached the conclusion that there is no corruption, and that injection points are not a valid way to reproduce them, because of WaitLatch and friends. 4) But we now suspect there is another corruption with ANY critical section in scenario: I wrote: > Maybe I'm very wrong about this, but I'm currently suspecting there is > corruption involving CHECKPOINT, process in CRIT section and kill -9. >1) Some process p1 locks some buffer (name it buf1), enters CRIT >section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar >in (GetXLogBuffer -> AdvanceXLInsertBuffer). >2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1 >3) Postmaster kill-9-ed >4) signal of postmaster death delivered to p1, it wakes up in >WaitLatch/WaitEventSetWaitBlock functions, checks postmaster >aliveness, and exits releasing all locks. >5) p2 acquires locks on buf1 and flushes it to disk. >6) signal of postmaster death delivered to p2, p2 exits. 5) We create an open item for pg18 and propose revering bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly. Please note that patches in this thread are NOT reproducer of corruption, as of today we have NO valid repro of corruption -- Best regards, Kirill Reshke
19.08.2025 16:09, Andres Freund пишет: > Hi, > > On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote: >> 09.08.2025 22:54, Kirill Reshke пишет: >>> On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev >>> <aleksander@tigerdata.com> wrote: >>> >>>> Perhaps there was a good >>>> reason to update the VM *before* creating WAL records I'm unaware of. >>> >>> Looks like 503c730 intentionally does it this way; however, I have not >>> yet fully understood the reasoning behind it. >> >> I repeat: there was no intention. Neither in commit message, nor in >> discussion about. >> >> There was intention to move visibilitymap_clear under heap page lock and >> into critical section, but there were no any word about logging. >> >> I believe, it was just an unfortunate oversight that the change is made >> before logging. > > The normal pattern *is* to modify the buffer while holding an exclusive lock, > in a critical section, before WAL logging. Look at > src/backend/access/transam/README: > >> The general schema for executing a WAL-logged action is >> >> 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s) >> to be modified. >> >> 2. START_CRIT_SECTION() (Any error during the next three steps must cause a >> PANIC because the shared buffers will contain unlogged changes, which we >> have to ensure don't get to disk. Obviously, you should check conditions >> such as whether there's enough free space on the page before you start the >> critical section.) >> >> 3. Apply the required changes to the shared buffer(s). >> >> 4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must >> happen before the WAL record is inserted; see notes in SyncOneBuffer().) >> Note that marking a buffer dirty with MarkBufferDirty() should only >> happen iff you write a WAL record; see Writing Hints below. >> >> 5. If the relation requires WAL-logging, build a WAL record using >> XLogBeginInsert and XLogRegister* functions, and insert it. (See >> "Constructing a WAL record" below). Then update the page's LSN using the >> returned XLOG location. For instance, >> >> XLogBeginInsert(); >> XLogRegisterBuffer(...) >> XLogRegisterData(...) >> recptr = XLogInsert(rmgr_id, info); >> >> PageSetLSN(dp, recptr); >> >> 6. END_CRIT_SECTION() >> >> 7. Unlock and unpin the buffer(s). There is quite important step in this instruction: > Then update the page's LSN using the returned XLOG location. This step is violated for the call of visibilitymap_clear. Though, probably I'm mistaken this is source of the bug. But it is really source of other kinds of issues. -- regards Yura Sokolov aka funny-falcon
19.08.2025 16:17, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote: >> >> This thread is a candidate for [0] >> >> >> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items >> > > Let me summarize this thread for ease of understanding of what's going on: > > Timeline: > 1) Andrey Borodin sends a patch (on 6 Aug) claiming there is > corruption in VM bits. > 2) We investigate problem in not with how PostgreSQL modified buffers > or logs changes, but with LWLockReleaseALl in proc_exit(1) after > kill-9 PM > 3) We have reached the conclusion that there is no corruption, and > that injection points are not a valid way to reproduce them, because > of WaitLatch and friends. > > 4) But we now suspect there is another corruption with ANY critical > section in scenario: > > I wrote: > >> Maybe I'm very wrong about this, but I'm currently suspecting there is >> corruption involving CHECKPOINT, process in CRIT section and kill -9. >> 1) Some process p1 locks some buffer (name it buf1), enters CRIT >> section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar >> in (GetXLogBuffer -> AdvanceXLInsertBuffer). >> 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1 >> 3) Postmaster kill-9-ed >> 4) signal of postmaster death delivered to p1, it wakes up in >> WaitLatch/WaitEventSetWaitBlock functions, checks postmaster >> aliveness, and exits releasing all locks. >> 5) p2 acquires locks on buf1 and flushes it to disk. >> 6) signal of postmaster death delivered to p2, p2 exits. > > 5) We create an open item for pg18 and propose revering > bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly. Latch and ConditionVariable (that uses Latch) are among basic synchronization primitives in PostgreSQL. Therefore they have to work correctly in any place: in critical section, in wal logging, etc. Current behavior of WaitEventSetWaitBlock is certainly the bug and it is ought to be fixed. So +1 for _exit(2) as Tom suggested. -- regards Yura Sokolov aka funny-falcon
On Tue, 19 Aug 2025 at 18:29, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > Latch and ConditionVariable (that uses Latch) are among basic > synchronization primitives in PostgreSQL. Sure > Therefore they have to work correctly in any place: in critical section, in > wal logging, etc. No. Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 ConditionVariable code path was never exercised in critical sections. After bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 it is exercised in almost every one (if the system is highly loaded). This is a crucial change with corruption as a drawback (until we fix this). To replace proc_exit(1) with _exit(2) is not a cure too: if we exit inside CRIT section without any message to LWLock contenders, they will never do the same (never exit), because they are wait the semaphore and do not respond to signals (so, only way to stop them in to kill-9). Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 lwlock holders did not exit inside crit sections (unless kill9) I had one suggestion about what can be done [0]. However there is little no time until the pg18 release for a change that scary and big (my own understanding), so the safest option is to revert. [0] https://www.postgresql.org/message-id/CALdSSPgDAyqt%3DORyLMWMpotb9V4Jk1Am%2Bhe39mNtBA8%2Ba8TQDw%40mail.gmail.com -- Best regards, Kirill Reshke
Hi, On 2025-08-19 02:13:43 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> But I'm of the opinion that proc_exit > >> is the wrong thing to use after seeing postmaster death, critical > >> section or no. We should assume that system integrity is already > >> compromised, and get out as fast as we can with as few side-effects > >> as possible. It'll be up to the next generation of postmaster to > >> try to clean up. > > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > > someone who holds x calls _exit()? > > If someone who holds x is killed by (say) the OOM killer, how do > we get out of that? On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock waiters would get killed due to the postmaster death signal we've configured (c.f. PostmasterDeathSignalInit()). A long while back I had experimented with replacing waiting on semaphores (within lwlocks) with a latch wait. IIRC it was a bit slower under heavy contention, but that vanished when adding some adaptive spinning to lwlocks - which is also what we need to make it more feasible to replace some of the remaining spinlocks... Greetings, Andres Freund
On Wed, Aug 20, 2025 at 1:56 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-08-19 02:13:43 -0400, Tom Lane wrote: > > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > > > someone who holds x calls _exit()? > > > > If someone who holds x is killed by (say) the OOM killer, how do > > we get out of that? If a backend is killed by the OOM killer, the postmaster will of course send SIGQUIT/SIGKILL to all backend. If the postmaster itself is killed, then surviving backends will notice at their next WaitEventSetWait() and exit, but if any are blocked in sem_wait(), they it will only make progress because other exiting backends release their LWLocks on their way out. So if we change that to _exit(), I assume such backends would linger forever in sem_wait() after the postmaster dies. I do agree that it seems quite weird to release all locks as if this is a "normal" exit though, which is why Kirill and I both wondered about other ways to boot them out of sem_wait()... > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock > waiters would get killed due to the postmaster death signal we've configured > (c.f. PostmasterDeathSignalInit()). No, that has a handler that just sets a global variable. That was done because recovery used to try to read() from the postmaster pipe after replaying every record. Also we currently have some places that don't want to be summarily killed (off the top of my head, syncrep wants to send a special error message, and the logger wants to survive longer than everyone else to catch as much output as possible, things I've been thinking about in the context of threads).
Hi, On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock > > waiters would get killed due to the postmaster death signal we've configured > > (c.f. PostmasterDeathSignalInit()). > > No, that has a handler that just sets a global variable. That was > done because recovery used to try to read() from the postmaster pipe > after replaying every record. Also we currently have some places that > don't want to be summarily killed (off the top of my head, syncrep > wants to send a special error message, and the logger wants to survive > longer than everyone else to catch as much output as possible, things > I've been thinking about in the context of threads). That makes no sense. We should just _exit(). If postmaster has been killed, trying to stay up longer just makes everything more fragile. Waiting for the logger is *exactly* what we should *not* do - what if the logger also crashed? There's no postmaster around to start it. Greetings, Andres Freund
On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock > > > waiters would get killed due to the postmaster death signal we've configured > > > (c.f. PostmasterDeathSignalInit()). > > > > No, that has a handler that just sets a global variable. That was > > done because recovery used to try to read() from the postmaster pipe > > after replaying every record. Also we currently have some places that > > don't want to be summarily killed (off the top of my head, syncrep > > wants to send a special error message, and the logger wants to survive > > longer than everyone else to catch as much output as possible, things > > I've been thinking about in the context of threads). > > That makes no sense. We should just _exit(). If postmaster has been killed, > trying to stay up longer just makes everything more fragile. Waiting for the > logger is *exactly* what we should *not* do - what if the logger also crashed? > There's no postmaster around to start it. Nobody is waiting for the logger. The logger waits for everyone else to exit first to collect forensics: * Unlike all other postmaster child processes, we'll ignore postmaster * death because we want to collect final log output from all backends and * then exit last. We'll do that by running until we see EOF on the * syslog pipe, which implies that all other backends have exited * (including the postmaster). The syncrep case is a bit weirder: it wants to tell the user that syncrep is broken, so its own WaitEventSetWait() has WL_POSTMASTER_DEATH, but that's basically bogus because the backend can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code paths. I've proposed nuking that before.
Hi, On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote: > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock > > > > waiters would get killed due to the postmaster death signal we've configured > > > > (c.f. PostmasterDeathSignalInit()). > > > > > > No, that has a handler that just sets a global variable. That was > > > done because recovery used to try to read() from the postmaster pipe > > > after replaying every record. Also we currently have some places that > > > don't want to be summarily killed (off the top of my head, syncrep > > > wants to send a special error message, and the logger wants to survive > > > longer than everyone else to catch as much output as possible, things > > > I've been thinking about in the context of threads). > > > > That makes no sense. We should just _exit(). If postmaster has been killed, > > trying to stay up longer just makes everything more fragile. Waiting for the > > logger is *exactly* what we should *not* do - what if the logger also crashed? > > There's no postmaster around to start it. > > Nobody is waiting for the logger. Error messages that we might be printing will wait for logger if the pipe is full, no? > The logger waits for everyone else to exit first to collect forensics: > > * Unlike all other postmaster child processes, we'll ignore postmaster > * death because we want to collect final log output from all backends and > * then exit last. We'll do that by running until we see EOF on the > * syslog pipe, which implies that all other backends have exited > * (including the postmaster). > The syncrep case is a bit weirder: it wants to tell the user that > syncrep is broken, so its own WaitEventSetWait() has > WL_POSTMASTER_DEATH, but that's basically bogus because the backend > can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code > paths. I've proposed nuking that before. Yea, that's just bogus. I think this is one more instance of "let's try hard to continue limping along" making things way more fragile than the simpler "let's just do crash-restart in the most normal way possible". Greetings, Andres Freund
19.08.2025 16:43, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 18:29, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > >> Latch and ConditionVariable (that uses Latch) are among basic >> synchronization primitives in PostgreSQL. > > Sure > >> Therefore they have to work correctly in any place: in critical section, in >> wal logging, etc. > > No. Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 ConditionVariable > code path was never exercised in critical sections. After > bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 it is exercised in almost > every one (if the system is highly loaded). This is a crucial change > with corruption as a drawback (until we fix this). > > To replace proc_exit(1) with _exit(2) is not a cure too: if we exit > inside CRIT section without any message to LWLock contenders, they > will never do the same (never exit), because they are wait the > semaphore and do not respond to signals (so, only way to stop them in > to kill-9). Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 lwlock > holders did not exit inside crit sections (unless kill9) That is not true. elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called from AdvanceXLInsertBuffer, may call elog(PANIC) from several places. It doesn't lead to any error, because usually postmaster is alive and it will kill -9 all its children if any one is died in critical section. So the problem is postmaster is already killed with SIGKILL by definition of the issue. Documentation says [0]: > If at all possible, do not use SIGKILL to kill the main postgres server. > Doing so will prevent postgres from freeing the system resources (e.g., shared memory and semaphores) that it holds before terminating. Therefore if postmaster SIGKILL-ed, administrator already have to do some actions. So the issue Andrey Borodin arose is without any fix Pg18 does provides inconsistency between data and the WAL log: data could written to main fork and vm fork although WAL is not written yet IF POSTMASTER IS SIGKILL-ed. `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed (since postmaster will SIGKILL its children). > I had one suggestion about what can be done [0]. However there is > little no time until the pg18 release for a change that scary and big > (my own understanding), so the safest option is to revert. > > [0] https://www.postgresql.org/message-id/CALdSSPgDAyqt%3DORyLMWMpotb9V4Jk1Am%2Bhe39mNtBA8%2Ba8TQDw%40mail.gmail.com [0] https://www.postgresql.org/docs/17/app-postgres.html -- regards Yura Sokolov aka funny-falcon
On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > That is not true. > elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called > from AdvanceXLInsertBuffer, may call elog(PANIC) from several places. > > It doesn't lead to any error, because usually postmaster is alive and it > will kill -9 all its children if any one is died in critical section. > > So the problem is postmaster is already killed with SIGKILL by definition > of the issue. > > Documentation says [0]: > > If at all possible, do not use SIGKILL to kill the main postgres server. > > Doing so will prevent postgres from freeing the system resources (e.g., > shared memory and semaphores) that it holds before terminating. > > Therefore if postmaster SIGKILL-ed, administrator already have to do some > actions. > There are surely many cases when a system reaches the state which can only be fixed by admin action. The elog(PANIC) in the CRIT section is very rare (and very probably is corruption already). The simpler example is to kill-9 postmaster and then immediately kill-9 someone who holds LWLock. The problem is in pgv18 is that this state probability is much higher due to the aforementioned commit. In can happen with almost any OOM on highly loaded systems. -- Best regards, Kirill Reshke
On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > > `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in > WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS > SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed > (since postmaster will SIGKILL its children). > This fix was proposed in this thread. It fixes inconsistency but it replaces one set of problems with another set, namely systems that fail to shut down. -- Best regards, Kirill Reshke
On Tue, 19 Aug 2025 at 20:24, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock > > > > > waiters would get killed due to the postmaster death signal we've configured > > > > > (c.f. PostmasterDeathSignalInit()). > > > > > > > > No, that has a handler that just sets a global variable. That was > > > > done because recovery used to try to read() from the postmaster pipe > > > > after replaying every record. Also we currently have some places that > > > > don't want to be summarily killed (off the top of my head, syncrep > > > > wants to send a special error message, and the logger wants to survive > > > > longer than everyone else to catch as much output as possible, things > > > > I've been thinking about in the context of threads). > > > > > > That makes no sense. We should just _exit(). If postmaster has been killed, > > > trying to stay up longer just makes everything more fragile. Waiting for the > > > logger is *exactly* what we should *not* do - what if the logger also crashed? > > > There's no postmaster around to start it. > > > > Nobody is waiting for the logger. > > Error messages that we might be printing will wait for logger if the pipe is > full, no? I did some crit_sections check for elog usage, and I did not really find any crit section that uses elog(elevel) with elevel < ERROR. But surely there are cases when we might be printing messages inside crit sections, or there can be such sections in backbranches/future branches. Anyway, this case, when we are hung indefinitely while waiting for (already dead logger), might be a rare one. The problem is, even without a logger, on current HEAD, we will fail to stop the system when PM dies, and there is no simple fix. It would be very helpful if LWLock implementation was done using latch wait, there will be no problem then. But we are where we are, so I can see there is a sense in making a try to notify other processes that we are going to die soon and they need to do the same (through shared memory), and then _exit(1). -- Best regards, Kirill Reshke
Kirill Reshke <reshkekirill@gmail.com> writes: > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed >> (since postmaster will SIGKILL its children). > This fix was proposed in this thread. It fixes inconsistency but it > replaces one set of problems with another set, namely systems that > fail to shut down. I think a bigger objection is that it'd result in two separate shutdown behaviors in what's already an extremely under-tested (and hard to test) scenario. I don't want to have to deal with the ensuing state-space explosion. I still think that proc_exit(1) is fundamentally the wrong thing to do if the postmaster is gone: that code path assumes that the cluster is still functional, which is at best shaky. I concur though that we'd have to do some more engineering work before _exit(2) would be a practical solution. In the meantime, it seems like this discussion point arises only because the presented test case is doing something that seems pretty unsafe, namely invoking WaitEventSet inside a critical section. We'd probably be best off to get back to the actual bug the thread started with, namely whether we aren't doing the wrong thing with VM-update order of operations. regards, tom lane
On Tue, 19 Aug 2025 at 23:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Kirill Reshke <reshkekirill@gmail.com> writes: > > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in > >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS > >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed > >> (since postmaster will SIGKILL its children). > > > This fix was proposed in this thread. It fixes inconsistency but it > > replaces one set of problems with another set, namely systems that > > fail to shut down. > > I think a bigger objection is that it'd result in two separate > shutdown behaviors in what's already an extremely under-tested > (and hard to test) scenario. I don't want to have to deal with > the ensuing state-space explosion. Agreed. > I still think that proc_exit(1) is fundamentally the wrong thing > to do if the postmaster is gone: that code path assumes that > the cluster is still functional, which is at best shaky. > I concur though that we'd have to do some more engineering work > before _exit(2) would be a practical solution. Agreed. > In the meantime, it seems like this discussion point arises > only because the presented test case is doing something that > seems pretty unsafe, namely invoking WaitEventSet inside a > critical section. Agreed. > We'd probably be best off to get back to the actual bug the > thread started with, namely whether we aren't doing the wrong > thing with VM-update order of operations. > > regards, tom lane My understanding is that there is no bug in the VM. At least not in [0] test, because it uses an injection point in the CRIT section, making the server exit too early. So, behaviour with inj point and without are very different. The corruption we are looking for has to reproducer (see [1]). [0] https://www.postgresql.org/message-id/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru [1] https://www.postgresql.org/message-id/CALdSSPhGQ1xx10c2NaZgce8qmi%2BSuKFp6T1uWG_aZvPpvoJRkQ%40mail.gmail.com -- Best regards, Kirill Reshke
> On 19 Aug 2025, at 23:23, Kirill Reshke <reshkekirill@gmail.com> wrote: > >> We'd probably be best off to get back to the actual bug the >> thread started with, namely whether we aren't doing the wrong >> thing with VM-update order of operations. >> >> regards, tom lane > > My understanding is that there is no bug in the VM. At least not in > [0] test, because it uses an injection point in the CRIT section, > making the server exit too early. > So, behaviour with inj point and without are very different. > The corruption we are looking for has to reproducer (see [1]). I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injection pointtest. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'm readingthis discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection point waitmechanism. Best regards, Andrey Borodin.
Kirill Reshke <reshkekirill@gmail.com> writes: > On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote: >> Any idea involving deferring the handling of PM death from here >> doesn't seem right: you'd keep waiting for the CV, but the backend >> that would wake you might have exited. Yeah. Taking the check for PM death out of here seems just about as hazardous as leaving it in :-(. Not something I want to mess with so late in the v18 cycle. > I revert this commit (these were conflicts but i resolved them) and > added assert for crit sections in WaitEventSetWait. Your patch still contains some conflict markers :-(. Attached is a corrected version, just to save other people the effort of fixing the diffs themselves. > make check passes (without v2-0001 it fails) While 'make check' is okay with this assertion, 'make check-world' still falls over if you have injection points enabled, because src/test/modules/test_slru/t/001_multixact.pl also had the cute idea of putting an injection-point wait inside a critical section. I did not find any other failures though. I'm inclined to think that we do want to prohibit WaitEventSetWait inside a critical section --- it just seems like a bad idea all around, even without considering this specific failure mode. Therefore, I vote for reverting bc22dc0e0. Hopefully only temporarily, but it's too late to figure out another way for v18, and I don't think that bc22dc0e0 is such an essential improvement that we can't afford to give it up for v18. However, we can't install the proposed assertion until we do something about that test_slru test. It seems like in general it would be sad if we can't put injection points inside critical sections, so I'm wondering if there's a way to re-implement the injection point "wait" functionality without depending on WaitEventSetWait. I would be willing to accept weaker safety guarantees in this context, since we only anticipate such cases being used in test scaffolding. regards, tom lane diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e8909406686..7ffb2179151 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -303,6 +303,11 @@ static bool doPageWrites; * so it's a plain spinlock. The other locks are held longer (potentially * over I/O operations), so we use LWLocks for them. These locks are: * + * WALBufMappingLock: must be held to replace a page in the WAL buffer cache. + * It is only held while initializing and changing the mapping. If the + * contents of the buffer being replaced haven't been written yet, the mapping + * lock is released while the write is done, and reacquired afterwards. + * * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or * XLogFlush). * @@ -468,37 +473,21 @@ typedef struct XLogCtlData pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */ /* - * First initialized page in the cache (first byte position). - */ - XLogRecPtr InitializedFrom; - - /* - * Latest reserved for initialization page in the cache (last byte - * position + 1). + * Latest initialized page in the cache (last byte position + 1). * - * To change the identity of a buffer, you need to advance - * InitializeReserved first. To change the identity of a buffer that's + * To change the identity of a buffer (and InitializedUpTo), you need to + * hold WALBufMappingLock. To change the identity of a buffer that's * still dirty, the old page needs to be written out first, and for that * you need WALWriteLock, and you need to ensure that there are no * in-progress insertions to the page by calling * WaitXLogInsertionsToFinish(). */ - pg_atomic_uint64 InitializeReserved; - - /* - * Latest initialized page in the cache (last byte position + 1). - * - * InitializedUpTo is updated after the buffer initialization. After - * update, waiters got notification using InitializedUpToCondVar. - */ - pg_atomic_uint64 InitializedUpTo; - ConditionVariable InitializedUpToCondVar; + XLogRecPtr InitializedUpTo; /* * These values do not change after startup, although the pointed-to pages - * and xlblocks values certainly do. xlblocks values are changed - * lock-free according to the check for the xlog write position and are - * accompanied by changes of InitializeReserved and InitializedUpTo. + * and xlblocks values certainly do. xlblocks values are protected by + * WALBufMappingLock. */ char *pages; /* buffers for unwritten XLOG pages */ pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */ @@ -821,9 +810,9 @@ XLogInsertRecord(XLogRecData *rdata, * fullPageWrites from changing until the insertion is finished. * * Step 2 can usually be done completely in parallel. If the required WAL - * page is not initialized yet, you have to go through AdvanceXLInsertBuffer, - * which will ensure it is initialized. But the WAL writer tries to do that - * ahead of insertions to avoid that from happening in the critical path. + * page is not initialized yet, you have to grab WALBufMappingLock to + * initialize it, but the WAL writer tries to do that ahead of insertions + * to avoid that from happening in the critical path. * *---------- */ @@ -2005,79 +1994,32 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) XLogRecPtr NewPageEndPtr = InvalidXLogRecPtr; XLogRecPtr NewPageBeginPtr; XLogPageHeader NewPage; - XLogRecPtr ReservedPtr; int npages pg_attribute_unused() = 0; - /* - * We must run the loop below inside the critical section as we expect - * XLogCtl->InitializedUpTo to eventually keep up. The most of callers - * already run inside the critical section. Except for WAL writer, which - * passed 'opportunistic == true', and therefore we don't perform - * operations that could error out. - * - * Start an explicit critical section anyway though. - */ - Assert(CritSectionCount > 0 || opportunistic); - START_CRIT_SECTION(); + LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE); - /*-- - * Loop till we get all the pages in WAL buffer before 'upto' reserved for - * initialization. Multiple process can initialize different buffers with - * this loop in parallel as following. - * - * 1. Reserve page for initialization using XLogCtl->InitializeReserved. - * 2. Initialize the reserved page. - * 3. Attempt to advance XLogCtl->InitializedUpTo, + /* + * Now that we have the lock, check if someone initialized the page + * already. */ - ReservedPtr = pg_atomic_read_u64(&XLogCtl->InitializeReserved); - while (upto >= ReservedPtr || opportunistic) + while (upto >= XLogCtl->InitializedUpTo || opportunistic) { - Assert(ReservedPtr % XLOG_BLCKSZ == 0); + nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo); /* - * Get ending-offset of the buffer page we need to replace. - * - * We don't lookup into xlblocks, but rather calculate position we - * must wait to be written. If it was written, xlblocks will have this - * position (or uninitialized) + * Get ending-offset of the buffer page we need to replace (this may + * be zero if the buffer hasn't been used yet). Fall through if it's + * already written out. */ - if (ReservedPtr + XLOG_BLCKSZ > XLogCtl->InitializedFrom + XLOG_BLCKSZ * XLOGbuffers) - OldPageRqstPtr = ReservedPtr + XLOG_BLCKSZ - (XLogRecPtr) XLOG_BLCKSZ * XLOGbuffers; - else - OldPageRqstPtr = InvalidXLogRecPtr; - - if (LogwrtResult.Write < OldPageRqstPtr && opportunistic) + OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]); + if (LogwrtResult.Write < OldPageRqstPtr) { /* - * If we just want to pre-initialize as much as we can without - * flushing, give up now. + * Nope, got work to do. If we just want to pre-initialize as much + * as we can without flushing, give up now. */ - upto = ReservedPtr - 1; - break; - } - - /* - * Attempt to reserve the page for initialization. Failure means that - * this page got reserved by another process. - */ - if (!pg_atomic_compare_exchange_u64(&XLogCtl->InitializeReserved, - &ReservedPtr, - ReservedPtr + XLOG_BLCKSZ)) - continue; - - /* - * Wait till page gets correctly initialized up to OldPageRqstPtr. - */ - nextidx = XLogRecPtrToBufIdx(ReservedPtr); - while (pg_atomic_read_u64(&XLogCtl->InitializedUpTo) < OldPageRqstPtr) - ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT); - ConditionVariableCancelSleep(); - Assert(pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) == OldPageRqstPtr); - - /* Fall through if it's already written out. */ - if (LogwrtResult.Write < OldPageRqstPtr) - { - /* Nope, got work to do. */ + if (opportunistic) + break; /* Advance shared memory write request position */ SpinLockAcquire(&XLogCtl->info_lck); @@ -2092,6 +2034,14 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) RefreshXLogWriteResult(LogwrtResult); if (LogwrtResult.Write < OldPageRqstPtr) { + /* + * Must acquire write lock. Release WALBufMappingLock first, + * to make sure that all insertions that we need to wait for + * can finish (up to this same position). Otherwise we risk + * deadlock. + */ + LWLockRelease(WALBufMappingLock); + WaitXLogInsertionsToFinish(OldPageRqstPtr); LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); @@ -2119,6 +2069,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) */ pgstat_report_fixed = true; } + /* Re-acquire WALBufMappingLock and retry */ + LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE); + continue; } } @@ -2126,9 +2079,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * Now the next buffer slot is free and we can set it up to be the * next output page. */ - NewPageBeginPtr = ReservedPtr; + NewPageBeginPtr = XLogCtl->InitializedUpTo; NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; + Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx); + NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ); /* @@ -2192,100 +2147,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) */ pg_write_barrier(); - /*----- - * Update the value of XLogCtl->xlblocks[nextidx] and try to advance - * XLogCtl->InitializedUpTo in a lock-less manner. - * - * First, let's provide a formal proof of the algorithm. Let it be 'n' - * process with the following variables in shared memory: - * f - an array of 'n' boolean flags, - * v - atomic integer variable. - * - * Also, let - * i - a number of a process, - * j - local integer variable, - * CAS(var, oldval, newval) - compare-and-swap atomic operation - * returning true on success, - * write_barrier()/read_barrier() - memory barriers. - * - * The pseudocode for each process is the following. - * - * j := i - * f[i] := true - * write_barrier() - * while CAS(v, j, j + 1): - * j := j + 1 - * read_barrier() - * if not f[j]: - * break - * - * Let's prove that v eventually reaches the value of n. - * 1. Prove by contradiction. Assume v doesn't reach n and stucks - * on k, where k < n. - * 2. Process k attempts CAS(v, k, k + 1). 1). If, as we assumed, v - * gets stuck at k, then this CAS operation must fail. Therefore, - * v < k when process k attempts CAS(v, k, k + 1). - * 3. If, as we assumed, v gets stuck at k, then the value k of v - * must be achieved by some process m, where m < k. The process - * m must observe f[k] == false. Otherwise, it will later attempt - * CAS(v, k, k + 1) with success. - * 4. Therefore, corresponding read_barrier() (while j == k) on - * process m reached before write_barrier() of process k. But then - * process k attempts CAS(v, k, k + 1) after process m successfully - * incremented v to k, and that CAS operation must succeed. - * That leads to a contradiction. So, there is no such k (k < n) - * where v gets stuck. Q.E.D. - * - * To apply this proof to the code below, we assume - * XLogCtl->InitializedUpTo will play the role of v with XLOG_BLCKSZ - * granularity. We also assume setting XLogCtl->xlblocks[nextidx] to - * NewPageEndPtr to play the role of setting f[i] to true. Also, note - * that processes can't concurrently map different xlog locations to - * the same nextidx because we previously requested that - * XLogCtl->InitializedUpTo >= OldPageRqstPtr. So, a xlog buffer can - * be taken for initialization only once the previous initialization - * takes effect on XLogCtl->InitializedUpTo. - */ - pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); - - pg_write_barrier(); - - while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr)) - { - NewPageBeginPtr = NewPageEndPtr; - NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; - nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr); - - pg_read_barrier(); - - if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr) - { - /* - * Page at nextidx wasn't initialized yet, so we can't move - * InitializedUpto further. It will be moved by backend which - * will initialize nextidx. - */ - ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); - break; - } - } + XLogCtl->InitializedUpTo = NewPageEndPtr; npages++; } - - END_CRIT_SECTION(); - - /* - * All the pages in WAL buffer before 'upto' were reserved for - * initialization. However, some pages might be reserved by concurrent - * processes. Wait till they finish initialization. - */ - while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo)) - ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT); - ConditionVariableCancelSleep(); - - pg_read_barrier(); + LWLockRelease(WALBufMappingLock); #ifdef WAL_DEBUG if (XLOG_DEBUG && npages > 0) @@ -5178,10 +5045,6 @@ XLOGShmemInit(void) pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr); pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr); pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr); - - pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr); - pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr); - ConditionVariableInit(&XLogCtl->InitializedUpToCondVar); } /* @@ -6205,8 +6068,7 @@ StartupXLOG(void) memset(page + len, 0, XLOG_BLCKSZ - len); pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ); - pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ); - XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr; + XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ; } else { @@ -6215,10 +6077,8 @@ StartupXLOG(void) * let the first attempt to insert a log record to initialize the next * buffer. */ - pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog); - XLogCtl->InitializedFrom = EndOfLog; + XLogCtl->InitializedUpTo = EndOfLog; } - pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo)); /* * Update local and shared status. This is OK to do without any locks diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt index 0be307d2ca0..5427da5bc1b 100644 --- a/src/backend/utils/activity/wait_event_names.txt +++ b/src/backend/utils/activity/wait_event_names.txt @@ -156,7 +156,6 @@ REPLICATION_SLOT_DROP "Waiting for a replication slot to become inactive so it c RESTORE_COMMAND "Waiting for <xref linkend="guc-restore-command"/> to complete." SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction." SYNC_REP "Waiting for confirmation from a remote server during synchronous replication." -WAL_BUFFER_INIT "Waiting on WAL buffer to be initialized." WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit." WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication." WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated." @@ -316,6 +315,7 @@ XidGen "Waiting to allocate a new transaction ID." ProcArray "Waiting to access the shared per-process data structures (typically, to get a snapshot or report a session'stransaction ID)." SInvalRead "Waiting to retrieve messages from the shared catalog invalidation queue." SInvalWrite "Waiting to add a message to the shared catalog invalidation queue." +WALBufMapping "Waiting to replace a page in WAL buffers." WALWrite "Waiting for WAL buffers to be written to disk." ControlFile "Waiting to read or update the <filename>pg_control</filename> file or create a new WAL file." MultiXactGen "Waiting to read or update shared multixact state." diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h index 208d2e3a8ed..06a1ffd4b08 100644 --- a/src/include/storage/lwlocklist.h +++ b/src/include/storage/lwlocklist.h @@ -38,7 +38,7 @@ PG_LWLOCK(3, XidGen) PG_LWLOCK(4, ProcArray) PG_LWLOCK(5, SInvalRead) PG_LWLOCK(6, SInvalWrite) -/* 7 was WALBufMapping */ +PG_LWLOCK(7, WALBufMapping) PG_LWLOCK(8, WALWrite) PG_LWLOCK(9, ControlFile) /* 10 was CheckpointLock */
Andrey Borodin <x4mmm@yandex-team.ru> writes: > I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injectionpoint test. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'mreading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection pointwait mechanism. Yeah, I was coming to similar conclusions in the reply I just sent: we don't really want a policy that we can't put injection-point-based delays inside critical sections. So that infrastructure is leaving something to be desired. Having said that, the test script is also doing something we tell people not to do, namely SIGKILL the postmaster. Could we use SIGQUIT (immediate shutdown) instead? regards, tom lane
On Tue, Aug 19, 2025 at 03:55:41PM -0400, Tom Lane wrote: > Yeah, I was coming to similar conclusions in the reply I just sent: > we don't really want a policy that we can't put injection-point-based > delays inside critical sections. So that infrastructure is leaving > something to be desired. Yes, it doesn't make sense to restrict the use of injection point waits within critical sections. A simple switch that we could do is to rely on a clock-based check in the wait() routine, removing the condition variable part. It costs in responsiveness because the wakeup() routine would not be able to ping the wait() part to recheck the shmem counter immediately. But we could reduce the delay with a variable recheck timing, say double the time to recheck the counter after each loop, capped at a maximum of a couple of hundred ms so as it's still good enough on fast machines, and does not stress too much slow machines. That costs a bit more in terms of clock calls and delay checks, but it would be low-level enough that the internal interrupts would not matter if we rely on syscalls, I guess? And we don't care about efficiency in this case. -- Michael
Вложения
On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. FWIW aio/README.md describes a case where we'd need to wait for an IO, which might involve a CV to wait for an IO worker to do something, in order to start writing WAL, which is in a CS. Starting IO involves calling pgaio_io_acquire(), and if it can't find a handle it calls pgaio_io_wait_for_free(). That's all hypothetical for now as v18 is only doing reads, but it's an important architectural principle. That makes me suspect this new edict can't be the final policy, even if v18 uses it to solve the immediate problem. For v19 I think we should probably attack the original sin and make this work. Several mechanisms for unwedging LWLockAcquire() have been mentioned: (1) the existing LWLockReleaseAll(), which clearly makes bogus assumptions about system state and cannot stay, (2) some new thing that would sem_post() all the waiters having set flags that cause LWLockAcquire() to exit (ie a sort of multiplexing, making our semaphore-based locks inch towards latch-nature), (3) moving LWLock over to latches, so the wait would already be multiplexed with PM death detection, (4) having the parent death signal handler exit directly (unfortunately Linux and FreeBSD only*), (5) in multi-threaded prototype work, the whole process exits anyway taking all backend threads with it** which is a strong motivation to make multi-process mode act as much like that as possible, eg something the exits a lot more eagerly and hopefully preemptively than today. * That's an IRIX invention picked up by Linux and FreeBSD, a sort of reverse SIGCHLD, and I've tried to recreate it for pure POSIX systems before. (1) Maybe it's enough for any backend that learns of postmaster death to signal everyone else since they can't all be blocked in sig_wait() unless there is already a deadlock. (2) I once tried making the postmaster deathpipe O_ASYNC so that the "owner" gets a signal when it becomes readable, but it turned out to require a separate postmaster pipe for every backend (not just a dup'd descriptor); perhaps this would be plausible if we already had a bidirectional postmaster control socket protocol and choose to give every backend process its own socket pair in MP mode, something I've been looking into for various other reasons. ** I've been studying the unusual logger case in this context and contemplated running it as a separate process even in MT mode, as its stated aim didn't sound crazy to me and I was humbly attempting to preserve that characteristic in MT mode. Another way to achieve MP/MT consistency is to decide that the MP design already isn't robust enough on full pipe and just nuke the logger like everything else. Reading Andres's earlier comments, I briefly wondered about a compromise where log senders would make a best effort to send nonblockingly when they know the postmaster is gone, but that's neither as reliable as whoever wrote that had in mind (and in their defence, the logger is basically independent of shared memory state so whether it should be exiting ASAP or draining final log statements is at least debatable; barring bugs, it's only going to block progress if your system is really hosed), nor free entirely of "limping along" syndrome as Andres argues quite compellingly, so I cancelled that thought.
On Wed, Aug 20, 2025 at 11:59 AM Thomas Munro <thomas.munro@gmail.com> wrote: > they can't all be > blocked in sig_wait() unless there is already a deadlock. s/sig_wait()/sem_wait()/
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm inclined to think that we do want to prohibit WaitEventSetWait >> inside a critical section --- it just seems like a bad idea all >> around, even without considering this specific failure mode. > FWIW aio/README.md describes a case where we'd need to wait for an IO, > which might involve a CV to wait for an IO worker to do something, in > order to start writing WAL, which is in a CS. Hm. It still makes me mighty uncomfortable, because the point of a critical section is "crash the database if anything goes wrong during this bit". Waiting for another process --- or thread --- greatly increases the scope of ways for things to go wrong. So I'm not exactly convinced that this aspect of the AIO architecture is well-thought-out. Having said that, we should in any case have a better story on what WaitEventSetWait should do after detecting postmaster death. So I'm all for trying to avoid the proc_exit path if we can design a better answer. regards, tom lane
On Wed, 20 Aug 2025 at 00:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Kirill Reshke <reshkekirill@gmail.com> writes: > > I revert this commit (these were conflicts but i resolved them) and > > added assert for crit sections in WaitEventSetWait. > > Your patch still contains some conflict markers :-(. Attached is > a corrected version, just to save other people the effort of fixing > the diffs themselves. Oh, yes, it indeed contains some conflict markers, but surprisingly compiles without errors, that's why i missed it. Attached looks good to me. > > While 'make check' is okay with this assertion, 'make check-world' > still falls over if you have injection points enabled, because > src/test/modules/test_slru/t/001_multixact.pl also had the > cute idea of putting an injection-point wait inside a critical > section. I did not find any other failures though. Okay, we probably need to reimplement the wait function in injection point without latch, got it. > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. > Therefore, I vote for reverting bc22dc0e0. Hopefully only > temporarily, but it's too late to figure out another way for v18, > and I don't think that bc22dc0e0 is such an essential improvement > that we can't afford to give it up for v18. +1 -- Best regards, Kirill Reshke
Hi, On 2025-08-19 23:47:21 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I'm inclined to think that we do want to prohibit WaitEventSetWait > >> inside a critical section --- it just seems like a bad idea all > >> around, even without considering this specific failure mode. > > > FWIW aio/README.md describes a case where we'd need to wait for an IO, > > which might involve a CV to wait for an IO worker to do something, in > > order to start writing WAL, which is in a CS. > > Hm. It still makes me mighty uncomfortable, because the point of a > critical section is "crash the database if anything goes wrong during > this bit". Waiting for another process --- or thread --- greatly > increases the scope of ways for things to go wrong. So I'm not > exactly convinced that this aspect of the AIO architecture is > well-thought-out. I don't see the alternative: 1) Some IO is done in critical sections (e.g. WAL writes / flushes) 2) Sometimes we need to wait for already started IO in critical sections (also WAL) 3) With some ways of doing AIO the IO is offloaded to other processes, and thus waiting for the IO to complete always requires waiting for another process How could we avoid the need to wait for another process in criticial sections given these points? Greetings, Andres Freund
On Wed, Aug 20, 2025 at 09:14:04AM -0400, Andres Freund wrote: > On 2025-08-19 23:47:21 -0400, Tom Lane wrote: >> Hm. It still makes me mighty uncomfortable, because the point of a >> critical section is "crash the database if anything goes wrong during >> this bit". Waiting for another process --- or thread --- greatly >> increases the scope of ways for things to go wrong. So I'm not >> exactly convinced that this aspect of the AIO architecture is >> well-thought-out. > > I don't see the alternative: > > 1) Some IO is done in critical sections (e.g. WAL writes / flushes) > > 2) Sometimes we need to wait for already started IO in critical sections > (also WAL) > > 3) With some ways of doing AIO the IO is offloaded to other processes, and > thus waiting for the IO to complete always requires waiting for another > process > > How could we avoid the need to wait for another process in criticial sections > given these points? Yes, it comes down to the point that for some code path we just cannot accept a soft failure: some IOs are critical enough that if they fail the only thing we can should and can do is to recover and replay based on the past IOs that we know did succeed and made it durably to disk. Having what can be qualified as safe and efficient to use in a critical section for event broadcasting and waits would be really, really nice. -- Michael
Вложения
> On 20 Aug 2025, at 00:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrey Borodin <x4mmm@yandex-team.ru> writes: >> I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injectionpoint test. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'mreading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection pointwait mechanism. > > Yeah, I was coming to similar conclusions in the reply I just sent: > we don't really want a policy that we can't put injection-point-based > delays inside critical sections. So that infrastructure is leaving > something to be desired. > > Having said that, the test script is also doing something we tell > people not to do, namely SIGKILL the postmaster. Could we use > SIGQUIT (immediate shutdown) instead? I'm working backwards from corruptions I see on our production. And almost always I see stormbringers like OOM, power outage or Debian scripts that (I think) do kill -9 when `service postgresqlstop` takes too long. Best regards, Andrey Borodin.
On Wed, Aug 20, 2025 at 3:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Having said that, we should in any case have a better story on > what WaitEventSetWait should do after detecting postmaster death. > So I'm all for trying to avoid the proc_exit path if we can > design a better answer. Yeah. I've posted a concept patch in a new thread: https://www.postgresql.org/message-id/flat/CA%2BhUKGKp0kTpummCPa97%2BWFJTm%2BuYzQ9Ex8UMdH8ZXkLwO0QgA%40mail.gmail.com
Hi, Tom! On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. > Therefore, I vote for reverting bc22dc0e0. Hopefully only > temporarily, but it's too late to figure out another way for v18, > and I don't think that bc22dc0e0 is such an essential improvement > that we can't afford to give it up for v18. I'm OK about this. Do you mind if I revert bc22dc0e0 myself? And let's retry it for v19. ------ Regards, Alexander Korotkov Supabase
On Fri, Aug 22, 2025 at 01:27:17AM +0300, Alexander Korotkov wrote: > I'm OK about this. Do you mind if I revert bc22dc0e0 myself? > And let's retry it for v19. Yes, agreed that it may be the best thing to do for v18 based on the information we have gathered until now. -- Michael
Вложения
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Therefore, I vote for reverting bc22dc0e0. Hopefully only >> temporarily, but it's too late to figure out another way for v18, >> and I don't think that bc22dc0e0 is such an essential improvement >> that we can't afford to give it up for v18. > I'm OK about this. Do you mind if I revert bc22dc0e0 myself? Not at all, go for it. > And let's retry it for v19. +1 regards, tom lane
On Fri, Aug 22, 2025 at 10:27 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > And let's retry it for v19. +1 I'm hoping we can fix PM death handling soon, and then I assume this can go straight back in without modification. CVs are an essential low level synchronisation component that really should work in lots of environments, but we need to straighten out some historical mistakes and rough edges. Commit cfdf4dc4 removed open-coded exits to fix a lot of bugs of omission, but it failed to fully consider all the consequences of "composition", ie hiding that behaviour out of sight. We really need a sort of postmasterless PANIC here, and I am happy to work on that (see new thread), not least because it aligns better with the behaviour of a multithreaded server. There, the answer is "what other backends?" so I'd already been looking sideways at system states including the lingering logger, lingering query execution and the special but inherently flaky error reporting in a few spots.
[RMT hat] On Thu, Aug 21, 2025 at 06:42:48PM -0400, Tom Lane wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: >> On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Therefore, I vote for reverting bc22dc0e0. Hopefully only >>> temporarily, but it's too late to figure out another way for v18, >>> and I don't think that bc22dc0e0 is such an essential improvement >>> that we can't afford to give it up for v18. > >> I'm OK about this. Do you mind if I revert bc22dc0e0 myself? > > Not at all, go for it. Now that this is reverted, can the related open item be marked as resolved? -- nathan
On Mon, Aug 25, 2025 at 10:07:26AM -0500, Nathan Bossart wrote: > Now that this is reverted, can the related open item be marked as resolved? Since there has been no further discussion, I will go ahead and resolve the open item. -- nathan
On Tue, Aug 12, 2025 at 8:38 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > Hi hackers!
> >
> > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
> > So I toyed around and accidentally wrote a test that reproduces $subj.
> >
> > I think the corruption happens as follows:
> > 0. we create a table with one frozen tuple
> > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> > 2. VM buffer is flushed on disk with checkpointer or bgwriter
> > 3. primary is killed with -9
> > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> > 5. pg_visibility detects corruption
> >
> > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:
> >
> > /* Clear only the all-frozen bit on visibility map if needed */
> > if (PageIsAllVisible(page) &&
> > visibilitymap_clear(relation, block, vmbuffer,
> > VISIBILITYMAP_ALL_FROZEN))
> > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
> >
> > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
> >
> > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.
> >
> > Thanks!
> >
> >
> > Best regards, Andrey Borodin.
> >
> >
> >
>
> Attached reproduces the same but without any standby node. CHECKPOINT
> somehow manages to flush the heap page when instance kill-9-ed.
> As a result, we have inconsistency between heap and VM pages:
>
> ```
> reshke=# select * from pg_visibility('x');
> blkno | all_visible | all_frozen | pd_all_visible
> -------+-------------+------------+----------------
> 0 | t | t | f
> (1 row)
> ```
>
> Notice I moved INJECTION point one line above visibilitymap_clear.
> Without this change, such behaviour also reproduced, but with much
> less frequency.
BTW, I've tried this patch on the current master, where bc22dc0e0d was reverted. And it fails for me.
t/001_multixact.pl .. 1/?
# Failed test 'pg_check_frozen() observes corruption'
# at t/001_multixact.pl line 102.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''
# Failed test 'pg_check_visible() observes corruption'
# at t/001_multixact.pl line 103.
# got: '(0,2)
# (0,4)'
# expected: ''
# Failed test 'deleted data returned by select'
# at t/001_multixact.pl line 104.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
t/001_multixact.pl .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/3 subtests
Test Summary Report
-------------------
t/001_multixact.pl (Wstat: 768 Tests: 3 Failed: 3)
Failed tests: 1-3
Non-zero exit status: 3
Files=1, Tests=3, 2 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.27 csys = 0.37 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:452: check] Error 1
Could you, please, recheck?
------
Regards,
Alexander Korotkov
Supabase
> On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > Hi hackers!
> >
> > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
> > So I toyed around and accidentally wrote a test that reproduces $subj.
> >
> > I think the corruption happens as follows:
> > 0. we create a table with one frozen tuple
> > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> > 2. VM buffer is flushed on disk with checkpointer or bgwriter
> > 3. primary is killed with -9
> > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> > 5. pg_visibility detects corruption
> >
> > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:
> >
> > /* Clear only the all-frozen bit on visibility map if needed */
> > if (PageIsAllVisible(page) &&
> > visibilitymap_clear(relation, block, vmbuffer,
> > VISIBILITYMAP_ALL_FROZEN))
> > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
> >
> > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
> >
> > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.
> >
> > Thanks!
> >
> >
> > Best regards, Andrey Borodin.
> >
> >
> >
>
> Attached reproduces the same but without any standby node. CHECKPOINT
> somehow manages to flush the heap page when instance kill-9-ed.
> As a result, we have inconsistency between heap and VM pages:
>
> ```
> reshke=# select * from pg_visibility('x');
> blkno | all_visible | all_frozen | pd_all_visible
> -------+-------------+------------+----------------
> 0 | t | t | f
> (1 row)
> ```
>
> Notice I moved INJECTION point one line above visibilitymap_clear.
> Without this change, such behaviour also reproduced, but with much
> less frequency.
BTW, I've tried this patch on the current master, where bc22dc0e0d was reverted. And it fails for me.
t/001_multixact.pl .. 1/?
# Failed test 'pg_check_frozen() observes corruption'
# at t/001_multixact.pl line 102.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''
# Failed test 'pg_check_visible() observes corruption'
# at t/001_multixact.pl line 103.
# got: '(0,2)
# (0,4)'
# expected: ''
# Failed test 'deleted data returned by select'
# at t/001_multixact.pl line 104.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
t/001_multixact.pl .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/3 subtests
Test Summary Report
-------------------
t/001_multixact.pl (Wstat: 768 Tests: 3 Failed: 3)
Failed tests: 1-3
Non-zero exit status: 3
Files=1, Tests=3, 2 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.27 csys = 0.37 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:452: check] Error 1
Could you, please, recheck?
------
Regards,
Alexander Korotkov
Supabase
> On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Could you, please, recheck? That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous. Actual patch to deteact a problem is much simpler: ``` diff --git a/src/backend/storage/ipc/waiteventset.c b/src/backend/storage/ipc/waiteventset.c index 7c0e66900f9..e89e1d115cb 100644 --- a/src/backend/storage/ipc/waiteventset.c +++ b/src/backend/storage/ipc/waiteventset.c @@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, long cur_timeout = -1; Assert(nevents > 0); + Assert(CritSectionCount == 0); /* * Initialize timeout if requested. We must record the current time so ``` Though it will fail in multixact test. Best regards, Andrey Borodin.
On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > Could you, please, recheck? > > That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous. > > Actual patch to deteact a problem is much simpler: > ``` > diff --git a/src/backend/storage/ipc/waiteventset.c > b/src/backend/storage/ipc/waiteventset.c > index 7c0e66900f9..e89e1d115cb 100644 > --- a/src/backend/storage/ipc/waiteventset.c > +++ b/src/backend/storage/ipc/waiteventset.c > @@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, > long cur_timeout = -1; > > Assert(nevents > 0); > + Assert(CritSectionCount == 0); > > /* > * Initialize timeout if requested. We must record the current time so > ``` > > Though it will fail in multixact test. I thought that patch allows to reproduce the problem of bc22dc0e0d. Sorry for the noise. ------ Regards, Alexander Korotkov Supabase
On Wed, Sep 3, 2025 at 11:28 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > > On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > > > Could you, please, recheck? > > > > That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous. > > > > Actual patch to deteact a problem is much simpler: > > ``` > > diff --git a/src/backend/storage/ipc/waiteventset.c > > b/src/backend/storage/ipc/waiteventset.c > > index 7c0e66900f9..e89e1d115cb 100644 > > --- a/src/backend/storage/ipc/waiteventset.c > > +++ b/src/backend/storage/ipc/waiteventset.c > > @@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, > > long cur_timeout = -1; > > > > Assert(nevents > 0); > > + Assert(CritSectionCount == 0); > > > > /* > > * Initialize timeout if requested. We must record the current time so > > ``` > > > > Though it will fail in multixact test. > > I thought that patch allows to reproduce the problem of bc22dc0e0d. > Sorry for the noise. The attached is my attempt to rework the bc22dc0e0d. The problem discussed in this thread occurs because condition variable ends the process with proc_exit(1) in the case of postmaster death even in the critical section. However, the shared memory cleanup happening after proc_exit(1) being called in the critical section is leaving the shared memory in the inconsistent state. Generally I see two possible behaviors in this situation. 1) Exit without the shared memory cleanup (e.g. proc_exit(2) as proposed upthread). That would save us from data corruption caused by the inconsistent shared memory state. However, if our process has occupied shared resources (e.g. locks), they wouldn't be released. That might cause other processes hand waiting for those resources. 2) Continue to wait after postmaster death. While bc22dc0e0d was the first case when we waited on conditional variable in the critical section, there are a lot of cases where we're taking LWLock's in the critical section. Unlike condition variable, lwlock will continue to wait in the case of postmaster death. The critical section will continue its operation until completion, postmaster death will be handled afterwards. If we would apply the same logic to the arbitrary condition variable, it might however cause our process to stuck if we're waiting for something from another process, which would exit seeing postmaster death. I think the approach #2 is more appropriate for bc22dc0e0d, because in the critical section we only wait for other processes also in the critical section (so, there is no risk they will exit immediately after postmaster death making us stuck). I've implemented a patch, where waiting on conditional variable is replaced with LWLock-style waiting on semaphore. However, custom waiting code just for AdvanceXLInsertBuffer() doesn't look good. I believe we need some general solution. We might have a special kind of condition variable, a critical section condition variable, where both waiting and signaling must be invoked only in a critical section. However, I dig into our Latch and WaitEventSet, it seems there are too many assumptions about postmaster death. So, a critical section condition variable probably should be implemented on top of semaphore. Any thoughts? ------ Regards, Alexander Korotkov Supabase
Вложения
> On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > I think the approach #2 is more appropriate for bc22dc0e0d, because in > the critical section we only wait for other processes also in the > critical section (so, there is no risk they will exit immediately > after postmaster death making us stuck). I've implemented a patch, > where waiting on conditional variable is replaced with LWLock-style > waiting on semaphore. However, custom waiting code just for > AdvanceXLInsertBuffer() doesn't look good. Well, at least I'd like to see corruption-free solution for injection point wait too. > I believe we need some > general solution. We might have a special kind of condition variable, > a critical section condition variable, where both waiting and > signaling must be invoked only in a critical section. However, I dig > into our Latch and WaitEventSet, it seems there are too many > assumptions about postmaster death. So, a critical section condition > variable probably should be implemented on top of semaphore. Any > thoughts? We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is iteasier to fix and maintain existing Latch\WaitEventSet? Best regards, Andrey Borodin.
On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > I believe we need some > > general solution. We might have a special kind of condition variable, > > a critical section condition variable, where both waiting and > > signaling must be invoked only in a critical section. However, I dig > > into our Latch and WaitEventSet, it seems there are too many > > assumptions about postmaster death. So, a critical section condition > > variable probably should be implemented on top of semaphore. Any > > thoughts? > > We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is iteasier to fix and maintain existing Latch\WaitEventSet? FWIW I'm working on a patch set that kills all backends without releasing any locks when the postmaster exists. Then CVs and other latch-based stuff should be safe in this context. Work was interrupted by a vacation but I hope to post something in the nexts couple of days, over on that other thread I started...
On Thu, Sep 11, 2025 at 10:59:19AM +1200, Thomas Munro wrote: > FWIW I'm working on a patch set that kills all backends without > releasing any locks when the postmaster exists. Then CVs and other > latch-based stuff should be safe in this context. Work was > interrupted by a vacation but I hope to post something in the nexts > couple of days, over on that other thread I started... Wow. Thanks! -- Michael
Вложения
On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > I believe we need some > > > general solution. We might have a special kind of condition variable, > > > a critical section condition variable, where both waiting and > > > signaling must be invoked only in a critical section. However, I dig > > > into our Latch and WaitEventSet, it seems there are too many > > > assumptions about postmaster death. So, a critical section condition > > > variable probably should be implemented on top of semaphore. Any > > > thoughts? > > > > We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or isit easier to fix and maintain existing Latch\WaitEventSet? > > FWIW I'm working on a patch set that kills all backends without > releasing any locks when the postmaster exists. Then CVs and other > latch-based stuff should be safe in this context. Work was > interrupted by a vacation but I hope to post something in the nexts > couple of days, over on that other thread I started... Thank you! I'm looking forward to see it! ------ Regards, Alexander Korotkov Supabase
Hi, Thomas! On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > I believe we need some > > > general solution. We might have a special kind of condition variable, > > > a critical section condition variable, where both waiting and > > > signaling must be invoked only in a critical section. However, I dig > > > into our Latch and WaitEventSet, it seems there are too many > > > assumptions about postmaster death. So, a critical section condition > > > variable probably should be implemented on top of semaphore. Any > > > thoughts? > > > > We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or isit easier to fix and maintain existing Latch\WaitEventSet? > > FWIW I'm working on a patch set that kills all backends without > releasing any locks when the postmaster exists. Then CVs and other > latch-based stuff should be safe in this context. Work was > interrupted by a vacation but I hope to post something in the nexts > couple of days, over on that other thread I started... How is it going? ------ Regards, Alexander Korotkov Supabase