Обсуждение: VM corruption on standby

Поиск

Список

Период

Сортировка

VM corruption on standby

От

Andrey Borodin

Дата:

06 августа, 17:59:58

Hi hackers!

I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw
somethingsimilar too.
So I toyed around and accidentally wrote a test that reproduces $subj.

I think the corruption happens as follows:
0. we create a table with one frozen tuple
1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
2. VM buffer is flushed on disk with checkpointer or bgwriter
3. primary is killed with -9
now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
5. pg_visibility detects corruption

Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging
dependson PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:

/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
visibilitymap_clear(relation, block, vmbuffer,
VISIBILITYMAP_ALL_FROZEN))
cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash

Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.

The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is
placedat modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less
hackyversion. And, probably, more generic.

Thanks!

Best regards, Andrey Borodin.

Вложения

v1-0001-Corrupt-VM-on-standby.patch

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

07 августа, 15:15:05

Hi Andrey,

> I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw
somethingsimilar too. 
> So I toyed around and accidentally wrote a test that reproduces $subj.
>
> I think the corruption happens as follows:
> 0. we create a table with one frozen tuple
> 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> 2. VM buffer is flushed on disk with checkpointer or bgwriter
> 3. primary is killed with -9
> now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> 5. pg_visibility detects corruption
>
> Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging
dependson PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case: 
>
>     /* Clear only the all-frozen bit on visibility map if needed */
>     if (PageIsAllVisible(page) &&
>         visibilitymap_clear(relation, block, vmbuffer,
>             VISIBILITYMAP_ALL_FROZEN))
>         cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
>
> Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.

Great find. I executed your test on a pretty much regular Linux x64
machine and indeed it failed:

```
not ok 1 - pg_check_frozen() observes corruption
not ok 2 - pg_check_visible() observes corruption
not ok 3 - deleted data returned by select
1..3
# test failed
----------------------------------- stderr -----------------------------------
#   Failed test 'pg_check_frozen() observes corruption'
#   at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl
line 110.
#          got: '(0,2)
# (0,3)
# (0,4)'
#     expected: ''
#   Failed test 'pg_check_visible() observes corruption'
#   at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl
line 111.
#          got: '(0,2)
# (0,4)'
#     expected: ''
#   Failed test 'deleted data returned by select'
#   at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl
line 112.
#          got: '2'
#     expected: ''
# Looks like you failed 3 tests of 3.
```

This is a tricky bug. Do you also have a proposal of a particular fix?

> The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test
isplaced at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a
lesshacky version. And, probably, more generic. 

IMO - yes, we do need this regression test.

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

07 августа, 17:09:39

Hi,

> This is a tricky bug. Do you also have a proposal of a particular fix?

If my understanding is correct, we should make a WAL record with the
XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
the same critical section (in order to avoid race conditions within
the same backend).

A draft patch is attached. It makes the test pass and doesn't seem to
break any other tests.

Thoughts?

Вложения

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

07 августа, 17:14:00

> > This is a tricky bug. Do you also have a proposal of a particular fix?
>
> If my understanding is correct, we should make a WAL record with the
> XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
> the same critical section (in order to avoid race conditions within
> the same backend).

I meant instance, not backend. Sorry for confusion.

> A draft patch is attached. It makes the test pass and doesn't seem to
> break any other tests.

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

07 августа, 17:27:22

Hi again,

> I meant instance, not backend. Sorry for confusion.

It looks like I completely misunderstood what START_CRIT_SECTION() /
END_CRIT_SECTION() are for here. Simply ignore this part :) Apologies
for the noise.

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

07 августа, 18:17:17

Hi,

> If my understanding is correct, we should make a WAL record with the
> XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
> the same critical section [...]
>
> A draft patch is attached. It makes the test pass and doesn't seem to
> break any other tests.
>
> Thoughts?

In order not to forget - assuming I'm not wrong about the cause of the
issue, we might want to recheck the order of visibilitymap_* and XLog*
calls in the following functions too:

- heap_multi_insert
- heap_delete
- heap_update
- heap_lock_tuple
- heap_lock_updated_tuple_rec

By a quick look all named functions modify the VM before making a
corresponding WAL record. This can cause a similar issue:

1. VM modified
2. evicted asynchronously before logging
3. kill 9
4. different state of VM on primary and standby

Re: VM corruption on standby

От

Andrey Borodin

Дата:

07 августа, 18:54:45

> On 7 Aug 2025, at 17:09, Aleksander Alekseev <aleksander@tigerdata.com> wrote:
>
> If my understanding is correct, we should make a WAL record with the
> XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
> the same critical section (in order to avoid race conditions within
> the same backend).

Well, the test passes because you moved injection point to a very safe position. I can't comment anything on other
aspectsof moving visibilitymap_clear() around. 
The approach seems viable to me, but I'd like to have understanding why PD_ALL_VISIBLE in a heap page header did not
savethe day before fixing anything. 

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Andrey Borodin

Дата:

07 августа, 19:00:35

> On 7 Aug 2025, at 18:54, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> moved injection point to a very safe position.

BTW, your fix also fixes ALL_FROZEN stuff, just because WAL for heap insert is already emitted by the time of -9.

I want to emphasize that it seems to me that position of injection point is not a hint, but rather coincidental.

I concur that all other users of visibilitymap_clear() likely will need to be fixed. But only when we have a good
picturewhat exactly is broken. 

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

07 августа, 19:36:24

Hi Andrey,

> the test passes because you moved injection point to a very safe position
> [...]
> I want to emphasize that it seems to me that position of injection point is not a hint, but rather coincidental.

Well, I wouldn't say that the test passes merely because the location
of the injection point was moved.

For sure it was moved, because the visibilitymap_clear() call was
moved. Maybe I misunderstood the intent of the test. Wasn't it to call
the injection point right after updating the VM? I tried to place it
between updating the WAL and updating the VM and the effect was the
same - the test still passes.

In any case we can place it anywhere we want to if we agree to include
the test into the final version of the patch.

> I concur that all other users of visibilitymap_clear() likely will need to be fixed.

Right, I realized there are a few places besides heapam.c that might
need a change.

> The approach seems viable to me, but I'd like to have understanding why PD_ALL_VISIBLE in a heap page header did not
savethe day before fixing anything
 
> ... But only when we have a good picture what exactly is broken.

Agree. I especially would like to know the opinion of somebody who's
been hacking Postgres longer than I did. Perhaps there was a good
reason to update the VM *before* creating WAL records I'm unaware of.

Re: VM corruption on standby

От

Andrey Borodin

Дата:

09 августа, 18:28:41

> On 7 Aug 2025, at 19:36, Aleksander Alekseev <aleksander@tigerdata.com> wrote:
>
> Maybe I misunderstood the intent of the test.

You understood precisely my intent of writing the test. But it fail not due to a bug I anticipated!

So far I noticed that if I move injection point before

PageClearAllVisible(BufferGetPage(buffer));

or after writing WAL - test passes.

Also I investigated that in a moment of kill -9 checkpointer flushes heap page to disk despite content lock. I haven't
foundwho released content lock though. 

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Andrey Borodin

Дата:

09 августа, 18:53:54

> On 9 Aug 2025, at 18:28, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Also I investigated that in a moment of kill -9 checkpointer flushes heap page to disk despite content lock. I
haven'tfound who released content lock though. 

I've written this message and understood: its LWLockReleaseAll().

0. checkpointer is going to flush a heap buffer but waits on content lock
1. client is resetting PD_ALL_VISIBLE from page
2. postmaster is killed and command client to go down
3. client calls LWLockReleaseAll() at ProcKill() (?)
4. checkpointer flushes buffer with reset PG_ALL_VISIBLE that is not WAL-logged to standby
5. subsequent deletes do not log resetting this bit
6. deleted data is observable on standby with IndexOnlyScan

Any idea how to fix this?

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Kirill Reshke

Дата:

09 августа, 22:54:58

On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

> Perhaps there was a good
> reason to update the VM *before* creating WAL records I'm unaware of.

Looks like 503c730 intentionally does it this way; however, I have not
yet fully understood the reasoning behind it.

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Aleksander Alekseev

Дата:

09 августа, 23:54:42

Hi Andrey,

> 0. checkpointer is going to flush a heap buffer but waits on content lock
> 1. client is resetting PD_ALL_VISIBLE from page
> 2. postmaster is killed and command client to go down
> 3. client calls LWLockReleaseAll() at ProcKill() (?)
> 4. checkpointer flushes buffer with reset PG_ALL_VISIBLE that is not WAL-logged to standby
> 5. subsequent deletes do not log resetting this bit
> 6. deleted data is observable on standby with IndexOnlyScan

Thanks for investigating this in more detail. If this is indeed what
happens it is a violation of the "log before changing" approach. For
this reason we have PageHeaderData.pd_lsn for instance - to make sure
pages are evicted only *after* the record that changed it is written
to disk (because WAL records can't be applied to pages from the
future).

I guess the intent here could be to do an optimization of some sort
but the facts that 1. the instance can be killed at any time and 2.
there might be replicas - were not considered.

> Any idea how to fix this?

IMHO: logging the changes first, then allowing to evict the page.

Re: VM corruption on standby

От

Kirill Reshke

Дата:

10 августа, 08:45:15

On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:
> For this reason we have PageHeaderData.pd_lsn for instance - to make sure
> pages are evicted only *after* the record that changed it is written
> to disk (because WAL records can't be applied to pages from the
> future).

We don't bump the LSN of the heap page when setting the visibility
map bit.

> I guess the intent here could be to do an optimization of some sort
> but the facts that 1. the instance can be killed at any time and 2.
> there might be replicas - were not considered.

> IMHO: logging the changes first, then allowing to evict the page.

Clearing the vm before the logging changes was intentional [0].
So I assume we should not change the approach, but rather just tweak
things a bit to make the whole thing work.

[0] https://www.postgresql.org/message-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA%40mail.gmail.com

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Andrey Borodin

Дата:

10 августа, 09:56:28

> On 9 Aug 2025, at 23:54, Aleksander Alekseev <aleksander@tigerdata.com> wrote:
>
> IMHO: logging the changes first, then allowing to evict the page.

VM and BufferManager code does not allow flush of a buffer until changes are logged.
The problem is that our crash-exiting system destroys locks that protect buffer from being flushed.

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Kirill Reshke

Дата:

12 августа, 08:38:04

On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers!
>
> I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw
somethingsimilar too. 
> So I toyed around and accidentally wrote a test that reproduces $subj.
>
> I think the corruption happens as follows:
> 0. we create a table with one frozen tuple
> 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> 2. VM buffer is flushed on disk with checkpointer or bgwriter
> 3. primary is killed with -9
> now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> 5. pg_visibility detects corruption
>
> Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging
dependson PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case: 
>
>     /* Clear only the all-frozen bit on visibility map if needed */
>     if (PageIsAllVisible(page) &&
>         visibilitymap_clear(relation, block, vmbuffer,
>             VISIBILITYMAP_ALL_FROZEN))
>         cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
>
> Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
>
> The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test
isplaced at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a
lesshacky version. And, probably, more generic. 
>
> Thanks!
>
>
> Best regards, Andrey Borodin.
>
>
>

Attached reproduces the same but without any standby node. CHECKPOINT
somehow manages to flush the heap page  when instance kill-9-ed.
As a result, we have inconsistency between heap and VM pages:

```
reshke=# select * from pg_visibility('x');
 blkno | all_visible | all_frozen | pd_all_visible
-------+-------------+------------+----------------
     0 | t           | t          | f
(1 row)
```

Notice I moved INJECTION point one line above visibilitymap_clear.
Without this change, such behaviour also reproduced, but with much
less frequency.

--
Best regards,
Kirill Reshke

Вложения

v2-0001-Corrupt-VM-on-standby.patch

Re: VM corruption on standby

От

Kirill Reshke

Дата:

12 августа, 09:26:12

On Tue, 12 Aug 2025 at 10:38, I wrote:
> CHECKPOINT
> somehow manages to flush the heap page  when instance kill-9-ed.

This corruption does not reproduce without CHECKPOINT call, however I
do not see any suspicious syscal that CHECKPOINT's process does.
It does not write anything to disk here, isn’t ? PFA strace.

--
Best regards,
Kirill Reshke

Вложения

checkpointer.strace.txt

Re: VM corruption on standby

От

Kirill Reshke

Дата:

12 августа, 11:00:17

On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers!
>
> I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw
somethingsimilar too.

While this aims to find existing VM corruption (i mean, in PG <= 17),
this reproducer does not seem to work on pg17. At least, I did not
manage to reproduce this scenario on pg17.

This makes me think this exact corruption may be pg18-only. Is it
possible that AIO is somehow involved here?

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

13 августа, 14:15:03

On Tue, 12 Aug 2025 at 13:00, I wrote:

> While this aims to find existing VM corruption (i mean, in PG <= 17),
> this reproducer does not seem to work on pg17. At least, I did not
> manage to reproduce this scenario on pg17.
>
> This makes me think this exact corruption may be pg18-only. Is it
> possible that AIO is somehow involved here?

First of all, the "corruption" is reproducible with io_method = sync,
so AIO is not under suspicion.
Then, I did a gdb session many times and I ended up with the
conclusion that this test is NOT a valid corruption reproducer.
So, the thing is, when you involve injection point logic, due how inj
points are implemented, you allow postgres to enter WaitLatch
function, which
has its own logic for postmaster death handling[0].

So, when we add injection point here, we allow this sequence of events
to happen:

1) INSERT enters `heap_insert`, modifies HEAP page, goes to inj point and hangs.
2) CHECKPOINT process tries to FLUSH this page and wiats
3) kill -9 to postmaster
4) INSERT wakes up on postmaster death, goes to [0] and releases all locks.
5) CHECKPOINT-er flushes the HEAP page to disk, causing corruption.

The thing is, this execution will NOT happen without inj points.

So, overall, injection points are not suitable for this critical
section testing (i think).

==
Off-list Andrey send me this patch:

```
diff --git a/src/backend/storage/ipc/waiteventset.c
b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
        long            cur_timeout = -1;

        Assert(nevents > 0);
+       Assert(CritSectionCount == 0);

        /*
         * Initialize timeout if requested.  We must record the current time so
```

The objective is to confirm our assumption that the WaitEventSetWait
call ought not to occur during critical sections. This patch causes
`make check` to fail, indicating that this assumption is incorrect.
Assertion breaks due to AdvanceXLInsertBuffer call  (which uses
condvar logic) inside XlogInsertRecord.

I did not find any doc or other piece of information indicating
whether WaitEventSetWait and critical sections are allowed. But I do
thing this is bad, because we do not process interruptions during
critical sections, so it is unclear to me why we should handle
postmaster death any differently.

[0]
https://github.com/postgres/postgres/blob/393e0d2314050576c9c039853fdabe7f685a4f47/src/backend/storage/ipc/waiteventset.c#L1260-L1261

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

14 августа, 08:41:50

On Wed, 13 Aug 2025 at 16:15, I wrote:
> I did not find any doc or other piece of information indicating
> whether WaitEventSetWait and critical sections are allowed. But I do
> thing this is bad, because we do not process interruptions during
> critical sections, so it is unclear to me why we should handle
> postmaster death any differently.

Maybe I'm very wrong about this, but I'm currently suspecting there is
corruption involving CHECKPOINT, process in CRIT section and kill -9.

The scenario I am trying to reproduce is following:

1) Some process p1 locks some buffer (name it buf1), enters CRIT
section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
in (GetXLogBuffer -> AdvanceXLInsertBuffer).
2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
3) Postmaster kill-9-ed
4) signal of postmaster death delivered to p1, it wakes up in
WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
aliveness, and exits releasing all locks.
5) p2 acquires locks  on buf1 and flushes it to disk.
6) signal of postmaster death delivered to p2, p2 exits.

And we now have a case when the buffer is flushed to disk, while the
xlog record that describes this change never makes it to disk. This is
very bad.

To be clear, I am trying to avoid use of inj points to reproduce
corruption. I am not yet successful in this though.

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

14 августа, 09:35:15

On Thu, 14 Aug 2025 at 10:41, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
o I am trying to reproduce is following:
>
> 1) Some process p1 locks some buffer (name it buf1), enters CRIT
> section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
> in (GetXLogBuffer -> AdvanceXLInsertBuffer).
> 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
> 3) Postmaster kill-9-ed
> 4) signal of postmaster death delivered to p1, it wakes up in
> WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
> aliveness, and exits releasing all locks.
> 5) p2 acquires locks  on buf1 and flushes it to disk.
> 6) signal of postmaster death delivered to p2, p2 exits.

Andrey told me to create CF entry and attach fix, so doing it

[0] https://commitfest.postgresql.org/patch/5964/




-- 
Best regards,
Kirill Reshke

Вложения

v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch

Re: VM corruption on standby

От

Tom Lane

Дата:

17 августа, 17:33:46

Kirill Reshke <reshkekirill@gmail.com> writes:
> [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ]

I do not like this patch one bit: it will replace one set of problems
with another set, namely systems that fail to shut down.

I think the actual bug here is the use of proc_exit(1) after
observing postmaster death.  That is what creates the hazard,
because it releases the locks that are preventing other processes
from observing the inconsistent state in shared memory.
Compare this to what we do, for example, on receipt of SIGQUIT:

    /*
     * We DO NOT want to run proc_exit() or atexit() callbacks -- we're here
     * because shared memory may be corrupted, so we don't want to try to
     * clean up our transaction.  Just nail the windows shut and get out of
     * town.  The callbacks wouldn't be safe to run from a signal handler,
     * anyway.
     *
     * Note we do _exit(2) not _exit(0).  This is to force the postmaster into
     * a system reset cycle if someone sends a manual SIGQUIT to a random
     * backend.  This is necessary precisely because we don't clean up our
     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
     * should ensure the postmaster sees this as a crash, too, but no harm in
     * being doubly sure.)
     */
    _exit(2);

So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
places that are responding to postmaster death.  There might be
more than just WaitEventSetWaitBlock; I didn't look.

            regards, tom lane

Re: VM corruption on standby

От

Andrey Borodin

Дата:

18 августа, 09:43:05

> On 17 Aug 2025, at 17:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
> places that are responding to postmaster death.

+1. But should we _exit(2) only in critical section or always in case of postmaster death?

Another question that was bothering Kirill is do we want a test that naturally reproduces WaitEventSetWaitBlock() under
criticalsection so that we can observe a corruption? Or is it kind of obvious from code that such things might happen? 
Existing test is adding WaitEventSetWaitBlock() via injection point to a place where it was not present before. Though
withexisting test at hand we can check that fix is curing WaitEventSetWaitBlock() against critical section. 

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Kirill Reshke

Дата:

18 августа, 11:15:18

Hi! Thank you for putting attention to this.

On Sun, 17 Aug 2025 at 19:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Kirill Reshke <reshkekirill@gmail.com> writes:
> > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ]
>
> I do not like this patch one bit: it will replace one set of problems
> with another set, namely systems that fail to shut down.

I did not observe this during my by-hand testing. I am under the
impression that CRIT sections are something that backend (or other)
postgres processes try to pass quickly. So, what this patch is doing,
is that it defers the process reaction to postmaster death until the
end of the CRIT section.
So, typical scenario here (as I understand) is this:

(1) Process doing its goods, enters CRIT section.
(2) Postmaster dies.
(3a) Signal of postmaster death (SIGPWR on my VM) delivered to process
(3b) Process exists CRIT sections, and then does CFI logic, observes
postmaster death and quits.

This is why I did my patch the way I did it. I mean, is it always
possible for race conditions to occur, which will result in late
signal delivery, so why bother and all?

> I think the actual bug here is the use of proc_exit(1) after
> observing postmaster death.

Agreed.

> So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
> places that are responding to postmaster death.  There might be
> more than just WaitEventSetWaitBlock; I didn't look.

Well, I see that patching this way will be a much safer way to fix the
issue.  I can see that doing more conservative changes can be more
beneficial (more future-proof and less bug-prone).
I will take a detailed look and try to send a patch soon.

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

18 августа, 13:32:19

On Mon, 18 Aug 2025 at 13:15, I wrote:
> > I do not like this patch one bit: it will replace one set of problems
> > with another set, namely systems that fail to shut down.
>
> I did not observe this during my by-hand testing.

I am sorry: I was wrong. This is exactly what happens in this test
(modified 001_multixact.pl).
To be precise, after the INSERT process exits, CHECKPOINT process
waits indefinitely in LWLockAcquire.

It looks like the reason why proc_exit(1) releases all holded lwlocks
is because we use it to notify all lwlock contenders through shared
memory about state change, which will not be notified otherwise, since
we do not check for signals inside LWLockAcquire.

Looks like we need to do something like ConditionVariableBroadcast(),
but without lwlock release, to notify all lwlock contenders and then
exit(2).

===
As for the fix, I am now trying to make attached work. The idea to
"escalate" proc_exit to immediately exit via syscall comes to my mind
from how elog(ERROR) behaves in CRIT sections (every elog(ERROR)
efficiently becomes elog(PANIC)).

-- 
Best regards,
Kirill Reshke

Вложения

v2-0001-Do-not-exit-on-postmaster-death-even-inside-CRIT-.patch

Re: VM corruption on standby

От

Tom Lane

Дата:

18 августа, 19:52:27

Kirill Reshke <reshkekirill@gmail.com> writes:
> On Sun, 17 Aug 2025 at 19:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not like this patch one bit: it will replace one set of problems
>> with another set, namely systems that fail to shut down.

> I did not observe this during my by-hand testing. I am under the
> impression that CRIT sections are something that backend (or other)
> postgres processes try to pass quickly. So, what this patch is doing,
> is that it defers the process reaction to postmaster death until the
> end of the CRIT section.

Well, if you're inside WaitEventSetWaitBlock, you have little control
over how long you're going to sit.  There is a separate discussion
to be had over whether we should prohibit calling that function
inside a critical section.  But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no.  We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible.  It'll be up to the next generation of postmaster to
try to clean up.

            regards, tom lane

Re: VM corruption on standby

От

Thomas Munro

Дата:

19 августа, 08:31:36

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> But I'm of the opinion that proc_exit
> is the wrong thing to use after seeing postmaster death, critical
> section or no.  We should assume that system integrity is already
> compromised, and get out as fast as we can with as few side-effects
> as possible.  It'll be up to the next generation of postmaster to
> try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

I don't know if there are other ways that LWLockReleaseAll() can lead
to persistent corruption that won't be corrected by crash recovery,
but this one is probably new since the following commit, explaining
the failure to reproduce on v17:

commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date:   Wed Apr 2 12:44:24 2025 +0300

    Get rid of WALBufMappingLock

Any idea involving deferring the handling of PM death from here
doesn't seem right: you'd keep waiting for the CV, but the backend
that would wake you might have exited.

Hmm, I wonder if there could be a solution in between where we don't
release the locks on PM exit, but we still wake the waiters so they
can observe a new dead state in the lock word (or perhaps a shared
postmaster_is_dead flag), and exit themselves.

Nice detective work Andrey and others!  That's a complicated and rare
interaction.

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 08:53:06

Hi! Thank you for putting attention to this.

On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > But I'm of the opinion that proc_exit
> > is the wrong thing to use after seeing postmaster death, critical
> > section or no.  We should assume that system integrity is already
> > compromised, and get out as fast as we can with as few side-effects
> > as possible.  It'll be up to the next generation of postmaster to
> > try to clean up.
>
> Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
> someone who holds x calls _exit()?
>
> I don't know if there are other ways that LWLockReleaseAll() can lead
> to persistent corruption that won't be corrected by crash recovery,
> but this one is probably new since the following commit, explaining
> the failure to reproduce on v17:
>
> commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
> Author: Alexander Korotkov <akorotkov@postgresql.org>
> Date:   Wed Apr 2 12:44:24 2025 +0300
>
>     Get rid of WALBufMappingLock
>
> Any idea involving deferring the handling of PM death from here
> doesn't seem right: you'd keep waiting for the CV, but the backend
> that would wake you might have exited.

OK.

> Hmm, I wonder if there could be a solution in between where we don't
> release the locks on PM exit, but we still wake the waiters so they
> can observe a new dead state in the lock word (or perhaps a shared
> postmaster_is_dead flag), and exit themselves.

Since yesterday I was thinking about adding a new state bit for
LWLockWaitState. Something like LW_WS_OWNER_DEAD, which will be set by
lwlock owner after observing PM death and then checked by containers
in LWLockAcquire.

so something like:

*lock holder in proc_exit(1)*
```
for all my lwlock do:

waiter->lwWaiting = LW_WS_OWNER_DEAD;
PGSemaphoreUnlock(waiter->sem);
```

*lock contender in LWLockAttemptLock*

```
old_state = pg_atomic_read_u32(&lock->state);

/* loop until we've determined whether we could acquire the lock or not */
while (true)
{
   if (old_state & (1<< LW_WS_OWNER_DEAD)) _exit(2) /* or maybe proc_exit(1)*/
   ....

    if (pg_atomic_compare_exchange_u32(&lock->state, &old_state, desired_state))
    ...
    /*rerty*/
}
```

I am not sure this idea is workable though.

--
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Tom Lane

Дата:

19 августа, 09:13:43

Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But I'm of the opinion that proc_exit
>> is the wrong thing to use after seeing postmaster death, critical
>> section or no.  We should assume that system integrity is already
>> compromised, and get out as fast as we can with as few side-effects
>> as possible.  It'll be up to the next generation of postmaster to
>> try to clean up.

> Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
> someone who holds x calls _exit()?

If someone who holds x is killed by (say) the OOM killer, how do
we get out of that?

            regards, tom lane

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 09:17:03

On Tue, 19 Aug 2025 at 11:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> But I'm of the opinion that proc_exit
> >> is the wrong thing to use after seeing postmaster death, critical
> >> section or no.  We should assume that system integrity is already
> >> compromised, and get out as fast as we can with as few side-effects
> >> as possible.  It'll be up to the next generation of postmaster to
> >> try to clean up.
>
> > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
> > someone who holds x calls _exit()?
>
> If someone who holds x is killed by (say) the OOM killer, how do
> we get out of that?

+1, if we kill-9 PM and then immediately kill-9 lwlock holder, there
is no way for system to shutdown (both HEAD and back branches).
So  we can ignore this case.



--
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 09:26:13

On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote:
>

> I don't know if there are other ways that LWLockReleaseAll() can lead
> to persistent corruption that won't be corrected by crash recovery,
> but this one is probably new since the following commit, explaining
> the failure to reproduce on v17:
>
> commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
> Author: Alexander Korotkov <akorotkov@postgresql.org>
> Date:   Wed Apr 2 12:44:24 2025 +0300
>
>     Get rid of WALBufMappingLock
>
> Any idea involving deferring the handling of PM death from here
> doesn't seem right: you'd keep waiting for the CV, but the backend
> that would wake you might have exited.


I revert this commit (these were conflicts but i resolved them) and
added assert for crit sections in WaitEventSetWait.

make check passes (without v2-0001 it fails)


-- 
Best regards,
Kirill Reshke

Вложения

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 12:14:00

This thread is a candidate for [0]

[0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items

Best regards,

Kirill Reshke

Re: VM corruption on standby

От

Tomas Vondra

Дата:

19 августа, 14:56:45

On 8/19/25 11:14, Kirill Reshke wrote:
> This thread is a candidate for [0]
> 
> 
> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items <https://
> wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items>
> 

Added, with Alexander as an owner (assuming it really is caused by
commit bc22dc0e0d.


regards

-- 
Tomas Vondra

Re: VM corruption on standby

От

Yura Sokolov

Дата:

19 августа, 15:50:44

10.08.2025 08:45, Kirill Reshke пишет:
> On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev
> <aleksander@tigerdata.com> wrote:
>> For this reason we have PageHeaderData.pd_lsn for instance - to make sure
>> pages are evicted only *after* the record that changed it is written
>> to disk (because WAL records can't be applied to pages from the
>> future).
> 
> We don't bump the LSN of the heap page when setting the visibility
> map bit.

And that is very bad!!!
In fact, I believe "incremental backup" is not safe because of that.

>> I guess the intent here could be to do an optimization of some sort
>> but the facts that 1. the instance can be killed at any time and 2.
>> there might be replicas - were not considered.
> 
>> IMHO: logging the changes first, then allowing to evict the page.
> 
> Clearing the vm before the logging changes was intentional [0].
> So I assume we should not change the approach, but rather just tweak
> things a bit to make the whole thing work.
> 
> [0] https://www.postgresql.org/message-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA%40mail.gmail.com

Lets cite that message:

On Fri, May 6, 2011 at 5:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> Another question:
>>> To address the problem in
>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>>> , should we just clear the vm before the log of insert/update/delete?
>>> This may reduce the performance, is there another solution?
>>
>> Yeah, that's a straightforward way to fix it. I don't think the performance
>> hit will be too bad. But we need to be careful not to hold locks while doing
>> I/O, which might require some rearrangement of the code. We might want to do
>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>> lock and update the heap page, and then set the VM bit while holding the
>> lock on the heap page.
>
> Here's an attempt at implementing the necessary gymnastics.

So we see: Heikki told "lock and update the heap page, and then set the VM
bit while holding lock on the heap page". He didn't say "set the VM bit
before log heap update". It is not clear why Robert put visibilitymap_clear
before logging of update, so your point "was intentional" is not valid.

Call of visibilitymap_clear after logging but before exit from critical
section (as Aleksander Alekseev did in suggested patch) is correct way to
do the thing.

-- 
regards
Yura Sokolov aka funny-falcon

Re: VM corruption on standby

От

Yura Sokolov

Дата:

19 августа, 15:56:05

09.08.2025 22:54, Kirill Reshke пишет:
> On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
> <aleksander@tigerdata.com> wrote:
>
>> Perhaps there was a good
>> reason to update the VM *before* creating WAL records I'm unaware of.
>
> Looks like 503c730 intentionally does it this way; however, I have not
> yet fully understood the reasoning behind it.

I repeat: there was no intention. Neither in commit message, nor in
discussion about.

There was intention to move visibilitymap_clear under heap page lock and
into critical section, but there were no any word about logging.

I believe, it was just an unfortunate oversight that the change is made
before logging.

--
regards
Yura Sokolov aka funny-falcon

Re: VM corruption on standby

От

Andres Freund

Дата:

19 августа, 16:09:53

Hi,

On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote:
> 09.08.2025 22:54, Kirill Reshke пишет:
> > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
> > <aleksander@tigerdata.com> wrote:
> >
> >> Perhaps there was a good
> >> reason to update the VM *before* creating WAL records I'm unaware of.
> >
> > Looks like 503c730 intentionally does it this way; however, I have not
> > yet fully understood the reasoning behind it.
>
> I repeat: there was no intention. Neither in commit message, nor in
> discussion about.
>
> There was intention to move visibilitymap_clear under heap page lock and
> into critical section, but there were no any word about logging.
>
> I believe, it was just an unfortunate oversight that the change is made
> before logging.

The normal pattern *is* to modify the buffer while holding an exclusive lock,
in a critical section, before WAL logging. Look at
src/backend/access/transam/README:

> The general schema for executing a WAL-logged action is
>
> 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
> to be modified.
>
> 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
> PANIC because the shared buffers will contain unlogged changes, which we
> have to ensure don't get to disk.  Obviously, you should check conditions
> such as whether there's enough free space on the page before you start the
> critical section.)
>
> 3. Apply the required changes to the shared buffer(s).
>
> 4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
> happen before the WAL record is inserted; see notes in SyncOneBuffer().)
> Note that marking a buffer dirty with MarkBufferDirty() should only
> happen iff you write a WAL record; see Writing Hints below.
>
> 5. If the relation requires WAL-logging, build a WAL record using
> XLogBeginInsert and XLogRegister* functions, and insert it.  (See
> "Constructing a WAL record" below).  Then update the page's LSN using the
> returned XLOG location.  For instance,
>
>         XLogBeginInsert();
>         XLogRegisterBuffer(...)
>         XLogRegisterData(...)
>         recptr = XLogInsert(rmgr_id, info);
>
>         PageSetLSN(dp, recptr);
>
> 6. END_CRIT_SECTION()
>
> 7. Unlock and unpin the buffer(s).


Greetings,

Andres Freund

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 16:17:44

On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> This thread is a candidate for [0]
>
>
> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items
>

Let me summarize this thread for ease of understanding of what's going on:

Timeline:
1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
corruption in VM bits.
2) We investigate problem in not with how PostgreSQL modified buffers
or logs changes, but with LWLockReleaseALl in proc_exit(1) after
kill-9 PM
3) We have reached the conclusion that there is no corruption, and
that injection points are not a valid way to reproduce them, because
of WaitLatch and friends.

4) But we now suspect there is another corruption with ANY critical
section in scenario:

I wrote:

> Maybe I'm very wrong about this, but I'm currently suspecting there is
> corruption involving CHECKPOINT, process in CRIT section and kill -9.
>1) Some process p1 locks some buffer (name it buf1), enters CRIT
>section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
>in (GetXLogBuffer -> AdvanceXLInsertBuffer).
>2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
>3) Postmaster kill-9-ed
>4) signal of postmaster death delivered to p1, it wakes up in
>WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
>aliveness, and exits releasing all locks.
>5) p2 acquires locks  on buf1 and flushes it to disk.
>6) signal of postmaster death delivered to p2, p2 exits.

5) We create an open item for pg18 and propose revering
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.

Please note that patches in this thread are NOT reproducer of
corruption, as of today we have NO valid repro of corruption

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Yura Sokolov

Дата:

19 августа, 16:17:46

19.08.2025 16:09, Andres Freund пишет:
> Hi,
> 
> On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote:
>> 09.08.2025 22:54, Kirill Reshke пишет:
>>> On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
>>> <aleksander@tigerdata.com> wrote:
>>>
>>>> Perhaps there was a good
>>>> reason to update the VM *before* creating WAL records I'm unaware of.
>>>
>>> Looks like 503c730 intentionally does it this way; however, I have not
>>> yet fully understood the reasoning behind it.
>>
>> I repeat: there was no intention. Neither in commit message, nor in
>> discussion about.
>>
>> There was intention to move visibilitymap_clear under heap page lock and
>> into critical section, but there were no any word about logging.
>>
>> I believe, it was just an unfortunate oversight that the change is made
>> before logging.
> 
> The normal pattern *is* to modify the buffer while holding an exclusive lock,
> in a critical section, before WAL logging. Look at
> src/backend/access/transam/README:
> 
>> The general schema for executing a WAL-logged action is
>>
>> 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
>> to be modified.
>>
>> 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
>> PANIC because the shared buffers will contain unlogged changes, which we
>> have to ensure don't get to disk.  Obviously, you should check conditions
>> such as whether there's enough free space on the page before you start the
>> critical section.)
>>
>> 3. Apply the required changes to the shared buffer(s).
>>
>> 4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
>> happen before the WAL record is inserted; see notes in SyncOneBuffer().)
>> Note that marking a buffer dirty with MarkBufferDirty() should only
>> happen iff you write a WAL record; see Writing Hints below.
>>
>> 5. If the relation requires WAL-logging, build a WAL record using
>> XLogBeginInsert and XLogRegister* functions, and insert it.  (See
>> "Constructing a WAL record" below).  Then update the page's LSN using the
>> returned XLOG location.  For instance,
>>
>>         XLogBeginInsert();
>>         XLogRegisterBuffer(...)
>>         XLogRegisterData(...)
>>         recptr = XLogInsert(rmgr_id, info);
>>
>>         PageSetLSN(dp, recptr);
>>
>> 6. END_CRIT_SECTION()
>>
>> 7. Unlock and unpin the buffer(s).

There is quite important step in this instruction:

> Then update the page's LSN using the returned XLOG location.

This step is violated for the call of visibilitymap_clear.

Though, probably I'm mistaken this is source of the bug. But it is really
source of other kinds of issues.

-- 
regards
Yura Sokolov aka funny-falcon

Re: VM corruption on standby

От

Yura Sokolov

Дата:

19 августа, 16:29:34

19.08.2025 16:17, Kirill Reshke пишет:
> On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote:
>>
>> This thread is a candidate for [0]
>>
>>
>> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items
>>
>
> Let me summarize this thread for ease of understanding of what's going on:
>
> Timeline:
> 1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
> corruption in VM bits.
> 2) We investigate problem in not with how PostgreSQL modified buffers
> or logs changes, but with LWLockReleaseALl in proc_exit(1) after
> kill-9 PM
> 3) We have reached the conclusion that there is no corruption, and
> that injection points are not a valid way to reproduce them, because
> of WaitLatch and friends.
>
> 4) But we now suspect there is another corruption with ANY critical
> section in scenario:
>
> I wrote:
>
>> Maybe I'm very wrong about this, but I'm currently suspecting there is
>> corruption involving CHECKPOINT, process in CRIT section and kill -9.
>> 1) Some process p1 locks some buffer (name it buf1), enters CRIT
>> section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
>> in (GetXLogBuffer -> AdvanceXLInsertBuffer).
>> 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
>> 3) Postmaster kill-9-ed
>> 4) signal of postmaster death delivered to p1, it wakes up in
>> WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
>> aliveness, and exits releasing all locks.
>> 5) p2 acquires locks  on buf1 and flushes it to disk.
>> 6) signal of postmaster death delivered to p2, p2 exits.
>
> 5) We create an open item for pg18 and propose revering
> bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.

Latch and ConditionVariable (that uses Latch) are among basic
synchronization primitives in PostgreSQL.
Therefore they have to work correctly in any place: in critical section, in
wal logging, etc.
Current behavior of WaitEventSetWaitBlock is certainly the bug and it is
ought to be fixed.
So +1 for _exit(2) as Tom suggested.

--
regards
Yura Sokolov aka funny-falcon

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 16:43:33

On Tue, 19 Aug 2025 at 18:29, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

> Latch and ConditionVariable (that uses Latch) are among basic
> synchronization primitives in PostgreSQL.

Sure

> Therefore they have to work correctly in any place: in critical section, in
> wal logging, etc.

No. Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 ConditionVariable
code path was never exercised in critical sections. After
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 it is exercised in almost
every one (if the system is highly loaded). This is a crucial change
with corruption as a drawback (until we fix this).

To replace proc_exit(1) with _exit(2) is not a cure too: if we exit
inside CRIT section without any message to LWLock contenders, they
will never do the same (never exit), because they are wait the
semaphore and do not respond to signals (so, only way to stop them in
to kill-9). Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 lwlock
holders did not exit inside crit sections (unless kill9)

I had one suggestion about what can be done [0]. However there is
little no time until the pg18 release for a change that scary and big
(my own understanding), so the safest option is to revert.

[0] https://www.postgresql.org/message-id/CALdSSPgDAyqt%3DORyLMWMpotb9V4Jk1Am%2Bhe39mNtBA8%2Ba8TQDw%40mail.gmail.com
-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Andres Freund

Дата:

19 августа, 16:56:27

Hi,

On 2025-08-19 02:13:43 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> But I'm of the opinion that proc_exit
> >> is the wrong thing to use after seeing postmaster death, critical
> >> section or no.  We should assume that system integrity is already
> >> compromised, and get out as fast as we can with as few side-effects
> >> as possible.  It'll be up to the next generation of postmaster to
> >> try to clean up.
> 
> > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
> > someone who holds x calls _exit()?
> 
> If someone who holds x is killed by (say) the OOM killer, how do
> we get out of that?

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

A long while back I had experimented with replacing waiting on semaphores
(within lwlocks) with a latch wait. IIRC it was a bit slower under heavy
contention, but that vanished when adding some adaptive spinning to lwlocks -
which is also what we need to make it more feasible to replace some of the
remaining spinlocks...

Greetings,

Andres Freund

Re: VM corruption on standby

От

Thomas Munro

Дата:

19 августа, 17:54:09

On Wed, Aug 20, 2025 at 1:56 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-08-19 02:13:43 -0400, Tom Lane wrote:
> > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
> > > someone who holds x calls _exit()?
> >
> > If someone who holds x is killed by (say) the OOM killer, how do
> > we get out of that?

If a backend is killed by the OOM killer, the postmaster will of
course send SIGQUIT/SIGKILL to all backend.  If the postmaster itself
is killed, then surviving backends will notice at their next
WaitEventSetWait() and exit, but if any are blocked in sem_wait(),
they it will only make progress because other exiting backends release
their LWLocks on their way out.  So if we change that to _exit(), I
assume such backends would linger forever in sem_wait() after the
postmaster dies.  I do agree that it seems quite weird to release all
locks as if this is a "normal" exit though, which is why Kirill and I
both wondered about other ways to boot them out of sem_wait()...

> On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> waiters would get killed due to the postmaster death signal we've configured
> (c.f. PostmasterDeathSignalInit()).

No, that has a handler that just sets a global variable.  That was
done because recovery used to try to read() from the postmaster pipe
after replaying every record.  Also we currently have some places that
don't want to be summarily killed (off the top of my head, syncrep
wants to send a special error message, and the logger wants to survive
longer than everyone else to catch as much output as possible, things
I've been thinking about in the context of threads).

Re: VM corruption on standby

От

Andres Freund

Дата:

19 августа, 17:57:43

Hi,

On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:
> > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> > waiters would get killed due to the postmaster death signal we've configured
> > (c.f. PostmasterDeathSignalInit()).
> 
> No, that has a handler that just sets a global variable.  That was
> done because recovery used to try to read() from the postmaster pipe
> after replaying every record.  Also we currently have some places that
> don't want to be summarily killed (off the top of my head, syncrep
> wants to send a special error message, and the logger wants to survive
> longer than everyone else to catch as much output as possible, things
> I've been thinking about in the context of threads).

That makes no sense. We should just _exit(). If postmaster has been killed,
trying to stay up longer just makes everything more fragile. Waiting for the
logger is *exactly* what we should *not* do - what if the logger also crashed?
There's no postmaster around to start it.

Greetings,

Andres Freund

Re: VM corruption on standby

От

Thomas Munro

Дата:

19 августа, 18:19:38

On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:
> > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> > > waiters would get killed due to the postmaster death signal we've configured
> > > (c.f. PostmasterDeathSignalInit()).
> >
> > No, that has a handler that just sets a global variable.  That was
> > done because recovery used to try to read() from the postmaster pipe
> > after replaying every record.  Also we currently have some places that
> > don't want to be summarily killed (off the top of my head, syncrep
> > wants to send a special error message, and the logger wants to survive
> > longer than everyone else to catch as much output as possible, things
> > I've been thinking about in the context of threads).
>
> That makes no sense. We should just _exit(). If postmaster has been killed,
> trying to stay up longer just makes everything more fragile. Waiting for the
> logger is *exactly* what we should *not* do - what if the logger also crashed?
> There's no postmaster around to start it.

Nobody is waiting for the logger.  The logger waits for everyone else
to exit first to collect forensics:

     * Unlike all other postmaster child processes, we'll ignore postmaster
     * death because we want to collect final log output from all backends and
     * then exit last.  We'll do that by running until we see EOF on the
     * syslog pipe, which implies that all other backends have exited
     * (including the postmaster).

The syncrep case is a bit weirder: it wants to tell the user that
syncrep is broken, so its own WaitEventSetWait() has
WL_POSTMASTER_DEATH, but that's basically bogus because the backend
can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code
paths.  I've proposed nuking that before.

Re: VM corruption on standby

От

Andres Freund

Дата:

19 августа, 18:24:27

Hi,

On 2025-08-20 03:19:38 +1200, Thomas Munro wrote:
> On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:
> > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> > > > waiters would get killed due to the postmaster death signal we've configured
> > > > (c.f. PostmasterDeathSignalInit()).
> > >
> > > No, that has a handler that just sets a global variable.  That was
> > > done because recovery used to try to read() from the postmaster pipe
> > > after replaying every record.  Also we currently have some places that
> > > don't want to be summarily killed (off the top of my head, syncrep
> > > wants to send a special error message, and the logger wants to survive
> > > longer than everyone else to catch as much output as possible, things
> > > I've been thinking about in the context of threads).
> >
> > That makes no sense. We should just _exit(). If postmaster has been killed,
> > trying to stay up longer just makes everything more fragile. Waiting for the
> > logger is *exactly* what we should *not* do - what if the logger also crashed?
> > There's no postmaster around to start it.
> 
> Nobody is waiting for the logger.

Error messages that we might be printing will wait for logger if the pipe is
full, no?


> The logger waits for everyone else to exit first to collect forensics:
> 
>      * Unlike all other postmaster child processes, we'll ignore postmaster
>      * death because we want to collect final log output from all backends and
>      * then exit last.  We'll do that by running until we see EOF on the
>      * syslog pipe, which implies that all other backends have exited
>      * (including the postmaster).

> The syncrep case is a bit weirder: it wants to tell the user that
> syncrep is broken, so its own WaitEventSetWait() has
> WL_POSTMASTER_DEATH, but that's basically bogus because the backend
> can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code
> paths.  I've proposed nuking that before.

Yea, that's just bogus.


I think this is one more instance of "let's try hard to continue limping
along" making things way more fragile than the simpler "let's just do
crash-restart in the most normal way possible".

Greetings,

Andres Freund

Re: VM corruption on standby

От

Yura Sokolov

Дата:

19 августа, 19:16:39

19.08.2025 16:43, Kirill Reshke пишет:
> On Tue, 19 Aug 2025 at 18:29, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
>
>> Latch and ConditionVariable (that uses Latch) are among basic
>> synchronization primitives in PostgreSQL.
>
> Sure
>
>> Therefore they have to work correctly in any place: in critical section, in
>> wal logging, etc.
>
> No. Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 ConditionVariable
> code path was never exercised in critical sections. After
> bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 it is exercised in almost
> every one (if the system is highly loaded). This is a crucial change
> with corruption as a drawback (until we fix this).
>
> To replace proc_exit(1) with _exit(2) is not a cure too: if we exit
> inside CRIT section without any message to LWLock contenders, they
> will never do the same (never exit), because they are wait the
> semaphore and do not respond to signals (so, only way to stop them in
> to kill-9). Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 lwlock
> holders did not exit inside crit sections (unless kill9)

That is not true.
elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called
from AdvanceXLInsertBuffer, may call elog(PANIC) from several places.

It doesn't lead to any error, because usually postmaster is alive and it
will kill -9 all its children if any one is died in critical section.

So the problem is postmaster is already killed with SIGKILL by definition
of the issue.

Documentation says [0]:
> If at all possible, do not use SIGKILL to kill the main postgres server.
> Doing so will prevent postgres from freeing the system resources (e.g.,
shared memory and semaphores) that it holds before terminating.

Therefore if postmaster SIGKILL-ed, administrator already have to do some
actions.

So the issue Andrey Borodin arose is without any fix Pg18 does provides
inconsistency between data and the WAL log: data could written to main fork
and vm fork although WAL is not written yet IF POSTMASTER IS SIGKILL-ed.

`if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
(since postmaster will SIGKILL its children).

> I had one suggestion about what can be done [0]. However there is
> little no time until the pg18 release for a change that scary and big
> (my own understanding), so the safest option is to revert.
>
> [0] https://www.postgresql.org/message-id/CALdSSPgDAyqt%3DORyLMWMpotb9V4Jk1Am%2Bhe39mNtBA8%2Ba8TQDw%40mail.gmail.com

[0] https://www.postgresql.org/docs/17/app-postgres.html

--
regards
Yura Sokolov aka funny-falcon

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 20:00:53

On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

>
> That is not true.
> elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called
> from AdvanceXLInsertBuffer, may call elog(PANIC) from several places.
>
> It doesn't lead to any error, because usually postmaster is alive and it
> will kill -9 all its children if any one is died in critical section.
>
> So the problem is postmaster is already killed with SIGKILL by definition
> of the issue.
>
> Documentation says [0]:
> > If at all possible, do not use SIGKILL to kill the main postgres server.
> > Doing so will prevent postgres from freeing the system resources (e.g.,
> shared memory and semaphores) that it holds before terminating.
>
> Therefore if postmaster SIGKILL-ed, administrator already have to do some
> actions.
>

There are surely many cases when a system reaches the state which can
only be fixed by admin action.
The elog(PANIC) in the CRIT section is very rare (and very probably is
corruption already).
The simpler example is to kill-9 postmaster and then immediately
kill-9 someone who holds LWLock.
The problem is in pgv18 is that this state probability is much higher
due to the aforementioned commit. In can happen with almost
any OOM on highly loaded systems.

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 20:07:06

On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>

>
> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
> (since postmaster will SIGKILL its children).
>

This fix was proposed in this thread. It fixes inconsistency but it
replaces one set of problems
with another set, namely systems that fail to shut down.



-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 20:20:57

On Tue, 19 Aug 2025 at 20:24, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-08-20 03:19:38 +1200, Thomas Munro wrote:
> > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:
> > > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> > > > > waiters would get killed due to the postmaster death signal we've configured
> > > > > (c.f. PostmasterDeathSignalInit()).
> > > >
> > > > No, that has a handler that just sets a global variable.  That was
> > > > done because recovery used to try to read() from the postmaster pipe
> > > > after replaying every record.  Also we currently have some places that
> > > > don't want to be summarily killed (off the top of my head, syncrep
> > > > wants to send a special error message, and the logger wants to survive
> > > > longer than everyone else to catch as much output as possible, things
> > > > I've been thinking about in the context of threads).
> > >
> > > That makes no sense. We should just _exit(). If postmaster has been killed,
> > > trying to stay up longer just makes everything more fragile. Waiting for the
> > > logger is *exactly* what we should *not* do - what if the logger also crashed?
> > > There's no postmaster around to start it.
> >
> > Nobody is waiting for the logger.
>
> Error messages that we might be printing will wait for logger if the pipe is
> full, no?

I did some crit_sections check for elog usage, and I did not really
find any crit section that uses elog(elevel) with elevel < ERROR. But
surely there are cases when
we might be printing messages inside crit sections, or there can be
such sections in backbranches/future branches.
Anyway, this case, when we are hung indefinitely while waiting for
(already dead logger), might be a rare one.
The problem is, even without a logger, on current HEAD, we will fail
to stop the system when PM dies, and there is no simple fix.
It would be very helpful if LWLock implementation was done using latch
wait, there will be no problem then.
But we are where we are, so I can see there is a sense in making a try
to notify other processes that we are going to die soon and they need
to do the same (through shared memory), and then _exit(1).

--
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Tom Lane

Дата:

19 августа, 21:08:19

Kirill Reshke <reshkekirill@gmail.com> writes:
> On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
>> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
>> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
>> (since postmaster will SIGKILL its children).

> This fix was proposed in this thread. It fixes inconsistency but it
> replaces one set of problems with another set, namely systems that
> fail to shut down.

I think a bigger objection is that it'd result in two separate
shutdown behaviors in what's already an extremely under-tested
(and hard to test) scenario.  I don't want to have to deal with
the ensuing state-space explosion.

I still think that proc_exit(1) is fundamentally the wrong thing
to do if the postmaster is gone: that code path assumes that
the cluster is still functional, which is at best shaky.
I concur though that we'd have to do some more engineering work
before _exit(2) would be a practical solution.

In the meantime, it seems like this discussion point arises
only because the presented test case is doing something that
seems pretty unsafe, namely invoking WaitEventSet inside a
critical section.

We'd probably be best off to get back to the actual bug the
thread started with, namely whether we aren't doing the wrong
thing with VM-update order of operations.

            regards, tom lane

Re: VM corruption on standby

От

Kirill Reshke

Дата:

19 августа, 21:23:01

On Tue, 19 Aug 2025 at 23:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Kirill Reshke <reshkekirill@gmail.com> writes:
> > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
> >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
> >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
> >> (since postmaster will SIGKILL its children).
>
> > This fix was proposed in this thread. It fixes inconsistency but it
> > replaces one set of problems with another set, namely systems that
> > fail to shut down.
>
> I think a bigger objection is that it'd result in two separate
> shutdown behaviors in what's already an extremely under-tested
> (and hard to test) scenario.  I don't want to have to deal with
> the ensuing state-space explosion.

Agreed.

> I still think that proc_exit(1) is fundamentally the wrong thing
> to do if the postmaster is gone: that code path assumes that
> the cluster is still functional, which is at best shaky.
> I concur though that we'd have to do some more engineering work
> before _exit(2) would be a practical solution.

Agreed.

> In the meantime, it seems like this discussion point arises
> only because the presented test case is doing something that
> seems pretty unsafe, namely invoking WaitEventSet inside a
> critical section.

Agreed.

> We'd probably be best off to get back to the actual bug the
> thread started with, namely whether we aren't doing the wrong
> thing with VM-update order of operations.
>
>                         regards, tom lane

My understanding is that there is no bug in the VM. At least not in
[0] test, because it uses an injection point in the CRIT section,
making the server exit too early.
So, behaviour with inj point and without are very different.
The corruption we are looking for has to reproducer (see [1]).

[0] https://www.postgresql.org/message-id/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
[1] https://www.postgresql.org/message-id/CALdSSPhGQ1xx10c2NaZgce8qmi%2BSuKFp6T1uWG_aZvPpvoJRkQ%40mail.gmail.com

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Andrey Borodin

Дата:

19 августа, 21:34:21

> On 19 Aug 2025, at 23:23, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
>> We'd probably be best off to get back to the actual bug the
>> thread started with, namely whether we aren't doing the wrong
>> thing with VM-update order of operations.
>>
>>                        regards, tom lane
>
> My understanding is that there is no bug in the VM. At least not in
> [0] test, because it uses an injection point in the CRIT section,
> making the server exit too early.
> So, behaviour with inj point and without are very different.
> The corruption we are looking for has to reproducer (see [1]).

I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injection
pointtest. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'm
readingthis discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection point
waitmechanism. 

Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Tom Lane

Дата:

19 августа, 22:50:19

Kirill Reshke <reshkekirill@gmail.com> writes:
> On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote:
>> Any idea involving deferring the handling of PM death from here
>> doesn't seem right: you'd keep waiting for the CV, but the backend
>> that would wake you might have exited.

Yeah.  Taking the check for PM death out of here seems just about
as hazardous as leaving it in :-(.  Not something I want to mess
with so late in the v18 cycle.

> I revert this commit (these were conflicts but i resolved them) and
> added assert for crit sections in WaitEventSetWait.

Your patch still contains some conflict markers :-(.  Attached is
a corrected version, just to save other people the effort of fixing
the diffs themselves.

> make check passes (without v2-0001 it fails)

While 'make check' is okay with this assertion, 'make check-world'
still falls over if you have injection points enabled, because
src/test/modules/test_slru/t/001_multixact.pl also had the
cute idea of putting an injection-point wait inside a critical
section.  I did not find any other failures though.

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.
Therefore, I vote for reverting bc22dc0e0.  Hopefully only
temporarily, but it's too late to figure out another way for v18,
and I don't think that bc22dc0e0 is such an essential improvement
that we can't afford to give it up for v18.

However, we can't install the proposed assertion until we do
something about that test_slru test.  It seems like in general
it would be sad if we can't put injection points inside
critical sections, so I'm wondering if there's a way to
re-implement the injection point "wait" functionality without
depending on WaitEventSetWait.  I would be willing to accept
weaker safety guarantees in this context, since we only anticipate
such cases being used in test scaffolding.

            regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e8909406686..7ffb2179151 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -303,6 +303,11 @@ static bool doPageWrites;
  * so it's a plain spinlock.  The other locks are held longer (potentially
  * over I/O operations), so we use LWLocks for them.  These locks are:
  *
+ * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
+ * It is only held while initializing and changing the mapping.  If the
+ * contents of the buffer being replaced haven't been written yet, the mapping
+ * lock is released while the write is done, and reacquired afterwards.
+ *
  * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
  * XLogFlush).
  *
@@ -468,37 +473,21 @@ typedef struct XLogCtlData
     pg_atomic_uint64 logFlushResult;    /* last byte + 1 flushed */

     /*
-     * First initialized page in the cache (first byte position).
-     */
-    XLogRecPtr    InitializedFrom;
-
-    /*
-     * Latest reserved for initialization page in the cache (last byte
-     * position + 1).
+     * Latest initialized page in the cache (last byte position + 1).
      *
-     * To change the identity of a buffer, you need to advance
-     * InitializeReserved first.  To change the identity of a buffer that's
+     * To change the identity of a buffer (and InitializedUpTo), you need to
+     * hold WALBufMappingLock.  To change the identity of a buffer that's
      * still dirty, the old page needs to be written out first, and for that
      * you need WALWriteLock, and you need to ensure that there are no
      * in-progress insertions to the page by calling
      * WaitXLogInsertionsToFinish().
      */
-    pg_atomic_uint64 InitializeReserved;
-
-    /*
-     * Latest initialized page in the cache (last byte position + 1).
-     *
-     * InitializedUpTo is updated after the buffer initialization.  After
-     * update, waiters got notification using InitializedUpToCondVar.
-     */
-    pg_atomic_uint64 InitializedUpTo;
-    ConditionVariable InitializedUpToCondVar;
+    XLogRecPtr    InitializedUpTo;

     /*
      * These values do not change after startup, although the pointed-to pages
-     * and xlblocks values certainly do.  xlblocks values are changed
-     * lock-free according to the check for the xlog write position and are
-     * accompanied by changes of InitializeReserved and InitializedUpTo.
+     * and xlblocks values certainly do.  xlblocks values are protected by
+     * WALBufMappingLock.
      */
     char       *pages;            /* buffers for unwritten XLOG pages */
     pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
@@ -821,9 +810,9 @@ XLogInsertRecord(XLogRecData *rdata,
      * fullPageWrites from changing until the insertion is finished.
      *
      * Step 2 can usually be done completely in parallel. If the required WAL
-     * page is not initialized yet, you have to go through AdvanceXLInsertBuffer,
-     * which will ensure it is initialized. But the WAL writer tries to do that
-     * ahead of insertions to avoid that from happening in the critical path.
+     * page is not initialized yet, you have to grab WALBufMappingLock to
+     * initialize it, but the WAL writer tries to do that ahead of insertions
+     * to avoid that from happening in the critical path.
      *
      *----------
      */
@@ -2005,79 +1994,32 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
     XLogRecPtr    NewPageEndPtr = InvalidXLogRecPtr;
     XLogRecPtr    NewPageBeginPtr;
     XLogPageHeader NewPage;
-    XLogRecPtr    ReservedPtr;
     int            npages pg_attribute_unused() = 0;

-    /*
-     * We must run the loop below inside the critical section as we expect
-     * XLogCtl->InitializedUpTo to eventually keep up.  The most of callers
-     * already run inside the critical section. Except for WAL writer, which
-     * passed 'opportunistic == true', and therefore we don't perform
-     * operations that could error out.
-     *
-     * Start an explicit critical section anyway though.
-     */
-    Assert(CritSectionCount > 0 || opportunistic);
-    START_CRIT_SECTION();
+    LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);

-    /*--
-     * Loop till we get all the pages in WAL buffer before 'upto' reserved for
-     * initialization.  Multiple process can initialize different buffers with
-     * this loop in parallel as following.
-     *
-     * 1. Reserve page for initialization using XLogCtl->InitializeReserved.
-     * 2. Initialize the reserved page.
-     * 3. Attempt to advance XLogCtl->InitializedUpTo,
+    /*
+     * Now that we have the lock, check if someone initialized the page
+     * already.
      */
-    ReservedPtr = pg_atomic_read_u64(&XLogCtl->InitializeReserved);
-    while (upto >= ReservedPtr || opportunistic)
+    while (upto >= XLogCtl->InitializedUpTo || opportunistic)
     {
-        Assert(ReservedPtr % XLOG_BLCKSZ == 0);
+        nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);

         /*
-         * Get ending-offset of the buffer page we need to replace.
-         *
-         * We don't lookup into xlblocks, but rather calculate position we
-         * must wait to be written. If it was written, xlblocks will have this
-         * position (or uninitialized)
+         * Get ending-offset of the buffer page we need to replace (this may
+         * be zero if the buffer hasn't been used yet).  Fall through if it's
+         * already written out.
          */
-        if (ReservedPtr + XLOG_BLCKSZ > XLogCtl->InitializedFrom + XLOG_BLCKSZ * XLOGbuffers)
-            OldPageRqstPtr = ReservedPtr + XLOG_BLCKSZ - (XLogRecPtr) XLOG_BLCKSZ * XLOGbuffers;
-        else
-            OldPageRqstPtr = InvalidXLogRecPtr;
-
-        if (LogwrtResult.Write < OldPageRqstPtr && opportunistic)
+        OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
+        if (LogwrtResult.Write < OldPageRqstPtr)
         {
             /*
-             * If we just want to pre-initialize as much as we can without
-             * flushing, give up now.
+             * Nope, got work to do. If we just want to pre-initialize as much
+             * as we can without flushing, give up now.
              */
-            upto = ReservedPtr - 1;
-            break;
-        }
-
-        /*
-         * Attempt to reserve the page for initialization.  Failure means that
-         * this page got reserved by another process.
-         */
-        if (!pg_atomic_compare_exchange_u64(&XLogCtl->InitializeReserved,
-                                            &ReservedPtr,
-                                            ReservedPtr + XLOG_BLCKSZ))
-            continue;
-
-        /*
-         * Wait till page gets correctly initialized up to OldPageRqstPtr.
-         */
-        nextidx = XLogRecPtrToBufIdx(ReservedPtr);
-        while (pg_atomic_read_u64(&XLogCtl->InitializedUpTo) < OldPageRqstPtr)
-            ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
-        ConditionVariableCancelSleep();
-        Assert(pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) == OldPageRqstPtr);
-
-        /* Fall through if it's already written out. */
-        if (LogwrtResult.Write < OldPageRqstPtr)
-        {
-            /* Nope, got work to do. */
+            if (opportunistic)
+                break;

             /* Advance shared memory write request position */
             SpinLockAcquire(&XLogCtl->info_lck);
@@ -2092,6 +2034,14 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
             RefreshXLogWriteResult(LogwrtResult);
             if (LogwrtResult.Write < OldPageRqstPtr)
             {
+                /*
+                 * Must acquire write lock. Release WALBufMappingLock first,
+                 * to make sure that all insertions that we need to wait for
+                 * can finish (up to this same position). Otherwise we risk
+                 * deadlock.
+                 */
+                LWLockRelease(WALBufMappingLock);
+
                 WaitXLogInsertionsToFinish(OldPageRqstPtr);

                 LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
@@ -2119,6 +2069,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
                      */
                     pgstat_report_fixed = true;
                 }
+                /* Re-acquire WALBufMappingLock and retry */
+                LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+                continue;
             }
         }

@@ -2126,9 +2079,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
          * Now the next buffer slot is free and we can set it up to be the
          * next output page.
          */
-        NewPageBeginPtr = ReservedPtr;
+        NewPageBeginPtr = XLogCtl->InitializedUpTo;
         NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;

+        Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
+
         NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);

         /*
@@ -2192,100 +2147,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
          */
         pg_write_barrier();

-        /*-----
-         * Update the value of XLogCtl->xlblocks[nextidx] and try to advance
-         * XLogCtl->InitializedUpTo in a lock-less manner.
-         *
-         * First, let's provide a formal proof of the algorithm.  Let it be 'n'
-         * process with the following variables in shared memory:
-         *    f - an array of 'n' boolean flags,
-         *    v - atomic integer variable.
-         *
-         * Also, let
-         *    i - a number of a process,
-         *    j - local integer variable,
-         * CAS(var, oldval, newval) - compare-and-swap atomic operation
-         *                              returning true on success,
-         * write_barrier()/read_barrier() - memory barriers.
-         *
-         * The pseudocode for each process is the following.
-         *
-         *    j := i
-         *    f[i] := true
-         *    write_barrier()
-         *    while CAS(v, j, j + 1):
-         *        j := j + 1
-         *        read_barrier()
-         *        if not f[j]:
-         *            break
-         *
-         * Let's prove that v eventually reaches the value of n.
-         * 1. Prove by contradiction.  Assume v doesn't reach n and stucks
-         *      on k, where k < n.
-         * 2. Process k attempts CAS(v, k, k + 1).  1). If, as we assumed, v
-         *      gets stuck at k, then this CAS operation must fail.  Therefore,
-         *    v < k when process k attempts CAS(v, k, k + 1).
-         * 3. If, as we assumed, v gets stuck at k, then the value k of v
-         *      must be achieved by some process m, where m < k.  The process
-         *      m must observe f[k] == false.  Otherwise, it will later attempt
-         *      CAS(v, k, k + 1) with success.
-         * 4. Therefore, corresponding read_barrier() (while j == k) on
-         *      process m reached before write_barrier() of process k.  But then
-         *      process k attempts CAS(v, k, k + 1) after process m successfully
-         *      incremented v to k, and that CAS operation must succeed.
-         *      That leads to a contradiction.  So, there is no such k (k < n)
-         *    where v gets stuck.  Q.E.D.
-         *
-         * To apply this proof to the code below, we assume
-         * XLogCtl->InitializedUpTo will play the role of v with XLOG_BLCKSZ
-         * granularity.  We also assume setting XLogCtl->xlblocks[nextidx] to
-         * NewPageEndPtr to play the role of setting f[i] to true.  Also, note
-         * that processes can't concurrently map different xlog locations to
-         * the same nextidx because we previously requested that
-         * XLogCtl->InitializedUpTo >= OldPageRqstPtr.  So, a xlog buffer can
-         * be taken for initialization only once the previous initialization
-         * takes effect on XLogCtl->InitializedUpTo.
-         */
-
         pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
-
-        pg_write_barrier();
-
-        while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
-        {
-            NewPageBeginPtr = NewPageEndPtr;
-            NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-            nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
-
-            pg_read_barrier();
-
-            if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
-            {
-                /*
-                 * Page at nextidx wasn't initialized yet, so we can't move
-                 * InitializedUpto further. It will be moved by backend which
-                 * will initialize nextidx.
-                 */
-                ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
-                break;
-            }
-        }
+        XLogCtl->InitializedUpTo = NewPageEndPtr;

         npages++;
     }
-
-    END_CRIT_SECTION();
-
-    /*
-     * All the pages in WAL buffer before 'upto' were reserved for
-     * initialization.  However, some pages might be reserved by concurrent
-     * processes.  Wait till they finish initialization.
-     */
-    while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
-        ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
-    ConditionVariableCancelSleep();
-
-    pg_read_barrier();
+    LWLockRelease(WALBufMappingLock);

 #ifdef WAL_DEBUG
     if (XLOG_DEBUG && npages > 0)
@@ -5178,10 +5045,6 @@ XLOGShmemInit(void)
     pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
     pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
     pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
-
-    pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr);
-    pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr);
-    ConditionVariableInit(&XLogCtl->InitializedUpToCondVar);
 }

 /*
@@ -6205,8 +6068,7 @@ StartupXLOG(void)
         memset(page + len, 0, XLOG_BLCKSZ - len);

         pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-        pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-        XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
+        XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
     }
     else
     {
@@ -6215,10 +6077,8 @@ StartupXLOG(void)
          * let the first attempt to insert a log record to initialize the next
          * buffer.
          */
-        pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
-        XLogCtl->InitializedFrom = EndOfLog;
+        XLogCtl->InitializedUpTo = EndOfLog;
     }
-    pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));

     /*
      * Update local and shared status.  This is OK to do without any locks
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..5427da5bc1b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,7 +156,6 @@ REPLICATION_SLOT_DROP    "Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND    "Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT    "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SYNC_REP    "Waiting for confirmation from a remote server during synchronous replication."
-WAL_BUFFER_INIT    "Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT    "Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START    "Waiting for startup process to send initial data for streaming replication."
 WAL_SUMMARY_READY    "Waiting for a new WAL summary to be generated."
@@ -316,6 +315,7 @@ XidGen    "Waiting to allocate a new transaction ID."
 ProcArray    "Waiting to access the shared per-process data structures (typically, to get a snapshot or report a
session'stransaction ID)." 
 SInvalRead    "Waiting to retrieve messages from the shared catalog invalidation queue."
 SInvalWrite    "Waiting to add a message to the shared catalog invalidation queue."
+WALBufMapping    "Waiting to replace a page in WAL buffers."
 WALWrite    "Waiting for WAL buffers to be written to disk."
 ControlFile    "Waiting to read or update the <filename>pg_control</filename> file or create a new WAL file."
 MultiXactGen    "Waiting to read or update shared multixact state."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..06a1ffd4b08 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -38,7 +38,7 @@ PG_LWLOCK(3, XidGen)
 PG_LWLOCK(4, ProcArray)
 PG_LWLOCK(5, SInvalRead)
 PG_LWLOCK(6, SInvalWrite)
-/* 7 was WALBufMapping */
+PG_LWLOCK(7, WALBufMapping)
 PG_LWLOCK(8, WALWrite)
 PG_LWLOCK(9, ControlFile)
 /* 10 was CheckpointLock */

Re: VM corruption on standby

От

Tom Lane

Дата:

19 августа, 22:55:41

Andrey Borodin <x4mmm@yandex-team.ru> writes:
> I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an
injectionpoint test. Because injections points rely on CondVar, that per se creates corruption in critical section. So
I'mreading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection
pointwait mechanism. 

Yeah, I was coming to similar conclusions in the reply I just sent:
we don't really want a policy that we can't put injection-point-based
delays inside critical sections.  So that infrastructure is leaving
something to be desired.

Having said that, the test script is also doing something we tell
people not to do, namely SIGKILL the postmaster.  Could we use
SIGQUIT (immediate shutdown) instead?

            regards, tom lane

Re: VM corruption on standby

От

Michael Paquier

Дата:

20 августа, 01:33:12

On Tue, Aug 19, 2025 at 03:55:41PM -0400, Tom Lane wrote:
> Yeah, I was coming to similar conclusions in the reply I just sent:
> we don't really want a policy that we can't put injection-point-based
> delays inside critical sections.  So that infrastructure is leaving
> something to be desired.

Yes, it doesn't make sense to restrict the use of injection point
waits within critical sections.  A simple switch that we could do is
to rely on a clock-based check in the wait() routine, removing the
condition variable part.  It costs in responsiveness because the
wakeup() routine would not be able to ping the wait() part to recheck
the shmem counter immediately.  But we could reduce the delay with a
variable recheck timing, say double the time to recheck the counter
after each loop, capped at a maximum of a couple of hundred ms so as
it's still good enough on fast machines, and does not stress too much
slow machines.  That costs a bit more in terms of clock calls and
delay checks, but it would be low-level enough that the internal
interrupts would not matter if we rely on syscalls, I guess?  And we
don't care about efficiency in this case.
--
Michael

Вложения

signature.asc

Re: VM corruption on standby

От

Thomas Munro

Дата:

20 августа, 02:59:42

On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm inclined to think that we do want to prohibit WaitEventSetWait
> inside a critical section --- it just seems like a bad idea all
> around, even without considering this specific failure mode.

FWIW aio/README.md describes a case where we'd need to wait for an IO,
which might involve a CV to wait for an IO worker to do something, in
order to start writing WAL, which is in a CS.  Starting IO involves
calling pgaio_io_acquire(), and if it can't find a handle it calls
pgaio_io_wait_for_free().  That's all hypothetical for now as v18 is
only doing reads, but it's an important architectural principle.  That
makes me suspect this new edict can't be the final policy, even if v18
uses it to solve the immediate problem.

For v19 I think we should probably attack the original sin and make
this work.  Several mechanisms for unwedging LWLockAcquire() have been
mentioned: (1) the existing LWLockReleaseAll(), which clearly makes
bogus assumptions about system state and cannot stay, (2) some new
thing that would sem_post() all the waiters having set flags that
cause LWLockAcquire() to exit (ie a sort of multiplexing, making our
semaphore-based locks inch towards latch-nature), (3) moving LWLock
over to latches, so the wait would already be multiplexed with PM
death detection, (4) having the parent death signal handler exit
directly (unfortunately Linux and FreeBSD only*), (5) in
multi-threaded prototype work, the whole process exits anyway taking
all backend threads with it** which is a strong motivation to make
multi-process mode act as much like that as possible, eg something the
exits a lot more eagerly and hopefully preemptively than today.

* That's an IRIX invention picked up by Linux and FreeBSD, a sort of
reverse SIGCHLD, and I've tried to recreate it for pure POSIX systems
before.  (1) Maybe it's enough for any backend that learns of
postmaster death to signal everyone else since they can't all be
blocked in sig_wait() unless there is already a deadlock.  (2) I once
tried making the postmaster deathpipe O_ASYNC so that the "owner" gets
a signal when it becomes readable, but it turned out to require a
separate postmaster pipe for every backend (not just a dup'd
descriptor); perhaps this would be plausible if we already had a
bidirectional postmaster control socket protocol and choose to give
every backend process its own socket pair in MP mode, something I've
been looking into for various other reasons.

** I've been studying the unusual logger case in this context and
contemplated running it as a separate process even in MT mode, as its
stated aim didn't sound crazy to me and I was humbly attempting to
preserve that characteristic in MT mode.  Another way to achieve MP/MT
consistency is to decide that the MP design already isn't robust
enough on full pipe and just nuke the logger like everything else.
Reading Andres's earlier comments, I briefly wondered about a
compromise where log senders would make a best effort to send
nonblockingly when they know the postmaster is gone, but that's
neither as reliable as whoever wrote that had in mind (and in their
defence, the logger is basically independent of shared memory state so
whether it should be exiting ASAP or draining final log statements is
at least debatable; barring bugs, it's only going to block progress if
your system is really hosed), nor free entirely of "limping along"
syndrome as Andres argues quite compellingly, so I cancelled that
thought.

Re: VM corruption on standby

От

Thomas Munro

Дата:

20 августа, 03:11:06

On Wed, Aug 20, 2025 at 11:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> they can't all be
> blocked in sig_wait() unless there is already a deadlock.

s/sig_wait()/sem_wait()/

Re: VM corruption on standby

От

Tom Lane

Дата:

20 августа, 06:47:21

Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm inclined to think that we do want to prohibit WaitEventSetWait
>> inside a critical section --- it just seems like a bad idea all
>> around, even without considering this specific failure mode.

> FWIW aio/README.md describes a case where we'd need to wait for an IO,
> which might involve a CV to wait for an IO worker to do something, in
> order to start writing WAL, which is in a CS.

Hm.  It still makes me mighty uncomfortable, because the point of a
critical section is "crash the database if anything goes wrong during
this bit".  Waiting for another process --- or thread --- greatly
increases the scope of ways for things to go wrong.  So I'm not
exactly convinced that this aspect of the AIO architecture is
well-thought-out.

Having said that, we should in any case have a better story on
what WaitEventSetWait should do after detecting postmaster death.
So I'm all for trying to avoid the proc_exit path if we can
design a better answer.

            regards, tom lane

Re: VM corruption on standby

От

Kirill Reshke

Дата:

20 августа, 08:03:10

On Wed, 20 Aug 2025 at 00:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Kirill Reshke <reshkekirill@gmail.com> writes:

> > I revert this commit (these were conflicts but i resolved them) and
> > added assert for crit sections in WaitEventSetWait.
>
> Your patch still contains some conflict markers :-(.  Attached is
> a corrected version, just to save other people the effort of fixing
> the diffs themselves.

Oh, yes, it indeed contains some conflict markers, but surprisingly
compiles without errors, that's why i missed it.
Attached looks good to me.

>
> While 'make check' is okay with this assertion, 'make check-world'
> still falls over if you have injection points enabled, because
> src/test/modules/test_slru/t/001_multixact.pl also had the
> cute idea of putting an injection-point wait inside a critical
> section.  I did not find any other failures though.

Okay, we probably need to reimplement the wait function in injection
point without latch, got it.

> I'm inclined to think that we do want to prohibit WaitEventSetWait
> inside a critical section --- it just seems like a bad idea all
> around, even without considering this specific failure mode.
> Therefore, I vote for reverting bc22dc0e0.  Hopefully only
> temporarily, but it's too late to figure out another way for v18,
> and I don't think that bc22dc0e0 is such an essential improvement
> that we can't afford to give it up for v18.

+1

-- 
Best regards,
Kirill Reshke

Re: VM corruption on standby

От

Andres Freund

Дата:

20 августа, 16:14:04

Hi,

On 2025-08-19 23:47:21 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I'm inclined to think that we do want to prohibit WaitEventSetWait
> >> inside a critical section --- it just seems like a bad idea all
> >> around, even without considering this specific failure mode.
> 
> > FWIW aio/README.md describes a case where we'd need to wait for an IO,
> > which might involve a CV to wait for an IO worker to do something, in
> > order to start writing WAL, which is in a CS.
> 
> Hm.  It still makes me mighty uncomfortable, because the point of a
> critical section is "crash the database if anything goes wrong during
> this bit".  Waiting for another process --- or thread --- greatly
> increases the scope of ways for things to go wrong.  So I'm not
> exactly convinced that this aspect of the AIO architecture is
> well-thought-out.

I don't see the alternative:

1) Some IO is done in critical sections (e.g. WAL writes / flushes)

2) Sometimes we need to wait for already started IO in critical sections
   (also WAL)

3) With some ways of doing AIO the IO is offloaded to other processes, and
   thus waiting for the IO to complete always requires waiting for another
   process

How could we avoid the need to wait for another process in criticial sections
given these points?

Greetings,

Andres Freund

Re: VM corruption on standby

От

Michael Paquier

Дата:

21 августа, 02:12:36

On Wed, Aug 20, 2025 at 09:14:04AM -0400, Andres Freund wrote:
> On 2025-08-19 23:47:21 -0400, Tom Lane wrote:
>> Hm.  It still makes me mighty uncomfortable, because the point of a
>> critical section is "crash the database if anything goes wrong during
>> this bit".  Waiting for another process --- or thread --- greatly
>> increases the scope of ways for things to go wrong.  So I'm not
>> exactly convinced that this aspect of the AIO architecture is
>> well-thought-out.
>
> I don't see the alternative:
>
> 1) Some IO is done in critical sections (e.g. WAL writes / flushes)
>
> 2) Sometimes we need to wait for already started IO in critical sections
>    (also WAL)
>
> 3) With some ways of doing AIO the IO is offloaded to other processes, and
>    thus waiting for the IO to complete always requires waiting for another
>    process
>
> How could we avoid the need to wait for another process in criticial sections
> given these points?

Yes, it comes down to the point that for some code path we just cannot
accept a soft failure: some IOs are critical enough that if they fail
the only thing we can should and can do is to recover and replay based
on the past IOs that we know did succeed and made it durably to disk.

Having what can be qualified as safe and efficient to use in a
critical section for event broadcasting and waits would be really,
really nice.
--
Michael

Вложения

signature.asc

Re: VM corruption on standby

От

Andrey Borodin

Дата:

21 августа, 05:07:16


> On 20 Aug 2025, at 00:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
>> I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an
injectionpoint test. Because injections points rely on CondVar, that per se creates corruption in critical section. So
I'mreading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection
pointwait mechanism. 
>
> Yeah, I was coming to similar conclusions in the reply I just sent:
> we don't really want a policy that we can't put injection-point-based
> delays inside critical sections.  So that infrastructure is leaving
> something to be desired.
>
> Having said that, the test script is also doing something we tell
> people not to do, namely SIGKILL the postmaster.  Could we use
> SIGQUIT (immediate shutdown) instead?

I'm working backwards from corruptions I see on our production.
And almost always I see stormbringers like OOM, power outage or Debian scripts that (I think) do kill -9 when `service
postgresqlstop` takes too long. 


Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Thomas Munro

Дата:

21 августа, 08:15:25

On Wed, Aug 20, 2025 at 3:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Having said that, we should in any case have a better story on
> what WaitEventSetWait should do after detecting postmaster death.
> So I'm all for trying to avoid the proc_exit path if we can
> design a better answer.

Yeah.  I've posted a concept patch in a new thread:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKp0kTpummCPa97%2BWFJTm%2BuYzQ9Ex8UMdH8ZXkLwO0QgA%40mail.gmail.com

Re: VM corruption on standby

От

Alexander Korotkov

Дата:

22 августа, 01:27:17

Hi, Tom!

On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm inclined to think that we do want to prohibit WaitEventSetWait
> inside a critical section --- it just seems like a bad idea all
> around, even without considering this specific failure mode.
> Therefore, I vote for reverting bc22dc0e0.  Hopefully only
> temporarily, but it's too late to figure out another way for v18,
> and I don't think that bc22dc0e0 is such an essential improvement
> that we can't afford to give it up for v18.

I'm OK about this.  Do you mind if I revert bc22dc0e0 myself?
And let's retry it for v19.

------
Regards,
Alexander Korotkov
Supabase

Re: VM corruption on standby

От

Michael Paquier

Дата:

22 августа, 01:33:38

On Fri, Aug 22, 2025 at 01:27:17AM +0300, Alexander Korotkov wrote:
> I'm OK about this.  Do you mind if I revert bc22dc0e0 myself?
> And let's retry it for v19.

Yes, agreed that it may be the best thing to do for v18 based on
the information we have gathered until now.
--
Michael

Вложения

signature.asc

Re: VM corruption on standby

От

Tom Lane

Дата:

22 августа, 01:42:48

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Therefore, I vote for reverting bc22dc0e0.  Hopefully only
>> temporarily, but it's too late to figure out another way for v18,
>> and I don't think that bc22dc0e0 is such an essential improvement
>> that we can't afford to give it up for v18.

> I'm OK about this.  Do you mind if I revert bc22dc0e0 myself?

Not at all, go for it.

> And let's retry it for v19.

+1

            regards, tom lane

Re: VM corruption on standby

От

Thomas Munro

Дата:

22 августа, 04:41:44

On Fri, Aug 22, 2025 at 10:27 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> And let's retry it for v19.

+1

I'm hoping we can fix PM death handling soon, and then I assume this
can go straight back in without modification.  CVs are an essential
low level synchronisation component that really should work in lots of
environments, but we need to straighten out some historical mistakes
and rough edges.  Commit cfdf4dc4 removed open-coded exits to fix a
lot of bugs of omission, but it failed to fully consider all the
consequences of "composition", ie hiding that behaviour out of sight.
We really need a sort of postmasterless PANIC here, and I am happy to
work on that (see new thread), not least because it aligns better with
the behaviour of a multithreaded server.  There, the answer is "what
other backends?" so I'd already been looking sideways at system states
including the lingering logger, lingering query execution and the
special but inherently flaky error reporting in a few spots.

Re: VM corruption on standby

От

Nathan Bossart

Дата:

25 августа, 18:07:26

[RMT hat]

On Thu, Aug 21, 2025 at 06:42:48PM -0400, Tom Lane wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
>> On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Therefore, I vote for reverting bc22dc0e0.  Hopefully only
>>> temporarily, but it's too late to figure out another way for v18,
>>> and I don't think that bc22dc0e0 is such an essential improvement
>>> that we can't afford to give it up for v18.
> 
>> I'm OK about this.  Do you mind if I revert bc22dc0e0 myself?
> 
> Not at all, go for it.

Now that this is reverted, can the related open item be marked as resolved?

-- 
nathan

Re: VM corruption on standby

От

Nathan Bossart

Дата:

28 августа, 19:33:50

On Mon, Aug 25, 2025 at 10:07:26AM -0500, Nathan Bossart wrote:
> Now that this is reverted, can the related open item be marked as resolved?

Since there has been no further discussion, I will go ahead and resolve the
open item.

-- 
nathan

Re: VM corruption on standby

От

Alexander Korotkov

Дата:

03 сентября, 09:37:04

On Tue, Aug 12, 2025 at 8:38 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > Hi hackers!
> >
> > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
> > So I toyed around and accidentally wrote a test that reproduces $subj.
> >
> > I think the corruption happens as follows:
> > 0. we create a table with one frozen tuple
> > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> > 2. VM buffer is flushed on disk with checkpointer or bgwriter
> > 3. primary is killed with -9
> > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> > 5. pg_visibility detects corruption
> >
> > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:
> >
> > /* Clear only the all-frozen bit on visibility map if needed */
> > if (PageIsAllVisible(page) &&
> > visibilitymap_clear(relation, block, vmbuffer,
> > VISIBILITYMAP_ALL_FROZEN))
> > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
> >
> > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
> >
> > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.
> >
> > Thanks!
> >
> >
> > Best regards, Andrey Borodin.
> >
> >
> >
>
> Attached reproduces the same but without any standby node. CHECKPOINT
> somehow manages to flush the heap page when instance kill-9-ed.
> As a result, we have inconsistency between heap and VM pages:
>
> ```
> reshke=# select * from pg_visibility('x');
> blkno | all_visible | all_frozen | pd_all_visible
> -------+-------------+------------+----------------
> 0 | t | t | f
> (1 row)
> ```
>
> Notice I moved INJECTION point one line above visibilitymap_clear.
> Without this change, such behaviour also reproduced, but with much
> less frequency.

BTW, I've tried this patch on the current master, where bc22dc0e0d was reverted. And it fails for me.

t/001_multixact.pl .. 1/?
# Failed test 'pg_check_frozen() observes corruption'
# at t/001_multixact.pl line 102.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''

# Failed test 'pg_check_visible() observes corruption'
# at t/001_multixact.pl line 103.
# got: '(0,2)
# (0,4)'
# expected: ''

# Failed test 'deleted data returned by select'
# at t/001_multixact.pl line 104.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
t/001_multixact.pl .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/3 subtests

Test Summary Report
-------------------
t/001_multixact.pl (Wstat: 768 Tests: 3 Failed: 3)
Failed tests: 1-3
Non-zero exit status: 3
Files=1, Tests=3, 2 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.27 csys = 0.37 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:452: check] Error 1

Could you, please, recheck?

------
Regards,
Alexander Korotkov
Supabase

Re: VM corruption on standby

От

Andrey Borodin

Дата:

03 сентября, 09:47:28


> On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Could you, please, recheck?

That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous.

Actual patch to deteact a problem is much simpler:
```
diff --git a/src/backend/storage/ipc/waiteventset.c
b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
       long            cur_timeout = -1;

       Assert(nevents > 0);
+       Assert(CritSectionCount == 0);

       /*
        * Initialize timeout if requested.  We must record the current time so
```

Though it will fail in multixact test.


Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Alexander Korotkov

Дата:

03 сентября, 11:28:45

On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > Could you, please, recheck?
>
> That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous.
>
> Actual patch to deteact a problem is much simpler:
> ```
> diff --git a/src/backend/storage/ipc/waiteventset.c
> b/src/backend/storage/ipc/waiteventset.c
> index 7c0e66900f9..e89e1d115cb 100644
> --- a/src/backend/storage/ipc/waiteventset.c
> +++ b/src/backend/storage/ipc/waiteventset.c
> @@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
>        long            cur_timeout = -1;
>
>        Assert(nevents > 0);
> +       Assert(CritSectionCount == 0);
>
>        /*
>         * Initialize timeout if requested.  We must record the current time so
> ```
>
> Though it will fail in multixact test.

I thought that patch allows to reproduce the problem of bc22dc0e0d.
Sorry for the noise.

------
Regards,
Alexander Korotkov
Supabase

Re: VM corruption on standby

От

Alexander Korotkov

Дата:

10 сентября, 13:25:37

On Wed, Sep 3, 2025 at 11:28 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > > On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >
> > > Could you, please, recheck?
> >
> > That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is
dangerous.
> >
> > Actual patch to deteact a problem is much simpler:
> > ```
> > diff --git a/src/backend/storage/ipc/waiteventset.c
> > b/src/backend/storage/ipc/waiteventset.c
> > index 7c0e66900f9..e89e1d115cb 100644
> > --- a/src/backend/storage/ipc/waiteventset.c
> > +++ b/src/backend/storage/ipc/waiteventset.c
> > @@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
> >        long            cur_timeout = -1;
> >
> >        Assert(nevents > 0);
> > +       Assert(CritSectionCount == 0);
> >
> >        /*
> >         * Initialize timeout if requested.  We must record the current time so
> > ```
> >
> > Though it will fail in multixact test.
>
> I thought that patch allows to reproduce the problem of bc22dc0e0d.
> Sorry for the noise.

The attached is my attempt to rework the bc22dc0e0d.  The problem
discussed in this thread occurs because condition variable ends the
process with proc_exit(1) in the case of postmaster death even in the
critical section.  However, the shared memory cleanup happening after
proc_exit(1) being called in the critical section is leaving the
shared memory in the inconsistent state.  Generally I see two possible
behaviors in this situation.

1) Exit without the shared memory cleanup (e.g. proc_exit(2) as
proposed upthread).  That would save us from data corruption caused by
the inconsistent shared memory state.  However, if our process has
occupied shared resources (e.g. locks), they wouldn't be released.
That might cause other processes hand waiting for those resources.
2) Continue to wait after postmaster death.  While bc22dc0e0d was the
first case when we waited on conditional variable in the critical
section, there are a lot of cases where we're taking LWLock's in the
critical section.  Unlike condition variable, lwlock will continue to
wait in the case of postmaster death.  The critical section will
continue its operation until completion, postmaster death will be
handled afterwards.  If we would apply the same logic to the arbitrary
condition variable, it might however cause our process to stuck if
we're waiting for something from another process, which would exit
seeing postmaster death.

I think the approach #2 is more appropriate for bc22dc0e0d, because in
the critical section we only wait for other processes also in the
critical section (so, there is no risk they will exit immediately
after postmaster death making us stuck).  I've implemented a patch,
where waiting on conditional variable is replaced with LWLock-style
waiting on semaphore.  However, custom waiting code just for
AdvanceXLInsertBuffer() doesn't look good.  I believe we need some
general solution.  We might have a special kind of condition variable,
a critical section condition variable, where both waiting and
signaling must be invoked only in a critical section.  However, I dig
into our Latch and WaitEventSet, it seems there are too many
assumptions about postmaster death.  So, a critical section condition
variable probably should be implemented on top of semaphore.  Any
thoughts?

------
Regards,
Alexander Korotkov
Supabase

Вложения

v11-0001-Get-rid-of-WALBufMappingLock.patch

Re: VM corruption on standby

От

Andrey Borodin

Дата:

10 сентября, 14:59:56


> On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> I think the approach #2 is more appropriate for bc22dc0e0d, because in
> the critical section we only wait for other processes also in the
> critical section (so, there is no risk they will exit immediately
> after postmaster death making us stuck).  I've implemented a patch,
> where waiting on conditional variable is replaced with LWLock-style
> waiting on semaphore.  However, custom waiting code just for
> AdvanceXLInsertBuffer() doesn't look good.

Well, at least I'd like to see corruption-free solution for injection point wait too.

>  I believe we need some
> general solution.  We might have a special kind of condition variable,
> a critical section condition variable, where both waiting and
> signaling must be invoked only in a critical section.  However, I dig
> into our Latch and WaitEventSet, it seems there are too many
> assumptions about postmaster death.  So, a critical section condition
> variable probably should be implemented on top of semaphore.  Any
> thoughts?

We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is
iteasier to fix and maintain existing Latch\WaitEventSet? 


Best regards, Andrey Borodin.

Re: VM corruption on standby

От

Thomas Munro

Дата:

11 сентября, 01:59:19

On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >  I believe we need some
> > general solution.  We might have a special kind of condition variable,
> > a critical section condition variable, where both waiting and
> > signaling must be invoked only in a critical section.  However, I dig
> > into our Latch and WaitEventSet, it seems there are too many
> > assumptions about postmaster death.  So, a critical section condition
> > variable probably should be implemented on top of semaphore.  Any
> > thoughts?
>
> We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is
iteasier to fix and maintain existing Latch\WaitEventSet? 

FWIW I'm working on a patch set that kills all backends without
releasing any locks when the postmaster exists.  Then CVs and other
latch-based stuff should be safe in this context.  Work was
interrupted by a vacation but I hope to post something in the nexts
couple of days, over on that other thread I started...

Re: VM corruption on standby

От

Michael Paquier

Дата:

11 сентября, 08:05:34

On Thu, Sep 11, 2025 at 10:59:19AM +1200, Thomas Munro wrote:
> FWIW I'm working on a patch set that kills all backends without
> releasing any locks when the postmaster exists.  Then CVs and other
> latch-based stuff should be safe in this context.  Work was
> interrupted by a vacation but I hope to post something in the nexts
> couple of days, over on that other thread I started...

Wow.  Thanks!
--
Michael

Вложения

signature.asc

Re: VM corruption on standby

От

Alexander Korotkov

Дата:

11 сентября, 10:10:48

On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >  I believe we need some
> > > general solution.  We might have a special kind of condition variable,
> > > a critical section condition variable, where both waiting and
> > > signaling must be invoked only in a critical section.  However, I dig
> > > into our Latch and WaitEventSet, it seems there are too many
> > > assumptions about postmaster death.  So, a critical section condition
> > > variable probably should be implemented on top of semaphore.  Any
> > > thoughts?
> >
> > We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or
isit easier to fix and maintain existing Latch\WaitEventSet? 
>
> FWIW I'm working on a patch set that kills all backends without
> releasing any locks when the postmaster exists.  Then CVs and other
> latch-based stuff should be safe in this context.  Work was
> interrupted by a vacation but I hope to post something in the nexts
> couple of days, over on that other thread I started...

Thank you!
I'm looking forward to see it!

------
Regards,
Alexander Korotkov
Supabase

Re: VM corruption on standby

От

Alexander Korotkov

Дата:

03 октября, 09:31:33

Hi, Thomas!

On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >  I believe we need some
> > > general solution.  We might have a special kind of condition variable,
> > > a critical section condition variable, where both waiting and
> > > signaling must be invoked only in a critical section.  However, I dig
> > > into our Latch and WaitEventSet, it seems there are too many
> > > assumptions about postmaster death.  So, a critical section condition
> > > variable probably should be implemented on top of semaphore.  Any
> > > thoughts?
> >
> > We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or
isit easier to fix and maintain existing Latch\WaitEventSet? 
>
> FWIW I'm working on a patch set that kills all backends without
> releasing any locks when the postmaster exists.  Then CVs and other
> latch-based stuff should be safe in this context.  Work was
> interrupted by a vacation but I hope to post something in the nexts
> couple of days, over on that other thread I started...

How is it going?

------
Regards,
Alexander Korotkov
Supabase