Обсуждение: processes stuck in shutdown following OOM/recovery

Поиск
Список
Период
Сортировка

processes stuck in shutdown following OOM/recovery

От
Justin Pryzby
Дата:
If postgres starts, and one of its children is immediately killed, and
the cluster is also told to stop, then, instead, the whole system gets
wedged.

$ initdb -D ./pgdev.dat1
$ pg_ctl -D ./pgdev.dat1 start -o '-c port=5678'
$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
.......................................................... failed
pg_ctl: server does not shut down

$ ps -wwwf --ppid 2524494
UID          PID    PPID  C STIME TTY          TIME CMD
pryzbyj  2524552 2524494  0 20:47 ?        00:00:00 postgres: checkpointer 

(gdb) bt
#0  0x00007f0ce2d08c03 in epoll_wait (epfd=10, events=0x55cb4cbaac28, maxevents=1, timeout=timeout@entry=156481) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055cb4c219208 in WaitEventSetWaitBlock (set=set@entry=0x55cb4cbaabc0, cur_timeout=cur_timeout@entry=156481,
occurred_events=occurred_events@entry=0x7ffd80130410,
 
    nevents=nevents@entry=1) at ../src/backend/storage/ipc/latch.c:1583
#2  0x000055cb4c219e02 in WaitEventSetWait (set=0x55cb4cbaabc0, timeout=timeout@entry=300000,
occurred_events=occurred_events@entry=0x7ffd80130410,nevents=nevents@entry=1, 
 
    wait_event_info=wait_event_info@entry=83886084) at ../src/backend/storage/ipc/latch.c:1529
#3  0x000055cb4c219f87 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41,
timeout=timeout@entry=300000,wait_event_info=wait_event_info@entry=83886084)
 
    at ../src/backend/storage/ipc/latch.c:539
#4  0x000055cb4c1aabc2 in CheckpointerMain () at ../src/backend/postmaster/checkpointer.c:523
#5  0x000055cb4c1a8207 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at
../src/backend/postmaster/auxprocess.c:153
#6  0x000055cb4c1ae63d in StartChildProcess (type=type@entry=CheckpointerProcess) at
../src/backend/postmaster/postmaster.c:5331
#7  0x000055cb4c1b07f3 in ServerLoop () at ../src/backend/postmaster/postmaster.c:1792
#8  0x000055cb4c1b1c56 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x55cb4cbaa380) at
../src/backend/postmaster/postmaster.c:1466
#9  0x000055cb4c0f4c1b in main (argc=5, argv=0x55cb4cbaa380) at ../src/backend/main/main.c:198

I noticed this because of the counter-effective behavior of systemd+PGDG
unit files to run "pg_ctl stop" whenever a backend is killed for OOM:
https://www.postgresql.org/message-id/ZVI112aVNCHOQgfF@pryzbyj2023

This affects v15, and fails at 7ff23c6d27 but not its parent.

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 (HEAD)
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Mon Aug 2 17:32:20 2021 +1200

    Run checkpointer and bgwriter in crash recovery.

-- 
Justin



Re: processes stuck in shutdown following OOM/recovery

От
Thomas Munro
Дата:
On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> $ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid

> This affects v15, and fails at 7ff23c6d27 but not its parent.

Repro'd here.  I had to make the sleep shorter on my system.  Looking...



Re: processes stuck in shutdown following OOM/recovery

От
Thomas Munro
Дата:
On Sat, Dec 2, 2023 at 2:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > $ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
>
> > This affects v15, and fails at ) but not its parent.
>
> Repro'd here.  I had to make the sleep shorter on my system.  Looking...

The PostmasterStateMachine() case for PM_WAIT_BACKENDS doesn't tell
the checkpointer to shut down in this race case.  We have
CheckpointerPID != 0 (because 7ff23c6d27 starts it earlier than
before), and FatalError is true because a child recently crashed and
we haven't yet received the PMSIGNAL_RECOVERY_STARTED handler that
would clear it.  Hmm.



Re: processes stuck in shutdown following OOM/recovery

От
Thomas Munro
Дата:
On Sat, Dec 2, 2023 at 3:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Dec 2, 2023 at 2:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > $ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
> >
> > > This affects v15, and fails at ) but not its parent.
> >
> > Repro'd here.  I had to make the sleep shorter on my system.  Looking...
>
> The PostmasterStateMachine() case for PM_WAIT_BACKENDS doesn't tell
> the checkpointer to shut down in this race case.  We have
> CheckpointerPID != 0 (because 7ff23c6d27 starts it earlier than
> before), and FatalError is true because a child recently crashed and
> we haven't yet received the PMSIGNAL_RECOVERY_STARTED handler that
> would clear it.  Hmm.

Here is a first attempt at fixing this.  I am not yet 100% sure if it
is right, and there may be a nicer/simpler way to express the
conditions.  It passes the test suite, and it fixes the repro that
Justin posted.  FYI on my machine I had to use sleep 0.005 where he
had 0.05, as an FYI if someone else is trying to reproduce the issue.

Вложения

Re: processes stuck in shutdown following OOM/recovery

От
Martijn Wallet
Дата:
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

Hi, I somehow fail to be able to mark all checkboxes on this review page...
However, build and tested with all passed successfully on Rocky Linux release 8.9 (Green Obsidian).
Not sure of more reviewing is needed on other Operating Systems since this is only my second review.

Cheers, Martijn.

The new status of this patch is: Ready for Committer

Re: processes stuck in shutdown following OOM/recovery

От
Martijn Wallet
Дата:
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

Hi, I somehow fail to be able to mark all checkboxes on this review page...
However, build and tested with all passed successfully on Rocky Linux release 8.9 (Green Obsidian).
Not sure of more reviewing is needed on other Operating Systems since this is only my second review.

Cheers, Martijn.

nb: second mail to see spf is fixed and Thomas receives this message.

Re: processes stuck in shutdown following OOM/recovery

От
Thomas Munro
Дата:
On Thu, May 23, 2024 at 9:58 AM Martijn Wallet <martijn@dbre.nu> wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:       not tested
> Spec compliant:           not tested
> Documentation:            not tested
>
> Hi, I somehow fail to be able to mark all checkboxes on this review page...
> However, build and tested with all passed successfully on Rocky Linux release 8.9 (Green Obsidian).
> Not sure of more reviewing is needed on other Operating Systems since this is only my second review.

Thanks!

I'm also hoping to get review of the rather finickity state machine
logic involved from people familiar with that; I think it's right, but
I'd hate to break some other edge case...

> nb: second mail to see spf is fixed and Thomas receives this message.

FTR 171641337152.1103.7326466732639994038.pgcf@coridan.postgresql.org
and 171641505305.1105.9868637944637520353.pgcf@coridan.postgresql.org
both showed up in my inbox, and they both have headers "Received-SPF:
pass ...".



Re: processes stuck in shutdown following OOM/recovery

От
Martijn Wallet
Дата:
Great, thanks for the feedback. It was probably the DKIM.

Re: processes stuck in shutdown following OOM/recovery

От
Martijn Wallet
Дата:
Last test to have a verified mail, added lists.postgresql.org to spf record. Cheers.