Обсуждение: kevent latch paths don't handle postmaster death well
I noticed that chipmunk failed [1] with a rather interesting log: 2020-10-14 08:57:01.661 EEST [27048:6] pg_regress/prepared_xacts LOG: statement: UPDATE pxtest1 SET foobar = 'bbb' WHEREfoobar = 'aaa'; 2020-10-14 08:57:01.721 EEST [27048:7] pg_regress/prepared_xacts LOG: statement: SELECT * FROM pxtest1; 2020-10-14 08:57:01.823 EEST [27048:8] pg_regress/prepared_xacts FATAL: postmaster exited during a parallel transaction TRAP: FailedAssertion("entry->trans == NULL", File: "pgstat.c", Line: 909, PID: 27048) 2020-10-14 08:57:01.861 EEST [27051:1] ERROR: could not attach to dynamic shared area 2020-10-14 08:57:01.861 EEST [27051:2] STATEMENT: SELECT * FROM pxtest1; I do not know what happened to the postmaster, but seeing that chipmunk is a very small machine running a pretty old Linux kernel, it's plausible to guess that the OOM killer decided to pick on the postmaster. (I wonder whether Heikki has taken any steps to prevent that on that machine.) My concern today is not that the postmaster died, but that the subsequent response was an Assert failure. Not good. I tried to reproduce this by dint of manually kill -9'ing the postmaster during the select_parallel regression test. Figuring that a slower machine would give me a better chance of success at that, I used an old Mac laptop that wasn't doing anything else. I did not succeed yet, but what I did reproducibly get (in five out of five tries) was a leader process that was permanently stuck on a latch, waiting for its worker(s) to die. Needless to say, there were no workers and never would be. The stack trace varied a bit, but here's an interesting case: (gdb) bt #0 0x90b267ac in kevent () #1 0x003b76e8 in WaitEventSetWaitBlock [inlined] () at latch.c:1506 #2 0x003b76e8 in WaitEventSetWait (set=0x10003a8, timeout=-1, occurred_events=<value temporarily unavailable, due to optimizations>,nevents=1, wait_event_info=<value temporarily unavailable, due to optimizations>) at latch.c:1309 #3 0x003b814c in WaitLatch (latch=<value temporarily unavailable, due to optimizations>, wakeEvents=17, timeout=-1, wait_event_info=134217729)at latch.c:411 #4 0x0032e77c in WaitForBackgroundWorkerShutdown (handle=0x10280a4) at bgworker.c:1139 #5 0x000bc6fc in WaitForParallelWorkersToExit (pcxt=0xc6f84c) at parallel.c:876 #6 0x000bc99c in DestroyParallelContext (pcxt=0xc6f84c) at parallel.c:958 #7 0x000bdc48 in dlist_is_empty [inlined] () at lib/ilist.h:1231 #8 0x000bdc48 in AtEOXact_Parallel (isCommit=4 '\004') at parallel.c:1224 #9 0x000ccf24 in AbortTransaction () at xact.c:2702 #10 0x000cd534 in AbortOutOfAnyTransaction () at xact.c:4623 #11 0x00550b54 in ShutdownPostgres (code=<value temporarily unavailable, due to optimizations>, arg=<value temporarily unavailable,due to optimizations>) at postinit.c:1195 #12 0x003b5ff0 in shmem_exit (code=1) at ipc.c:239 #13 0x003b6168 in proc_exit_prepare (code=1) at ipc.c:194 #14 0x003b6240 in proc_exit (code=1) at ipc.c:107 #15 0x00541f6c in errfinish (filename=<value temporarily unavailable, due to optimizations>, lineno=<value temporarily unavailable,due to optimizations>, funcname=0x5bd55c "WaitForParallelWorkersToExit") at elog.c:578 #16 0x000bc748 in WaitForParallelWorkersToExit (pcxt=0xc6f84c) at parallel.c:885 #17 0x000bc7c8 in ReinitializeParallelDSM (pcxt=0xc6f84c) at parallel.c:471 #18 0x0021e468 in ExecParallelReinitialize (planstate=0x109a9a0, pei=0xc3a09c, sendParams=0x0) at execParallel.c:906 #19 0x00239f6c in ExecGather (pstate=0x109a848) at nodeGather.c:177 #20 0x00221cd8 in ExecProcNodeInstr (node=0x109a848) at execProcnode.c:466 #21 0x0024f7a8 in ExecNestLoop (pstate=0x1099700) at executor/executor.h:244 #22 0x00221cd8 in ExecProcNodeInstr (node=0x1099700) at execProcnode.c:466 #23 0x0022edbc in ExecProcNode [inlined] () at executor.h:244 #24 0x0022edbc in fetch_input_tuple (aggstate=0x10993d0) at executor.h:589 We appear to have already realized that the postmaster died, since we're inside proc_exit. WaitForBackgroundWorkerShutdown is doing this: rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, 0, WAIT_EVENT_BGWORKER_SHUTDOWN); which one would certainly hope would not block at all if the postmaster is already dead, yet it's doing so. I guess that the kevent stuff is failing to handle the case of another WaitLatch call after the postmaster is already known dead. In case it helps, I checked the contents of the WaitEventSet: (gdb) p *LatchWaitSet $2 = { nevents = 2, nevents_space = 2, events = 0x10003cc, latch = 0xa1b673c, latch_pos = 0, exit_on_postmaster_death = 0 '\0', kqueue_fd = 12, kqueue_ret_events = 0x10003ec, report_postmaster_not_running = 0 '\0' } (gdb) p LatchWaitSet->events[0] $3 = { pos = 0, events = 1, fd = 10, user_data = 0x0 } (gdb) p LatchWaitSet->events[1] $4 = { pos = 1, events = 16, fd = 3, user_data = 0x0 } I thought possibly this was an ancient-macOS problem, but I've now reproduced substantially the same behavior on an up-to-date Catalina machine (10.15.7), so I do not think we can write it off that way. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2020-10-14%2000%3A04%3A08
On Thu, Oct 15, 2020 at 7:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > We appear to have already realized that the postmaster died, since we're > inside proc_exit. WaitForBackgroundWorkerShutdown is doing this: > > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_POSTMASTER_DEATH, 0, > WAIT_EVENT_BGWORKER_SHUTDOWN); > > which one would certainly hope would not block at all if the postmaster > is already dead, yet it's doing so. I guess that the kevent stuff is > failing to handle the case of another WaitLatch call after the postmaster > is already known dead. The process exit event is like an 'edge', not a 'level'... hmm. It might be enough to set report_postmaster_not_running = true the first time it tells us so if we try to wait again we'll treat it like a level. I will look into it later today.
Thomas Munro <thomas.munro@gmail.com> writes: > The process exit event is like an 'edge', not a 'level'... hmm. It > might be enough to set report_postmaster_not_running = true the first > time it tells us so if we try to wait again we'll treat it like a > level. I will look into it later today. Seems like having that be per-WaitEventSet state is also not a great idea --- if we detect PM death while waiting on one WES, and then wait on another one, it won't work. A plain process-wide static variable would be a better way I bet. regards, tom lane
On 14/10/2020 21:58, Tom Lane wrote: > I noticed that chipmunk failed [1] with a rather interesting log: > > 2020-10-14 08:57:01.661 EEST [27048:6] pg_regress/prepared_xacts LOG: statement: UPDATE pxtest1 SET foobar = 'bbb' WHEREfoobar = 'aaa'; > 2020-10-14 08:57:01.721 EEST [27048:7] pg_regress/prepared_xacts LOG: statement: SELECT * FROM pxtest1; > 2020-10-14 08:57:01.823 EEST [27048:8] pg_regress/prepared_xacts FATAL: postmaster exited during a parallel transaction > TRAP: FailedAssertion("entry->trans == NULL", File: "pgstat.c", Line: 909, PID: 27048) > 2020-10-14 08:57:01.861 EEST [27051:1] ERROR: could not attach to dynamic shared area > 2020-10-14 08:57:01.861 EEST [27051:2] STATEMENT: SELECT * FROM pxtest1; > > I do not know what happened to the postmaster, but seeing that chipmunk > is a very small machine running a pretty old Linux kernel, it's plausible > to guess that the OOM killer decided to pick on the postmaster. (I wonder > whether Heikki has taken any steps to prevent that on that machine.) For the record, it was not the OOM killer. It was the buildfarm cron job that did it: Oct 14 08:57:01 raspberrypi /USR/SBIN/CRON[27050]: (pgbfarm) CMD (killall -q -9 postgres; cd /home/pgbfarm/build-farm-client/ && ./run_branches.pl --run-all) Apparently building and testing all the branches is now taking slightly more than 24 h on that system, so the next day's cron job kills the previous tests. I'm going to change the cron schedule so that it runs only every other day. - Heikki
On Thu, Oct 15, 2020 at 8:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > The process exit event is like an 'edge', not a 'level'... hmm. It > > might be enough to set report_postmaster_not_running = true the first > > time it tells us so if we try to wait again we'll treat it like a > > level. I will look into it later today. > > Seems like having that be per-WaitEventSet state is also not a great > idea --- if we detect PM death while waiting on one WES, and then > wait on another one, it won't work. A plain process-wide static > variable would be a better way I bet. I don't think that's a problem -- the kernel will report the event to each interested kqueue object. The attached fixes the problem for me.
Вложения
Thomas Munro <thomas.munro@gmail.com> writes: > On Thu, Oct 15, 2020 at 8:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Seems like having that be per-WaitEventSet state is also not a great >> idea --- if we detect PM death while waiting on one WES, and then >> wait on another one, it won't work. A plain process-wide static >> variable would be a better way I bet. > I don't think that's a problem -- the kernel will report the event to > each interested kqueue object. The attached fixes the problem for me. Oh, OK. I confirm this makes the kqueue path work like the EPOLL and POLL paths. (I can't test the WIN32 path.) regards, tom lane
On Thu, Oct 15, 2020 at 11:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Thu, Oct 15, 2020 at 8:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Seems like having that be per-WaitEventSet state is also not a great > >> idea --- if we detect PM death while waiting on one WES, and then > >> wait on another one, it won't work. A plain process-wide static > >> variable would be a better way I bet. > > > I don't think that's a problem -- the kernel will report the event to > > each interested kqueue object. The attached fixes the problem for me. > > Oh, OK. I confirm this makes the kqueue path work like the EPOLL and POLL > paths. (I can't test the WIN32 path.) Thanks. Pushed. (Hmm, I wonder about that Windows process exit event.)
Thomas Munro <thomas.munro@gmail.com> writes: > (Hmm, I wonder about that Windows process exit event.) If anyone wants to test that, I can save you a little time building infrastructure, perhaps. I used the attached program built into a .so. After creating the function, invoke it, and once it's blocked kill -9 the postmaster. If it successfully reports multiple WL_POSTMASTER_DEATH results then it's good. regards, tom lane /* create function wait_on_process_latch() returns void strict volatile language c as '.../whatever.so'; select wait_on_process_latch(); */ #include "postgres.h" #include "fmgr.h" #include "miscadmin.h" #include "storage/latch.h" #include "pgstat.h" PG_MODULE_MAGIC; /* * wait_on_process_latch() returns void */ PG_FUNCTION_INFO_V1(wait_on_process_latch); Datum wait_on_process_latch(PG_FUNCTION_ARGS) { int count = 10; while (count-- > 0) { int rc; CHECK_FOR_INTERRUPTS(); elog(INFO, "waiting on latch"); rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, 0, WAIT_EVENT_BGWORKER_SHUTDOWN); if (rc & WL_POSTMASTER_DEATH) { elog(INFO, "got WL_POSTMASTER_DEATH report"); } ResetLatch(MyLatch); } PG_RETURN_VOID(); }
Hi, On 2020-10-15 11:10:28 +1300, Thomas Munro wrote: > I don't think that's a problem -- the kernel will report the event to > each interested kqueue object. The attached fixes the problem for me. Will it do so even if the kqueue is created after postmaster death? - Andres
Andres Freund <andres@anarazel.de> writes: > On 2020-10-15 11:10:28 +1300, Thomas Munro wrote: >> I don't think that's a problem -- the kernel will report the event to >> each interested kqueue object. The attached fixes the problem for me. > Will it do so even if the kqueue is created after postmaster death? I did not try to test it, but there's code that purports to handle that in latch.c, ~ line 1150, and the behavior it's expecting mostly agrees with what I read in the macOS kevent man page. One thing I'd suggest is that EACCES probably needs to be treated as "postmaster already dead", too, in case the PID is now owned by another user ID. regards, tom lane
On Thu, Oct 15, 2020 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2020-10-15 11:10:28 +1300, Thomas Munro wrote: > >> I don't think that's a problem -- the kernel will report the event to > >> each interested kqueue object. The attached fixes the problem for me. > > > Will it do so even if the kqueue is created after postmaster death? > > I did not try to test it, but there's code that purports to handle that > in latch.c, ~ line 1150, and the behavior it's expecting mostly agrees > with what I read in the macOS kevent man page. One thing I'd suggest Yep, I did handle the obvious races here. > is that EACCES probably needs to be treated as "postmaster already dead", > too, in case the PID is now owned by another user ID. Good point. I'll push that change later today.
On Thu, Oct 15, 2020 at 12:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Oct 15, 2020 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I did not try to test it, but there's code that purports to handle that > > in latch.c, ~ line 1150, and the behavior it's expecting mostly agrees > > with what I read in the macOS kevent man page. One thing I'd suggest > > is that EACCES probably needs to be treated as "postmaster already dead", > > too, in case the PID is now owned by another user ID. > > Good point. I'll push that change later today. I tried to test this on my system but it seems like maybe FreeBSD can't really report EACCES for EVFILT_PROC. From the man page and a quick inspection of the source, you only have to be able to "see" the process, and if you can't I think you'll get ESRCH, so EACCES may be for other kinds of filters. I don't currently have any Apple gear to hand, but its man page uses the same language, but on the other hand I do see EACCES in filt_procattach() in the darwin-xnu sources on github so I guess you can reach this case and get an ugly ereport (hopefully followed swiftly by a proc_exit() from the next wait on one of the long lived WESs, or a FATAL if this was the creation of one of those). Fixed. Thanks!
On Thu, Oct 15, 2020 at 6:42 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I tried to test this on my system but it seems like maybe FreeBSD > can't really report EACCES for EVFILT_PROC. From the man page and a > quick inspection of the source, you only have to be able to "see" the > process, and if you can't I think you'll get ESRCH, so EACCES may be > for other kinds of filters. I don't currently have any Apple gear to > hand, but its man page uses the same language, but on the other hand I > do see EACCES in filt_procattach() in the darwin-xnu sources on github > so I guess you can reach this case and get an ugly ereport (hopefully > followed swiftly by a proc_exit() from the next wait on one of the > long lived WESs, or a FATAL if this was the creation of one of those). I couldn't resist digging further into the Apple sources to figure out what was going on there, and I realised that the code path I was looking at can only report EACCES if you asked for NOTE_EXITSTATUS, which appears to be an Apple extension to the original FreeBSD kqueue system designed to let you receive the exit status of the monitored process. That is indeed much more privileged information, and it's only allowed for your own children. So it's possible that commit 70516a17 was a waste of electrons, but I don't think it can hurt; either way, our system is toast if we get that error, so it's mostly just a question of what sort of noises we make as we fail, if indeed any system really can produce EACCES for NOTE_EXIT (maybe in some other code path I haven't found, or some other cousin BSD).
Thomas Munro <thomas.munro@gmail.com> writes: > I couldn't resist digging further into the Apple sources to figure out > what was going on there, and I realised that the code path I was > looking at can only report EACCES if you asked for NOTE_EXITSTATUS, > which appears to be an Apple extension to the original FreeBSD kqueue > system designed to let you receive the exit status of the monitored > process. That is indeed much more privileged information, and it's > only allowed for your own children. Ah. > So it's possible that commit > 70516a17 was a waste of electrons, but I don't think it can hurt; Yeah, I'm not inclined to revert it. If we did get that errno, it'd be hard to interpret it in any way that didn't involve the postmaster being gone. regards, tom lane