Обсуждение: Something is fairly whacko about shutdown in CVS HEAD
I'm seeing two sets of shutdown messages, and apparently a second shutdown checkpoint being forced, during a normal database stop: 2007-06-30 14:21:00 EDT 9644 LOG: received smart shutdown request 2007-06-30 14:21:00 EDT 9647 LOG: shutting down 2007-06-30 14:21:00 EDT 9647 LOG: checkpoint starting: shutdown immediate 2007-06-30 14:21:00 EDT 9647 LOG: checkpoint complete: wrote 23 buffers (0.6%); 0 transaction log file(s) added, 0 removed,0 recycled; write=0.003 s, sync=0.178 s, total=0.237 s 2007-06-30 14:21:00 EDT 9647 LOG: database system is shut down 2007-06-30 14:21:00 EDT 9644 LOG: background writer process (PID 9647) exited with exit code 0 2007-06-30 14:21:00 EDT 9644 LOG: terminating any other active server processes 2007-06-30 14:21:00 EDT 9644 LOG: all server processes terminated; reinitializing 2007-06-30 14:21:00 EDT 9654 LOG: database system was shut down at 2007-06-30 14:21:00 EDT 2007-06-30 14:21:01 EDT 9655 LOG: shutting down 2007-06-30 14:21:01 EDT 9655 LOG: checkpoint starting: shutdown immediate 2007-06-30 14:21:01 EDT 9655 LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed,0 recycled; write=0.000 s, sync=0.000 s, total=0.041 s 2007-06-30 14:21:01 EDT 9655 LOG: database system is shut down I haven't looked for the cause yet but I suspect it's got something to do with this patch: http://archives.postgresql.org/pgsql-committers/2007-06/msg00285.php Either that or somebody else broke it recently. regards, tom lane
Tom Lane wrote: > I'm seeing two sets of shutdown messages, and apparently a second > shutdown checkpoint being forced, during a normal database stop: > > 2007-06-30 14:21:00 EDT 9644 LOG: received smart shutdown request > 2007-06-30 14:21:00 EDT 9647 LOG: shutting down > 2007-06-30 14:21:00 EDT 9647 LOG: checkpoint starting: shutdown immediate > 2007-06-30 14:21:00 EDT 9647 LOG: checkpoint complete: wrote 23 buffers (0.6%); 0 transaction log file(s) added, 0 removed,0 recycled; write=0.003 s, sync=0.178 s, total=0.237 s > 2007-06-30 14:21:00 EDT 9647 LOG: database system is shut down > 2007-06-30 14:21:00 EDT 9644 LOG: background writer process (PID 9647) exited with exit code 0 > 2007-06-30 14:21:00 EDT 9644 LOG: terminating any other active server processes > 2007-06-30 14:21:00 EDT 9644 LOG: all server processes terminated; reinitializing > 2007-06-30 14:21:00 EDT 9654 LOG: database system was shut down at 2007-06-30 14:21:00 EDT > 2007-06-30 14:21:01 EDT 9655 LOG: shutting down > 2007-06-30 14:21:01 EDT 9655 LOG: checkpoint starting: shutdown immediate > 2007-06-30 14:21:01 EDT 9655 LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed,0 recycled; write=0.000 s, sync=0.000 s, total=0.041 s > 2007-06-30 14:21:01 EDT 9655 LOG: database system is shut down > > I haven't looked for the cause yet but I suspect it's got something to > do with this patch: > http://archives.postgresql.org/pgsql-committers/2007-06/msg00285.php > Either that or somebody else broke it recently. Huh, I can't reproduce it here. 2007-06-30 20:51:27 CLT 23221 LOG: received smart shutdown request 2007-06-30 20:51:27 CLT 23223 LOG: shutting down 2007-06-30 20:51:27 CLT 23225 LOG: autovacuum launcher shutting down 2007-06-30 20:51:27 CLT 23223 LOG: checkpoint starting: shutdown immediate 2007-06-30 20:51:27 CLT 23223 LOG: checkpoint complete: wrote 474 buffers (15.4%); 0 transaction log file(s) added, 0 removed,1 recycled; write=0.016 s, sync=0.084 s, total=0.124 s 2007-06-30 20:51:27 CLT 23223 LOG: database system is shut down In your test output, it looks like the bgwriter is exiting early with code 0 which makes postmaster upset, so it restarts it. Or maybe bgwriter exits as expected but the postmaster is denying reality. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I'm seeing two sets of shutdown messages, and apparently a second >> shutdown checkpoint being forced, during a normal database stop: > Huh, I can't reproduce it here. It looks to me like this is a race condition induced by the autovacuum-launcher patches. Observe the following chunk of postmaster.c, which responds to exit of the bgwriter child: /* * Was it the bgwriter? */ if (BgWriterPID != 0 && pid == BgWriterPID) { BgWriterPID= 0; if (EXIT_STATUS_0(exitstatus) && Shutdown > NoShutdown && !FatalError && !DLGetHead(BackendList) && AutoVacPID == 0) { /* * Normal postmaster exit ishere: we've seen normal exit of * the bgwriter after it's been told to shut down. We expect * that it wrote a shutdown checkpoint. (If for some reason * it didn't, recovery will occur on next postmasterstart.) * * Note: we do not wait around for exit of the archiver or * stats processes. They've been sent SIGQUIT by this point, * and in any case contain logic to commit hara-kiriif they * notice the postmaster is gone. */ ExitPostmaster(0); } /* * Any unexpected exit of the bgwriter (including FATAL exit) * is treated as a crash. */ HandleChildCrash(pid, exitstatus, _("background writer process")); If AutoVacPID is still nonzero when bgwriter exit is detected, then we think we've seen a crash. I'm not clear why it happens reliably for me and not for you, but this is certainly a bug. To resolve this I think we need a clearer definition of the autovac launcher's role in life. I see that it is attached to shared memory; is it supposed to be able to execute transactions or otherwise do anything the bgwriter might have to clean up after? If so we need to fix things so that we don't tell the bgwriter to exit until after the launcher is gone. If not, we could possibly allow these things to happen asynchronously, though I wonder whether it wouldn't be best to force the ordering anyway. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane wrote: > >> I'm seeing two sets of shutdown messages, and apparently a second > >> shutdown checkpoint being forced, during a normal database stop: > > > Huh, I can't reproduce it here. > > It looks to me like this is a race condition induced by the > autovacuum-launcher patches. Observe the following chunk of > postmaster.c, which responds to exit of the bgwriter child: > If AutoVacPID is still nonzero when bgwriter exit is detected, > then we think we've seen a crash. I'm not clear why it happens > reliably for me and not for you, but this is certainly a bug. Oops, you are right. So the problem predates the latest autovac patch; it goes all the way back to when the launcher was introduced. I can't imagine the reason, but I haven't ever seen the double shutdown checkpoint (not that I've been looking too closely, but I almost always run postmasters with stderr to a console in front of my eyes when I'm testing autovac). > To resolve this I think we need a clearer definition of the autovac > launcher's role in life. I see that it is attached to shared memory; > is it supposed to be able to execute transactions or otherwise do > anything the bgwriter might have to clean up after? If so we need > to fix things so that we don't tell the bgwriter to exit until after > the launcher is gone. If not, we could possibly allow these things > to happen asynchronously, though I wonder whether it wouldn't be best > to force the ordering anyway. No; the launcher is attached to shared memory, but it doesn't connect to databases and it doesn't execute transactions either. In fact, the launcher cannot do anything when the postmaster is not running (or when it is not in a state where it doesn't want to start processes, anyway), because it (the launcher) is only capable of sending signals to postmaster (apart from that, it ocasionally grabs lwlocks, reads the pg_database flat file, pgstats, and the Xid counter in shared memory). It doesn't write anything. I think it is perfectly acceptable to have the launcher shut down in parallel with bgwriter. -- Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC "Amanece. (Ignacio Reyes)El Cerro San Cristóbal me mira, cínicamente, con ojosde virgen"
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> To resolve this I think we need a clearer definition of the autovac >> launcher's role in life. I see that it is attached to shared memory; >> is it supposed to be able to execute transactions or otherwise do >> anything the bgwriter might have to clean up after? > No; the launcher is attached to shared memory, but it doesn't connect to > databases and it doesn't execute transactions either. In fact, the > launcher cannot do anything when the postmaster is not running (or when > it is not in a state where it doesn't want to start processes, anyway), > because it (the launcher) is only capable of sending signals to > postmaster (apart from that, it ocasionally grabs lwlocks, reads the > pg_database flat file, pgstats, and the Xid counter in shared memory). > It doesn't write anything. I think it is perfectly acceptable to have > the launcher shut down in parallel with bgwriter. [ after sleeping on it... ] That's probably okay as far as bgwriter goes, but I think it's a bad idea for anything connected to shared memory to outlive the postmaster. To do so opens risks that some platforms might treat the postmaster's shmctl(IPC_RMID) differently than others. So I want the postmaster to not exit until it's seen the launcher quit. It's quite easy to make this happen if we treat the launcher more like a backend, and don't start bgwriter shutdown till it's gone. (I have a patch I've been testing that does it that way.) Allowing the two to shut down in parallel seems more complex, though if you see a reasonable way to make it work, propose a patch. regards, tom lane