Обсуждение: assertion at postmaster start
Once in a blue moon I get this assertion failure on server start: 2019-06-15 12:00:29.650 -04 [30080] LOG: iniciando PostgreSQL 12beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04)7.4.0, 64-bit 2019-06-15 12:00:29.650 -04 [30080] LOG: escuchando en la dirección IPv4 «127.0.0.1», port 55432 2019-06-15 12:00:29.650 -04 [30080] LOG: escuchando en el socket Unix «/tmp/.s.PGSQL.55432» 2019-06-15 12:00:29.658 -04 [30956] LOG: el sistema de bases de datos fue apagado en 2019-06-15 12:00:24 -04 2019-06-15 12:00:29.659 -04 [30080] LOG: proceso de servidor (PID 30107) terminó con código de salida 15 2019-06-15 12:00:29.659 -04 [30080] LOG: terminando todos los otros procesos de servidor activos TRAP: FailedAssertion(«!(AbortStartTime == 0)», Archivo: «/pgsql/source/master/src/backend/postmaster/postmaster.c», Línea:2957) Aborted (core dumped) Apologies for the Spanish -- I cannot readily reproduce this. In essence, this shows a normal startup, until suddenly process 30107 terminates with exit code 15, and then while shutting everything down, postmaster hits the aforementioned assertion and terminates. One problem with debugging this is that I don't know what process 30107 is, since the logs don't mention it. No idea what is going on. But I'm going to set my script to start the server with log_min_messages=debug1, in case I hit it again ... Has anybody else seen this? -- Álvaro Herrera
Hi, On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote: > Once in a blue moon I get this assertion failure on server start: > > 2019-06-15 12:00:29.650 -04 [30080] LOG: iniciando PostgreSQL 12beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit > 2019-06-15 12:00:29.650 -04 [30080] LOG: escuchando en la dirección IPv4 «127.0.0.1», port 55432 > 2019-06-15 12:00:29.650 -04 [30080] LOG: escuchando en el socket Unix «/tmp/.s.PGSQL.55432» > 2019-06-15 12:00:29.658 -04 [30956] LOG: el sistema de bases de datos fue apagado en 2019-06-15 12:00:24 -04 > 2019-06-15 12:00:29.659 -04 [30080] LOG: proceso de servidor (PID 30107) terminó con código de salida 15 > 2019-06-15 12:00:29.659 -04 [30080] LOG: terminando todos los otros procesos de servidor activos > TRAP: FailedAssertion(«!(AbortStartTime == 0)», Archivo: «/pgsql/source/master/src/backend/postmaster/postmaster.c», Línea:2957) > Aborted (core dumped) > > Apologies for the Spanish -- I cannot readily reproduce this. In > essence, this shows a normal startup, until suddenly process 30107 > terminates with exit code 15, and then while shutting everything down, > postmaster hits the aforementioned assertion and terminates. I assume this is on master as of a few days ago? This doesn't even look to be *after* a crash-restart? And I assume core files weren't enabled? > One problem with debugging this is that I don't know what process 30107 > is, since the logs don't mention it. Hm - it probably can't be that many processes, it looks like 30107 has to have started pretty soon after the startup process (which IIRC is the one emitting "el sistema de bases de datos fue apagado en"), and as soon as that's done 30107 is noticed as having crashed. Unfortunately, as this appears to be a start in a clean database, we don't really know which phase of startup this is. There's IIRC no messages to be expected before "database system is ready to accept connections" in a clean start. What is a bit odd is that: > 2019-06-15 12:00:29.659 -04 [30080] LOG: proceso de servidor (PID 30107) terminó con código de salida 15 comes from: #. translator: %s is a noun phrase describing a child process, such as #. "server process" #: postmaster/postmaster.c:3656 #, c-format msgid "%s (PID %d) exited with exit code %d" msgstr "%s (PID %d) terminó con código de salida %d" #: postmaster/postmaster.c:3301 postmaster/postmaster.c:3321 #: postmaster/postmaster.c:3328 postmaster/postmaster.c:3346 msgid "server process" msgstr "proceso de servidor" And "server process" is afaict only used for actual backends, not other types of processes. But we've not yet seen "database system is ready to accept accept connections", so IIRC it could only be a "dead_end" type backend? But we didn't yet see an error from that... > No idea what is going on. Seems to indicate a logic error in postmaster's state machine. Perhaps something related to dead_end processes? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote: >> Once in a blue moon I get this assertion failure on server start: >> TRAP: FailedAssertion(«!(AbortStartTime == 0)», Archivo: «/pgsql/source/master/src/backend/postmaster/postmaster.c», Línea:2957) > And "server process" is afaict only used for actual backends, not other > types of processes. But we've not yet seen "database system is ready to > accept accept connections", so IIRC it could only be a "dead_end" type > backend? But we didn't yet see an error from that... > Seems to indicate a logic error in postmaster's state machine. Perhaps > something related to dead_end processes? So if Andres is guessing right, this must be from something trying to connect before the postmaster is ready? Seems like that could be tested for ... regards, tom lane
I wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote: >>> Once in a blue moon I get this assertion failure on server start: >>> TRAP: FailedAssertion("!(AbortStartTime == 0)", Archivo: "/pgsql/source/master/src/backend/postmaster/postmaster.c",Linea: 2957) >> And "server process" is afaict only used for actual backends, not other >> types of processes. But we've not yet seen "database system is ready to >> accept accept connections", so IIRC it could only be a "dead_end" type >> backend? But we didn't yet see an error from that... >> Seems to indicate a logic error in postmaster's state machine. Perhaps >> something related to dead_end processes? > So if Andres is guessing right, this must be from something trying to > connect before the postmaster is ready? Seems like that could be > tested for ... I got around to trying to test for this, and I find that I can reproduce the symptom exactly by applying the attached hack and then trying to connect a couple of seconds after starting the postmaster. Basically what seems to be happening in Alvaro's report is that (1) before the startup process is done, something tries to connect, causing a dead_end child to be launched; (2) for reasons mysterious, that child exits with exit code 15 rather than the expected 0 or 1; (3) the postmaster therefore initiates a system-wide restart cycle; (4) the startup process completes normally anyway, indicating that the SIGQUIT arrived too late to affect it; (5) then we hit the Assert, since we reach the transition-to-normal-run code even though HandleChildCrash set AbortStartTime in step (3). The timing window for (4) to happen is extremely tight normally. The attached patch makes it wider by the expedient of just not sending the SIGQUIT to the startup process ;-). Then you just need enough of a delay in startup to perform a manual connection, plus some hack to make the dead_end child exit with an unexpected exit code. I think what this demonstrates is that that Assert is just wrong: we *can* reach PM_RUN with the flag still set, so we should do StartupStatus = STARTUP_NOT_RUNNING; FatalError = false; - Assert(AbortStartTime == 0); + AbortStartTime = 0; ReachedNormalRunning = true; pmState = PM_RUN; Probably likewise for the similar Assert in sigusr1_handler. A larger question is whether we should modify the postmaster logic so that crashes of dead_end children aren't treated as reasons to perform a system restart. I'm dubious about this, because frankly, such crashes shouldn't be happening. There is very little code that a dead_end child will traverse before exiting ... so how the devil did it reach an exit(15)? Alvaro, are you running any nonstandard code in the postmaster (shared_preload_libraries, maybe)? regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 3339804..ef53522 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2260,6 +2260,7 @@ retry1: switch (port->canAcceptConnections) { case CAC_STARTUP: + exit(15); ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is starting up"))); @@ -3512,7 +3513,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) (errmsg_internal("sending %s to process %d", (SendStop ? "SIGSTOP" : "SIGQUIT"), (int) StartupPID))); - signal_child(StartupPID, (SendStop ? SIGSTOP : SIGQUIT)); +// signal_child(StartupPID, (SendStop ? SIGSTOP : SIGQUIT)); StartupStatus = STARTUP_SIGNALED; } @@ -5383,6 +5384,9 @@ StartChildProcess(AuxProcType type) MemoryContextDelete(PostmasterContext); PostmasterContext = NULL; + if (type == StartupProcess) + pg_usleep(5000000); + AuxiliaryProcessMain(ac, av); ExitPostmaster(0); }
I wrote: > I think what this demonstrates is that that Assert is just wrong: > we *can* reach PM_RUN with the flag still set, so we should do > StartupStatus = STARTUP_NOT_RUNNING; > FatalError = false; > - Assert(AbortStartTime == 0); > + AbortStartTime = 0; > ReachedNormalRunning = true; > pmState = PM_RUN; > Probably likewise for the similar Assert in sigusr1_handler. Poking further at this, I noticed that the code just above here completely fails to do what the comments say it intends to do, which is restart the startup process after we've SIGQUIT'd it. That's because the careful manipulation of StartupStatus in reaper (lines 2943ff in HEAD) is stomped on by HandleChildCrash, which will just unconditionally reset it to STARTUP_CRASHED (line 3507). So we end up shutting down the database after all, which is not what the intention seems to be. Hence, commit 45811be94 was still a few bricks shy of a load :-(. I propose the attached. I'm inclined to think that the risk/benefit of back-patching this is not very good, so I just want to stick it in HEAD, unless somebody can explain why dead_end children are likely to crash in the field. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 3339804..ecb7892 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2920,7 +2920,9 @@ reaper(SIGNAL_ARGS) * during PM_STARTUP is treated as catastrophic. There are no * other processes running yet, so we can just exit. */ - if (pmState == PM_STARTUP && !EXIT_STATUS_0(exitstatus)) + if (pmState == PM_STARTUP && + StartupStatus != STARTUP_SIGNALED && + !EXIT_STATUS_0(exitstatus)) { LogChildExit(LOG, _("startup process"), pid, exitstatus); @@ -2937,11 +2939,24 @@ reaper(SIGNAL_ARGS) * then we previously sent the startup process a SIGQUIT; so * that's probably the reason it died, and we do want to try to * restart in that case. + * + * This stanza also handles the case where we sent a SIGQUIT + * during PM_STARTUP due to some dead_end child crashing: in that + * situation, if the startup process dies on the SIGQUIT, we need + * to transition to PM_WAIT_BACKENDS state which will allow + * PostmasterStateMachine to restart the startup process. (On the + * other hand, the startup process might complete normally, if we + * were too late with the SIGQUIT. In that case we'll fall + * through and commence normal operations.) */ if (!EXIT_STATUS_0(exitstatus)) { if (StartupStatus == STARTUP_SIGNALED) + { StartupStatus = STARTUP_NOT_RUNNING; + if (pmState == PM_STARTUP && Shutdown == NoShutdown) + pmState = PM_WAIT_BACKENDS; + } else StartupStatus = STARTUP_CRASHED; HandleChildCrash(pid, exitstatus, @@ -2954,7 +2969,7 @@ reaper(SIGNAL_ARGS) */ StartupStatus = STARTUP_NOT_RUNNING; FatalError = false; - Assert(AbortStartTime == 0); + AbortStartTime = 0; ReachedNormalRunning = true; pmState = PM_RUN; @@ -3504,7 +3519,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) if (pid == StartupPID) { StartupPID = 0; - StartupStatus = STARTUP_CRASHED; + /* Caller adjusts StartupStatus, so don't touch it here */ } else if (StartupPID != 0 && take_action) { @@ -5100,7 +5115,7 @@ sigusr1_handler(SIGNAL_ARGS) { /* WAL redo has started. We're out of reinitialization. */ FatalError = false; - Assert(AbortStartTime == 0); + AbortStartTime = 0; /* * Crank up the background tasks. It doesn't matter if this fails,
I wrote: > I propose the attached. I'm inclined to think that the risk/benefit > of back-patching this is not very good, so I just want to stick it in > HEAD, unless somebody can explain why dead_end children are likely to > crash in the field. Pushed at ee3278239. I'm still curious as to the explanation for a dead_end child exiting with code 15, but I have no way to pursue the point. regards, tom lane
On 2019-Aug-26, Tom Lane wrote: > I wrote: > > I propose the attached. I'm inclined to think that the risk/benefit > > of back-patching this is not very good, so I just want to stick it in > > HEAD, unless somebody can explain why dead_end children are likely to > > crash in the field. > > Pushed at ee3278239. > > I'm still curious as to the explanation for a dead_end child exiting > with code 15, but I have no way to pursue the point. Many thanks for all the investigation and fix! Sadly, I have *no* idea what could have happened that would have caused a connection at that point (my start scripts don't do it). It is possible that I had a terminal running some shell loop on psql ("watch psql -c something" perhaps). But I'm sure I didn't notice that when I reported this, or I would have mentioned it. However, I have no idea why would it have died with code 15. From my notes of what I was doing that day, I can't find any evidence that I would have had anything in shared_preload_libraries. (I don't have Frost's complete timestamped shell history, however.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services