Обсуждение: Parallel worker hangs while handling errors.
Hi, Parallel worker hangs while handling errors. Analysis: When there is an error in the parallel worker process, we will call ereport/elog with the error message. Worker will then jump from errfinish to setjmp in StartBackgroundWorker function which was set earlier. Then the worker process will then send the error message through the shared memory to the leader process. Shared memory size is ok 16K, if the error message is less than 16K it works fine. If there is a bigger error message, the worker process will wait for the leader process to read the message, free up some memory in shared memory and set the latch. The worker will be waiting at the below back trace: #4 0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454, wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368 #5 0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350 "SERROR", len=230015) at pqmq.c:171 #6 0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301 #7 0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0 <errordata>) at elog.c:3327 #8 0x0000000000abdf5b in EmitErrorReport () at elog.c:1460 Leader process then identifies that there are some messages that need to be processed, it copies the messages and sets the latch so that the worker process can copy the remaining message from the below function: shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not able to receive any signal at this point of time & hangs infinitely Worker hangs in this case because when the worker is started the signals will be masked using sigprocmask. Unblocking of signals is done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error handling the worker has jumped to setjmp in StartBackgroundWorker function. Here the signals are in a blocked state, hence the signal is not received by the worker process. One of the fixes could be to call BackgroundWorkerUnblockSignals just after sigsetjmp. I'm not sure if this is the best solution. Robert & myself had a discussion about the problem yesterday. We felt this is a genuine problem with the parallel worker error handling and need to be fixed. I could reproduce this issue when there is an error during copy of toast data using parallel copy, this project is an in-progress project. I don't have a test case to reproduce on the head. Any suggestions for a test case on head? The Attached patch has the fix for the same. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Вложения
On Fri, Jul 3, 2020 at 2:40 PM vignesh C <vignesh21@gmail.com> wrote:
>
> The Attached patch has the fix for the same.
>
I have added a commitfest entry for this bug:
https://commitfest.postgresql.org/29/2636/
>
> The Attached patch has the fix for the same.
>
I have added a commitfest entry for this bug:
https://commitfest.postgresql.org/29/2636/
> > Parallel worker hangs while handling errors. > > When there is an error in the parallel worker process, we will call > ereport/elog with the error message. Worker will then jump from > errfinish to setjmp in StartBackgroundWorker function which was set > earlier. Then the worker process will then send the error message > through the shared memory to the leader process. Shared memory size is > ok 16K, if the error message is less than 16K it works fine. I reproduced the hang issue with the parallel copy patches[1]. The use case is as follows - one of the parallel workers tries to report error to the leader process and as part of the error context it also tries to send the entire row/tuple data(which is a lot more than 16KB). The fix provided here solves the above problem, i.e. no hang occurs, and the entire tuple/row data in the error from worker to leader gets transferred, see the attachment "testcase.text" for the output. Apart from that, I also executed the regression tests (make check and make check-world) on the patch, no issues are observed. [1] - https://www.postgresql.org/message-id/CALDaNm2-wMYO68vtDuuWO5h4FQCsfm4Pcg5XrzEPtRty1bEM7w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Вложения
> > Leader process then identifies that there are some messages that need > to be processed, it copies the messages and sets the latch so that the > worker process can copy the remaining message from the below function: > shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not > able to receive any signal at this point of time & hangs infinitely > Worker hangs in this case because when the worker is started the > signals will be masked using sigprocmask. Unblocking of signals is > done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. > Now due to error handling the worker has jumped to setjmp in > StartBackgroundWorker function. Here the signals are in a blocked > state, hence the signal is not received by the worker process. > Your analysis looks fine to me. Adding some more info: The worker uses SIGUSR1 (with a special shared memory flag PROCSIG_PARALLEL_MESSAGE) both for error message sharing(from mq_putmessage) and for parallel worker shutdown(from ParallelWorkerShutdown). And yes, the postmaster blocks SIGUSR1 before forking bgworker(in PostmasterMain with pqinitmask() and PG_SETMASK(&BlockSig)), bgworker receives the same blocked signal mask for itself and enters StartBackgroundWorker(), uses sigsetjmp for error handling, and then goes to ParallelWorkerMain() where few of the signal handers are set and then BackgroundWorkerUnblockSignals() is called to not block any signals. But observe when we did sigsetjmp the state of the signal mask is that of we received from postmaster which is all signals blocked. So, now in error cases when the control is jumped to sigsetjmp we still have the signals blocked(along with SIGUSR1) mask and in the code path of EmitErrorReport, we do send SIGUSR1 with flag PROCSIG_PARALLEL_MESSAGE to the leader/backend and wait for the latch to be set, this happens only if the worker is able to receive back SIGUSR1 from the leader/backend. In this reported issue, since SIGUSR1 is blocked at sigsetjmp in StartBackgroundWorker(), there is no way that the worker process receiving it from the leader and the latch cannot be set and hence the hang occurs. The same hang issue can occur(though I'm not able to back it up with a use case), in the cases from wherever the EmitErrorReport() is called from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and postgres.c. > > One of the fixes could be to call BackgroundWorkerUnblockSignals just > after sigsetjmp. I'm not sure if this is the best solution. > Robert & myself had a discussion about the problem yesterday. We felt > this is a genuine problem with the parallel worker error handling and > need to be fixed. > Note that, in all sigsetjmp blocks, we intentionally HOLD_INTERRUPTS(), to not cause any issues while performing error handling, I'm concerned here that now, if we directly call BackgroundWorkerUnblockSignals() which will open up all the signals and our main intention of holding interrupts or block signals may go away. Since the main problem for this hang issue is because of blocking SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, instead of unblocking all signals? I tried this with parallel copy hang, the issue is resolved. Something like below - if (sigsetjmp(local_sigjmp_buf, 1) != 0) { sigset_t x; sigemptyset (&x); sigaddset(&x, SIGUSR1); sigprocmask(SIG_UNBLOCK, &x, NULL); /* Since not using PG_TRY, must reset error stack by hand */ error_context_stack = NULL; /* Prevent interrupts while cleaning up */ HOLD_INTERRUPTS(); If okay, with the above approach, we can put the above sigprocmask(SIG_UNBLOCK,..) piece of code(of course generically to unblock any given signal) in a macro similar to PG_SETMASK() and use that in all the places wherever EmitErrorReport() is called from sigsetjmp. We should mind the portability of sigprocmask as well. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Thanks for reviewing and adding your thoughts, My comments are inline. On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > The same hang issue can occur(though I'm not able to back it up with a > use case), in the cases from wherever the EmitErrorReport() is called > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and > postgres.c. > I'm not sure if this can occur in other cases. > > > > One of the fixes could be to call BackgroundWorkerUnblockSignals just > > after sigsetjmp. I'm not sure if this is the best solution. > > Robert & myself had a discussion about the problem yesterday. We felt > > this is a genuine problem with the parallel worker error handling and > > need to be fixed. > > > > Note that, in all sigsetjmp blocks, we intentionally > HOLD_INTERRUPTS(), to not cause any issues while performing error > handling, I'm concerned here that now, if we directly call > BackgroundWorkerUnblockSignals() which will open up all the signals > and our main intention of holding interrupts or block signals may go > away. > > Since the main problem for this hang issue is because of blocking > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, > instead of unblocking all signals? I tried this with parallel copy > hang, the issue is resolved. > On putting further thoughts on this, I feel just unblocking SIGUSR1 would be the right approach in this case. I'm attaching a new patch which unblocks SIGUSR1 signal. I have verified that the original issue with WIP parallel copy patch gets fixed. I have made changes only in bgworker.c as we require the parallel worker to receive this signal and continue processing. I have not included the changes for other processes as I'm not sure if this scenario is applicable for other processes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Вложения
On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote: > > Thanks for reviewing and adding your thoughts, My comments are inline. > > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > The same hang issue can occur(though I'm not able to back it up with a > > use case), in the cases from wherever the EmitErrorReport() is called > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and > > postgres.c. > > > > I'm not sure if this can occur in other cases. > I checked further on this point: Yes, it can't occur for the other cases, as mq_putmessage() gets only used for parallel workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()). > > > Note that, in all sigsetjmp blocks, we intentionally > > HOLD_INTERRUPTS(), to not cause any issues while performing error > > handling, I'm concerned here that now, if we directly call > > BackgroundWorkerUnblockSignals() which will open up all the signals > > and our main intention of holding interrupts or block signals may go > > away. > > > > Since the main problem for this hang issue is because of blocking > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, > > instead of unblocking all signals? I tried this with parallel copy > > hang, the issue is resolved. > > > > On putting further thoughts on this, I feel just unblocking SIGUSR1 > would be the right approach in this case. I'm attaching a new patch > which unblocks SIGUSR1 signal. I have verified that the original issue > with WIP parallel copy patch gets fixed. I have made changes only in > bgworker.c as we require the parallel worker to receive this signal > and continue processing. I have not included the changes for other > processes as I'm not sure if this scenario is applicable for other > processes. > Since we are sure that this hang issue can occur only for parallel workers, and the change in StartBackgroundWorker's sigsetjmp's block should only be made for parallel worker cases. And also there can be a lot of other callbacks execution and processing in proc_exit() for which we might not need the SIGUSR1 unblocked. So, let's undo the unblocking right after EmitErrorReport() to not cause any new issues. Attaching a modified v2 patch: it has the unblocking for only for parallel workers, undoing it after EmitErrorReport(), and some adjustments in the comment. I verified this fix for the parallel copy hang issue. And also make check and make check-world passes. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Вложения
On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Thanks for reviewing and adding your thoughts, My comments are inline. > > > > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > The same hang issue can occur(though I'm not able to back it up with a > > > use case), in the cases from wherever the EmitErrorReport() is called > > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as > > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and > > > postgres.c. > > > > > > > I'm not sure if this can occur in other cases. > > > > I checked further on this point: Yes, it can't occur for the other > cases, as mq_putmessage() gets only used for parallel > workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()). > > > > > > Note that, in all sigsetjmp blocks, we intentionally > > > HOLD_INTERRUPTS(), to not cause any issues while performing error > > > handling, I'm concerned here that now, if we directly call > > > BackgroundWorkerUnblockSignals() which will open up all the signals > > > and our main intention of holding interrupts or block signals may go > > > away. > > > > > > Since the main problem for this hang issue is because of blocking > > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, > > > instead of unblocking all signals? I tried this with parallel copy > > > hang, the issue is resolved. > > > > > > > On putting further thoughts on this, I feel just unblocking SIGUSR1 > > would be the right approach in this case. I'm attaching a new patch > > which unblocks SIGUSR1 signal. I have verified that the original issue > > with WIP parallel copy patch gets fixed. I have made changes only in > > bgworker.c as we require the parallel worker to receive this signal > > and continue processing. I have not included the changes for other > > processes as I'm not sure if this scenario is applicable for other > > processes. > > > > Since we are sure that this hang issue can occur only for parallel > workers, and the change in StartBackgroundWorker's sigsetjmp's block > should only be made for parallel worker cases. And also there can be a > lot of other callbacks execution and processing in proc_exit() for > which we might not need the SIGUSR1 unblocked. So, let's undo the > unblocking right after EmitErrorReport() to not cause any new issues. > > Attaching a modified v2 patch: it has the unblocking for only for > parallel workers, undoing it after EmitErrorReport(), and some > adjustments in the comment. > I have made slight changes on top of the patch to remove duplicate code, attached v3 patch for the same. The parallel worker hang issue gets resolved, make check & make check-world passes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Вложения
On Sat, Jul 25, 2020 at 7:02 AM vignesh C <vignesh21@gmail.com> wrote: > > I have made slight changes on top of the patch to remove duplicate > code, attached v3 patch for the same. > The parallel worker hang issue gets resolved, make check & make > check-world passes. > Having a function to unblock selective signals for a bg worker looks good to me. Few comments: 1. Do we need "worker" as a function argument in update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since MyBgworkerEntry is a global variable, can't we have a local variable instead? 2. Instead of update_parallel_worker_sigmask() serving only for parallel workers, can we make it generic, so that for any bgworker, given a signal it unblocks it, although there's no current use case for a bg worker unblocking a single signal other than a parallel worker doing it for SIGUSR1 for this hang issue. Please note that we have BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals(). I slightly modified your function, something like below? void BackgroundWorkerUpdateSignalMask(int signum, bool toblock) { if (toblock) sigaddset(&BlockSig, signum); else sigdelset(&BlockSig, signum); PG_SETMASK(&BlockSig); } /*to unblock SIGUSR1*/ if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerUpdateSignalMask(SIGUSR1, false); /*to block SIGUSR1*/ if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerUpdateSignalMask(SIGUSR1, true); If okay, with the BackgroundWorkerUpdateSignalMask() function, please note that we might have to add it in bgworker.sgml as well. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Thanks for your comments Bharath. On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > 1. Do we need "worker" as a function argument in > update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since > MyBgworkerEntry is a global variable, can't we have a local variable > instead? Fixed, I have moved the worker check to the caller function. > 2. Instead of update_parallel_worker_sigmask() serving only for > parallel workers, can we make it generic, so that for any bgworker, > given a signal it unblocks it, although there's no current use case > for a bg worker unblocking a single signal other than a parallel > worker doing it for SIGUSR1 for this hang issue. Please note that we > have BackgroundWorkerBlockSignals() and > BackgroundWorkerUnblockSignals(). Fixed. I have slightly modified the changes to break into BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal. This maintains the consistency similar to BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals(). > If okay, with the BackgroundWorkerUpdateSignalMask() function, please > note that we might have to add it in bgworker.sgml as well. Included the documentation. Attached the updated patch for the same. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Вложения
On Tue, Jul 28, 2020 at 11:05 AM vignesh C <vignesh21@gmail.com> wrote: > > Thanks for your comments Bharath. > > On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > 1. Do we need "worker" as a function argument in > > update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since > > MyBgworkerEntry is a global variable, can't we have a local variable > > instead? > > Fixed, I have moved the worker check to the caller function. > > > 2. Instead of update_parallel_worker_sigmask() serving only for > > parallel workers, can we make it generic, so that for any bgworker, > > given a signal it unblocks it, although there's no current use case > > for a bg worker unblocking a single signal other than a parallel > > worker doing it for SIGUSR1 for this hang issue. Please note that we > > have BackgroundWorkerBlockSignals() and > > BackgroundWorkerUnblockSignals(). > > Fixed. I have slightly modified the changes to break into > BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal. > This maintains the consistency similar to > BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals(). > > > If okay, with the BackgroundWorkerUpdateSignalMask() function, please > > note that we might have to add it in bgworker.sgml as well. > > Included the documentation. > > Attached the updated patch for the same. > The v4 patch looks good to me. Hang is not seen, make check and make check-world passes. I moved this to the committer for further review in https://commitfest.postgresql.org/29/2636/. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > The v4 patch looks good to me. Hang is not seen, make check and make > check-world passes. I moved this to the committer for further review > in https://commitfest.postgresql.org/29/2636/. I don't think I agree with this approach. In particular, I don't understand the rationale for unblocking only SIGUSR1. Above, Vignesh says that he feels that unblocking only that signal would be the right approach, but no reason is given. I have two reasons why I suspect it's not the right approach. One, it doesn't seem to be what we do elsewhere; the only existing cases where we have special handling for particular signals are SIGQUIT and SIGPIPE, and those places have comments explaining the reason why they are handled in a special way. Two, SIGUSR1 is used for a LOT of things: look at all the different cases procsignal_sigusr1_handler() checks. If the intention is to only allow the things we know are safe, rather than all the signals there are, I think this coding utterly fails to achieve that - and for reasons that I don't think are really fixable. My first idea about how to fix this was just to call BackgroundWorkerUnblockSignals() before sigsetjmp(), but that doesn't really work, because ParallelWorkerMain() needs to set the handler for SIGTERM before unblocking signals. When you really look at it, the code that does sigsetjmp() in StartBackgroundWorker() is entirely bogus. The comment says "See notes in postgres.c about the design of this coding," but if you go read that comment, it says that the point of using sigsetjmp() is to make sure that signals are unblocked within the if-block that follows, but the use in bgworker.c actually achieves exactly the opposite, because signals have not yet been unblocked at this point. So, whereas the postgres.c code unblocks signals if they are blocked, this code blocks signals if they are unblocked. Given that, maybe the right thing to do is to just start the if-block with a call to BackgroundWorkerUnblockSignals(). Perhaps there's some reason that would be unsafe, if the failure occurs too early: postgres.c doesn't unblock signals until after BaseInit() and InitProcess() have been called, but here an error in those functions would unblock signals while it's being handled. Off-hand, I don't see why that would matter, though. In the postgres.c case, there wouldn't be a PG_exception_stack yet, so we'd end up in the long part of pg_re_throw(), which basically promotes the ERROR to FATAL but otherwise does pretty similar things to what this handler does anyway. So I'm not really sure there's any reason not to just go this way. Another approach would be to establish a new PG_exception_stack() with a free sigsetjmp() call and fresh local buffer, inside ParallelWorkerMain(). It would do the same thing as the existing handler, but because it would be established after unblocking signals, sigsetjmp() would behave as desired. This doesn't seem quite as good to me because I think this pattern might end up getting copied into many background workers, and it's a bunch of extra code for which I don't really see a clear need. So at the moment I think a one line fix in StartBackgroundWorker(), to just unblock signals while handling errors, looks better. Adding Alvaro to the CC line, since I think he wrote this code originally. Not sure if he or anyone else might have an opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > The v4 patch looks good to me. Hang is not seen, make check and make > > check-world passes. I moved this to the committer for further review > > in https://commitfest.postgresql.org/29/2636/. > > I don't think I agree with this approach. In particular, I don't > understand the rationale for unblocking only SIGUSR1. Above, Vignesh > says that he feels that unblocking only that signal would be the right > approach, but no reason is given. I have two reasons why I suspect > it's not the right approach. One, it doesn't seem to be what we do > elsewhere; the only existing cases where we have special handling for > particular signals are SIGQUIT and SIGPIPE, and those places have > comments explaining the reason why they are handled in a special way. > We intend to unblock SIGQUIT before sigsetjmp() in places like bgwriter, checkpointer, walwriter and walreceiver, but we only call sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems like we are not actually unblocking SIQUIT and quickdie() will never get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) != 0){....}, if postmaster sends a SIGQUIT while these processes are doing clean up tasks in sigsetjmp(), it will not be received, and the postmaster later sends SIGKLL to kill from below places. /* * If we already sent SIGQUIT to children and they are slow to shut * down, it's time to send them SIGKILL. This doesn't happen * normally, but under certain conditions backends can get stuck while * shutting down. This is a last measure to get them unwedged. * * Note we also do this during recovery from a process crash. */ if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) && AbortStartTime != 0 && (now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS) { /* We were gentle with them before. Not anymore */ TerminateChildren(SIGKILL); /* reset flag so we don't SIGKILL again */ AbortStartTime = 0; } Shouldn't we call PG_SETMASK(&BlockSig); to make it effective? Am I missing anything here? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > We intend to unblock SIGQUIT before sigsetjmp() in places like > bgwriter, checkpointer, walwriter and walreceiver, but we only call > sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems > like we are not actually unblocking SIQUIT and quickdie() will never > get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) != > 0){....} Yeah, maybe so. This code has been around for a long time and I'm not sure what the thought process behind it was, but I don't see a flaw in your analysis here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> We intend to unblock SIGQUIT before sigsetjmp() in places like >> bgwriter, checkpointer, walwriter and walreceiver, but we only call >> sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems >> like we are not actually unblocking SIQUIT and quickdie() will never >> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) != >> 0){....} > Yeah, maybe so. This code has been around for a long time and I'm not > sure what the thought process behind it was, but I don't see a flaw in > your analysis here. I think that code is the way it is intentionally: the idea is to not accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);" call further down, between the sigsetjmp stanza and the main loop. The sigdelset call, just like the adjacent pqsignal calls, is preparatory setup; it does not intend to allow anything to happen immediately. In general, you don't want to accept signals in that area because the process state may not be fully set up yet. You could argue that the SIGQUIT handler has no state dependencies, making it safe to accept SIGQUIT earlier during startup of one of these processes, and likewise for them to accept SIGQUIT during error recovery. But barring actual evidence of a problem with slow SIGQUIT response in these areas I'm more inclined to leave well enough alone. Changing this would add hazards, e.g. if somebody ever changes the behavior of the SIGQUIT handler, so I'd want some concrete evidence of a benefit. It seems fairly irrelevant to the problem at hand with bgworkers, anyway. As for said problem, I concur with Robert that the v4 patch seems pretty dubious; it's adding a lot of barely-thought-out mechanism for no convincing gain in safety. I think the v1 patch was more nearly the right thing, except that the unblock needs to happen a tad further down, as attached (this is untested but certainly it should pass any test that v1 passed). I didn't do anything about rewriting the bogus comment just above the sigsetjmp call, but I agree that that should happen too. regards, tom lane diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index beb5e85434..dd22241600 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -753,7 +753,14 @@ StartBackgroundWorker(void) /* Prevent interrupts while cleaning up */ HOLD_INTERRUPTS(); - /* Report the error to the server log */ + /* + * sigsetjmp will have blocked all signals, but we may need to accept + * signals while communicating with our parallel leader. Once we've + * done HOLD_INTERRUPTS() it should be safe to unblock signals. + */ + BackgroundWorkerUnblockSignals(); + + /* Report the error to the parallel leader and the server log */ EmitErrorReport(); /*
On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think that code is the way it is intentionally: the idea is to not > accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);" > call further down, between the sigsetjmp stanza and the main loop. > The sigdelset call, just like the adjacent pqsignal calls, is > preparatory setup; it does not intend to allow anything to happen > immediately. I don't think that your analysis here is correct. The sigdelset call is manipulating BlockSig, and the subsequent PG_SETMASK call is working with UnblockSig, so it doesn't make sense to view one as a preparatory step for the other. It could be correct to interpret the sigdelset call as preparatory work for a future call to PG_SETMASK(&BlockSig), but AFAICS there are no such calls in the processes where this incantation exists, so really it just seems to be a no-op. Furthermore, the comment says that the point of the sigdelset() is to allow SIGQUIT "at all times," which doesn't square well with your suggestion that we intended it to take effect only later. > In general, you don't want to accept signals in that area because the > process state may not be fully set up yet. You could argue that the > SIGQUIT handler has no state dependencies, making it safe to accept > SIGQUIT earlier during startup of one of these processes, and likewise > for them to accept SIGQUIT during error recovery. But barring actual > evidence of a problem with slow SIGQUIT response in these areas I'm more > inclined to leave well enough alone. Changing this would add hazards, > e.g. if somebody ever changes the behavior of the SIGQUIT handler, so > I'd want some concrete evidence of a benefit. It seems fairly > irrelevant to the problem at hand with bgworkers, anyway. The SIGQUIT handler in question contains nothing than a call to _exit(2) and a long comment explaining why we don't do anything else, so I think the argument that it has no state dependencies is pretty well water-tight. Whether or not we've got a problem with timely SIGQUIT acceptance is much less clear. So it seems to me that the safer thing to do here would be to unblock the signal. It might gain something, and it can't really lose anything. Now it's true that the calculus might change if someone were to modify the behavior of the SIGQUIT handler in the future, but if they do then it's their job to think about this stuff. It doesn't seem especially likely for that to change, anyway. The only reason that the handler for regular backends does anything other than _exit(2) is that we want to try to let the client know what happened before we croak, and that concern is irrelevant for background workers. Doing any other cleanup here is unsafe and unnecessary. > As for said problem, I concur with Robert that the v4 patch seems pretty > dubious; it's adding a lot of barely-thought-out mechanism for no > convincing gain in safety. I think the v1 patch was more nearly the right > thing, except that the unblock needs to happen a tad further down, as > attached (this is untested but certainly it should pass any test that v1 > passed). I didn't do anything about rewriting the bogus comment just > above the sigsetjmp call, but I agree that that should happen too. I am not sure whether the difference between this and v1 matters, because in postgres.c it's effectively happening inside sigsetjmp, so the earlier unblock must not be that bad. But I don't mind putting it in the place you suggest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The sigdelset call, just like the adjacent pqsignal calls, is >> preparatory setup; it does not intend to allow anything to happen >> immediately. > I don't think that your analysis here is correct. The sigdelset call > is manipulating BlockSig, and the subsequent PG_SETMASK call is > working with UnblockSig, so it doesn't make sense to view one as a > preparatory step for the other. That SETMASK call will certainly unblock SIGQUIT, so I don't see what your point is. Anyway, the bottom line is that that code's been like that for a decade or two without complaints, so I'm disinclined to mess with it on the strength of nothing much. regards, tom lane
On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I don't think that your analysis here is correct. The sigdelset call > > is manipulating BlockSig, and the subsequent PG_SETMASK call is > > working with UnblockSig, so it doesn't make sense to view one as a > > preparatory step for the other. > > That SETMASK call will certainly unblock SIGQUIT, so I don't see what > your point is. I can't figure out if you're trolling me here or what. It's true that the PG_SETMASK() call will certainly unblock SIGQUIT, but that would also be true if the sigdelset() call were absent. > Anyway, the bottom line is that that code's been like > that for a decade or two without complaints, so I'm disinclined to > mess with it on the strength of nothing much. Really? Have you reversed your policy of wanting the comments to accurately describe what the code does? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That SETMASK call will certainly unblock SIGQUIT, so I don't see what >> your point is. > I can't figure out if you're trolling me here or what. It's true that > the PG_SETMASK() call will certainly unblock SIGQUIT, but that would > also be true if the sigdelset() call were absent. The point of the sigdelset is that if somewhere later on, we install the BlockSig mask, then SIGQUIT will remain unblocked. You asserted upthread that noplace in these processes ever does so; maybe that's true today, or maybe not, but the intent of this code is that *once we get through initialization* SIGQUIT will remain unblocked. I'll concede that it's not 100% clear whether or not these processes need to re-block SIGQUIT during error recovery. I repeat, though, that I'm disinclined to change that without some evidence that there's actually a problem with the way it works now. regards, tom lane
On Fri, Aug 7, 2020 at 2:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The point of the sigdelset is that if somewhere later on, we install > the BlockSig mask, then SIGQUIT will remain unblocked. I mean, we're just repeating the same points here, but that's not what the comment says. > You asserted > upthread that noplace in these processes ever does so; maybe that's > true today, or maybe not, It's easily checked using 'git grep'. > but the intent of this code is that *once > we get through initialization* SIGQUIT will remain unblocked. I can't speak to the intent, but I can speak to what the comment says. > I'll concede that it's not 100% clear whether or not these processes > need to re-block SIGQUIT during error recovery. I think it's entirely clear that they do not, and I have explained my reasoning already. > I repeat, though, > that I'm disinclined to change that without some evidence that there's > actually a problem with the way it works now. I've also already explained why I don't agree with this perspective. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 7, 2020 at 11:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> That SETMASK call will certainly unblock SIGQUIT, so I don't see what > >> your point is. > > > I can't figure out if you're trolling me here or what. It's true that > > the PG_SETMASK() call will certainly unblock SIGQUIT, but that would > > also be true if the sigdelset() call were absent. > > The point of the sigdelset is that if somewhere later on, we install > the BlockSig mask, then SIGQUIT will remain unblocked. You asserted > upthread that noplace in these processes ever does so; maybe that's > true today, or maybe not, but the intent of this code is that *once > we get through initialization* SIGQUIT will remain unblocked. > > I'll concede that it's not 100% clear whether or not these processes > need to re-block SIGQUIT during error recovery. I repeat, though, > that I'm disinclined to change that without some evidence that there's > actually a problem with the way it works now. > I think the main point that needs to be thought is that: will any of the bgwriter, checkpointer, walwriter and walreceiver processes need to unblock SIGQUIT during their error recovery code paths i.e. in their respective if (sigsetjmp(local_sigjmp_buf, 1) != 0){....} stanzas? Currently, SIGQUIT is blocked in the sigsetjmp() stanza. If the answer is yes: then we must do PG_SETMASK(&BlockSig); :either right after sigdelset(&BlockSig, SIGQUIT); to allow quickdie() even before the sigsetjmp() stanza and also in the sigsetjmp() stanza or do PG_SETMASK(&BlockSig); only inside the sigsetjmp() stanza. The postmaster sends SIGQUIT in immediate shutdown mode and it gives children a chance to exit safely, but if the children take longer time, then it anyways kills them with SIGKILL(note that SIGKILL can not be handled or ignored by any process). If the answer is no: let these processes perform clean ups in their respective sigsetjmp() stanzas, until the postmaster sends SIGKILL if the clean ups take time. We could have some elaborated comments before sigdelset(&BlockSig, SIGQUIT); instead of "/* We allow SIGQUIT (quickdie) at all times */" to avoid confusion. We must not worry about blocking or unblocking SIGQUIT in these processes after the sigsetjmp() stanza, as it anyways gets unblocked by PG_SETMASK(&UnBlockSig); and also no problem if somebody does PG_SETMASK(&BlockSig); in future as we have already done sigdelset(&BlockSig, SIGQUIT);. Can we start a separate thread to discuss this SIGQUIT point to not sidetrack the main issue "Parallel worker hangs while handling errors."? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > The v4 patch looks good to me. Hang is not seen, make check and make > > check-world passes. I moved this to the committer for further review > > in https://commitfest.postgresql.org/29/2636/. > > I don't think I agree with this approach. In particular, I don't > understand the rationale for unblocking only SIGUSR1. Above, Vignesh > says that he feels that unblocking only that signal would be the right > approach, but no reason is given. I have two reasons why I suspect > it's not the right approach. One, it doesn't seem to be what we do > elsewhere; the only existing cases where we have special handling for > particular signals are SIGQUIT and SIGPIPE, and those places have > comments explaining the reason why they are handled in a special way. > Two, SIGUSR1 is used for a LOT of things: look at all the different > cases procsignal_sigusr1_handler() checks. If the intention is to only > allow the things we know are safe, rather than all the signals there > are, I think this coding utterly fails to achieve that - and for > reasons that I don't think are really fixable. > My intention of blocking only SIGUSR1 over unblocking all signals mainly because we are already in the error path and we are about to exit after emitting the error report. I was not sure if we intended to receive any other signal just before exiting. The Solution Robert & Tom are suggesting by Calling BackgroundWorkerUnblockSignals fixes the actual problem. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
vignesh C <vignesh21@gmail.com> writes: > The Solution Robert & Tom are suggesting by Calling > BackgroundWorkerUnblockSignals fixes the actual problem. I've gone ahead and pushed the bgworker fix, since everyone seems to agree that that's okay, and it is provably fixing a problem. As for the question of SIGQUIT handling, I see that postgres.c does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset change, so there probably isn't any harm in having the background processes do likewise. I wonder though why bgworkers are not applying the same policy. (I remain of the opinion that any changes in this area should not be back-patched without evidence of a concrete problem; it's at least as likely that we'll introduce a problem as fix one.) regards, tom lane
On 2020-Sep-03, Tom Lane wrote: > As for the question of SIGQUIT handling, I see that postgres.c > does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset > change, so there probably isn't any harm in having the background > processes do likewise. I wonder though why bgworkers are not > applying the same policy. It's quite likely that it's the way it is more by accident than because I was thinking extremely carefully about signal handling when originally writing that code. Some parts of that code I was copying from others' patches, and I could easily have missed a detail like this. (I didn't "git blame" to verify that these parts are mine, though). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote: > As for the question of SIGQUIT handling, I see that postgres.c > does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset > change, so there probably isn't any harm in having the background > processes do likewise. Concretely, something about like this (I just did the bgwriter, but we'd want the same in all the background processes). I tried to respond to Robert's complaint about the inaccurate comment just above sigsetjmp, too. This passes check-world, for what little that's worth. regards, tom lane diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 069e27e427..3ae7901bf6 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -117,6 +117,7 @@ BackgroundWriterMain(void) /* We allow SIGQUIT (quickdie) at all times */ sigdelset(&BlockSig, SIGQUIT); + PG_SETMASK(&BlockSig); /* * We just started, assume there has been either a shutdown or @@ -140,7 +141,19 @@ BackgroundWriterMain(void) /* * If an exception is encountered, processing resumes here. * - * See notes in postgres.c about the design of this coding. + * You might wonder why this isn't coded as an infinite loop around a + * PG_TRY construct. The reason is that this is the bottom of the + * exception stack, and so with PG_TRY there would be no exception handler + * in force at all during the CATCH part. By leaving the outermost setjmp + * always active, we have at least some chance of recovering from an error + * during error recovery. (If we get into an infinite loop thereby, it + * will soon be stopped by overflow of elog.c's internal state stack.) + * + * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask + * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus, + * signals will be blocked until we complete error recovery. It might + * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but + * it is not since InterruptPending might be set already. */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) {
On Thu, Sep 3, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Concretely, something about like this (I just did the bgwriter, but > we'd want the same in all the background processes). I tried to > respond to Robert's complaint about the inaccurate comment just above > sigsetjmp, too. > > This passes check-world, for what little that's worth. Seems totally reasonable from here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 3, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Concretely, something about like this (I just did the bgwriter, but >> we'd want the same in all the background processes). I tried to >> respond to Robert's complaint about the inaccurate comment just above >> sigsetjmp, too. >> This passes check-world, for what little that's worth. > Seems totally reasonable from here. OK, I did the same in other relevant places and pushed it. It's not clear to me whether we want to institute the "accepting SIGQUIT is always okay" rule in processes that didn't already have code to change BlockSig. The relevant processes are pgarch.c, startup.c, bgworker.c, autovacuum.c (launcher and workers both), and walsender.c. In the first two of these I doubt it matters, because I don't think they'll ever block signals again anyway -- they certainly don't have outer sigsetjmp blocks. And I'm a bit hesitant to mess with bgworker given that we seem to expect that to be heavily used by extension code, and we're exposing code to allow extensions to mess with the signal blocking state. On the other hand, as long as SIGQUIT is pointing at SignalHandlerForCrashExit, it's hard to see a reason why holding it off could be necessary. So maybe having a uniform rule would be good. Any thoughts about that? regards, tom lane
On Fri, Sep 11, 2020 at 4:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's not clear to me whether we want to institute the "accepting SIGQUIT > is always okay" rule in processes that didn't already have code to change > BlockSig. The relevant processes are pgarch.c, startup.c, bgworker.c, > autovacuum.c (launcher and workers both), and walsender.c. In the first > two of these I doubt it matters, because I don't think they'll ever block > signals again anyway -- they certainly don't have outer sigsetjmp blocks. > And I'm a bit hesitant to mess with bgworker given that we seem to expect > that to be heavily used by extension code, and we're exposing code to > allow extensions to mess with the signal blocking state. On the other > hand, as long as SIGQUIT is pointing at SignalHandlerForCrashExit, it's > hard to see a reason why holding it off could be necessary. So maybe > having a uniform rule would be good. > > Any thoughts about that? I think a backend process that isn't timely handling SIGQUIT is by that very fact buggy. "pg_ctl stop -mi" isn't a friendly suggestion. So I favor trying to make it uniform. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Sep 11, 2020 at 4:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's not clear to me whether we want to institute the "accepting SIGQUIT >> is always okay" rule in processes that didn't already have code to change >> BlockSig. > I think a backend process that isn't timely handling SIGQUIT is by > that very fact buggy. "pg_ctl stop -mi" isn't a friendly suggestion. > So I favor trying to make it uniform. Well, if we want to take a hard line about that, we should centralize the setup of SIGQUIT. The attached makes InitPostmasterChild do it, and removes the duplicative code from elsewhere. I also flipped autovacuum and walsender from using quickdie to using SignalHandlerForCrashExit. Whatever you think about the safety or unsafety of quickdie, there seems no reason for autovacuum to be trying to tell its nonexistent client about a shutdown. I don't think it's terribly useful for a walsender either, though maybe somebody has a different opinion about that? regards, tom lane diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 19ba26b914..2cef56f115 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -454,8 +454,8 @@ AutoVacLauncherMain(int argc, char *argv[]) pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, StatementCancelHandler); pqsignal(SIGTERM, SignalHandlerForShutdownRequest); + /* SIGQUIT handler was already set up by InitPostmasterChild */ - pqsignal(SIGQUIT, quickdie); InitializeTimeouts(); /* establishes SIGALRM handler */ pqsignal(SIGPIPE, SIG_IGN); @@ -498,9 +498,10 @@ AutoVacLauncherMain(int argc, char *argv[]) * * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus, - * signals will be blocked until we complete error recovery. It might - * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but - * it is not since InterruptPending might be set already. + * signals other than SIGQUIT will be blocked until we complete error + * recovery. It might seem that this policy makes the HOLD_INTERRUPTS() + * call redundant, but it is not since InterruptPending might be set + * already. */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) { @@ -1531,7 +1532,8 @@ AutoVacWorkerMain(int argc, char *argv[]) */ pqsignal(SIGINT, StatementCancelHandler); pqsignal(SIGTERM, die); - pqsignal(SIGQUIT, quickdie); + /* SIGQUIT handler was already set up by InitPostmasterChild */ + InitializeTimeouts(); /* establishes SIGALRM handler */ pqsignal(SIGPIPE, SIG_IGN); @@ -1562,9 +1564,9 @@ AutoVacWorkerMain(int argc, char *argv[]) * * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus, - * signals will be blocked until we exit. It might seem that this policy - * makes the HOLD_INTERRUPTS() call redundant, but it is not since - * InterruptPending might be set already. + * signals other than SIGQUIT will be blocked until we exit. It might + * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but + * it is not since InterruptPending might be set already. */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) { diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index d043ced686..5a9a0e3435 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -731,9 +731,9 @@ StartBackgroundWorker(void) pqsignal(SIGFPE, SIG_IGN); } pqsignal(SIGTERM, bgworker_die); + /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGHUP, SIG_IGN); - pqsignal(SIGQUIT, SignalHandlerForCrashExit); InitializeTimeouts(); /* establishes SIGALRM handler */ pqsignal(SIGPIPE, SIG_IGN); diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index c96568149f..a7afa758b6 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -104,7 +104,7 @@ BackgroundWriterMain(void) pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SIG_IGN); pqsignal(SIGTERM, SignalHandlerForShutdownRequest); - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); @@ -115,10 +115,6 @@ BackgroundWriterMain(void) */ pqsignal(SIGCHLD, SIG_DFL); - /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */ - sigdelset(&BlockSig, SIGQUIT); - PG_SETMASK(&BlockSig); - /* * We just started, assume there has been either a shutdown or * end-of-recovery snapshot. diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 45f5deca72..3e7dcd4f76 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -198,7 +198,7 @@ CheckpointerMain(void) pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */ pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */ - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); @@ -209,10 +209,6 @@ CheckpointerMain(void) */ pqsignal(SIGCHLD, SIG_DFL); - /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */ - sigdelset(&BlockSig, SIGQUIT); - PG_SETMASK(&BlockSig); - /* * Initialize so that first time-driven event happens at the correct time. */ diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 37be0e2bbb..ed1b65358d 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -228,7 +228,7 @@ PgArchiverMain(int argc, char *argv[]) pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SIG_IGN); pqsignal(SIGTERM, SignalHandlerForShutdownRequest); - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, pgarch_waken); diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 3cd6fa30eb..959e3b8873 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -4355,7 +4355,7 @@ BackendInitialize(Port *port) * cleaned up. */ pqsignal(SIGTERM, process_startup_packet_die); - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ InitializeTimeouts(); /* establishes SIGALRM handler */ PG_SETMASK(&StartupBlockSig); @@ -4435,7 +4435,7 @@ BackendInitialize(Port *port) status = ProcessStartupPacket(port, false, false); /* - * Disable the timeout, and prevent SIGTERM/SIGQUIT again. + * Disable the timeout, and prevent SIGTERM again. */ disable_timeout(STARTUP_PACKET_TIMEOUT, false); PG_SETMASK(&BlockSig); @@ -4983,10 +4983,6 @@ SubPostmasterMain(int argc, char *argv[]) if (strcmp(argv[1], "--forkavworker") == 0) AutovacuumWorkerIAm(); - /* In EXEC_BACKEND case we will not have inherited these settings */ - pqinitmask(); - PG_SETMASK(&BlockSig); - /* Read in remaining GUC variables */ read_nondefault_variables(); diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c index fd9ac35dac..64af7b8707 100644 --- a/src/backend/postmaster/startup.c +++ b/src/backend/postmaster/startup.c @@ -175,7 +175,7 @@ StartupProcessMain(void) pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */ pqsignal(SIGINT, SIG_IGN); /* ignore query cancel */ pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */ - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ InitializeTimeouts(); /* establishes SIGALRM handler */ pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c index 358c0916ac..a52832fe90 100644 --- a/src/backend/postmaster/walwriter.c +++ b/src/backend/postmaster/walwriter.c @@ -101,7 +101,7 @@ WalWriterMain(void) pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SignalHandlerForShutdownRequest); pqsignal(SIGTERM, SignalHandlerForShutdownRequest); - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); @@ -112,10 +112,6 @@ WalWriterMain(void) */ pqsignal(SIGCHLD, SIG_DFL); - /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */ - sigdelset(&BlockSig, SIGQUIT); - PG_SETMASK(&BlockSig); - /* * Create a memory context that we will do all our work in. We do this so * that we can reset the context during error recovery and thereby avoid diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b180598507..17f1a49f87 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -270,7 +270,7 @@ WalReceiverMain(void) pqsignal(SIGHUP, WalRcvSigHupHandler); /* set flag to read config file */ pqsignal(SIGINT, SIG_IGN); pqsignal(SIGTERM, WalRcvShutdownHandler); /* request shutdown */ - pqsignal(SIGQUIT, SignalHandlerForCrashExit); + /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); @@ -279,10 +279,6 @@ WalReceiverMain(void) /* Reset some signals that are accepted by postmaster but not here */ pqsignal(SIGCHLD, SIG_DFL); - /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */ - sigdelset(&BlockSig, SIGQUIT); - PG_SETMASK(&BlockSig); - /* Load the libpq-specific functions */ load_file("libpqwalreceiver", false); if (WalReceiverFunctions == NULL) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 3f756b470a..5e6ae6fde5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -3035,7 +3035,7 @@ WalSndSignals(void) pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, StatementCancelHandler); /* query cancel */ pqsignal(SIGTERM, die); /* request shutdown */ - pqsignal(SIGQUIT, quickdie); /* hard crash time */ + /* SIGQUIT handler was already set up by InitPostmasterChild */ InitializeTimeouts(); /* establishes SIGALRM handler */ pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index c9424f167c..411cfadbff 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3820,7 +3820,8 @@ PostgresMain(int argc, char *argv[], } /* - * Set up signal handlers and masks. + * Set up signal handlers. (InitPostmasterChild or InitStandaloneProcess + * has already set up BlockSig and made that the active signal mask.) * * Note that postmaster blocked all signals before forking child process, * so there is no race condition whereby we might receive a signal before @@ -3842,6 +3843,9 @@ PostgresMain(int argc, char *argv[], pqsignal(SIGTERM, die); /* cancel current query and exit */ /* + * In a postmaster child backend, replace SignalHandlerForCrashExit + * with quickdie, so we can tell the client we're dying. + * * In a standalone backend, SIGQUIT can be generated from the keyboard * easily, while SIGTERM cannot, so we make both signals do die() * rather than quickdie(). @@ -3871,16 +3875,6 @@ PostgresMain(int argc, char *argv[], * platforms */ } - pqinitmask(); - - if (IsUnderPostmaster) - { - /* We allow SIGQUIT (quickdie) at all times */ - sigdelset(&BlockSig, SIGQUIT); - } - - PG_SETMASK(&BlockSig); /* block everything except SIGQUIT */ - if (!IsUnderPostmaster) { /* diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c index cf8f9579c3..ed2ab4b5b2 100644 --- a/src/backend/utils/init/miscinit.c +++ b/src/backend/utils/init/miscinit.c @@ -32,10 +32,12 @@ #include "catalog/pg_authid.h" #include "common/file_perm.h" #include "libpq/libpq.h" +#include "libpq/pqsignal.h" #include "mb/pg_wchar.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/autovacuum.h" +#include "postmaster/interrupt.h" #include "postmaster/postmaster.h" #include "storage/fd.h" #include "storage/ipc.h" @@ -133,6 +135,23 @@ InitPostmasterChild(void) elog(FATAL, "setsid() failed: %m"); #endif + /* In EXEC_BACKEND case we will not have inherited BlockSig etc values */ +#ifdef EXEC_BACKEND + pqinitmask(); +#endif + + /* + * Every postmaster child process is expected to respond promptly to + * SIGQUIT at all times. Therefore we centrally remove SIGQUIT from + * BlockSig and install a suitable signal handler. (Client-facing + * processes may choose to replace this default choice of handler with + * quickdie().) All other blockable signals remain blocked for now. + */ + pqsignal(SIGQUIT, SignalHandlerForCrashExit); + + sigdelset(&BlockSig, SIGQUIT); + PG_SETMASK(&BlockSig); + /* Request a signal if the postmaster dies, if possible. */ PostmasterDeathSignalInit(); } @@ -155,6 +174,13 @@ InitStandaloneProcess(const char *argv0) InitLatch(MyLatch); InitializeLatchWaitSet(); + /* + * For consistency with InitPostmasterChild, initialize signal mask here. + * But we don't unblock SIGQUIT or provide a default handler for it. + */ + pqinitmask(); + PG_SETMASK(&BlockSig); + /* Compute paths, no postmaster to inherit from */ if (my_exec_path[0] == '\0') {