Обсуждение: Parallel worker hangs while handling errors.

Поиск

Список

Период

Сортировка

Parallel worker hangs while handling errors.

От

vignesh C

Дата:

03 июля 2020 г., 12:10:56

Hi,

Parallel worker hangs while handling errors.

Analysis:
When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine. If there
is a bigger error message, the worker process will wait for the leader
process to read the message, free up some memory in shared memory and
set the latch. The worker will be waiting at the below back trace:
#4  0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454,
wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368
#5  0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350
"SERROR", len=230015) at pqmq.c:171
#6  0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301
#7  0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0
<errordata>) at elog.c:3327
#8  0x0000000000abdf5b in EmitErrorReport () at elog.c:1460

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.
I could reproduce this issue when there is an error during copy of
toast data using parallel copy, this project is an in-progress
project. I don't have a test case to reproduce on the head. Any
suggestions for a test case on head?
The Attached patch has the fix for the same.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patch

Re: Parallel worker hangs while handling errors.

От

vignesh C

Дата:

07 июля 2020 г., 04:26:29

On Fri, Jul 3, 2020 at 2:40 PM vignesh C <vignesh21@gmail.com> wrote:
>
> The Attached patch has the fix for the same.
>
I have added a commitfest entry for this bug:
https://commitfest.postgresql.org/29/2636/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

09 июля 2020 г., 12:42:57

>
> Parallel worker hangs while handling errors.
>
> When there is an error in the parallel worker process, we will call
> ereport/elog with the error message. Worker will then jump from
> errfinish to setjmp in StartBackgroundWorker function which was set
> earlier. Then the worker process will then send the error message
> through the shared memory to the leader process. Shared memory size is
> ok 16K, if the error message is less than 16K it works fine.

I reproduced the hang issue with the parallel copy patches[1]. The use
case is as follows - one of the parallel workers tries to report error
to the leader process and as part of the error context it also tries
to send the entire row/tuple data(which is a lot more than 16KB).

The fix provided here solves the above problem, i.e. no hang occurs,
and the entire tuple/row data in the error from worker to leader gets
transferred, see the attachment "testcase.text" for the output.

Apart from that, I also executed the regression tests (make check and
make check-world) on the patch, no issues are observed.

[1] - https://www.postgresql.org/message-id/CALDaNm2-wMYO68vtDuuWO5h4FQCsfm4Pcg5XrzEPtRty1bEM7w%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Вложения

testcase.txt

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

17 июля 2020 г., 10:51:26

>
> Leader process then identifies that there are some messages that need
> to be processed, it copies the messages and sets the latch so that the
> worker process can copy the remaining message from the below function:
> shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
> able to receive any signal at this point of time & hangs infinitely
> Worker hangs in this case because when the worker is started the
> signals will be masked using sigprocmask. Unblocking of signals is
> done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
> Now due to error handling the worker has jumped to setjmp in
> StartBackgroundWorker function. Here the signals are in a blocked
> state, hence the signal is not received by the worker process.
>

Your analysis looks fine to me.

Adding some more info:

The worker uses SIGUSR1 (with a special shared memory flag
PROCSIG_PARALLEL_MESSAGE) both for error message sharing(from
mq_putmessage) and for parallel worker shutdown(from
ParallelWorkerShutdown).

And yes, the postmaster blocks SIGUSR1 before forking bgworker(in
PostmasterMain with pqinitmask() and PG_SETMASK(&BlockSig)), bgworker
receives the same blocked signal mask for itself and enters
StartBackgroundWorker(), uses sigsetjmp for error handling, and then
goes to ParallelWorkerMain() where few of the signal handers are set
and then BackgroundWorkerUnblockSignals() is called to not block any
signals.

But observe when we did sigsetjmp the state of the signal mask is that
of we received from postmaster which is all signals blocked.

So, now in error cases when the control is jumped to sigsetjmp we
still have the signals blocked(along with SIGUSR1) mask and in the
code path of EmitErrorReport, we do send SIGUSR1 with flag
PROCSIG_PARALLEL_MESSAGE to the leader/backend and wait for the latch
to be set, this happens only if the worker is able to receive back
SIGUSR1 from the leader/backend.

In this reported issue, since SIGUSR1 is blocked at sigsetjmp in
StartBackgroundWorker(), there is no way that the worker process
receiving it from the leader and the latch cannot be set and hence the
hang occurs.

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

>
> One of the fixes could be to call BackgroundWorkerUnblockSignals just
> after sigsetjmp. I'm not sure if this is the best solution.
> Robert & myself had a discussion about the problem yesterday. We felt
> this is a genuine problem with the parallel worker error handling and
> need to be fixed.
>

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

Something like below -

    if (sigsetjmp(local_sigjmp_buf, 1) != 0)
    {
        sigset_t x;
        sigemptyset (&x);
        sigaddset(&x, SIGUSR1);
        sigprocmask(SIG_UNBLOCK, &x, NULL);

        /* Since not using PG_TRY, must reset error stack by hand */
        error_context_stack = NULL;

        /* Prevent interrupts while cleaning up */
        HOLD_INTERRUPTS();

If okay, with the above approach, we can put the above
sigprocmask(SIG_UNBLOCK,..) piece of code(of course generically to
unblock any given signal) in a macro similar to PG_SETMASK() and use
that in all the places wherever EmitErrorReport() is called from
sigsetjmp. We should mind the portability of sigprocmask as well.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

vignesh C

Дата:

23 июля 2020 г., 10:24:05

Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> The same hang issue can occur(though I'm not able to back it up with a
> use case), in the cases from wherever the EmitErrorReport() is called
> from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> postgres.c.
>

I'm not sure if this can occur in other cases.

> >
> > One of the fixes could be to call BackgroundWorkerUnblockSignals just
> > after sigsetjmp. I'm not sure if this is the best solution.
> > Robert & myself had a discussion about the problem yesterday. We felt
> > this is a genuine problem with the parallel worker error handling and
> > need to be fixed.
> >
>
> Note that, in all sigsetjmp blocks, we intentionally
> HOLD_INTERRUPTS(), to not cause any issues while performing error
> handling, I'm concerned here that now, if we directly call
> BackgroundWorkerUnblockSignals() which will open up all the signals
> and our main intention of holding interrupts or block signals may go
> away.
>
> Since the main problem for this hang issue is because of blocking
> SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> instead of unblocking all signals? I tried this with parallel copy
> hang, the issue is resolved.
>

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patch

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

24 июля 2020 г., 10:10:52

On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and adding your thoughts, My comments are inline.
>
> On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > The same hang issue can occur(though I'm not able to back it up with a
> > use case), in the cases from wherever the EmitErrorReport() is called
> > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > postgres.c.
> >
>
> I'm not sure if this can occur in other cases.
>

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain()  --> pq_redirect_to_shm_mq()).

>
> > Note that, in all sigsetjmp blocks, we intentionally
> > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > handling, I'm concerned here that now, if we directly call
> > BackgroundWorkerUnblockSignals() which will open up all the signals
> > and our main intention of holding interrupts or block signals may go
> > away.
> >
> > Since the main problem for this hang issue is because of blocking
> > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > instead of unblocking all signals? I tried this with parallel copy
> > hang, the issue is resolved.
> >
>
> On putting further thoughts on this, I feel just unblocking SIGUSR1
> would be the right approach in this case. I'm attaching a new patch
> which unblocks SIGUSR1 signal. I have verified that the original issue
> with WIP parallel copy patch gets fixed. I have made changes only in
> bgworker.c as we require the parallel worker to receive this signal
> and continue processing. I have not included the changes for other
> processes as I'm not sure if this scenario is applicable for other
> processes.
>

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I verified this fix for the parallel copy hang issue. And also make
check and make check-world passes.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v2-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patch

Re: Parallel worker hangs while handling errors.

От

vignesh C

Дата:

25 июля 2020 г., 04:32:32

On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Thanks for reviewing and adding your thoughts, My comments are inline.
> >
> > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > The same hang issue can occur(though I'm not able to back it up with a
> > > use case), in the cases from wherever the EmitErrorReport() is called
> > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > > postgres.c.
> > >
> >
> > I'm not sure if this can occur in other cases.
> >
>
> I checked further on this point: Yes, it can't occur for the other
> cases, as mq_putmessage() gets only used for parallel
> workers(ParallelWorkerMain()  --> pq_redirect_to_shm_mq()).
>
> >
> > > Note that, in all sigsetjmp blocks, we intentionally
> > > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > > handling, I'm concerned here that now, if we directly call
> > > BackgroundWorkerUnblockSignals() which will open up all the signals
> > > and our main intention of holding interrupts or block signals may go
> > > away.
> > >
> > > Since the main problem for this hang issue is because of blocking
> > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > > instead of unblocking all signals? I tried this with parallel copy
> > > hang, the issue is resolved.
> > >
> >
> > On putting further thoughts on this, I feel just unblocking SIGUSR1
> > would be the right approach in this case. I'm attaching a new patch
> > which unblocks SIGUSR1 signal. I have verified that the original issue
> > with WIP parallel copy patch gets fixed. I have made changes only in
> > bgworker.c as we require the parallel worker to receive this signal
> > and continue processing. I have not included the changes for other
> > processes as I'm not sure if this scenario is applicable for other
> > processes.
> >
>
> Since we are sure that this hang issue can occur only for parallel
> workers, and the change in StartBackgroundWorker's sigsetjmp's block
> should only be made for parallel worker cases. And also there can be a
> lot of other callbacks execution and processing in proc_exit() for
> which we might not need the SIGUSR1 unblocked. So, let's undo the
> unblocking right after EmitErrorReport() to not cause any new issues.
>
> Attaching a modified v2 patch: it has the unblocking for only for
> parallel workers, undoing it after EmitErrorReport(), and some
> adjustments in the comment.
>

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

v3-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patch

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

27 июля 2020 г., 07:43:43

On Sat, Jul 25, 2020 at 7:02 AM vignesh C <vignesh21@gmail.com> wrote:
>
> I have made slight changes on top of the patch to remove duplicate
> code, attached v3 patch for the same.
> The parallel worker hang issue gets resolved, make check & make
> check-world passes.
>

Having a function to unblock selective signals for a bg worker looks good to me.

Few comments:
1. Do we need "worker" as a function argument in
update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
MyBgworkerEntry is a global variable, can't we have a local variable
instead?
2. Instead of update_parallel_worker_sigmask() serving only for
parallel workers, can we make it generic, so that for any bgworker,
given a signal it unblocks it, although there's no current use case
for a bg worker unblocking a single signal other than a parallel
worker doing it for SIGUSR1 for this hang issue. Please note that we
have BackgroundWorkerBlockSignals() and
BackgroundWorkerUnblockSignals().
I slightly modified your function, something like below?

void
BackgroundWorkerUpdateSignalMask(int signum, bool toblock)
{
    if (toblock)
        sigaddset(&BlockSig, signum);
    else
        sigdelset(&BlockSig, signum);

    PG_SETMASK(&BlockSig);
}

/*to unblock SIGUSR1*/
if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
    BackgroundWorkerUpdateSignalMask(SIGUSR1, false);

/*to block SIGUSR1*/
if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
    BackgroundWorkerUpdateSignalMask(SIGUSR1, true);

If okay, with the BackgroundWorkerUpdateSignalMask() function, please
note that we might have to add it in bgworker.sgml as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

vignesh C

Дата:

28 июля 2020 г., 08:35:05

Thanks for your comments Bharath.

On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> 1. Do we need "worker" as a function argument in
> update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
> MyBgworkerEntry is a global variable, can't we have a local variable
> instead?

Fixed, I have moved the worker check to the caller function.

> 2. Instead of update_parallel_worker_sigmask() serving only for
> parallel workers, can we make it generic, so that for any bgworker,
> given a signal it unblocks it, although there's no current use case
> for a bg worker unblocking a single signal other than a parallel
> worker doing it for SIGUSR1 for this hang issue. Please note that we
> have BackgroundWorkerBlockSignals() and
> BackgroundWorkerUnblockSignals().

Fixed. I have slightly modified the changes to break into
BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
This maintains the consistency similar to
BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().

> If okay, with the BackgroundWorkerUpdateSignalMask() function, please
> note that we might have to add it in bgworker.sgml as well.

Included the documentation.

Attached the updated patch for the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

v4-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patch

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

28 июля 2020 г., 12:34:51

On Tue, Jul 28, 2020 at 11:05 AM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for your comments Bharath.
>
> On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > 1. Do we need "worker" as a function argument in
> > update_parallel_worker_sigmask(BackgroundWorker *worker,.... ? Since
> > MyBgworkerEntry is a global variable, can't we have a local variable
> > instead?
>
> Fixed, I have moved the worker check to the caller function.
>
> > 2. Instead of update_parallel_worker_sigmask() serving only for
> > parallel workers, can we make it generic, so that for any bgworker,
> > given a signal it unblocks it, although there's no current use case
> > for a bg worker unblocking a single signal other than a parallel
> > worker doing it for SIGUSR1 for this hang issue. Please note that we
> > have BackgroundWorkerBlockSignals() and
> > BackgroundWorkerUnblockSignals().
>
> Fixed. I have slightly modified the changes to break into
> BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
> This maintains the consistency similar to
> BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().
>
> > If okay, with the BackgroundWorkerUpdateSignalMask() function, please
> > note that we might have to add it in bgworker.sgml as well.
>
> Included the documentation.
>
> Attached the updated patch for the same.
>

The v4 patch looks good to me. Hang is not seen, make check and make
check-world passes. I moved this to the committer for further review
in https://commitfest.postgresql.org/29/2636/.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

06 августа 2020 г., 23:03:48

On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> The v4 patch looks good to me. Hang is not seen, make check and make
> check-world passes. I moved this to the committer for further review
> in https://commitfest.postgresql.org/29/2636/.

I don't think I agree with this approach. In particular, I don't
understand the rationale for unblocking only SIGUSR1. Above, Vignesh
says that he feels that unblocking only that signal would be the right
approach, but no reason is given. I have two reasons why I suspect
it's not the right approach. One, it doesn't seem to be what we do
elsewhere; the only existing cases where we have special handling for
particular signals are SIGQUIT and SIGPIPE, and those places have
comments explaining the reason why they are handled in a special way.
Two, SIGUSR1 is used for a LOT of things: look at all the different
cases procsignal_sigusr1_handler() checks. If the intention is to only
allow the things we know are safe, rather than all the signals there
are, I think this coding utterly fails to achieve that - and for
reasons that I don't think are really fixable.

My first idea about how to fix this was just to call
BackgroundWorkerUnblockSignals() before sigsetjmp(), but that doesn't
really work, because ParallelWorkerMain() needs to set the handler for
SIGTERM before unblocking signals. When you really look at it, the
code that does sigsetjmp() in StartBackgroundWorker() is entirely
bogus. The comment says "See notes in postgres.c about the design of
this coding," but if you go read that comment, it says that the point
of using sigsetjmp() is to make sure that signals are unblocked within
the if-block that follows, but the use in bgworker.c actually achieves
exactly the opposite, because signals have not yet been unblocked at
this point. So, whereas the postgres.c code unblocks signals if they
are blocked, this code blocks signals if they are unblocked. Given
that, maybe the right thing to do is to just start the if-block with a
call to BackgroundWorkerUnblockSignals(). Perhaps there's some reason
that would be unsafe, if the failure occurs too early: postgres.c
doesn't unblock signals until after BaseInit() and InitProcess() have
been called, but here an error in those functions would unblock
signals while it's being handled. Off-hand, I don't see why that would
matter, though. In the postgres.c case, there wouldn't be a
PG_exception_stack yet, so we'd end up in the long part of
pg_re_throw(), which basically promotes the ERROR to FATAL but
otherwise does pretty similar things to what this handler does anyway.
So I'm not really sure there's any reason not to just go this way.

Another approach would be to establish a new PG_exception_stack() with
a free sigsetjmp() call and fresh local buffer, inside
ParallelWorkerMain(). It would do the same thing as the existing
handler, but because it would be established after unblocking signals,
sigsetjmp() would behave as desired. This doesn't seem quite as good
to me because I think this pattern might end up getting copied into
many background workers, and it's a bunch of extra code for which I
don't really see a clear need. So at the moment I think a one line fix
in StartBackgroundWorker(), to just unblock signals while handling
errors, looks better.

Adding Alvaro to the CC line, since I think he wrote this code
originally. Not sure if he or anyone else might have an opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

07 августа 2020 г., 12:04:39

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > The v4 patch looks good to me. Hang is not seen, make check and make
> > check-world passes. I moved this to the committer for further review
> > in https://commitfest.postgresql.org/29/2636/.
>
> I don't think I agree with this approach. In particular, I don't
> understand the rationale for unblocking only SIGUSR1. Above, Vignesh
> says that he feels that unblocking only that signal would be the right
> approach, but no reason is given. I have two reasons why I suspect
> it's not the right approach. One, it doesn't seem to be what we do
> elsewhere; the only existing cases where we have special handling for
> particular signals are SIGQUIT and SIGPIPE, and those places have
> comments explaining the reason why they are handled in a special way.
>

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){....}, if postmaster sends a SIGQUIT while these processes are
doing clean up tasks in sigsetjmp(), it will not be received, and the
postmaster later sends SIGKLL to kill from below places.

        /*
         * If we already sent SIGQUIT to children and they are slow to shut
         * down, it's time to send them SIGKILL.  This doesn't happen
         * normally, but under certain conditions backends can get stuck while
         * shutting down.  This is a last measure to get them unwedged.
         *
         * Note we also do this during recovery from a process crash.
         */
        if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
            AbortStartTime != 0 &&
            (now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
        {
            /* We were gentle with them before. Not anymore */
            TerminateChildren(SIGKILL);
            /* reset flag so we don't SIGKILL again */
            AbortStartTime = 0;
        }

Shouldn't we call PG_SETMASK(&BlockSig); to make it effective?

Am I missing anything here?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

07 августа 2020 г., 16:52:59

On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> We intend to unblock SIGQUIT before sigsetjmp() in places like
> bgwriter, checkpointer, walwriter and walreceiver, but we only call
> sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
> like we are not actually unblocking SIQUIT and quickdie() will never
> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
> 0){....}

Yeah, maybe so. This code has been around for a long time and I'm not
sure what the thought process behind it was, but I don't see a flaw in
your analysis here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

07 августа 2020 г., 18:36:45

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>> We intend to unblock SIGQUIT before sigsetjmp() in places like
>> bgwriter, checkpointer, walwriter and walreceiver, but we only call
>> sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
>> like we are not actually unblocking SIQUIT and quickdie() will never
>> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
>> 0){....}

> Yeah, maybe so. This code has been around for a long time and I'm not
> sure what the thought process behind it was, but I don't see a flaw in
> your analysis here.

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet.  You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery.  But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone.  Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit.  It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety.  I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed).  I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

            regards, tom lane

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85434..dd22241600 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -753,7 +753,14 @@ StartBackgroundWorker(void)
         /* Prevent interrupts while cleaning up */
         HOLD_INTERRUPTS();

-        /* Report the error to the server log */
+        /*
+         * sigsetjmp will have blocked all signals, but we may need to accept
+         * signals while communicating with our parallel leader.  Once we've
+         * done HOLD_INTERRUPTS() it should be safe to unblock signals.
+         */
+        BackgroundWorkerUnblockSignals();
+
+        /* Report the error to the parallel leader and the server log */
         EmitErrorReport();

         /*

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

07 августа 2020 г., 19:13:55

On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think that code is the way it is intentionally: the idea is to not
> accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
> call further down, between the sigsetjmp stanza and the main loop.
> The sigdelset call, just like the adjacent pqsignal calls, is
> preparatory setup; it does not intend to allow anything to happen
> immediately.

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other. It could be correct to interpret the
sigdelset call as preparatory work for a future call to
PG_SETMASK(&BlockSig), but AFAICS there are no such calls in the
processes where this incantation exists, so really it just seems to be
a no-op. Furthermore, the comment says that the point of the
sigdelset() is to allow SIGQUIT "at all times," which doesn't square
well with your suggestion that we intended it to take effect only
later.

> In general, you don't want to accept signals in that area because the
> process state may not be fully set up yet.  You could argue that the
> SIGQUIT handler has no state dependencies, making it safe to accept
> SIGQUIT earlier during startup of one of these processes, and likewise
> for them to accept SIGQUIT during error recovery.  But barring actual
> evidence of a problem with slow SIGQUIT response in these areas I'm more
> inclined to leave well enough alone.  Changing this would add hazards,
> e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
> I'd want some concrete evidence of a benefit.  It seems fairly
> irrelevant to the problem at hand with bgworkers, anyway.

The SIGQUIT handler in question contains nothing than a call to
_exit(2) and a long comment explaining why we don't do anything else,
so I think the argument that it has no state dependencies is pretty
well water-tight. Whether or not we've got a problem with timely
SIGQUIT acceptance is much less clear. So it seems to me that the
safer thing to do here would be to unblock the signal. It might gain
something, and it can't really lose anything. Now it's true that the
calculus might change if someone were to modify the behavior of the
SIGQUIT handler in the future, but if they do then it's their job to
think about this stuff. It doesn't seem especially likely for that to
change, anyway. The only reason that the handler for regular backends
does anything other than _exit(2) is that we want to try to let the
client know what happened before we croak, and that concern is
irrelevant for background workers. Doing any other cleanup here is
unsafe and unnecessary.

> As for said problem, I concur with Robert that the v4 patch seems pretty
> dubious; it's adding a lot of barely-thought-out mechanism for no
> convincing gain in safety.  I think the v1 patch was more nearly the right
> thing, except that the unblock needs to happen a tad further down, as
> attached (this is untested but certainly it should pass any test that v1
> passed).  I didn't do anything about rewriting the bogus comment just
> above the sigsetjmp call, but I agree that that should happen too.

I am not sure whether the difference between this and v1 matters,
because in postgres.c it's effectively happening inside sigsetjmp, so
the earlier unblock must not be that bad. But I don't mind putting it
in the place you suggest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

07 августа 2020 г., 19:56:01

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 7, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The sigdelset call, just like the adjacent pqsignal calls, is
>> preparatory setup; it does not intend to allow anything to happen
>> immediately.

> I don't think that your analysis here is correct. The sigdelset call
> is manipulating BlockSig, and the subsequent PG_SETMASK call is
> working with UnblockSig, so it doesn't make sense to view one as a
> preparatory step for the other.

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.  Anyway, the bottom line is that that code's been like
that for a decade or two without complaints, so I'm disinclined to
mess with it on the strength of nothing much.

            regards, tom lane

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

07 августа 2020 г., 20:37:59

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I don't think that your analysis here is correct. The sigdelset call
> > is manipulating BlockSig, and the subsequent PG_SETMASK call is
> > working with UnblockSig, so it doesn't make sense to view one as a
> > preparatory step for the other.
>
> That SETMASK call will certainly unblock SIGQUIT, so I don't see what
> your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

> Anyway, the bottom line is that that code's been like
> that for a decade or two without complaints, so I'm disinclined to
> mess with it on the strength of nothing much.

Really? Have you reversed your policy of wanting the comments to
accurately describe what the code does?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

07 августа 2020 г., 21:00:34

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That SETMASK call will certainly unblock SIGQUIT, so I don't see what
>> your point is.

> I can't figure out if you're trolling me here or what. It's true that
> the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
> also be true if the sigdelset() call were absent.

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked.  You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not, but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery.  I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

            regards, tom lane

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

07 августа 2020 г., 21:15:35

On Fri, Aug 7, 2020 at 2:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The point of the sigdelset is that if somewhere later on, we install
> the BlockSig mask, then SIGQUIT will remain unblocked.

I mean, we're just repeating the same points here, but that's not what
the comment says.

> You asserted
> upthread that noplace in these processes ever does so; maybe that's
> true today, or maybe not,

It's easily checked using 'git grep'.

> but the intent of this code is that *once
> we get through initialization* SIGQUIT will remain unblocked.

I can't speak to the intent, but I can speak to what the comment says.

> I'll concede that it's not 100% clear whether or not these processes
> need to re-block SIGQUIT during error recovery.

I think it's entirely clear that they do not, and I have explained my
reasoning already.

> I repeat, though,
> that I'm disinclined to change that without some evidence that there's
> actually a problem with the way it works now.

I've also already explained why I don't agree with this perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Bharath Rupireddy

Дата:

10 августа 2020 г., 13:54:07

On Fri, Aug 7, 2020 at 11:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Aug 7, 2020 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> That SETMASK call will certainly unblock SIGQUIT, so I don't see what
> >> your point is.
>
> > I can't figure out if you're trolling me here or what. It's true that
> > the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
> > also be true if the sigdelset() call were absent.
>
> The point of the sigdelset is that if somewhere later on, we install
> the BlockSig mask, then SIGQUIT will remain unblocked.  You asserted
> upthread that noplace in these processes ever does so; maybe that's
> true today, or maybe not, but the intent of this code is that *once
> we get through initialization* SIGQUIT will remain unblocked.
>
> I'll concede that it's not 100% clear whether or not these processes
> need to re-block SIGQUIT during error recovery.  I repeat, though,
> that I'm disinclined to change that without some evidence that there's
> actually a problem with the way it works now.
>

I think the main point that needs to be thought is that: will any of
the bgwriter, checkpointer, walwriter and walreceiver processes need
to unblock SIGQUIT during their error recovery code paths i.e. in
their respective if (sigsetjmp(local_sigjmp_buf, 1) != 0){....}
stanzas? Currently, SIGQUIT is blocked in the sigsetjmp() stanza.

If the answer is yes: then we must do PG_SETMASK(&BlockSig); :either
right after sigdelset(&BlockSig, SIGQUIT); to allow quickdie() even
before the sigsetjmp() stanza and also in the sigsetjmp() stanza or do
PG_SETMASK(&BlockSig); only inside the sigsetjmp() stanza.  The
postmaster sends SIGQUIT in immediate shutdown mode and it gives
children a chance to exit safely, but if the children take longer
time, then it anyways kills them with SIGKILL(note that SIGKILL can
not be handled or ignored by any process).

If the answer is no: let these processes perform clean ups in their
respective sigsetjmp() stanzas, until the postmaster sends SIGKILL if
the clean ups take time. We could have some elaborated comments before
sigdelset(&BlockSig, SIGQUIT); instead of "/* We allow SIGQUIT
(quickdie) at all times */" to avoid confusion.

We must not worry about blocking or unblocking SIGQUIT in these
processes after the sigsetjmp() stanza, as it anyways gets unblocked
by PG_SETMASK(&UnBlockSig); and also no problem if somebody does
PG_SETMASK(&BlockSig); in future as we have already done
sigdelset(&BlockSig, SIGQUIT);.

Can we start a separate thread to discuss this SIGQUIT point to not
sidetrack the main issue "Parallel worker hangs while handling
errors."?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

vignesh C

Дата:

11 августа 2020 г., 17:38:11

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > The v4 patch looks good to me. Hang is not seen, make check and make
> > check-world passes. I moved this to the committer for further review
> > in https://commitfest.postgresql.org/29/2636/.
>
> I don't think I agree with this approach. In particular, I don't
> understand the rationale for unblocking only SIGUSR1. Above, Vignesh
> says that he feels that unblocking only that signal would be the right
> approach, but no reason is given. I have two reasons why I suspect
> it's not the right approach. One, it doesn't seem to be what we do
> elsewhere; the only existing cases where we have special handling for
> particular signals are SIGQUIT and SIGPIPE, and those places have
> comments explaining the reason why they are handled in a special way.
> Two, SIGUSR1 is used for a LOT of things: look at all the different
> cases procsignal_sigusr1_handler() checks. If the intention is to only
> allow the things we know are safe, rather than all the signals there
> are, I think this coding utterly fails to achieve that - and for
> reasons that I don't think are really fixable.
>

My intention of blocking only SIGUSR1 over unblocking all signals
mainly because we are already in the error path and we are about to
exit after emitting the error report. I was not sure if we intended to
receive any other signal just before exiting.
The Solution Robert & Tom are suggesting by Calling
BackgroundWorkerUnblockSignals fixes the actual problem.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

04 сентября 2020 г., 00:01:43

vignesh C <vignesh21@gmail.com> writes:
> The Solution Robert & Tom are suggesting by Calling
> BackgroundWorkerUnblockSignals fixes the actual problem.

I've gone ahead and pushed the bgworker fix, since everyone seems
to agree that that's okay, and it is provably fixing a problem.

As for the question of SIGQUIT handling, I see that postgres.c
does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset
change, so there probably isn't any harm in having the background
processes do likewise.  I wonder though why bgworkers are not
applying the same policy.  (I remain of the opinion that any
changes in this area should not be back-patched without evidence
of a concrete problem; it's at least as likely that we'll introduce
a problem as fix one.)

            regards, tom lane

Re: Parallel worker hangs while handling errors.

От

Alvaro Herrera

Дата:

04 сентября 2020 г., 00:07:10

On 2020-Sep-03, Tom Lane wrote:

> As for the question of SIGQUIT handling, I see that postgres.c
> does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset
> change, so there probably isn't any harm in having the background
> processes do likewise.  I wonder though why bgworkers are not
> applying the same policy.

It's quite likely that it's the way it is more by accident than because
I was thinking extremely carefully about signal handling when originally
writing that code.  Some parts of that code I was copying from others'
patches, and I could easily have missed a detail like this.  (I didn't
"git blame" to verify that these parts are mine, though).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

04 сентября 2020 г., 00:29:35

I wrote:
> As for the question of SIGQUIT handling, I see that postgres.c
> does "PG_SETMASK(&BlockSig)" immediately after applying the sigdelset
> change, so there probably isn't any harm in having the background
> processes do likewise.

Concretely, something about like this (I just did the bgwriter, but
we'd want the same in all the background processes).  I tried to
respond to Robert's complaint about the inaccurate comment just above
sigsetjmp, too.

This passes check-world, for what little that's worth.

            regards, tom lane

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..3ae7901bf6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -117,6 +117,7 @@ BackgroundWriterMain(void)

     /* We allow SIGQUIT (quickdie) at all times */
     sigdelset(&BlockSig, SIGQUIT);
+    PG_SETMASK(&BlockSig);

     /*
      * We just started, assume there has been either a shutdown or
@@ -140,7 +141,19 @@ BackgroundWriterMain(void)
     /*
      * If an exception is encountered, processing resumes here.
      *
-     * See notes in postgres.c about the design of this coding.
+     * You might wonder why this isn't coded as an infinite loop around a
+     * PG_TRY construct.  The reason is that this is the bottom of the
+     * exception stack, and so with PG_TRY there would be no exception handler
+     * in force at all during the CATCH part.  By leaving the outermost setjmp
+     * always active, we have at least some chance of recovering from an error
+     * during error recovery.  (If we get into an infinite loop thereby, it
+     * will soon be stopped by overflow of elog.c's internal state stack.)
+     *
+     * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
+     * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
+     * signals will be blocked until we complete error recovery.  It might
+     * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
+     * it is not since InterruptPending might be set already.
      */
     if (sigsetjmp(local_sigjmp_buf, 1) != 0)
     {

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

08 сентября 2020 г., 14:54:37

On Thu, Sep 3, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Concretely, something about like this (I just did the bgwriter, but
> we'd want the same in all the background processes).  I tried to
> respond to Robert's complaint about the inaccurate comment just above
> sigsetjmp, too.
>
> This passes check-world, for what little that's worth.

Seems totally reasonable from here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

11 сентября 2020 г., 23:20:37

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 3, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Concretely, something about like this (I just did the bgwriter, but
>> we'd want the same in all the background processes).  I tried to
>> respond to Robert's complaint about the inaccurate comment just above
>> sigsetjmp, too.
>> This passes check-world, for what little that's worth.

> Seems totally reasonable from here.

OK, I did the same in other relevant places and pushed it.

It's not clear to me whether we want to institute the "accepting SIGQUIT
is always okay" rule in processes that didn't already have code to change
BlockSig.  The relevant processes are pgarch.c, startup.c, bgworker.c,
autovacuum.c (launcher and workers both), and walsender.c.  In the first
two of these I doubt it matters, because I don't think they'll ever block
signals again anyway -- they certainly don't have outer sigsetjmp blocks.
And I'm a bit hesitant to mess with bgworker given that we seem to expect
that to be heavily used by extension code, and we're exposing code to
allow extensions to mess with the signal blocking state.  On the other
hand, as long as SIGQUIT is pointing at SignalHandlerForCrashExit, it's
hard to see a reason why holding it off could be necessary.  So maybe
having a uniform rule would be good.

Any thoughts about that?

            regards, tom lane

Re: Parallel worker hangs while handling errors.

От

Robert Haas

Дата:

11 сентября 2020 г., 23:23:21

On Fri, Sep 11, 2020 at 4:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's not clear to me whether we want to institute the "accepting SIGQUIT
> is always okay" rule in processes that didn't already have code to change
> BlockSig.  The relevant processes are pgarch.c, startup.c, bgworker.c,
> autovacuum.c (launcher and workers both), and walsender.c.  In the first
> two of these I doubt it matters, because I don't think they'll ever block
> signals again anyway -- they certainly don't have outer sigsetjmp blocks.
> And I'm a bit hesitant to mess with bgworker given that we seem to expect
> that to be heavily used by extension code, and we're exposing code to
> allow extensions to mess with the signal blocking state.  On the other
> hand, as long as SIGQUIT is pointing at SignalHandlerForCrashExit, it's
> hard to see a reason why holding it off could be necessary.  So maybe
> having a uniform rule would be good.
>
> Any thoughts about that?

I think a backend process that isn't timely handling SIGQUIT is by
that very fact buggy. "pg_ctl stop -mi" isn't a friendly suggestion.
So I favor trying to make it uniform.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

От

Tom Lane

Дата:

12 сентября 2020 г., 20:57:21

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Sep 11, 2020 at 4:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's not clear to me whether we want to institute the "accepting SIGQUIT
>> is always okay" rule in processes that didn't already have code to change
>> BlockSig.

> I think a backend process that isn't timely handling SIGQUIT is by
> that very fact buggy. "pg_ctl stop -mi" isn't a friendly suggestion.
> So I favor trying to make it uniform.

Well, if we want to take a hard line about that, we should centralize
the setup of SIGQUIT.  The attached makes InitPostmasterChild do it,
and removes the duplicative code from elsewhere.

I also flipped autovacuum and walsender from using quickdie to using
SignalHandlerForCrashExit.  Whatever you think about the safety or
unsafety of quickdie, there seems no reason for autovacuum to be trying
to tell its nonexistent client about a shutdown.  I don't think it's
terribly useful for a walsender either, though maybe somebody has a
different opinion about that?

            regards, tom lane

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 19ba26b914..2cef56f115 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -454,8 +454,8 @@ AutoVacLauncherMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, StatementCancelHandler);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */

-    pqsignal(SIGQUIT, quickdie);
     InitializeTimeouts();        /* establishes SIGALRM handler */

     pqsignal(SIGPIPE, SIG_IGN);
@@ -498,9 +498,10 @@ AutoVacLauncherMain(int argc, char *argv[])
      *
      * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
      * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
-     * signals will be blocked until we complete error recovery.  It might
-     * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
-     * it is not since InterruptPending might be set already.
+     * signals other than SIGQUIT will be blocked until we complete error
+     * recovery.  It might seem that this policy makes the HOLD_INTERRUPTS()
+     * call redundant, but it is not since InterruptPending might be set
+     * already.
      */
     if (sigsetjmp(local_sigjmp_buf, 1) != 0)
     {
@@ -1531,7 +1532,8 @@ AutoVacWorkerMain(int argc, char *argv[])
      */
     pqsignal(SIGINT, StatementCancelHandler);
     pqsignal(SIGTERM, die);
-    pqsignal(SIGQUIT, quickdie);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
+
     InitializeTimeouts();        /* establishes SIGALRM handler */

     pqsignal(SIGPIPE, SIG_IGN);
@@ -1562,9 +1564,9 @@ AutoVacWorkerMain(int argc, char *argv[])
      *
      * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
      * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
-     * signals will be blocked until we exit.  It might seem that this policy
-     * makes the HOLD_INTERRUPTS() call redundant, but it is not since
-     * InterruptPending might be set already.
+     * signals other than SIGQUIT will be blocked until we exit.  It might
+     * seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
+     * it is not since InterruptPending might be set already.
      */
     if (sigsetjmp(local_sigjmp_buf, 1) != 0)
     {
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d043ced686..5a9a0e3435 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -731,9 +731,9 @@ StartBackgroundWorker(void)
         pqsignal(SIGFPE, SIG_IGN);
     }
     pqsignal(SIGTERM, bgworker_die);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGHUP, SIG_IGN);

-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     InitializeTimeouts();        /* establishes SIGALRM handler */

     pqsignal(SIGPIPE, SIG_IGN);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c96568149f..a7afa758b6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -104,7 +104,7 @@ BackgroundWriterMain(void)
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -115,10 +115,6 @@ BackgroundWriterMain(void)
      */
     pqsignal(SIGCHLD, SIG_DFL);

-    /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-    sigdelset(&BlockSig, SIGQUIT);
-    PG_SETMASK(&BlockSig);
-
     /*
      * We just started, assume there has been either a shutdown or
      * end-of-recovery snapshot.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 45f5deca72..3e7dcd4f76 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -198,7 +198,7 @@ CheckpointerMain(void)
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
     pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -209,10 +209,6 @@ CheckpointerMain(void)
      */
     pqsignal(SIGCHLD, SIG_DFL);

-    /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-    sigdelset(&BlockSig, SIGQUIT);
-    PG_SETMASK(&BlockSig);
-
     /*
      * Initialize so that first time-driven event happens at the correct time.
      */
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..ed1b65358d 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -228,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3cd6fa30eb..959e3b8873 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4355,7 +4355,7 @@ BackendInitialize(Port *port)
      * cleaned up.
      */
     pqsignal(SIGTERM, process_startup_packet_die);
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     InitializeTimeouts();        /* establishes SIGALRM handler */
     PG_SETMASK(&StartupBlockSig);

@@ -4435,7 +4435,7 @@ BackendInitialize(Port *port)
     status = ProcessStartupPacket(port, false, false);

     /*
-     * Disable the timeout, and prevent SIGTERM/SIGQUIT again.
+     * Disable the timeout, and prevent SIGTERM again.
      */
     disable_timeout(STARTUP_PACKET_TIMEOUT, false);
     PG_SETMASK(&BlockSig);
@@ -4983,10 +4983,6 @@ SubPostmasterMain(int argc, char *argv[])
     if (strcmp(argv[1], "--forkavworker") == 0)
         AutovacuumWorkerIAm();

-    /* In EXEC_BACKEND case we will not have inherited these settings */
-    pqinitmask();
-    PG_SETMASK(&BlockSig);
-
     /* Read in remaining GUC variables */
     read_nondefault_variables();

diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index fd9ac35dac..64af7b8707 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -175,7 +175,7 @@ StartupProcessMain(void)
     pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
     pqsignal(SIGINT, SIG_IGN);    /* ignore query cancel */
     pqsignal(SIGTERM, StartupProcShutdownHandler);    /* request shutdown */
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     InitializeTimeouts();        /* establishes SIGALRM handler */
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 358c0916ac..a52832fe90 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -101,7 +101,7 @@ WalWriterMain(void)
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SignalHandlerForShutdownRequest);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -112,10 +112,6 @@ WalWriterMain(void)
      */
     pqsignal(SIGCHLD, SIG_DFL);

-    /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-    sigdelset(&BlockSig, SIGQUIT);
-    PG_SETMASK(&BlockSig);
-
     /*
      * Create a memory context that we will do all our work in.  We do this so
      * that we can reset the context during error recovery and thereby avoid
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b180598507..17f1a49f87 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -270,7 +270,7 @@ WalReceiverMain(void)
     pqsignal(SIGHUP, WalRcvSigHupHandler);    /* set flag to read config file */
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, WalRcvShutdownHandler);    /* request shutdown */
-    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -279,10 +279,6 @@ WalReceiverMain(void)
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);

-    /* We allow SIGQUIT (SignalHandlerForCrashExit) at all times */
-    sigdelset(&BlockSig, SIGQUIT);
-    PG_SETMASK(&BlockSig);
-
     /* Load the libpq-specific functions */
     load_file("libpqwalreceiver", false);
     if (WalReceiverFunctions == NULL)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f756b470a..5e6ae6fde5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3035,7 +3035,7 @@ WalSndSignals(void)
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, StatementCancelHandler);    /* query cancel */
     pqsignal(SIGTERM, die);        /* request shutdown */
-    pqsignal(SIGQUIT, quickdie);    /* hard crash time */
+    /* SIGQUIT handler was already set up by InitPostmasterChild */
     InitializeTimeouts();        /* establishes SIGALRM handler */
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..411cfadbff 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3820,7 +3820,8 @@ PostgresMain(int argc, char *argv[],
     }

     /*
-     * Set up signal handlers and masks.
+     * Set up signal handlers.  (InitPostmasterChild or InitStandaloneProcess
+     * has already set up BlockSig and made that the active signal mask.)
      *
      * Note that postmaster blocked all signals before forking child process,
      * so there is no race condition whereby we might receive a signal before
@@ -3842,6 +3843,9 @@ PostgresMain(int argc, char *argv[],
         pqsignal(SIGTERM, die); /* cancel current query and exit */

         /*
+         * In a postmaster child backend, replace SignalHandlerForCrashExit
+         * with quickdie, so we can tell the client we're dying.
+         *
          * In a standalone backend, SIGQUIT can be generated from the keyboard
          * easily, while SIGTERM cannot, so we make both signals do die()
          * rather than quickdie().
@@ -3871,16 +3875,6 @@ PostgresMain(int argc, char *argv[],
                                      * platforms */
     }

-    pqinitmask();
-
-    if (IsUnderPostmaster)
-    {
-        /* We allow SIGQUIT (quickdie) at all times */
-        sigdelset(&BlockSig, SIGQUIT);
-    }
-
-    PG_SETMASK(&BlockSig);        /* block everything except SIGQUIT */
-
     if (!IsUnderPostmaster)
     {
         /*
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cf8f9579c3..ed2ab4b5b2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -32,10 +32,12 @@
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
+#include "libpq/pqsignal.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -133,6 +135,23 @@ InitPostmasterChild(void)
         elog(FATAL, "setsid() failed: %m");
 #endif

+    /* In EXEC_BACKEND case we will not have inherited BlockSig etc values */
+#ifdef EXEC_BACKEND
+    pqinitmask();
+#endif
+
+    /*
+     * Every postmaster child process is expected to respond promptly to
+     * SIGQUIT at all times.  Therefore we centrally remove SIGQUIT from
+     * BlockSig and install a suitable signal handler.  (Client-facing
+     * processes may choose to replace this default choice of handler with
+     * quickdie().)  All other blockable signals remain blocked for now.
+     */
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
+
+    sigdelset(&BlockSig, SIGQUIT);
+    PG_SETMASK(&BlockSig);
+
     /* Request a signal if the postmaster dies, if possible. */
     PostmasterDeathSignalInit();
 }
@@ -155,6 +174,13 @@ InitStandaloneProcess(const char *argv0)
     InitLatch(MyLatch);
     InitializeLatchWaitSet();

+    /*
+     * For consistency with InitPostmasterChild, initialize signal mask here.
+     * But we don't unblock SIGQUIT or provide a default handler for it.
+     */
+    pqinitmask();
+    PG_SETMASK(&BlockSig);
+
     /* Compute paths, no postmaster to inherit from */
     if (my_exec_path[0] == '\0')
     {

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Parallel worker hangs while handling errors.

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения