Re: Parallel worker hangs while handling errors.
От | vignesh C |
---|---|
Тема | Re: Parallel worker hangs while handling errors. |
Дата | |
Msg-id | CALDaNm04E=fbixNSqEeb4PWGdeYY26OZ-4bXNayKR2QoV6i=Yg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel worker hangs while handling errors. (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Ответы |
Re: Parallel worker hangs while handling errors.
|
Список | pgsql-hackers |
On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Thanks for reviewing and adding your thoughts, My comments are inline. > > > > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > The same hang issue can occur(though I'm not able to back it up with a > > > use case), in the cases from wherever the EmitErrorReport() is called > > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as > > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and > > > postgres.c. > > > > > > > I'm not sure if this can occur in other cases. > > > > I checked further on this point: Yes, it can't occur for the other > cases, as mq_putmessage() gets only used for parallel > workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()). > > > > > > Note that, in all sigsetjmp blocks, we intentionally > > > HOLD_INTERRUPTS(), to not cause any issues while performing error > > > handling, I'm concerned here that now, if we directly call > > > BackgroundWorkerUnblockSignals() which will open up all the signals > > > and our main intention of holding interrupts or block signals may go > > > away. > > > > > > Since the main problem for this hang issue is because of blocking > > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, > > > instead of unblocking all signals? I tried this with parallel copy > > > hang, the issue is resolved. > > > > > > > On putting further thoughts on this, I feel just unblocking SIGUSR1 > > would be the right approach in this case. I'm attaching a new patch > > which unblocks SIGUSR1 signal. I have verified that the original issue > > with WIP parallel copy patch gets fixed. I have made changes only in > > bgworker.c as we require the parallel worker to receive this signal > > and continue processing. I have not included the changes for other > > processes as I'm not sure if this scenario is applicable for other > > processes. > > > > Since we are sure that this hang issue can occur only for parallel > workers, and the change in StartBackgroundWorker's sigsetjmp's block > should only be made for parallel worker cases. And also there can be a > lot of other callbacks execution and processing in proc_exit() for > which we might not need the SIGUSR1 unblocked. So, let's undo the > unblocking right after EmitErrorReport() to not cause any new issues. > > Attaching a modified v2 patch: it has the unblocking for only for > parallel workers, undoing it after EmitErrorReport(), and some > adjustments in the comment. > I have made slight changes on top of the patch to remove duplicate code, attached v3 patch for the same. The parallel worker hang issue gets resolved, make check & make check-world passes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Вложения
В списке pgsql-hackers по дате отправления: