Re: [HACKERS] parallel.c oblivion of worker-startup failures
От | Amit Kapila |
---|---|
Тема | Re: [HACKERS] parallel.c oblivion of worker-startup failures |
Дата | |
Msg-id | CAA4eK1KD=bz6mfA3p0-p7=FGF6DfH2A_HGV_ffPDDz0AnH6cRQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] parallel.c oblivion of worker-startup failures (Peter Geoghegan <pg@bowt.ie>) |
Список | pgsql-hackers |
On Wed, Jan 24, 2018 at 10:03 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >>> Hmm, I think that case will be addressed because tuple queues can >>> detect if the leader is not attached. It does in code path >>> shm_mq_receive->shm_mq_counterparty_gone. In >>> shm_mq_counterparty_gone, it can detect if the worker is gone by using >>> GetBackgroundWorkerPid. Moreover, I have manually tested this >>> particular case before saying your patch is fine. Do you have some >>> other case in mind which I am missing? >> >> Hmm. Yeah. I can't seem to reach a stuck case and was probably just >> confused and managed to confuse Robert too. If you make >> fork_process() fail randomly (see attached), I see that there are a >> couple of easily reachable failure modes (example session at bottom of >> message): >> >> 1. HandleParallelMessages() is reached and raises a "lost connection >> to parallel worker" error because shm_mq_receive() returns >> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked >> GetBackgroundWorkerPid() just as you said. I guess that's happening >> because some other process is (coincidentally) sending >> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a >> process is unexpectedly stopped. >> >> 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel >> worker failed to initialize" error. TupleQueueReaderNext() set done >> to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once >> again, that is because shm_mq_counterparty_gone() returned true. This >> is the bit Robert and I missed in our off-list discussion. >> >> As long as we always get our latch set by the postmaster after a fork >> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is >> guaranteed to return BGWH_STOPPED after that, and as long as we only >> ever use latch/CFI loops to wait, and as long as we try to read from a >> shm_mq, then I don't see a failure mode that hangs. > > What about the parallel_leader_participation=off case? > There is nothing special about that case, there shouldn't be any problem till we can detect the worker failures appropriately. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: