Re: delay starting WAL receiver
От | Thomas Munro |
---|---|
Тема | Re: delay starting WAL receiver |
Дата | |
Msg-id | CA+hUKGLWv2PfMQR3FSo=M65+MGCGp_ZiYiWfS42+4VNqrrA+ig@mail.gmail.com обсуждение исходный текст |
Ответ на | delay starting WAL receiver (Nathan Bossart <nathandbossart@gmail.com>) |
Ответы |
Re: delay starting WAL receiver
Re: delay starting WAL receiver |
Список | pgsql-hackers |
On Wed, Jan 11, 2023 at 2:08 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I discussed this a bit in a different thread [0], but I thought it deserved > its own thread. > > After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed > that the recovery tests consistently take much longer. Upon further > inspection, it looks like a similar race condition to the one described in > e5d494d's commit message. With some added debug logs, I see that all of > the callers of MaybeStartWalReceiver() complete before SIGCHLD is > processed, so ServerLoop() waits for a minute before starting the WAL > receiver. > > The attached patch fixes this by adjusting DetermineSleepTime() to limit > the sleep to at most 100ms when WalReceiverRequested is set, similar to how > the sleep is limited when background workers must be restarted. Is the problem here that SIGCHLD is processed ... PG_SETMASK(&UnBlockSig); <--- here? selres = select(nSockets, &rmask, NULL, NULL, &timeout); Meanwhile the SIGCHLD handler code says: * Was it the wal receiver? If exit status is zero (normal) or one * (FATAL exit), we assume everything is all right just like normal * backends. (If we need a new wal receiver, we'll start one at the * next iteration of the postmaster's main loop.) ... which is true, but that won't be reached for a while in this case if the timeout has already been set to 60s. Your patch makes that 100ms, in that case, a time delay that by now attracts my attention like a red rag to a bull (I don't know why you didn't make it 0). I'm not sure, but if I got that right, then I think the whole problem might automatically go away with CF #4032. The SIGCHLD processing code will run not when signals are unblocked before select() (that is gone), but instead *after* the event loop wakes up with WL_LATCH_SET, and runs the handler code in the regular user context before dropping through to the rest of the main loop.
В списке pgsql-hackers по дате отправления: