Re: [HACKERS] Another reason why the recovery tests take a long time
От | Andres Freund |
---|---|
Тема | Re: [HACKERS] Another reason why the recovery tests take a long time |
Дата | |
Msg-id | 20170626164659.ea43donv7tqmwl5a@alap3.anarazel.de обсуждение исходный текст |
Ответ на | [HACKERS] Another reason why the recovery tests take a long time (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: [HACKERS] Another reason why the recovery tests take a long time
|
Список | pgsql-hackers |
Hi, On 2017-06-26 12:32:00 -0400, Tom Lane wrote: > I've found another edge-case bug through investigation of unexpectedly > slow recovery test runs. It goes like this: > > * While streaming from master to slave, test script shuts down master > while slave is left running. We soon restart the master, but meanwhile: > > * slave's walreceiver process fails, reporting > > 2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server > 2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098. > 2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress > > * slave's startup process observes that walreceiver is gone and sends > PMSIGNAL_START_WALRECEIVER to ask for a new one > > * more often than you would guess, in fact nearly 100% reproducibly for > me, the postmaster receives/services the PMSIGNAL before it receives > SIGCHLD for the walreceiver. In this situation sigusr1_handler just > throws away the walreceiver start request, reasoning that the walreceiver > is already running. Yuck. I've recently seen a bunch of symptoms vaguely around this, e.g. I can atm frequently reconnect to a new backend after a PANIC/segfault/whatnot, before postmastre gets the message. I've not yet figured out whether that's a kernel change, or whether some of the more recent tinkering in postmaster.c changed this. > * eventually, it dawns on the startup process that the walreceiver > isn't starting, and it asks for a new one. But that takes ten seconds > (WALRCV_STARTUP_TIMEOUT). > > So this looks like a pretty obvious race condition in the postmaster, > which should be resolved by having it set a flag on receipt of > PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a > new walreceiver. But I wonder whether it's intentional that the old > walreceiver dies in the first place. That FATAL exit looks suspiciously > like it wasn't originally-designed-in behavior. It's quite intentional afaik - I've complained about the bad error message recently (we really shouldn't say "no COPY in progress), but exiting seems quite reasonable. Otherwise we'd have add a separate retry logic into the walsender, that reconnects without a new walsender being started. - Andres
В списке pgsql-hackers по дате отправления: