[HACKERS] Another reason why the recovery tests take a long time
От | Tom Lane |
---|---|
Тема | [HACKERS] Another reason why the recovery tests take a long time |
Дата | |
Msg-id | 21344.1498494720@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: [HACKERS] Another reason why the recovery tests take a long time
Re: [HACKERS] Another reason why the recovery tests take a long time |
Список | pgsql-hackers |
I've found another edge-case bug through investigation of unexpectedly slow recovery test runs. It goes like this: * While streaming from master to slave, test script shuts down master while slave is left running. We soon restart the master, but meanwhile: * slave's walreceiver process fails, reporting 2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server 2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098. 2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress * slave's startup process observes that walreceiver is gone and sends PMSIGNAL_START_WALRECEIVER to ask for a new one * more often than you would guess, in fact nearly 100% reproducibly for me, the postmaster receives/services the PMSIGNAL before it receives SIGCHLD for the walreceiver. In this situation sigusr1_handler just throws away the walreceiver start request, reasoning that the walreceiver is already running. * eventually, it dawns on the startup process that the walreceiver isn't starting, and it asks for a new one. But that takes ten seconds (WALRCV_STARTUP_TIMEOUT). So this looks like a pretty obvious race condition in the postmaster, which should be resolved by having it set a flag on receipt of PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a new walreceiver. But I wonder whether it's intentional that the old walreceiver dies in the first place. That FATAL exit looks suspiciously like it wasn't originally-designed-in behavior. regards, tom lane
В списке pgsql-hackers по дате отправления: