Re: PANIC during crash recovery of a recently promoted standby
От | Michael Paquier |
---|---|
Тема | Re: PANIC during crash recovery of a recently promoted standby |
Дата | |
Msg-id | 20180514033836.GD1528@paquier.xyz обсуждение исходный текст |
Ответ на | Re: PANIC during crash recovery of a recently promoted standby (Michael Paquier <michael@paquier.xyz>) |
Список | pgsql-hackers |
On Sat, May 12, 2018 at 07:41:33AM +0900, Michael Paquier wrote: > pg_ctl promote would wait for the control file to be updated, so you > cannot use it in the TAP tests to trigger the promotion. Still I think > I found one after waking up? Please note I have not tested it: > - Use a custom trigger file and then trigger promotion with a signal. > - Use a sleep command in recovery_end_command to increase the window, as > what matters is sleeping after CreateEndOfRecoveryRecord updates the > control file. > - Issue a restart point on the standby, which will update the control > file. > - Stop the standby with immediate mode. > - Start the standby, it should see unreferenced pages. I have been looking at that this morning, and actually I have been able to create a test case where minRecoveryPoint goes backwards using Pavan's patch. Using a sleep in recovery_end_command has proved to not be enough so I had to patch the backend with a couple of sleeps and some processing, mainly: - One sleep in CreateRestartPoint to make a restart point wait before updating the control file, which I set at 5s. - One sleep just after calling CreateEndOfRecoveryRecord, which has been set at 20s. - Trigger an asynchronous checkpoint using IPC::Run::start. - Trigger a promotion with a trigger file and SIGUSR2 to the postmaster. The rest of the test is crafted with some magic wait number and adds some logs to ease the monitoring of the issue. In order to get a crash, I think that you would need to crash the backend after creating the last WAL records which generate the invalid page references, and also slow down the last restart point which makes minRecoveryPoint go backwards, which is err... Complicated to make deterministic. Still if you apply the patch attached you would see log entries on the standby as follows: 2018-05-14 12:24:15.065 JST [17352] LOG: selected new timeline ID: 2 2018-05-14 12:24:15.074 JST [17352] LOG: archive recovery complete 2018-05-14 12:24:15.074 JST [17352] WARNING: CreateEndOfRecoveryRecord: minRecoveryPoint is 0/32C4258 before update 2018-05-14 12:24:15.074 JST [17352] WARNING: CreateEndOfRecoveryRecord: minRecoveryPoint is 0/0 after update 2018-05-14 12:24:17.067 JST [17353] WARNING: checkPointCopy.redo =0/30B3D70, lastCheckPoint.redo = 0/31BC208 2018-05-14 12:24:17.067 JST [17353] WARNING: CreateRestartPoint: minRecoveryPoint is 0/0 before update, lastCheckPointEndPtris 0/31BC2B0 2018-05-14 12:24:17.067 JST [17353] WARNING: CreateRestartPoint: minRecoveryPoint is 0/31BC2B0 after update So minRecoveryRecord can go definitely go backwards here, which is not good. Attached is a patch which includes Pavan's fix on top of the test case I crafted with what is in upthread. You just need to apply it on HEAD. -- Michael
Вложения
В списке pgsql-hackers по дате отправления: