Error restoring from a base backup taken from standby
От | Heikki Linnakangas |
---|---|
Тема | Error restoring from a base backup taken from standby |
Дата | |
Msg-id | 50CF58D1.2060903@vmware.com обсуждение исходный текст |
Ответы |
Re: Error restoring from a base backup taken from standby
Re: Error restoring from a base backup taken from standby |
Список | pgsql-hackers |
(This is different from the other issue related to timeline switches I just posted about. There's no timeline switch involved in this one.) If you do "pg_basebackup -x" against a standby server, in some circumstances the backup fails to restore with error like this: C 2012-12-17 19:09:44.042 EET 7832 LOG: database system was not properly shut down; automatic recovery in progress C 2012-12-17 19:09:44.091 EET 7832 LOG: record with zero length at 0/1764F48 C 2012-12-17 19:09:44.091 EET 7832 LOG: redo is not required C 2012-12-17 19:09:44.091 EET 7832 FATAL: WAL ends before end of online backup C 2012-12-17 19:09:44.091 EET 7832 HINT: All WAL generated while online backup was taken must be available at recovery. C 2012-12-17 19:09:44.092 EET 7831 LOG: startup process (PID 7832) exited with exit code 1 C 2012-12-17 19:09:44.092 EET 7831 LOG: aborting startup due to startup process failure I spotted this bug while reading the code, and it took me quite a while to actually construct a test case to reproduce the bug, so let me begin by discussing the code where the bug is. You get the above error, "WAL ends before end of online backup", when you reach the end of WAL before reaching the backupEndPoint stored in the control file, which originally comes from the backup_label file. backupEndPoint is only used in a base backup taken from a standby, in a base backup taken from the master, the end-of-backup WAL record is used instead to mark the end of backup. In the xlog redo loop, after replaying each record, we check if we've just reached backupEndPoint, and clear it from the control file if we have. Now the problem is, if there are no WAL records after the checkpoint redo point, we never even enter the redo loop, so backupEndPoint is not cleared even though it's reached immediately after reading the initial checkpoint record. To deal with the similar situation wrt. reaching consistency for hot standby purposes, we call CheckRecoveryConsistency() before the redo loop. The straightforward fix is to copy-paste the check for backupEndPoint to just before the redo loop, next to the CheckRecoveryConsistency() call. Even better, I think we should move the backupEndPoint check into CheckRecoveryConsistency(). It's already responsible for keeping track of whether minRecoveryPoint has been reached, so it seems like a good idea to do this check there as well. Attached is a patch for that (for 9.2), as well as a script I used to reproduce the bug. The script is a bit messy, and requires tweaking the paths at the top. Anyone spot a problem with this? - Heikki
Вложения
В списке pgsql-hackers по дате отправления: