Re: Error restoring from a base backup taken from standby
От | Fujii Masao |
---|---|
Тема | Re: Error restoring from a base backup taken from standby |
Дата | |
Msg-id | CAHGQGwGr+U1xujAaQwDqOk0eXH3ZD6iE_JWDV5u-vxJT8ocX=g@mail.gmail.com обсуждение исходный текст |
Ответ на | Error restoring from a base backup taken from standby (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Список | pgsql-hackers |
On Tue, Dec 18, 2012 at 2:39 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > (This is different from the other issue related to timeline switches I just > posted about. There's no timeline switch involved in this one.) > > If you do "pg_basebackup -x" against a standby server, in some circumstances > the backup fails to restore with error like this: > > C 2012-12-17 19:09:44.042 EET 7832 LOG: database system was not properly > shut down; automatic recovery in progress > C 2012-12-17 19:09:44.091 EET 7832 LOG: record with zero length at > 0/1764F48 > C 2012-12-17 19:09:44.091 EET 7832 LOG: redo is not required > C 2012-12-17 19:09:44.091 EET 7832 FATAL: WAL ends before end of online > backup > C 2012-12-17 19:09:44.091 EET 7832 HINT: All WAL generated while online > backup was taken must be available at recovery. > C 2012-12-17 19:09:44.092 EET 7831 LOG: startup process (PID 7832) exited > with exit code 1 > C 2012-12-17 19:09:44.092 EET 7831 LOG: aborting startup due to startup > process failure > > I spotted this bug while reading the code, and it took me quite a while to > actually construct a test case to reproduce the bug, so let me begin by > discussing the code where the bug is. You get the above error, "WAL ends > before end of online backup", when you reach the end of WAL before reaching > the backupEndPoint stored in the control file, which originally comes from > the backup_label file. backupEndPoint is only used in a base backup taken > from a standby, in a base backup taken from the master, the end-of-backup > WAL record is used instead to mark the end of backup. In the xlog redo loop, > after replaying each record, we check if we've just reached backupEndPoint, > and clear it from the control file if we have. Now the problem is, if there > are no WAL records after the checkpoint redo point, we never even enter the > redo loop, so backupEndPoint is not cleared even though it's reached > immediately after reading the initial checkpoint record. Good catch! > To deal with the similar situation wrt. reaching consistency for hot standby > purposes, we call CheckRecoveryConsistency() before the redo loop. The > straightforward fix is to copy-paste the check for backupEndPoint to just > before the redo loop, next to the CheckRecoveryConsistency() call. Even > better, I think we should move the backupEndPoint check into > CheckRecoveryConsistency(). It's already responsible for keeping track of > whether minRecoveryPoint has been reached, so it seems like a good idea to > do this check there as well. > > Attached is a patch for that (for 9.2), as well as a script I used to > reproduce the bug. The patch looks good to me. Regards, -- Fujii Masao
В списке pgsql-hackers по дате отправления: