Re: Hot Backup with rsync fails at pg_clog if under load
От | Heikki Linnakangas |
---|---|
Тема | Re: Hot Backup with rsync fails at pg_clog if under load |
Дата | |
Msg-id | 4E7C4BE0.4060701@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Hot Backup with rsync fails at pg_clog if under load (Florian Pflug <fgp@phlo.org>) |
Ответы |
Re: Hot Backup with rsync fails at pg_clog if under load
|
Список | pgsql-hackers |
On 23.09.2011 11:48, Florian Pflug wrote: > On Sep23, 2011, at 10:41 , Heikki Linnakangas wrote: >> On 23.09.2011 11:02, Linas Virbalas wrote: >>> On 9/22/11 6:59 PM, "Euler Taveira de Oliveira"<euler@timbira.com> wrote: >>> >>>>> If needed, I could do that, if I had the exact procedure... Currently, >>>>> during the start of the backup I take the following information: >>>>> >>>> Just show us the output of pg_start_backup and part of the standby log with >>>> the following message 'redo starts at' and the subsequent messages up to the >>>> failure. >>> >>> Unfortunately, it's impossible, because the error message "Could not read >>> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup >>> aborted) before the turn for "redo starts at" message arrives. >> >> It looks to me that pg_clog/0001 exists, but it shorter than recovery expects. Which shouldn't happen, of course, becausethe start-backup checkpoint should flush all the clog that's needed by recovery to disk before the backup procedurebegins to them. > > Yeah. What confuses me though is that we fail while trying to read from the clog. When do we do that during normal (non-standby)recovery? I believe the OP is setting up a standby. He didn't say if it's a hot standby, though. That changes the startup sequence a bit. At the end of recovery, we read the "last" clog page, containing the next XID that will be assigned to a transaction, and zero out the future part of that clog file (StartupCLOG). In hot standby, we do that earlier, after reading the checkpoint record but before we start replaying records. > One other possibility would be that the problem is somehow triggered by vacuum running while the start-backup checkpointis commencing. That'd explain why the problem goes away with immediate checkpoints, because those make the windowsmach smaller. But I haven't got a concrete theory of whats happening.. Hmm, that's a good theory. The logic in SimpleLruPhysicakReadPage() tolerates non-existent files, because the old clog files might've been truncated away by a vacuum after the backup started. That's OK, because they will be recreated, and later deleted again, during the WAL replay. But what if something like this happens: 0. pg_start_backup(). 1. rsync starts. It gets a list of all files. It notes that pg_clog/0001 exists. 2. vacuum runs, and deletes pg_clog/0001 3. rsync starts to copy pg_clog/0001. Oops, it's gone. Instead of leaving it out altogether, it includes it as an empty file. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: