Re: BUG #8686: Standby could not restart.
От | Heikki Linnakangas |
---|---|
Тема | Re: BUG #8686: Standby could not restart. |
Дата | |
Msg-id | 52B46D0D.2070505@vmware.com обсуждение исходный текст |
Ответ на | BUG #8686: Standby could not restart. (katsumata.tomonari@po.ntts.co.jp) |
Ответы |
Re: BUG #8686: Standby could not restart.
|
Список | pgsql-bugs |
On 12/19/2013 04:57 AM, katsumata.tomonari@po.ntts.co.jp wrote: > At first, I doubted the recovery state reached "consistent" before redo > starts. > And then I checked pg_control and related WAL. > The WAL sequence is like below. > > > WAL--(a)--(b)--(c)--(d)--(e)--> > ================================================ > (a) Latest checkpoint's REDO location > 1/783B230 > > > (b) hot_update > 1/7842010 > > > (c) truncate > 1/8E7E5C8 > > > (d) Latest checkpoint location > 1/8E7F0B0 > > > (e) Minimum recovery ending location > 1/8E7F110 > ================================================ > > >>From these things, I found it has happened with this scenario. > ---------- > (1) standby starting > (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not > absecnt > (3) reachedConsistency is set to true at 1/8E7F110 in > CheckRecoveryConsistent > (4) redo start from 1/783B230 > (5) PANIC at 1/7842010 because reachedConsistency has set already and > operating against a block which will be truncated at (c). > ---------- > > At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0), > so reachedConsistency is set to true at step(3). Yep. Thanks for a good explanation. > I think it's not need to increase EndRecPtr while seeking checkpoint > location. > I tried to revise it and this worked fine. Hmm. There's this comment in StartupXLOG, after reading the checkpoint record, but before reading the first record at REDO point: > /* > * Initialize shared replayEndRecPtr, lastReplayedEndRecPtr, and > * recoveryLastXTime. > * > * This is slightly confusing if we're starting from an online > * checkpoint; we've just read and replayed the checkpoint record, but > * we're going to start replay from its redo pointer, which precedes > * the location of the checkpoint record itself. So even though the > * last record we've replayed is indeed ReadRecPtr, we haven't > * replayed all the preceding records yet. That's OK for the current > * use of these variables. > */ > SpinLockAcquire(&xlogctl->info_lck); > xlogctl->replayEndRecPtr = ReadRecPtr; > xlogctl->lastReplayedEndRecPtr = EndRecPtr; > xlogctl->recoveryLastXTime = 0; > xlogctl->currentChunkStartTime = 0; > xlogctl->recoveryPause = false; > SpinLockRelease(&xlogctl->info_lck); I think we need to fix that confusion. Your patch will do it by not setting EndRecPtr yet; that fixes the bug, but leaves those variables in a slightly strange state; I'm not sure what EndRecPtr points to in that case (0 ?), but ReadRecPtr would be set I guess. Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the REDO point here, instead of ReadRecPtr/EndRecPtr. - Heikki
В списке pgsql-bugs по дате отправления: