Re: Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Emre Hasegeli
Тема Re: Standby corruption after master is restarted
Дата
Msg-id CAE2gYzya6ewQ_KF8pTg6ctT-H_O1Q_+Z7Nipvd0PODm04hoB4w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Standby corruption after master is restarted  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Standby corruption after master is restarted  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-bugs
> OK, and on the standby? I suppose it breaks right after the
> CHECKPOINT_SHUTDOWN record, right?

Actually pg_xlogdump fails right at the SHUTDOWN record:

> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 1DFB/D6BDDEA0, prev 1DFB/D6BDDE60, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331538
 
> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 1DFB/D6BDDEE0, prev 1DFB/D6BDDEA0, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331539
 
> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331540
 
> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331541
 
> pg_xlogdump: FATAL:  error in WAL record at 1DFB/D6BDDF90: unexpected pageaddr 1DFB/66BDE000 in log segment
0000000000001DFB000000D6,offset 12443648
 

> Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
> depends on packages). I wonder if it does fast/smart/immediate.

I couldn't find it quickly.

> And can you determine how the broken WAL got to the standby? Was it by
> directly fetching from the primary, or from the barman archive? Is it
> broken in the archive too?

It must be streamed from the primary.  The log was on on the archive
during the incident.  The good log file came there later.  Such logs
are repeated on the standby before the error:

> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared m
 
emory.
> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected

> I think one hypothesis is that the standby starts fetching a WAL segment
> from the primary, and while doing that it gets recycled and overwritten
> (so the standby gets bogus somewhere half-way through). That should not
> be possible, of course, but perhaps there's something broken right after
> a restart? A race condition or something like that?

If I haven't missed anything on the hexdumps I posted, the logs of the
master and the standby are the same until the position on
pg_controldata.  I cannot understand how it can fail for something
later.  Is it possible that this is triggered if the replication is
interrupted on a certain point?


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: Standby corruption after master is restarted
Следующее
От: Emre Hasegeli
Дата:
Сообщение: Re: Wrong Results from SP-GiST with Collations