Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Дата
Msg-id 4219.1492875661@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> The assertion fails reliably for me, because standby2's reported write
> LSN jumps backwards after the timeline changes: for example I see
> 3020000 then 3028470 then 3020000 followed by a normal progression.
> Surprisingly, 004_timeline_switch.pl reports success anyway.  I'm not
> sure why the test fails sometimes on tern, but you can see that even
> when it passed on tern the assertion had failed.

Whoa.  This just turned into a much larger can of worms than I expected.
How can it be that processes are getting assertion crashes and yet the
test framework reports success anyway?  That's impossibly
broken/unacceptable.

Looking closer at the tern report we started the thread with, there
are actually TWO assertion trap reports, the one Alvaro noted and
another one in 009_twophase_master.log:

TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line:
92)

When I run the recovery test on my own machine, it reports success
(quite reliably, I tried a bunch of times yesterday), but now that
I know to look:

$ grep TRAP tmp_check/log/*
tmp_check/log/009_twophase_master.log:TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent &&
overwriteOK))",File: "subtrans.c", Line: 92) 

So we now have three problems not just one:

* How is it that the TAP tests aren't noticing the failure?  This one,
to my mind, is a code-red situation, as it basically invalidates every
TAP test we've ever run.

* If Thomas's explanation for the timeline-switch assertion is correct,
why isn't it reproducible everywhere?

* What's with that second TRAP?

> Here is a fix for the assertion failure.

As for this patch itself, is it reasonable to try to assert that the
timeline has in fact changed?
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] Small patch for pg_basebackup argument parsing
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: [HACKERS] Interval for launching the table sync worker