Re: Completely broken replica after PANIC: WAL contains references to invalid pages
От | Andres Freund |
---|---|
Тема | Re: Completely broken replica after PANIC: WAL contains references to invalid pages |
Дата | |
Msg-id | 20130402182644.GJ2415@alap2.anarazel.de обсуждение исходный текст |
Ответ на | Re: Completely broken replica after PANIC: WAL contains references to invalid pages (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: Completely broken replica after PANIC: WAL contains
references to invalid pages
Re: Completely broken replica after PANIC: WAL contains references to invalid pages Re: Completely broken replica after PANIC: WAL contains references to invalid pages |
Список | pgsql-bugs |
On 2013-04-02 12:10:12 +0200, Andres Freund wrote: > On 2013-04-01 08:49:16 +0100, Simon Riggs wrote: > > On 30 March 2013 17:21, Andres Freund <andres@2ndquadrant.com> wrote: > > > > > So if the xid is later than latestObservedXid we extend subtrans one by > > > one. So far so good. But we initialize it in > > > ProcArrayApplyRecoveryInfo() when consistency is initially reached: > > > latestObservedXid = running->nextXid; > > > TransactionIdRetreat(latestObservedXid); > > > Before that subtrans has initially been started up with: > > > if (wasShutdown) > > > oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids); > > > else > > > oldestActiveXID = checkPoint.oldestActiveXid; > > > ... > > > StartupSUBTRANS(oldestActiveXID); > > > > > > That means its only initialized up to checkPoint.oldestActiveXid. As it > > > can take some time till we reach consistency it seems rather plausible > > > that there now will be a gap in initilized pages. From > > > checkPoint.oldestActiveXid to running->nextXid if there are pages > > > inbetween. > > > > That was an old bug. > > > > StartupSUBTRANS() now explicitly fills that gap. Are you saying it > > does that incorrectly? How? > > Well, no. I think StartupSUBTRANS does this correctly, but there's a gap > between the call to Startup* and the first call to ExtendSUBTRANS. The > latter is only called *after* we reached STANDBY_INITIALIZED via > ProcArrayApplyRecoveryInfo(). The problem is that we StartupSUBTRANS to > checkPoint.oldestActiveXid while we start to ExtendSUBTRANS from > running->nextXid - 1. There very well can be a gap inbetween. > The window isn't terribly big but if you use subtransactions as heavily > as Sergey seems to be it doesn't seem unlikely to hit it. > > Let me come up with a testcase and patch. Developing a testcase was trivial, pgbench running the following function: CREATE OR REPLACE FUNCTION recurse_and_assign_txid(level bigint DEFAULT 0) RETURNS bigint LANGUAGE plpgsql AS $b$ BEGIN IF level < 500 THEN RETURN recurse_and_assign_txid(level + 1); ELSE -- assign xid in subtxn and parents CREATE TEMPORARY TABLE foo(); DROP TABLE foo; RETURN txid_current()::bigint; END IF; EXCEPTION WHEN others THEN RAISE NOTICE 'unexpected'; END $b$; When now restarting a standby (so it restarts from another checkpoint) it frequently crashed with various errors: * pg_subtrans/xxx does not exist * (warning) pg_subtrans page does not exist, assuming zero * xid overwritten in SubTransSetParent So I think my theory is correct. The attached patch fixes this although I don't like the way it knowledge of the point up to which StartupSUBTRANS zeroes pages is handled. Makes sense? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
В списке pgsql-bugs по дате отправления: