Race condition in HEAD, possibly due to PGPROC splitup

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Race condition in HEAD, possibly due to PGPROC splitup
Дата	14 декабря 2011 г. 00:15:52
Msg-id	27187.1323836130@sss.pgh.pa.us обсуждение исходный текст
Ответы	Re: Race condition in HEAD, possibly due to PGPROC splitup
Список	pgsql-hackers

Дерево обсуждения

If you add this Assert to lock.c:

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3ba4671..d9c15e0 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** GetRunningTransactionLocks(int *nlocks)
*** 3195,3200 ****
--- 3195,3202 ----             accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;
accessExclusiveLocks[index].relOid= lock->tag.locktag_field2; 
 
+             Assert(TransactionIdIsNormal(accessExclusiveLocks[index].xid));
+              index++;         }     }

then set wal_level = hot_standby, and run the regression tests
repeatedly, the Assert will trigger eventually --- for me, it happens
within a dozen or so parallel iterations, or rather longer if I run
the tests serial style.  Stack trace is unsurprising, since AFAIK this
is only called in the checkpointer:

#2  0x000000000073461d in ExceptionalCondition (   conditionName=<value optimized out>, errorType=<value optimized
out>,   fileName=<value optimized out>, lineNumber=<value optimized out>)   at assert.c:57
 
#3  0x000000000065eca1 in GetRunningTransactionLocks (nlocks=0x7fffa997de8c)   at lock.c:3198
#4  0x00000000006582b8 in LogStandbySnapshot (nextXid=0x7fffa997dee0)   at standby.c:835
#5  0x00000000004b0b97 in CreateCheckPoint (flags=32) at xlog.c:7761
#6  0x000000000062bf92 in CheckpointerMain () at checkpointer.c:488
#7  0x00000000004cf465 in AuxiliaryProcessMain (argc=2, argv=0x7fffa997e110)   at bootstrap.c:424
#8  0x00000000006261f5 in StartChildProcess (type=CheckpointerProcess)   at postmaster.c:4487

The actual value of the bogus xid (which was pulled from
allPgXact[proc->pgprocno]->xid just above here) is zero.  What I believe
is happening is that somebody is clearing his pgxact->xid entry
asynchronously to GetRunningTransactionLocks, and since that clearly
oughta be impossible, something is broken.

Without the added assert, you'd only notice this if you were running a
standby slave --- the zero xid results in an assert failure in WAL
replay on the slave end, which is how I found out about this to start
with.  But since we've not heard reports of such before, I suspect that
this is a recently introduced bug; and personally I'd bet money that it
was the PGXACT patch that broke it.

I have other things to do than look into this right now myself.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Race condition in HEAD, possibly due to PGPROC splitup