Re: Usage of epoch in txid_current

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Usage of epoch in txid_current
Дата
Msg-id CAEepm=03nb0xU+Q725OhQwRri2Emyb4raBhkouuWAs+xg7+5Yw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Usage of epoch in txid_current  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Usage of epoch in txid_current  (Andres Freund <andres@anarazel.de>)
Re: Usage of epoch in txid_current  (Craig Ringer <craig@2ndquadrant.com>)
Список pgsql-hackers
On Fri, Dec 8, 2017 at 3:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Dec 6, 2017 at 11:26 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Either way, it is not clear to me how we will keep it
>>> updated after recovery.  Right now, the mechanism is quite simple, at
>>> the beginning of a recovery we take the value of nextXid from
>>> checkpoint record and then if any xlog record indicates xid that
>>> follows nextXid, we advance it.  Here, the point to note is that we
>>> take the xid from the WAL record (which means that it assumes xids are
>>> non-contiguous or some xids are consumed without being logged) and
>>> increment it.  Unless we plan to change something in that logic (like
>>> storing 64-bit xids in WAL records), it is not clear to me how to make
>>> it work.  OTOH, recovering value of epoch which increments only at
>>> wraparound seems fairly straightforward as described in my previous
>>> email.
>>
>> I think it should be fairly simple if simply add the 64bit xid to the
>> existing clog extension WAL records.
>
> IIUC, you mean to say that we should log the 64bit xid value in
> CLOG_ZEROPAGE record while extending clog and that too we can do only
> at wraparound.  Now, maybe doing it every time also doesn't hurt, but
> I think doing it at wraparound should be sufficient.

Can you please elaborate on why clog redo in particular would need to
use 64 bit xids?  Is the 64 bit xid not derivable from the 32 bit xid
in the WAL + the current value of a new 64 bit next xid?

> Just to be clear, I am not planning to pursue writing a patch for this
> at the moment.  So, if anybody else is interested or if Andres wants
> to write it, I can help in the review.

I played around with this idea yesterday.  Experiment-grade patch
attached.  Approach:

1.  Introduce a new type BigTransactionId (better names welcome).
2.  Change ShmemVariableCache->nextXid to ShmemVariableCache->nextBigXid.
3.  Change checkpoints to use nextBigXid too.
4.  Change ReadNewTransactionId() to ReadNextBigTransactionId().
5.  Remove GetNextXidAndEpoch() as it's now redundant.
6.  Everywhere that was reading ShmemVariableCache->nextXid or calling
ReadNewTransactionId() but actually needs an xid now uses an explicit
conversion macro XidFromBigTransactionId().
7.  Everywhere that was writing ShmemVariableCache->nextXid but only
has an xid now goes through a new function
AdvanceNextBigTransactionIdPast(xid), to handle recovery (since WAL
records have xids in them as mentioned by Amit).

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent).  Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

The logic in CreateCheckPoint() that assumed that there could be only
one wraparound per checkpoint is gone (the problem reported in this
thread).  Note that AdvanceNextBigTransactionIdPast() contains logic
that infers an epoch, which is in some ways similar to what the
removed code was doing, but in this case I think all callers must have
an xid from the same or next epoch.

I'm probably missing a few details... this is only a first swing at
the problem.  It does pass check-world and various xid wraparound
tests I came up with.  Clearly big xids could spread to more places
than I show here.  Do you see problems with this approach or have
better ideas?

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Possibly too stringent Assert() in b-tree code
Следующее
От: Andres Freund
Дата:
Сообщение: small development tip: Consider using the gold linker