Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Дата
Msg-id 20220621.103533.1567183025826185124.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Robert Haas <robertmhaas@gmail.com>)
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> It seems to me that what we want to do is: if we're about to start
> allowing WAL writes, then consider whether to insert an aborted
> contrecord record. Now, if we are about to start allowing WAL write,
> we must determine the LSN at which the first such record will be
> written. Then there are two possibilities: either that LSN is on an
> existing record boundary, or it isn't. In the former case, no aborted
> contrecord record is required. In the latter case, we're writing at
> LSN which would have been in the middle of a previous record, except
> that the record in question was never finished or at least we don't
> have all of it. So the first WAL we insert at our chosen starting LSN
> must be an aborted contrecord record.

Right.

> I'm not quite sure I understand the nature of the remaining bug here,
> but this logic seems quite sketchy to me:
> 
>             /*
>              * When not in standby mode we find that WAL ends in an incomplete
>              * record, keep track of that record.  After recovery is done,
>              * we'll write a record to indicate to downstream WAL readers that
>              * that portion is to be ignored.
>              */
>             if (!StandbyMode &&
>                 !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
>             {
>                 abortedRecPtr = xlogreader->abortedRecPtr;
>                 missingContrecPtr = xlogreader->missingContrecPtr;
>             }
> 
> I don't see what StandbyMode has to do with anything here. The
> question is whether the place where we're going to begin writing WAL
> is on an existing record boundary or not. If it so happens that when
> StandbyMode is turned on we can never decide to promote in the middle
> of a record, then maybe there's no need to track this information when
> StandbyMode = true, but I don't see a good reason why we should make
> that assumption. What if in the future somebody added a command that

Right.  I came to the same conclusion before reading this section. It
is rearer than other cases but surely possible.

> says "don't keep trying to read more WAL, just promote RIGHT HERE?". I
> think this logic would surely be incorrect in that case. It feels to
> me like the right thing to do is to always keep track if WAL ends in
> an incomplete record, and then when we promote, we write an aborted
> contrecord record if WAL ended in an incomplete record, full stop.

Agreed. Actually, with the second patch applied, removing !StandbyMode
from the condition makes no difference in behavior.

> Now, we do need to keep in mind that, while in StandbyMode, we might
> reach the end of WAL more than once, because we have a polling loop
> that looks for more WAL. So it does not work to just set the values
> once and then assume that's the whole truth forever. But why not
> handle that by storing the abortedRecPtr and missingContrecPtr
> unconditionally after every call to XLogPrefetcherReadRecord()? They
> will go back and forth between XLogRecPtrIsInvalid and other values
> many times but the last value should -- I think -- be however things
> ended up at the point where we decided to stop reading WAL.

Unfortunately it doesn't work because we read a record already known
to be complete again at the end of recovery.  It is the reason of
"abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch.  Without it,
abrotedRecPtr is erased when it is actually needed.  I don't like that
expedient-looking condition, but the strict condition for resetting
abortedRecPtr is iff "we have read a complete record at the same or
grater LSN ofabortedRecPtr"...

Come to think of this, I noticed that we can get rid of the
file-global abortedRecPtr by letting FinishWalRecovery copy
abortedRecPtr *before* reading the last record. This also allows us to
remove the "expedient" condition.


> I haven't really dived into this issue much so I might be totally off
> base, but it just doesn't feel right to me that this should depend on
> whether we're in standby mode. That seems at best incidentally related
> to whether an aborted contrecord record is required.
> 
> P.S. "aborted contrecord record" is really quite an awkward turn of phrase!

Thats true!  Always open for better phrasings:(

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: [PATCH] Completed unaccent dictionary with many missing characters
Следующее
От: Peter Smith
Дата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply