Re: BUG #17928: Standby fails to decode WAL on termination of primary

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: BUG #17928: Standby fails to decode WAL on termination of primary
Дата
Msg-id ZPVU9FwENbnO6Q++@paquier.xyz
обсуждение исходный текст
Ответ на Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-bugs
On Mon, Sep 04, 2023 at 03:20:31PM +1200, Thomas Munro wrote:
> 1.  In the place where we fail to allocate memory for an oversized
> record, I copied the comment about treating that as a "bogus data"
> condition.  I suspect that we will soon be converting that to a FATAL
> error[1], and that'll need to be done in both places.

You mean for the two callers of XLogReadRecordAlloc(), even for the
case where !allow_oversized?  Using a FATAL on non-FRONTEND would be
the quickest fix, indeed, but there are argument for standbys where we
could let these continue, as well.  That would be an improvement over
the always-FATAL on OOM, of course.

> 2.  In this version of the commit message I said we'd only back-patch
> to 15 for now.  After sleeping on this for a week, I realised that the
> reason I keep vacillating on that point is that I am not sure what
> your plan is for the malloc-failure-means-end-of-wal policy ([1],
> ancient code from 0ffe11abd3a).  If we're going to fix that in master
> only but let sleeping dogs lie in the back-branches, then it becomes
> less important to go back further than 15 with THIS patch.

For HEAD, my plan was to give more control to the callers of the WAL
reader, which is not something that could happen in back-branches
because it requires a few structural changes in the reporting the
reader is able to do.  That's the standby vs crash-recovery behavior,
where we can make the startup process not FATAL all the time because
it should just loop forever if it cannot process a record.  It does
not fix the root issue discussed here, of course, but it make the OOM
handling more friendly in some replay cases.

> But if you
> want to be able to distinguish garbage from out-of-memory, and thereby
> end-of-wal from a FATAL please-insert-more-RAM condition, I think
> you'd really need this industrial strength validation in all affected
> branches, and I'd have more work to do, right?  The weak validation we
> are fixing here is the *real* underlying problem going back many
> years, right?

Getting the same validation checks for all the branches would be nice.
FATAL-ing on OOM to force recovery to happen again is a better option
than assuming that it is the end of recovery.  I am OK to provide
patches for all the branches for the sake of this thread, if that
helps.  Switching to a hard FATAL on OOM for the WAL reader in the
backend is backpatchable, but I'd rather consider that on a different
thread once the better checks for the record header are in place.

> I also wondered about strengthening the validation of various things
> like redo begin/end LSNs etc in these tests.  But we can always
> continue to improve all this later...

Agreed.
--
Michael

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #17950: Incorrect memory access in gtsvector_picksplit()