Timeline switching with partial WAL records can break replica recovery

Поиск

Список

Период

Сортировка

От	D Laaren
Тема	Timeline switching with partial WAL records can break replica recovery
Дата	15 июня 08:11:24
Msg-id	CAGWv16KJg=SHKxwTWxGFo9zibmEompk=bfK0MRHTEh8NipoyHQ@mail.gmail.com обсуждение исходный текст
Список	pgsql-hackers

Дерево обсуждения

[FIX] Timeline switching with partial WAL records can break replica recovery

Hi Hackers,

I've encountered an issue where physical replicas may enter an infinite WAL
request loop under the following conditions:
1. A promotion occurs during a multi-block WAL record write.
2. The resulting timeline is very short (begin/end in the same WAL block).

== Current behavior ==
When a WAL record crosses block boundaries and gets interrupted by a promotion:
1. The partial record remains in the original timeline's WAL and is not copied
to the next timeline.
2. The new timeline begins at the end of the last valid completed record.
3. Replicas consuming the WAL stream may:
a) Encounter the partial record in timeline N,
b) Recovery fetches a non-full record from walrcv and asks for more data,
c) Fetch the timeline history, including timeline N+1,
d) Request the same record from timeline N+1,
e) Receive data up to the end of timeline N+1 (which, notably, does not contain
this record),
f) Repeat indefinitely for the incomplete record on timeline N+1.

I hope this illustration helps clarify the issue:

block    block, the record is written    block
|    | up to that point    |
v    v v
--+-----------------------------+-------------------------------------------+---
: | I I    | I | :

--+-----------------------------+-------------------------------------------+---
   ^ ^ ^
   | └ end of TLI = N + 1, └ the point up to which the record
   |    start of TLI = N + 2 must be fully written,
   | recovery waits data until this LSN.
|
   └ end of TLI = N,
   start of TLI = N + 1,
   start of the record.

== Background theory ==
1. When a record crosses a page boundary, it's logically divided into parts
called "contrecords".
2. If an incomplete record is detected during crash recovery:
a) a special XLOG_OVERWRITE_CONTRECORD record is written after the incomplete
record,
b) the page header where the rest of the record would reside is flagged with
XLP_FIRST_IS_OVERWRITTEN_CONTRECORD,
c) this ensures PostgreSQL gracefully ignores the incomplete record in
subsequent recovery attempts,
3. If and incomplete record is detected during promotion recovery:
a) the new timeline starts at the end of the last valid record (before the
incomplete one),
b) the last block containing the record is copied to the new timeline, but the
record itself is zeroed out.
c) the new timeline contains neither the XLOG_OVERWRITE_CONTRECORD marker nor
the XLP_FIRST_IS_OVERWRITTEN_CONTRECORD page header flag.

== Historical context ==
1. The initial contrecord implementation [1] proposed:
a) copying the last block containing the last valid record to the new timeline,
and zeroing the partially written record,
b) writing XLOG_OVERWRITE_CONTRECORD in the new timeline.
2. Later, XLOG_OVERWRITE_CONTRECORD writing was avoided due to concerns about
zero-gaps in WAL [2].

== Proposed Solution ==
I propose preserving WAL's append-only linear nature by graceful handling of
incomplete records during timeline switches:
1. Timeline Finalization:
Before switching timelines write an XLOG_OVERWRITE_CONTRECORD record to mark
the incomplete record in the current timeline. Only then initialize the new
timeline and continue recovery. Since no concurrent WAL writes can occur
during this phase, the operation is safe.
2. Other processes may read InsertTLI from shared memory during the switch. This
can cause processes to read the current timeline instead of the new one we're
switching to.
This behaviour may occur in the walsummarizer process.
If the walsummarizer fetches InsertTLI of the current timeline (which is set
while a XLOG_OVERWRITE_CONTRECORD is being written), it can result in
latest_lsn < read_upto (which becomes walrcv->flushedUpto in this case),
triggering an assertion failure. This assertion ensures that the read-up-to
LSN is correctly updated. We can safely handle this edge case by replacing
the assertion with an if-statement.

I wrote a TAP test to reproduce the bug, but it doesn’t trigger the issue
consistently. To help investigate, I’ve also added additional logs with some
debug output.

I would appreciate review and feedback.

[1] https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
[2] http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com

--
Regards,
Alyona Vinter

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Timeline switching with partial WAL records can break replica recovery

Вложения