Обсуждение: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Поиск
Список
Период
Сортировка

Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

От
Xuneng Zhou
Дата:
Hi hackers,

During a performance run [1], I observed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
suggests replacing the current check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by the walsender:

1) Problem and Background
/*
 * Loop waiting for xlog to be available if necessary
 *
 * TODO: The walsender has its own version of this function, which uses a
 * condition variable to wake up whenever WAL is flushed. We could use the
 * same infrastructure here, instead of the check/sleep/repeat style of
 * loop.
 */

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when waits are long. I built a
POC patch that swaps polling for CVs, but a single global CV (or even
separate “flush” and “replay” CVs) isn’t ideal:
• The wake-up routines don’t know which LSN each waiter cares about,
so they would need to broadcast on every flush/replay.

• Caching the minimum outstanding target LSN could reduce spurious
wake-ups but won’t eliminate them when multiple backends wait for
different LSNs simultaneously.

• The walsender accepts some broadcast overhead via two CVs for
different waiters. A more precise approach would require a request
queue that maps waiters to target LSNs and issues targeted
wake-ups—adding complexity.

2) Proposal
I came across the thread “Implement waiting for WAL LSN replay:
reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that
thread provides a well-established infrastructure for waiting on WAL
replay in backends. With modest adjustments, it could be generalized.

Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling
with latch-based waiting:
• Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the
“WAIT FOR” work.

• Replace the busy-wait in read_local_xlog_page_guts() with
WaitForLSNFlush and WaitForLSNReplay.

• Add wake-up calls in XLogFlush and XLogBackgroundFlush.

Edge Case: Timeline Switch During Wait
/*
* Check which timeline to get the record from.
*
* We have to do it each time through the loop because if we're in
* recovery as a cascading standby, the current timeline might've
* become historical. We can't rely on RecoveryInProgress() because in
* a standby configuration like
*
* A => B => C
*
* if we're a logical decoding session on C, and B gets promoted, our
* timeline will change while we remain in recovery.
*
* We can't just keep reading from the old timeline as the last WAL
* archive in the timeline will get renamed to .partial by
* StartupXLOG().

read_local_xlog_page_guts() re-evaluates the active timeline on each
loop iteration because, on a cascading standby, the current timeline
can become historical. Once that happens, there’s no need to keep
waiting for that timeline. A timeline switch could therefore render an
in-progress wait unnecessary.

One option is to add a wake-up at the point where the timeline switch
occurs, so waiting processes exit promptly. The current approach
chooses not to do this, given that most waits are short and timeline
changes in cascading standby are rare.  Supporting timeline-switch
wake-ups would also require additional handling in both
WaitForLSNFlush and WaitForLSNReplay, increasing complexity.

Comments and suggestions are welcome.

[1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com

Best,
Xuneng

Вложения

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

От
Xuneng Zhou
Дата:
Hi,

Attached the wrong patch
v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patch. The
correct one is attached again.

On Wed, Aug 27, 2025 at 11:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi hackers,
>
> During a performance run [1], I observed heavy polling in
> read_local_xlog_page_guts(). Heikki’s comment from a few months ago
> suggests replacing the current check–sleep–repeat loop with the
> condition-variable (CV) infrastructure used by the walsender:
>
> 1) Problem and Background
> /*
>  * Loop waiting for xlog to be available if necessary
>  *
>  * TODO: The walsender has its own version of this function, which uses a
>  * condition variable to wake up whenever WAL is flushed. We could use the
>  * same infrastructure here, instead of the check/sleep/repeat style of
>  * loop.
>  */
>
> Because read_local_xlog_page_guts() waits for a specific flush or
> replay LSN, polling becomes inefficient when waits are long. I built a
> POC patch that swaps polling for CVs, but a single global CV (or even
> separate “flush” and “replay” CVs) isn’t ideal:
> • The wake-up routines don’t know which LSN each waiter cares about,
> so they would need to broadcast on every flush/replay.
>
> • Caching the minimum outstanding target LSN could reduce spurious
> wake-ups but won’t eliminate them when multiple backends wait for
> different LSNs simultaneously.
>
> • The walsender accepts some broadcast overhead via two CVs for
> different waiters. A more precise approach would require a request
> queue that maps waiters to target LSNs and issues targeted
> wake-ups—adding complexity.
>
> 2) Proposal
> I came across the thread “Implement waiting for WAL LSN replay:
> reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that
> thread provides a well-established infrastructure for waiting on WAL
> replay in backends. With modest adjustments, it could be generalized.
>
> Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling
> with latch-based waiting:
> • Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the
> “WAIT FOR” work.
>
> • Replace the busy-wait in read_local_xlog_page_guts() with
> WaitForLSNFlush and WaitForLSNReplay.
>
> • Add wake-up calls in XLogFlush and XLogBackgroundFlush.
>
> Edge Case: Timeline Switch During Wait
> /*
> * Check which timeline to get the record from.
> *
> * We have to do it each time through the loop because if we're in
> * recovery as a cascading standby, the current timeline might've
> * become historical. We can't rely on RecoveryInProgress() because in
> * a standby configuration like
> *
> * A => B => C
> *
> * if we're a logical decoding session on C, and B gets promoted, our
> * timeline will change while we remain in recovery.
> *
> * We can't just keep reading from the old timeline as the last WAL
> * archive in the timeline will get renamed to .partial by
> * StartupXLOG().
>
> read_local_xlog_page_guts() re-evaluates the active timeline on each
> loop iteration because, on a cascading standby, the current timeline
> can become historical. Once that happens, there’s no need to keep
> waiting for that timeline. A timeline switch could therefore render an
> in-progress wait unnecessary.
>
> One option is to add a wake-up at the point where the timeline switch
> occurs, so waiting processes exit promptly. The current approach
> chooses not to do this, given that most waits are short and timeline
> changes in cascading standby are rare.  Supporting timeline-switch
> wake-ups would also require additional handling in both
> WaitForLSNFlush and WaitForLSNReplay, increasing complexity.
>
> Comments and suggestions are welcome.
>
> [1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
> [2] https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com
>
> Best,
> Xuneng

Вложения

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

От
Xuneng Zhou
Дата:
Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

Best,
Xuneng

Вложения

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

От
Xuneng Zhou
Дата:
Hi,

On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> Some changes in v3:
> 1) Update the note of xlogwait.c to reflect the extending of its use
> for flush waiting and internal use for both flush and replay waiting.
> 2) Update the comment above logical_read_xlog_page which describes the
> prior-change behavior of read_local_xlog_page.

In an off-list discussion, Alexander pointed out potential issues with
the current single-heap design for replay and flush when promotion
occurs concurrently with WAIT FOR. The following is a simple example
illustrating the problem:

During promotion, there's a window where we can have mixed waiter
types in the same heap:

  T1: Process A calls read_local_xlog_page_guts on standby
  T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter
  T3: Promotion begins
  T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr)
  T5: SharedRecoveryState = RECOVERY_STATE_DONE
  T6: Process B calls read_local_xlog_page_guts
  T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter

The problem is that replay LSNs and flush LSNs represent different
positions in the WAL stream. Having both types in the same heap can
lead to:
  - Incorrect wakeup logic (comparing incomparable LSNs)
  - Processes waiting forever
  - Wrong waiters being woken up

To avoid this problem, patch v4 is updated to utilize two separate
heaps for flush and replay like Alexander suggested earlier.  It also
introduces a new separate min LSN tracking field for flushing.

Best,
Xuneng

Вложения

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

От
Xuneng Zhou
Дата:
Hi,

On Sun, Sep 28, 2025 at 9:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > Some changes in v3:
> > 1) Update the note of xlogwait.c to reflect the extending of its use
> > for flush waiting and internal use for both flush and replay waiting.
> > 2) Update the comment above logical_read_xlog_page which describes the
> > prior-change behavior of read_local_xlog_page.
>
> In an off-list discussion, Alexander pointed out potential issues with
> the current single-heap design for replay and flush when promotion
> occurs concurrently with WAIT FOR. The following is a simple example
> illustrating the problem:
>
> During promotion, there's a window where we can have mixed waiter
> types in the same heap:
>
>   T1: Process A calls read_local_xlog_page_guts on standby
>   T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter
>   T3: Promotion begins
>   T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr)
>   T5: SharedRecoveryState = RECOVERY_STATE_DONE
>   T6: Process B calls read_local_xlog_page_guts
>   T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter
>
> The problem is that replay LSNs and flush LSNs represent different
> positions in the WAL stream. Having both types in the same heap can
> lead to:
>   - Incorrect wakeup logic (comparing incomparable LSNs)
>   - Processes waiting forever
>   - Wrong waiters being woken up
>
> To avoid this problem, patch v4 is updated to utilize two separate
> heaps for flush and replay like Alexander suggested earlier.  It also
> introduces a new separate min LSN tracking field for flushing.
>

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

Best,
Xuneng

Вложения