Обсуждение: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi hackers, During a performance run [1], I observed heavy polling in read_local_xlog_page_guts(). Heikki’s comment from a few months ago suggests replacing the current check–sleep–repeat loop with the condition-variable (CV) infrastructure used by the walsender: 1) Problem and Background /* * Loop waiting for xlog to be available if necessary * * TODO: The walsender has its own version of this function, which uses a * condition variable to wake up whenever WAL is flushed. We could use the * same infrastructure here, instead of the check/sleep/repeat style of * loop. */ Because read_local_xlog_page_guts() waits for a specific flush or replay LSN, polling becomes inefficient when waits are long. I built a POC patch that swaps polling for CVs, but a single global CV (or even separate “flush” and “replay” CVs) isn’t ideal: • The wake-up routines don’t know which LSN each waiter cares about, so they would need to broadcast on every flush/replay. • Caching the minimum outstanding target LSN could reduce spurious wake-ups but won’t eliminate them when multiple backends wait for different LSNs simultaneously. • The walsender accepts some broadcast overhead via two CVs for different waiters. A more precise approach would require a request queue that maps waiters to target LSNs and issues targeted wake-ups—adding complexity. 2) Proposal I came across the thread “Implement waiting for WAL LSN replay: reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that thread provides a well-established infrastructure for waiting on WAL replay in backends. With modest adjustments, it could be generalized. Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling with latch-based waiting: • Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the “WAIT FOR” work. • Replace the busy-wait in read_local_xlog_page_guts() with WaitForLSNFlush and WaitForLSNReplay. • Add wake-up calls in XLogFlush and XLogBackgroundFlush. Edge Case: Timeline Switch During Wait /* * Check which timeline to get the record from. * * We have to do it each time through the loop because if we're in * recovery as a cascading standby, the current timeline might've * become historical. We can't rely on RecoveryInProgress() because in * a standby configuration like * * A => B => C * * if we're a logical decoding session on C, and B gets promoted, our * timeline will change while we remain in recovery. * * We can't just keep reading from the old timeline as the last WAL * archive in the timeline will get renamed to .partial by * StartupXLOG(). read_local_xlog_page_guts() re-evaluates the active timeline on each loop iteration because, on a cascading standby, the current timeline can become historical. Once that happens, there’s no need to keep waiting for that timeline. A timeline switch could therefore render an in-progress wait unnecessary. One option is to add a wake-up at the point where the timeline switch occurs, so waiting processes exit promptly. The current approach chooses not to do this, given that most waits are short and timeline changes in cascading standby are rare. Supporting timeline-switch wake-ups would also require additional handling in both WaitForLSNFlush and WaitForLSNReplay, increasing complexity. Comments and suggestions are welcome. [1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, Attached the wrong patch v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patch. The correct one is attached again. On Wed, Aug 27, 2025 at 11:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi hackers, > > During a performance run [1], I observed heavy polling in > read_local_xlog_page_guts(). Heikki’s comment from a few months ago > suggests replacing the current check–sleep–repeat loop with the > condition-variable (CV) infrastructure used by the walsender: > > 1) Problem and Background > /* > * Loop waiting for xlog to be available if necessary > * > * TODO: The walsender has its own version of this function, which uses a > * condition variable to wake up whenever WAL is flushed. We could use the > * same infrastructure here, instead of the check/sleep/repeat style of > * loop. > */ > > Because read_local_xlog_page_guts() waits for a specific flush or > replay LSN, polling becomes inefficient when waits are long. I built a > POC patch that swaps polling for CVs, but a single global CV (or even > separate “flush” and “replay” CVs) isn’t ideal: > • The wake-up routines don’t know which LSN each waiter cares about, > so they would need to broadcast on every flush/replay. > > • Caching the minimum outstanding target LSN could reduce spurious > wake-ups but won’t eliminate them when multiple backends wait for > different LSNs simultaneously. > > • The walsender accepts some broadcast overhead via two CVs for > different waiters. A more precise approach would require a request > queue that maps waiters to target LSNs and issues targeted > wake-ups—adding complexity. > > 2) Proposal > I came across the thread “Implement waiting for WAL LSN replay: > reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that > thread provides a well-established infrastructure for waiting on WAL > replay in backends. With modest adjustments, it could be generalized. > > Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling > with latch-based waiting: > • Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the > “WAIT FOR” work. > > • Replace the busy-wait in read_local_xlog_page_guts() with > WaitForLSNFlush and WaitForLSNReplay. > > • Add wake-up calls in XLogFlush and XLogBackgroundFlush. > > Edge Case: Timeline Switch During Wait > /* > * Check which timeline to get the record from. > * > * We have to do it each time through the loop because if we're in > * recovery as a cascading standby, the current timeline might've > * become historical. We can't rely on RecoveryInProgress() because in > * a standby configuration like > * > * A => B => C > * > * if we're a logical decoding session on C, and B gets promoted, our > * timeline will change while we remain in recovery. > * > * We can't just keep reading from the old timeline as the last WAL > * archive in the timeline will get renamed to .partial by > * StartupXLOG(). > > read_local_xlog_page_guts() re-evaluates the active timeline on each > loop iteration because, on a cascading standby, the current timeline > can become historical. Once that happens, there’s no need to keep > waiting for that timeline. A timeline switch could therefore render an > in-progress wait unnecessary. > > One option is to add a wake-up at the point where the timeline switch > occurs, so waiting processes exit promptly. The current approach > chooses not to do this, given that most waits are short and timeline > changes in cascading standby are rare. Supporting timeline-switch > wake-ups would also require additional handling in both > WaitForLSNFlush and WaitForLSNReplay, increasing complexity. > > Comments and suggestions are welcome. > > [1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com > [2] https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com > > Best, > Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, Some changes in v3: 1) Update the note of xlogwait.c to reflect the extending of its use for flush waiting and internal use for both flush and replay waiting. 2) Update the comment above logical_read_xlog_page which describes the prior-change behavior of read_local_xlog_page. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > Some changes in v3: > 1) Update the note of xlogwait.c to reflect the extending of its use > for flush waiting and internal use for both flush and replay waiting. > 2) Update the comment above logical_read_xlog_page which describes the > prior-change behavior of read_local_xlog_page. In an off-list discussion, Alexander pointed out potential issues with the current single-heap design for replay and flush when promotion occurs concurrently with WAIT FOR. The following is a simple example illustrating the problem: During promotion, there's a window where we can have mixed waiter types in the same heap: T1: Process A calls read_local_xlog_page_guts on standby T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter T3: Promotion begins T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr) T5: SharedRecoveryState = RECOVERY_STATE_DONE T6: Process B calls read_local_xlog_page_guts T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter The problem is that replay LSNs and flush LSNs represent different positions in the WAL stream. Having both types in the same heap can lead to: - Incorrect wakeup logic (comparing incomparable LSNs) - Processes waiting forever - Wrong waiters being woken up To avoid this problem, patch v4 is updated to utilize two separate heaps for flush and replay like Alexander suggested earlier. It also introduces a new separate min LSN tracking field for flushing. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Sun, Sep 28, 2025 at 9:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi, > > > > Some changes in v3: > > 1) Update the note of xlogwait.c to reflect the extending of its use > > for flush waiting and internal use for both flush and replay waiting. > > 2) Update the comment above logical_read_xlog_page which describes the > > prior-change behavior of read_local_xlog_page. > > In an off-list discussion, Alexander pointed out potential issues with > the current single-heap design for replay and flush when promotion > occurs concurrently with WAIT FOR. The following is a simple example > illustrating the problem: > > During promotion, there's a window where we can have mixed waiter > types in the same heap: > > T1: Process A calls read_local_xlog_page_guts on standby > T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter > T3: Promotion begins > T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr) > T5: SharedRecoveryState = RECOVERY_STATE_DONE > T6: Process B calls read_local_xlog_page_guts > T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter > > The problem is that replay LSNs and flush LSNs represent different > positions in the WAL stream. Having both types in the same heap can > lead to: > - Incorrect wakeup logic (comparing incomparable LSNs) > - Processes waiting forever > - Wrong waiters being woken up > > To avoid this problem, patch v4 is updated to utilize two separate > heaps for flush and replay like Alexander suggested earlier. It also > introduces a new separate min LSN tracking field for flushing. > v5-0002 separates the waitlsn_cmp() comparator function into two distinct functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay and flush heaps, respectively. Best, Xuneng