Обсуждение: Use WALReadFromBuffers in more places
Hi, Commit 91f2cae7a4e that introduced WALReadFromBuffers only used it for physical walsenders. It can also be used in more places benefitting logical walsenders, backends running pg_walinspect and logical decoding functions if the WAL is available in WAL buffers. I'm attaching a 0001 patch for this. While at it, I've also added a test module in 0002 patch to demonstrate 2 things: 1) how the caller can ensure the requested WAL is fully copied to WAL buffers using WaitXLogInsertionsToFinish before reading from WAL buffers. 2) how one can implement an xlogreader page_read callback to read unflushed/not-yet-flushed WAL directly from WAL buffers. FWIW, a separate test module to explicitly test the new function is suggested here - https://www.postgresql.org/message-id/CAFiTN-sE7CJn-ZFj%2B-0Wv6TNytv_fp4n%2BeCszspxJ3mt77t5ig%40mail.gmail.com. Please have a look at the attached patches. I will register this for the next commit fest. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, Bharath. I've been testing this. It's cool. Is there any way we could
monitor the hit rate about directly reading from WAL buffers by exporting
to some views?
---
Regards, Jingtang
On Wed, May 8, 2024 at 9:51 AM Jingtang Zhang <mrdrivingduck@gmail.com> wrote: > > Hi, Bharath. I've been testing this. It's cool. Is there any way we could > monitor the hit rate about directly reading from WAL buffers by exporting > to some views? Thanks for looking into this. I used purpose-built patches for verifying the WAL buffers hit ratio, please check USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt and USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt at https://www.postgresql.org/message-id/CALj2ACU9cfAcfVsGwUqXMace_7rfSBJ7%2BhXVJfVV1jnspTDGHQ%40mail.gmail.com. In the long run, it's better to extend what's proposed in the thread https://www.postgresql.org/message-id/CALj2ACU_f5_c8F%2BxyNR4HURjG%3DJziiz07wCpQc%3DAqAJUFh7%2B8w%40mail.gmail.com. I haven't had a chance to dive deep into that thread yet, but a quick glance over v8 patch tells me that it hasn't yet implemented the idea of collecting WAL read stats for both walsenders and the backends reading WAL. If that's done, we can extend it for WAL read from WAL buffers. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Bharath, I spent some time examining the patch. Here are my observations from the review. - I believe there’s no need for an extra variable ‘nbytes’ in this context. We can repurpose the ‘count’ variable for the same function. If necessary, we might think about renaming ‘count’ to ‘nbytes’. - The operations performed by logical_read_xlog_page() and read_local_xlog_page_guts() are identical. It might be beneficial to create a shared function to minimize code repetition. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft On Mon, May 13, 2024 at 12:17 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, May 8, 2024 at 9:51 AM Jingtang Zhang <mrdrivingduck@gmail.com> wrote: > > > > Hi, Bharath. I've been testing this. It's cool. Is there any way we could > > monitor the hit rate about directly reading from WAL buffers by exporting > > to some views? > > Thanks for looking into this. I used purpose-built patches for > verifying the WAL buffers hit ratio, please check > USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt and > USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt at > https://www.postgresql.org/message-id/CALj2ACU9cfAcfVsGwUqXMace_7rfSBJ7%2BhXVJfVV1jnspTDGHQ%40mail.gmail.com. > In the long run, it's better to extend what's proposed in the thread > https://www.postgresql.org/message-id/CALj2ACU_f5_c8F%2BxyNR4HURjG%3DJziiz07wCpQc%3DAqAJUFh7%2B8w%40mail.gmail.com. > I haven't had a chance to dive deep into that thread yet, but a quick > glance over v8 patch tells me that it hasn't yet implemented the idea > of collecting WAL read stats for both walsenders and the backends > reading WAL. If that's done, we can extend it for WAL read from WAL > buffers. > > -- > Bharath Rupireddy > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com > >
Hi,
On Sat, Jun 8, 2024 at 5:24 PM Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
>
> I spent some time examining the patch. Here are my observations from the review.
Thanks.
> - I believe there’s no need for an extra variable ‘nbytes’ in this
> context. We can repurpose the ‘count’ variable for the same function.
> If necessary, we might think about renaming ‘count’ to ‘nbytes’.
'count' variable can't be altered once determined as the page_read callbacks need to return the total number of bytes read. However, I ended up removing 'nbytes' like in the attached v2 patch.
> - The operations performed by logical_read_xlog_page() and
> read_local_xlog_page_guts() are identical. It might be beneficial to
> create a shared function to minimize code repetition.
IMO, creating another function to just wrap two other functions doesn't seem good to me.
I attached v2 patches for further review. No changes in 0002 patch.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Sat, Jun 8, 2024 at 5:24 PM Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
>
> I spent some time examining the patch. Here are my observations from the review.
Thanks.
> - I believe there’s no need for an extra variable ‘nbytes’ in this
> context. We can repurpose the ‘count’ variable for the same function.
> If necessary, we might think about renaming ‘count’ to ‘nbytes’.
'count' variable can't be altered once determined as the page_read callbacks need to return the total number of bytes read. However, I ended up removing 'nbytes' like in the attached v2 patch.
> - The operations performed by logical_read_xlog_page() and
> read_local_xlog_page_guts() are identical. It might be beneficial to
> create a shared function to minimize code repetition.
IMO, creating another function to just wrap two other functions doesn't seem good to me.
I attached v2 patches for further review. No changes in 0002 patch.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Вложения
Hi all.
I've been back to this patch for a while recently. I witness that if a WAL
writer works fast, the already flushed WAL buffers will be zeroed out and
re-initialized for future use by AdvanceXLInsertBuffer in
XLogBackgroundFlush, so that WALReadFromBuffers will miss even though the
space of WAL buffer is enough. It is much more unfriendly for logical
walsenders than physical walsenders, because logical ones consume WAL
slower than physical ones due to the extra decoding phase. Seems that the aim
of AdvanceXLInsertBuffer in WAL writer contradicts with our reading from
WAL buffer. Any thoughts?
Hi, On Tue, Oct 15, 2024 at 1:22 AM Jingtang Zhang <mrdrivingduck@gmail.com> wrote: > > I've been back to this patch for a while recently. I witness that if a WAL > writer works fast, the already flushed WAL buffers will be zeroed out and > re-initialized for future use by AdvanceXLInsertBuffer in > XLogBackgroundFlush, so that WALReadFromBuffers will miss even though the > space of WAL buffer is enough. It is much more unfriendly for logical > walsenders than physical walsenders, because logical ones consume WAL > slower than physical ones due to the extra decoding phase. Seems that the aim > of AdvanceXLInsertBuffer in WAL writer contradicts with our reading from > WAL buffer. Any thoughts? Thanks for looking at this. Yes, the WAL writers can zero out flushed buffers before WALReadFromBuffers gets to them. However, WALReadFromBuffers was intentionally designed as an opportunistic optimization - it's a "try this first, quickly" approach before falling back to reading from WAL files. The no-locks design ensures it never gets in the way of backends generating WAL, which is critical for overall system performance. I rebased and attached the v3 patch. I discarded the test extension patch that demonstrated WALReadFromBuffers' behavior (i.e., waiting for WAL to be fully copied to WAL buffers with WaitXLogInsertionsToFinish), as I believe the comment at the top of WALReadFromBuffers is sufficient documentation. I can reintroduce the test extension if there's interest. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi Bharath,
Hi,
Commit 91f2cae7a4e that introduced WALReadFromBuffers only used it for
physical walsenders. It can also be used in more places benefitting
logical walsenders, backends running pg_walinspect and logical
decoding functions if the WAL is available in WAL buffers. I'm
attaching a 0001 patch for this.
Thank you for working on this. It seems like a useful optimization. Do you have any information
on how much this improves the performance of the new callers of WALReadFromBuffers?
on how much this improves the performance of the new callers of WALReadFromBuffers?
Regarding the v3 version of the patch, do you also intend to include other callers of WALRead,
such as walsummarizer.c and pg_waldump.c?
In WALReadFromBuffers, the buffer scan currently stops when it encounters a buffer that
doesn't have the needed WAL page. However, it's possible to continue scanning past the
missing page and find other relevant pages further along. By keeping track of which pages
are missing, we could read only those specific pages from files, instead of reading everything
after the first missing page. I am wondering if this was taken into account during the design
of the function.
Thank you,
missing page and find other relevant pages further along. By keeping track of which pages
are missing, we could read only those specific pages from files, instead of reading everything
after the first missing page. I am wondering if this was taken into account during the design
of the function.
Thank you,
Rahila Syed
Hi~ > Thanks for looking at this. Yes, the WAL writers can zero out flushed > buffers before WALReadFromBuffers gets to them. However, > WALReadFromBuffers was intentionally designed as an opportunistic > optimization - it's a "try this first, quickly" approach before > falling back to reading from WAL files. The no-locks design ensures it > never gets in the way of backends generating WAL, which is critical > for overall system performance. Yes, it is actually an interesting thing, beyond current topic. Since we are using buffered I/O, even though we cannot read from WAL buffer due to the opportunistic AdvanceXLInsertBuffer by WAL writer, later WALRead may still find the page inside OS page cache, with high probability, because the page has just been written out. So WALRead will be fast, too. But if we are moving forward to direct I/O some day in the future, the cost of WALReadFromBuffers and WALRead might be obvious. Maybe the opportunistic WAL buffer initialization could keep a small ratio of old pages inside WAL buffer so these pages can still be hit by WALReadFromBuffers. > I rebased and attached the v3 patch. The v3 patch LGTM. — Regards, Jingtang Alibaba Cloud
On Sat, 2025-09-13 at 22:04 -0700, Bharath Rupireddy wrote: > Thanks for looking at this. Yes, the WAL writers can zero out flushed > buffers before WALReadFromBuffers gets to them. However, > WALReadFromBuffers was intentionally designed as an opportunistic > optimization - it's a "try this first, quickly" approach before > falling back to reading from WAL files. IIRC, one motivation (perhaps the primary motivation?) was to make it possible to read buffers before they are flushed. It was always possible to read already-flushed buffers. The benefit of reading unflushed buffers is that we can replicate the WAL sooner (though it can't be replayed until the primary flushes it). Is that right? Regards, Jeff Davis
Hi,
On Mon, Sep 22, 2025 at 8:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2025-09-13 at 22:04 -0700, Bharath Rupireddy wrote:
> Thanks for looking at this. Yes, the WAL writers can zero out flushed
> buffers before WALReadFromBuffers gets to them. However,
> WALReadFromBuffers was intentionally designed as an opportunistic
> optimization - it's a "try this first, quickly" approach before
> falling back to reading from WAL files.
IIRC, one motivation (perhaps the primary motivation?) was to make it
possible to read buffers before they are flushed. It was always
possible to read already-flushed buffers.
The benefit of reading unflushed buffers is that we can replicate the
WAL sooner (though it can't be replayed until the primary flushes it).
Is that right?
I'm not certain about the primary motivation, but as it stands, WALReadFromBuffers only reads WAL records present in buffers up to the flush pointer. This is because XLogSendPhysical currently sends records only up to the flush pointer, not beyond.
I am currently testing a patch developed by Melih Mutlu that implements the functionality you described, sending unflushed buffers during physical replication. After some tuning, the patch has shown a 5 percent improvement in TPS for synchronous replication with remote_write. I am working on further improving the patch before sharing it on the hackers mailing list.
Thank you,
Rahila Syed
Hi, On Mon, Sep 22, 2025 at 8:26 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Sat, 2025-09-13 at 22:04 -0700, Bharath Rupireddy wrote: > > Thanks for looking at this. Yes, the WAL writers can zero out flushed > > buffers before WALReadFromBuffers gets to them. However, > > WALReadFromBuffers was intentionally designed as an opportunistic > > optimization - it's a "try this first, quickly" approach before > > falling back to reading from WAL files. > > IIRC, one motivation (perhaps the primary motivation?) was to make it > possible to read buffers before they are flushed. It was always > possible to read already-flushed buffers. > > The benefit of reading unflushed buffers is that we can replicate the > WAL sooner (though it can't be replayed until the primary flushes it). > Is that right? Right. Reading unflushed WAL buffers for replication was one of the motivations. But, in general, WALReadFromBuffers has more benefits since it lets WAL buffers act as a cache for reads, avoiding the need to re-read WAL from disk for (both physical and logical) replication. For example, it makes the use of direct I/O for WAL more realistic and can provide significant performance benefits [1]. [1] https://www.postgresql.org/message-id/20230114203403.4zpg72fw2qb34awf%40awork3.anarazel.de -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, 2025-09-24 at 07:26 -0700, Bharath Rupireddy wrote: > Right. Reading unflushed WAL buffers for replication was one of the > motivations. But, in general, WALReadFromBuffers has more benefits > since it lets WAL buffers act as a cache for reads, avoiding the need > to re-read WAL from disk for (both physical and logical) replication. > For example, it makes the use of direct I/O for WAL more realistic > and > can provide significant performance benefits [1]. Is it possible to do a POC that shows the potential benefit, or are we still too far away? Regards, Jeff Davis
Hi,
On Wed, Sep 24, 2025 at 11:27 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2025-09-24 at 07:26 -0700, Bharath Rupireddy wrote:
> > Right. Reading unflushed WAL buffers for replication was one of the
> > motivations. But, in general, WALReadFromBuffers has more benefits
> > since it lets WAL buffers act as a cache for reads, avoiding the need
> > to re-read WAL from disk for (both physical and logical) replication.
> > For example, it makes the use of direct I/O for WAL more realistic
> > and
> > can provide significant performance benefits [1].
>
> Is it possible to do a POC that shows the potential benefit, or are we
> still too far away?
Thanks for looking into this. I did performance analysis with WAL directo I/O to see how reading from WAL buffers affects walsenders: https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com. Following is from that thread. Please let me know if you have any specific cases in mind. I'm happy to run the same test for logical replication.
It helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.
[2] Test case is an insert pgbench workload.
clients HEAD | WAL DIO | WAL DIO & WAL BUFFERS READ | WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 24, 2025 at 11:27 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2025-09-24 at 07:26 -0700, Bharath Rupireddy wrote:
> > Right. Reading unflushed WAL buffers for replication was one of the
> > motivations. But, in general, WALReadFromBuffers has more benefits
> > since it lets WAL buffers act as a cache for reads, avoiding the need
> > to re-read WAL from disk for (both physical and logical) replication.
> > For example, it makes the use of direct I/O for WAL more realistic
> > and
> > can provide significant performance benefits [1].
>
> Is it possible to do a POC that shows the potential benefit, or are we
> still too far away?
Thanks for looking into this. I did performance analysis with WAL directo I/O to see how reading from WAL buffers affects walsenders: https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com. Following is from that thread. Please let me know if you have any specific cases in mind. I'm happy to run the same test for logical replication.
It helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.
[2] Test case is an insert pgbench workload.
clients HEAD | WAL DIO | WAL DIO & WAL BUFFERS READ | WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com