Обсуждение: Eagerly evict bulkwrite strategy ring

Поиск
Список
Период
Сортировка

Eagerly evict bulkwrite strategy ring

От
Melanie Plageman
Дата:
Hi,

While discussing the next steps for AIO writes in Postgres, Andres
suggested that a good starting point would be to begin evicting more
than one buffer at a time in some of the buffer access strategies that
perform writes. This would make it easier to later combine these
writes and, eventually, issue them asynchronously.

The attached patch implements this behavior for the BAS_BULKWRITE
strategy. With the patch applied, I observe average performance
improvements of about 15-20% for parallel COPY FROM operations on the
same table.

After some analysis, this improvement appears to be primarily due to
reduced time spent by each backend waiting on the lock to flush WAL.

Since backends now issue more data file writes before each WAL flush
(using a heuristic that avoids eviction when it would require flushing
WAL), there is less interleaving between WAL flushes and data file
writes. With the patch applied, I observe client backends waiting
significantly less on the WALWriteLock. I also see lower f_await times
in iostat, suggesting reduced flush-related waiting at the kernel
level as well.

It's worth noting that for the serial COPY case (a single COPY FROM),
performance remains essentially unchanged with the patch. The benefit
seems to emerge only when multiple backends are concurrently writing
data and flushing WAL. In fact, the benefits go down the fewer
parallel COPY FROM operations are performed at a time.

The benchmark I did was simple:

-- make 16 source data files that are >= 1GB each

initdb
pg_ctl start
createdb

sudo fstrim -v /mnt/data

psql -c "drop table foo; create table foo(a int, b int) with
(autovacuum_enabled = off);"

time pgbench \
  --no-vacuum \
  -c 16 \
  -j 16 \
  -t 4 \
-f- <<EOF
COPY foo FROM '/mnt/data/foo:client_id.data';
EOF

master -> patch
6.2 minutes -> 5 minutes : ~20% reduction

A 15% improvement can be noticed with the same benchmark but 4 workers.

- Melanie

Вложения

Re: Eagerly evict bulkwrite strategy ring

От
Kirill Reshke
Дата:
On Wed, 16 Jul 2025 at 04:51, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Hi,
>
> While discussing the next steps for AIO writes in Postgres, Andres
> suggested that a good starting point would be to begin evicting more
> than one buffer at a time in some of the buffer access strategies that
> perform writes. This would make it easier to later combine these
> writes and, eventually, issue them asynchronously.
>
> The attached patch implements this behavior for the BAS_BULKWRITE
> strategy. With the patch applied, I observe average performance
> improvements of about 15-20% for parallel COPY FROM operations on the
> same table.
>
> After some analysis, this improvement appears to be primarily due to
> reduced time spent by each backend waiting on the lock to flush WAL.
>
> Since backends now issue more data file writes before each WAL flush
> (using a heuristic that avoids eviction when it would require flushing
> WAL), there is less interleaving between WAL flushes and data file
> writes. With the patch applied, I observe client backends waiting
> significantly less on the WALWriteLock. I also see lower f_await times
> in iostat, suggesting reduced flush-related waiting at the kernel
> level as well.
>
> It's worth noting that for the serial COPY case (a single COPY FROM),
> performance remains essentially unchanged with the patch. The benefit
> seems to emerge only when multiple backends are concurrently writing
> data and flushing WAL. In fact, the benefits go down the fewer
> parallel COPY FROM operations are performed at a time.
>


Hi!


1) In EvictStrategyRing we find io context for strategy:

> + io_context = IOContextForStrategy(strategy);

but the caller of this function (GetVictimBuffer) already has one.
Should we reuse its context, pass it as function param to
EvictStrategyRing?

2) QuickCleanBuffer function has a return value which is never
checked. Should we change the signature to `void QuickCleanBuffer
(...)` ?

3) In QuickCleanBuffer, we have `buffer =
BufferDescriptorGetBuffer(bufdesc);`, while in QuickCleanBuffer we do
the opposite right before QuickCleanBuffer call. Should we pass
`bufnum` as a parameter?

> The benchmark I did was simple:
>
> -- make 16 source data files that are >= 1GB each
>
> initdb
> pg_ctl start
> createdb
>
> sudo fstrim -v /mnt/data
>
> psql -c "drop table foo; create table foo(a int, b int) with
> (autovacuum_enabled = off);"
>
> time pgbench \
>   --no-vacuum \
>   -c 16 \
>   -j 16 \
>   -t 4 \
> -f- <<EOF
> COPY foo FROM '/mnt/data/foo:client_id.data';
> EOF
>
> master -> patch
> 6.2 minutes -> 5 minutes : ~20% reduction
>
> A 15% improvement can be noticed with the same benchmark but 4 workers.
>
> - Melanie
In only get 5-10% improvements

I did this benchmark also. For 16 source data files that are 150MB
each I get 5-10 % speedup (5% with small shared_buffers and 10 % with
big shared_buffers).


-- 
Best regards,
Kirill Reshke



Re: Eagerly evict bulkwrite strategy ring

От
Melanie Plageman
Дата:
On Mon, Aug 18, 2025 at 2:54 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> Hi!

Thanks for the review!

> 1) In EvictStrategyRing we find io context for strategy:
>
> > + io_context = IOContextForStrategy(strategy);
>
> but the caller of this function (GetVictimBuffer) already has one.
> Should we reuse its context, pass it as function param to
> EvictStrategyRing?

Done in attached v2.

> 2) QuickCleanBuffer function has a return value which is never
> checked. Should we change the signature to `void QuickCleanBuffer
> (...)` ?

Done in attached v2.

> 3) In QuickCleanBuffer, we have `buffer =
> BufferDescriptorGetBuffer(bufdesc);`, while in QuickCleanBuffer we do
> the opposite right before QuickCleanBuffer call. Should we pass
> `bufnum` as a parameter?

Done in attached v2.

v2 also changes instances of the word "evict" to simply "flush"
because we don't actually invalidate the contents of the buffer -- we
just flush it so that it can be cheaply evicted when it is needed.

Also, I noticed that v1 had an issue -- it goes through the buffers
from 0 to nbuffers and flushes them instead of starting at the buffer
just after the one we flushed and doing a single sweep. I've fixed
that in the attached. It makes it more likely we'll flush a buffer
we're about to use.

> > A 15% improvement can be noticed with the same benchmark but 4 workers.
> >
> > - Melanie
> In only get 5-10% improvements
>
> I did this benchmark also. For 16 source data files that are 150MB
> each I get 5-10 % speedup (5% with small shared_buffers and 10 % with
> big shared_buffers).

Yes, one possible reason is that with smaller source files (150 MB vs
1 GB) is that the times around the ring for each COPY are less vs. the
setup and one time costs. And there are fewer sustained periods of
time where the different COPYies are ongoing doing the same
operations.

However, the more likely reason you see a smaller improvement is just
variance from one SSD and CPU to the next. I tried on three models of
SSD and two of CPU (all local [not cloud]) and saw variation in the
improvement -- on some models the improvement was less. For all cases,
I had to turn off CPU turbo boosting and idling to see consistent
results.

It's possible that this performance improvement of this patch by
itself isn't large enough to merit committing it.

I just finished a draft of a patch set to do write combining for COPY
FROM using this same heuristic as this patch for deciding to eagerly
flush but then combining multiple buffers into a single IO. That has
larger performance gains, so one could argue to wait to do that.

- Melanie

Вложения

Re: Eagerly evict bulkwrite strategy ring

От
Melanie Plageman
Дата:
On Mon, Aug 25, 2025 at 3:03 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I just finished a draft of a patch set to do write combining for COPY
> FROM using this same heuristic as this patch for deciding to eagerly
> flush but then combining multiple buffers into a single IO. That has
> larger performance gains, so one could argue to wait to do that.

Attached v3 uses the same code structure as in the checkpointer write
combining thread [1]. I need the refactoring of FlushBuffer() now
included in this set to do the write combining in checkpointer. Upon
testing, I did notice that write combining seemed to have little
effect on COPY FROM beyond what eagerly flushing the buffers in the
ring has. The bottleneck is WAL IO and CPU. While we will need the
COPY FROM write combining to use direct IO, perhaps it is not worth
committing it just yet. For now, this thread remains limited to
eagerly flushing buffers in the BAS_BULKWRITE strategy ring.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com

Вложения

Re: Eagerly evict bulkwrite strategy ring

От
Nazir Bilal Yavuz
Дата:
Hi,

On Tue, 9 Sept 2025 at 20:37, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Aug 25, 2025 at 3:03 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > I just finished a draft of a patch set to do write combining for COPY
> > FROM using this same heuristic as this patch for deciding to eagerly
> > flush but then combining multiple buffers into a single IO. That has
> > larger performance gains, so one could argue to wait to do that.
>
> Attached v3 uses the same code structure as in the checkpointer write
> combining thread [1]. I need the refactoring of FlushBuffer() now
> included in this set to do the write combining in checkpointer. Upon
> testing, I did notice that write combining seemed to have little
> effect on COPY FROM beyond what eagerly flushing the buffers in the
> ring has. The bottleneck is WAL IO and CPU. While we will need the
> COPY FROM write combining to use direct IO, perhaps it is not worth
> committing it just yet. For now, this thread remains limited to
> eagerly flushing buffers in the BAS_BULKWRITE strategy ring.

Reviewing patches here instead of the checkpointer write combining thread.

From dae7c82146c2d73729fc12a742d84b660e6db2ad Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v3 1/4] Refactor goto into for loop in GetVictimBuffer()

LGTM.

From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v3 2/4] Split FlushBuffer() into two parts

Codewise LGTM. Two minor points:

1- Commit message only mentions PrepareFlushBuffer() and
DoFlushBuffer(), I did not expect to see the CleanVictimBuffer().
Maybe it is worth mentioning it in the commit message.

2-
+/*
+ * Prepare to write and write a dirty victim buffer.

Although this comment is correct, it is a bit complicated for me. How
about 'Prepare to write and then write a dirty victim buffer'?

From 32f8dbe2c885ce45ef7b217c2693333525fb9b89 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v3 3/4] Eagerly flush bulkwrite strategy ring

+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or or NULL when there are no further buffers to

s/or or/ or/.

+ * consider writing out.
+ */
+static BufferDesc *
+next_strat_buf_to_flush(BufferAccessStrategy strategy,
+                        XLogRecPtr *lsn)
+{
+    Buffer        bufnum;
+    BufferDesc *bufdesc;
+
+    while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
+    {

StrategySweepNextBuffer() returns InvalidBuffer when we reach the
start but can strategy->buffers[strategy->sweep_current] be an
InvalidBuffer? I mean, is the following case possible:
strategy->buffers[strategy->sweep_current] is an InvalidBuffer but
strategy->buffers[strategy->sweep_current + 1] is not. So, we exit
early from the next_strat_buf_to_flush() although there are more
buffers to consider writing out.

+/*
+ * Start a sweep of the strategy ring.
+ */
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+    if (!strategy)
+        return;

I think we will always use this function together with
strategy_supports_eager_flush(), right? If yes, then we do not need to
check if the strategy is NULL. If not, then I think this function
should return boolean to make it explicit that we can not do sweep.

+extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);

All the functions in the buf_internals.h are pascal case, should we
make this too?

+    /* Must do this before taking the buffer header spinlock. */
+    /* Try to start an I/O operation. */

These one line comments end with a dot.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Eagerly evict bulkwrite strategy ring

От
Melanie Plageman
Дата:
On Wed, Sep 10, 2025 at 6:14 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>

Thanks so much for the review! I've only included inline comments to
things that still might need discussion. Otherwise, I've incorporated
your suggested changes.

> From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 2 Sep 2025 11:32:24 -0400
> Subject: [PATCH v3 2/4] Split FlushBuffer() into two parts
> 2-
> +/*
> + * Prepare to write and write a dirty victim buffer.
>
> Although this comment is correct, it is a bit complicated for me. How
> about 'Prepare to write and then write a dirty victim buffer'?

I've gone with  * Prepare and write out a dirty victim buffer.

> From 32f8dbe2c885ce45ef7b217c2693333525fb9b89 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 2 Sep 2025 12:43:24 -0400
> Subject: [PATCH v3 3/4] Eagerly flush bulkwrite strategy ring
>
> + * consider writing out.
> + */
> +static BufferDesc *
> +next_strat_buf_to_flush(BufferAccessStrategy strategy,
> +                        XLogRecPtr *lsn)
> +{
> +    Buffer        bufnum;
> +    BufferDesc *bufdesc;
> +
> +    while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
> +    {
>
> StrategySweepNextBuffer() returns InvalidBuffer when we reach the
> start but can strategy->buffers[strategy->sweep_current] be an
> InvalidBuffer? I mean, is the following case possible:
> strategy->buffers[strategy->sweep_current] is an InvalidBuffer but
> strategy->buffers[strategy->sweep_current + 1] is not. So, we exit
> early from the next_strat_buf_to_flush() although there are more
> buffers to consider writing out.

Yes, good thought. Actually for BAS_BULKWRITE this cannot happen
because when a buffer is not reused we overwrite its place in the
buffers array with the shared buffer we then replace it with. It can
happen for BAS_BULKREAD. Since we are only concerned with writing, I
think we can terminate after we hit an InvalidBuffer in the ring.

While looking at this, I decided it didn't make sense to have sweep
variables in the strategy object, so I've actually changed the way
StrategySweepNextBuffer() works. There was also an issue with the
sweep -- it could run into and past the starting buffer. So, I had to
change it. Take a look at the new method and let me know what you
think.

> +/*
> + * Start a sweep of the strategy ring.
> + */
> +void
> +StartStrategySweep(BufferAccessStrategy strategy)
> +{
> +    if (!strategy)
> +        return;
>
> I think we will always use this function together with
> strategy_supports_eager_flush(), right? If yes, then we do not need to
> check if the strategy is NULL. If not, then I think this function
> should return boolean to make it explicit that we can not do sweep.

Yes, I just removed this check.

> +extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
>
> All the functions in the buf_internals.h are pascal case, should we
> make this too?

I thought maybe I'd go a different way because it's sort of
informational and not a function that does stuff -- but anyway you're
right. I've given up and made all my helpers pascal case.

- Melanie

Вложения

Re: Eagerly evict bulkwrite strategy ring

От
Nazir Bilal Yavuz
Дата:
Hi,

On Fri, 12 Sept 2025 at 02:17, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> While looking at this, I decided it didn't make sense to have sweep
> variables in the strategy object, so I've actually changed the way
> StrategySweepNextBuffer() works. There was also an issue with the
> sweep -- it could run into and past the starting buffer. So, I had to
> change it. Take a look at the new method and let me know what you
> think.

It looks good to me. StrategySweepNextBuffer()'s comment is no longer
correct, though.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft