Обсуждение: Eagerly evict bulkwrite strategy ring
Hi, While discussing the next steps for AIO writes in Postgres, Andres suggested that a good starting point would be to begin evicting more than one buffer at a time in some of the buffer access strategies that perform writes. This would make it easier to later combine these writes and, eventually, issue them asynchronously. The attached patch implements this behavior for the BAS_BULKWRITE strategy. With the patch applied, I observe average performance improvements of about 15-20% for parallel COPY FROM operations on the same table. After some analysis, this improvement appears to be primarily due to reduced time spent by each backend waiting on the lock to flush WAL. Since backends now issue more data file writes before each WAL flush (using a heuristic that avoids eviction when it would require flushing WAL), there is less interleaving between WAL flushes and data file writes. With the patch applied, I observe client backends waiting significantly less on the WALWriteLock. I also see lower f_await times in iostat, suggesting reduced flush-related waiting at the kernel level as well. It's worth noting that for the serial COPY case (a single COPY FROM), performance remains essentially unchanged with the patch. The benefit seems to emerge only when multiple backends are concurrently writing data and flushing WAL. In fact, the benefits go down the fewer parallel COPY FROM operations are performed at a time. The benchmark I did was simple: -- make 16 source data files that are >= 1GB each initdb pg_ctl start createdb sudo fstrim -v /mnt/data psql -c "drop table foo; create table foo(a int, b int) with (autovacuum_enabled = off);" time pgbench \ --no-vacuum \ -c 16 \ -j 16 \ -t 4 \ -f- <<EOF COPY foo FROM '/mnt/data/foo:client_id.data'; EOF master -> patch 6.2 minutes -> 5 minutes : ~20% reduction A 15% improvement can be noticed with the same benchmark but 4 workers. - Melanie
Вложения
On Wed, 16 Jul 2025 at 04:51, Melanie Plageman <melanieplageman@gmail.com> wrote: > > Hi, > > While discussing the next steps for AIO writes in Postgres, Andres > suggested that a good starting point would be to begin evicting more > than one buffer at a time in some of the buffer access strategies that > perform writes. This would make it easier to later combine these > writes and, eventually, issue them asynchronously. > > The attached patch implements this behavior for the BAS_BULKWRITE > strategy. With the patch applied, I observe average performance > improvements of about 15-20% for parallel COPY FROM operations on the > same table. > > After some analysis, this improvement appears to be primarily due to > reduced time spent by each backend waiting on the lock to flush WAL. > > Since backends now issue more data file writes before each WAL flush > (using a heuristic that avoids eviction when it would require flushing > WAL), there is less interleaving between WAL flushes and data file > writes. With the patch applied, I observe client backends waiting > significantly less on the WALWriteLock. I also see lower f_await times > in iostat, suggesting reduced flush-related waiting at the kernel > level as well. > > It's worth noting that for the serial COPY case (a single COPY FROM), > performance remains essentially unchanged with the patch. The benefit > seems to emerge only when multiple backends are concurrently writing > data and flushing WAL. In fact, the benefits go down the fewer > parallel COPY FROM operations are performed at a time. > Hi! 1) In EvictStrategyRing we find io context for strategy: > + io_context = IOContextForStrategy(strategy); but the caller of this function (GetVictimBuffer) already has one. Should we reuse its context, pass it as function param to EvictStrategyRing? 2) QuickCleanBuffer function has a return value which is never checked. Should we change the signature to `void QuickCleanBuffer (...)` ? 3) In QuickCleanBuffer, we have `buffer = BufferDescriptorGetBuffer(bufdesc);`, while in QuickCleanBuffer we do the opposite right before QuickCleanBuffer call. Should we pass `bufnum` as a parameter? > The benchmark I did was simple: > > -- make 16 source data files that are >= 1GB each > > initdb > pg_ctl start > createdb > > sudo fstrim -v /mnt/data > > psql -c "drop table foo; create table foo(a int, b int) with > (autovacuum_enabled = off);" > > time pgbench \ > --no-vacuum \ > -c 16 \ > -j 16 \ > -t 4 \ > -f- <<EOF > COPY foo FROM '/mnt/data/foo:client_id.data'; > EOF > > master -> patch > 6.2 minutes -> 5 minutes : ~20% reduction > > A 15% improvement can be noticed with the same benchmark but 4 workers. > > - Melanie In only get 5-10% improvements I did this benchmark also. For 16 source data files that are 150MB each I get 5-10 % speedup (5% with small shared_buffers and 10 % with big shared_buffers). -- Best regards, Kirill Reshke
On Mon, Aug 18, 2025 at 2:54 AM Kirill Reshke <reshkekirill@gmail.com> wrote: > Hi! Thanks for the review! > 1) In EvictStrategyRing we find io context for strategy: > > > + io_context = IOContextForStrategy(strategy); > > but the caller of this function (GetVictimBuffer) already has one. > Should we reuse its context, pass it as function param to > EvictStrategyRing? Done in attached v2. > 2) QuickCleanBuffer function has a return value which is never > checked. Should we change the signature to `void QuickCleanBuffer > (...)` ? Done in attached v2. > 3) In QuickCleanBuffer, we have `buffer = > BufferDescriptorGetBuffer(bufdesc);`, while in QuickCleanBuffer we do > the opposite right before QuickCleanBuffer call. Should we pass > `bufnum` as a parameter? Done in attached v2. v2 also changes instances of the word "evict" to simply "flush" because we don't actually invalidate the contents of the buffer -- we just flush it so that it can be cheaply evicted when it is needed. Also, I noticed that v1 had an issue -- it goes through the buffers from 0 to nbuffers and flushes them instead of starting at the buffer just after the one we flushed and doing a single sweep. I've fixed that in the attached. It makes it more likely we'll flush a buffer we're about to use. > > A 15% improvement can be noticed with the same benchmark but 4 workers. > > > > - Melanie > In only get 5-10% improvements > > I did this benchmark also. For 16 source data files that are 150MB > each I get 5-10 % speedup (5% with small shared_buffers and 10 % with > big shared_buffers). Yes, one possible reason is that with smaller source files (150 MB vs 1 GB) is that the times around the ring for each COPY are less vs. the setup and one time costs. And there are fewer sustained periods of time where the different COPYies are ongoing doing the same operations. However, the more likely reason you see a smaller improvement is just variance from one SSD and CPU to the next. I tried on three models of SSD and two of CPU (all local [not cloud]) and saw variation in the improvement -- on some models the improvement was less. For all cases, I had to turn off CPU turbo boosting and idling to see consistent results. It's possible that this performance improvement of this patch by itself isn't large enough to merit committing it. I just finished a draft of a patch set to do write combining for COPY FROM using this same heuristic as this patch for deciding to eagerly flush but then combining multiple buffers into a single IO. That has larger performance gains, so one could argue to wait to do that. - Melanie
Вложения
On Mon, Aug 25, 2025 at 3:03 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > I just finished a draft of a patch set to do write combining for COPY > FROM using this same heuristic as this patch for deciding to eagerly > flush but then combining multiple buffers into a single IO. That has > larger performance gains, so one could argue to wait to do that. Attached v3 uses the same code structure as in the checkpointer write combining thread [1]. I need the refactoring of FlushBuffer() now included in this set to do the write combining in checkpointer. Upon testing, I did notice that write combining seemed to have little effect on COPY FROM beyond what eagerly flushing the buffers in the ring has. The bottleneck is WAL IO and CPU. While we will need the COPY FROM write combining to use direct IO, perhaps it is not worth committing it just yet. For now, this thread remains limited to eagerly flushing buffers in the BAS_BULKWRITE strategy ring. - Melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
Вложения
Hi, On Tue, 9 Sept 2025 at 20:37, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Aug 25, 2025 at 3:03 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > I just finished a draft of a patch set to do write combining for COPY > > FROM using this same heuristic as this patch for deciding to eagerly > > flush but then combining multiple buffers into a single IO. That has > > larger performance gains, so one could argue to wait to do that. > > Attached v3 uses the same code structure as in the checkpointer write > combining thread [1]. I need the refactoring of FlushBuffer() now > included in this set to do the write combining in checkpointer. Upon > testing, I did notice that write combining seemed to have little > effect on COPY FROM beyond what eagerly flushing the buffers in the > ring has. The bottleneck is WAL IO and CPU. While we will need the > COPY FROM write combining to use direct IO, perhaps it is not worth > committing it just yet. For now, this thread remains limited to > eagerly flushing buffers in the BAS_BULKWRITE strategy ring. Reviewing patches here instead of the checkpointer write combining thread. From dae7c82146c2d73729fc12a742d84b660e6db2ad Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplageman@gmail.com> Date: Tue, 2 Sep 2025 11:00:44 -0400 Subject: [PATCH v3 1/4] Refactor goto into for loop in GetVictimBuffer() LGTM. From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplageman@gmail.com> Date: Tue, 2 Sep 2025 11:32:24 -0400 Subject: [PATCH v3 2/4] Split FlushBuffer() into two parts Codewise LGTM. Two minor points: 1- Commit message only mentions PrepareFlushBuffer() and DoFlushBuffer(), I did not expect to see the CleanVictimBuffer(). Maybe it is worth mentioning it in the commit message. 2- +/* + * Prepare to write and write a dirty victim buffer. Although this comment is correct, it is a bit complicated for me. How about 'Prepare to write and then write a dirty victim buffer'? From 32f8dbe2c885ce45ef7b217c2693333525fb9b89 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplageman@gmail.com> Date: Tue, 2 Sep 2025 12:43:24 -0400 Subject: [PATCH v3 3/4] Eagerly flush bulkwrite strategy ring +/* + * Returns the buffer descriptor of the buffer containing the next block we + * should eagerly flush or or NULL when there are no further buffers to s/or or/ or/. + * consider writing out. + */ +static BufferDesc * +next_strat_buf_to_flush(BufferAccessStrategy strategy, + XLogRecPtr *lsn) +{ + Buffer bufnum; + BufferDesc *bufdesc; + + while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer) + { StrategySweepNextBuffer() returns InvalidBuffer when we reach the start but can strategy->buffers[strategy->sweep_current] be an InvalidBuffer? I mean, is the following case possible: strategy->buffers[strategy->sweep_current] is an InvalidBuffer but strategy->buffers[strategy->sweep_current + 1] is not. So, we exit early from the next_strat_buf_to_flush() although there are more buffers to consider writing out. +/* + * Start a sweep of the strategy ring. + */ +void +StartStrategySweep(BufferAccessStrategy strategy) +{ + if (!strategy) + return; I think we will always use this function together with strategy_supports_eager_flush(), right? If yes, then we do not need to check if the strategy is NULL. If not, then I think this function should return boolean to make it explicit that we can not do sweep. +extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy); All the functions in the buf_internals.h are pascal case, should we make this too? + /* Must do this before taking the buffer header spinlock. */ + /* Try to start an I/O operation. */ These one line comments end with a dot. -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Sep 10, 2025 at 6:14 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > Thanks so much for the review! I've only included inline comments to things that still might need discussion. Otherwise, I've incorporated your suggested changes. > From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Tue, 2 Sep 2025 11:32:24 -0400 > Subject: [PATCH v3 2/4] Split FlushBuffer() into two parts > 2- > +/* > + * Prepare to write and write a dirty victim buffer. > > Although this comment is correct, it is a bit complicated for me. How > about 'Prepare to write and then write a dirty victim buffer'? I've gone with * Prepare and write out a dirty victim buffer. > From 32f8dbe2c885ce45ef7b217c2693333525fb9b89 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Tue, 2 Sep 2025 12:43:24 -0400 > Subject: [PATCH v3 3/4] Eagerly flush bulkwrite strategy ring > > + * consider writing out. > + */ > +static BufferDesc * > +next_strat_buf_to_flush(BufferAccessStrategy strategy, > + XLogRecPtr *lsn) > +{ > + Buffer bufnum; > + BufferDesc *bufdesc; > + > + while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer) > + { > > StrategySweepNextBuffer() returns InvalidBuffer when we reach the > start but can strategy->buffers[strategy->sweep_current] be an > InvalidBuffer? I mean, is the following case possible: > strategy->buffers[strategy->sweep_current] is an InvalidBuffer but > strategy->buffers[strategy->sweep_current + 1] is not. So, we exit > early from the next_strat_buf_to_flush() although there are more > buffers to consider writing out. Yes, good thought. Actually for BAS_BULKWRITE this cannot happen because when a buffer is not reused we overwrite its place in the buffers array with the shared buffer we then replace it with. It can happen for BAS_BULKREAD. Since we are only concerned with writing, I think we can terminate after we hit an InvalidBuffer in the ring. While looking at this, I decided it didn't make sense to have sweep variables in the strategy object, so I've actually changed the way StrategySweepNextBuffer() works. There was also an issue with the sweep -- it could run into and past the starting buffer. So, I had to change it. Take a look at the new method and let me know what you think. > +/* > + * Start a sweep of the strategy ring. > + */ > +void > +StartStrategySweep(BufferAccessStrategy strategy) > +{ > + if (!strategy) > + return; > > I think we will always use this function together with > strategy_supports_eager_flush(), right? If yes, then we do not need to > check if the strategy is NULL. If not, then I think this function > should return boolean to make it explicit that we can not do sweep. Yes, I just removed this check. > +extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy); > > All the functions in the buf_internals.h are pascal case, should we > make this too? I thought maybe I'd go a different way because it's sort of informational and not a function that does stuff -- but anyway you're right. I've given up and made all my helpers pascal case. - Melanie
Вложения
Hi, On Fri, 12 Sept 2025 at 02:17, Melanie Plageman <melanieplageman@gmail.com> wrote: > > While looking at this, I decided it didn't make sense to have sweep > variables in the strategy object, so I've actually changed the way > StrategySweepNextBuffer() works. There was also an issue with the > sweep -- it could run into and past the starting buffer. So, I had to > change it. Take a look at the new method and let me know what you > think. It looks good to me. StrategySweepNextBuffer()'s comment is no longer correct, though. -- Regards, Nazir Bilal Yavuz Microsoft