Обсуждение: Checkpointer write combining
Hi, The attached patchset implements checkpointer write combining -- which makes immediate checkpoints at least 20% faster in my tests. Checkpointer achieves higher write throughput and higher write IOPs with the patch. Besides the immediate performance gain with the patchset, we will eventually need all writers to do write combining if we want to use direct IO. Additionally, I think the general shape I refactored BufferSync() into will be useful for AIO-ifying checkpointer. The patch set has preliminary patches (0001-0004) that implement eager flushing and write combining for bulkwrites (like COPY FROM). The functions used to flush a batch of writes for bulkwrites (see 0004) are reused for the checkpointer. The eager flushing component of this patch set has been discussed elsewhere [1]. 0005 implements a fix for XLogNeedsFlush() when called by checkpointer during an end-of-crash-recovery checkpoint. I've already started another thread about this [2], but the patch is required for the patch set to pass tests. One outstanding action item is to test to see if there are any benefits to spread checkpoints. More on how I measured the performance benefit to immediate checkpoints: I tuned checkpoint_completion_target, checkpoint_timeout, and min and max_wal_size to ensure no other checkpoints were initiated. With 16 GB shared buffers and io_combine_limit 128, I created a 15 GB table. To get consistent results, I used pg_prewarm to read the table into shared buffers, issued a checkpoint, then used Bilal's patch [3] to mark all the buffers as dirty again and issue another checkpoint. On a fast local SSD, this proved to be a consistent 20%+ speed up (~6.5 seconds to ~5 seconds). - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com [2] https://www.postgresql.org/message-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com [3] https://www.postgresql.org/message-id/flat/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
Вложения
- v1-0005-Fix-XLogNeedsFlush-for-checkpointer.patch
- v1-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch
- v1-0004-Write-combining-for-BAS_BULKWRITE.patch
- v1-0003-Eagerly-flush-bulkwrite-strategy-ring.patch
- v1-0002-Split-FlushBuffer-into-two-parts.patch
- v1-0006-Add-database-Oid-to-CkptSortItem.patch
- v1-0007-Implement-checkpointer-data-write-combining.patch
On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > The attached patchset implements checkpointer write combining -- which > makes immediate checkpoints at least 20% faster in my tests. > Checkpointer achieves higher write throughput and higher write IOPs > with the patch. These needed a rebase. Attached v2. - Melanie
Вложения
- v2-0002-Split-FlushBuffer-into-two-parts.patch
- v2-0003-Eagerly-flush-bulkwrite-strategy-ring.patch
- v2-0004-Write-combining-for-BAS_BULKWRITE.patch
- v2-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch
- v2-0005-Fix-XLogNeedsFlush-for-checkpointer.patch
- v2-0006-Add-database-Oid-to-CkptSortItem.patch
- v2-0007-Implement-checkpointer-data-write-combining.patch
Hi, Thank you for working on this! On Tue, 9 Sept 2025 at 02:44, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > The attached patchset implements checkpointer write combining -- which > > makes immediate checkpoints at least 20% faster in my tests. > > Checkpointer achieves higher write throughput and higher write IOPs > > with the patch. I did the same benchmark you did and I found it is %50 faster (16 seconds to 8 seconds). From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplageman@gmail.com> Date: Tue, 2 Sep 2025 11:00:44 -0400 Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer() @@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf)) return false; + buf_state = LockBufHdr(buf); + lsn = BufferGetLSN(buf); + UnlockBufHdr(buf, buf_state); + + if (!XLogNeedsFlush(lsn)) + return true; I think this should return false. I am planning to review the other patches later and this is for the first patch only. -- Regards, Nazir Bilal Yavuz Microsoft
On Tue, Sep 9, 2025 at 9:27 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > Thanks for the review! > From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Tue, 2 Sep 2025 11:00:44 -0400 > Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer() > > @@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy > strategy, BufferDesc *buf, bool from_r > strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf)) > return false; > > + buf_state = LockBufHdr(buf); > + lsn = BufferGetLSN(buf); > + UnlockBufHdr(buf, buf_state); > + > + if (!XLogNeedsFlush(lsn)) > + return true; > > I think this should return false. Oops, you're right. v3 attached with that mistake fixed. - Melanie
Вложения
- v3-0003-Eagerly-flush-bulkwrite-strategy-ring.patch
- v3-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch
- v3-0002-Split-FlushBuffer-into-two-parts.patch
- v3-0005-Fix-XLogNeedsFlush-for-checkpointer.patch
- v3-0004-Write-combining-for-BAS_BULKWRITE.patch
- v3-0007-Implement-checkpointer-data-write-combining.patch
- v3-0006-Add-database-Oid-to-CkptSortItem.patch
On Tue, Sep 9, 2025 at 9:39 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Oops, you're right. v3 attached with that mistake fixed. One more fix and a bit more cleanup in attached v4. - Melanie
Вложения
- v4-0003-Eagerly-flush-bulkwrite-strategy-ring.patch
- v4-0005-Fix-XLogNeedsFlush-for-checkpointer.patch
- v4-0002-Split-FlushBuffer-into-two-parts.patch
- v4-0004-Write-combining-for-BAS_BULKWRITE.patch
- v4-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch
- v4-0007-Implement-checkpointer-data-write-combining.patch
- v4-0006-Add-database-Oid-to-CkptSortItem.patch
On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > One more fix and a bit more cleanup in attached v4. Okay one more version: I updated the thread on eager flushing the bulkwrite ring [1], and some updates were needed here. - Melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
Вложения
- v5-0004-Write-combining-for-BAS_BULKWRITE.patch
- v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch
- v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch
- v5-0002-Split-FlushBuffer-into-two-parts.patch
- v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch
- v5-0006-Add-database-Oid-to-CkptSortItem.patch
- v5-0007-Implement-checkpointer-data-write-combining.patch
On Tue, 2025-09-09 at 13:55 -0400, Melanie Plageman wrote: > On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > One more fix and a bit more cleanup in attached v4. > > Okay one more version: I updated the thread on eager flushing the > bulkwrite ring [1], and some updates were needed here. v5-0005 comments: * Please update the comment above the code change. * The last paragraph in the commit message has a typo: "potentially update the local copy of min recovery point, when xlog inserts are *not* allowed", right? * Shouldn't the code be consistent between XLogNeedsFlush() and XLogFlush()? The latter only checks for !XLogInsertAllowed(), whereas the former also checks for RecoveryInProgress(). I'm still not sure I understand the problem situation this is fixing, but that's being discussed in another thread. Regards, Jeff Davis
On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
[1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
<v5-0004-Write-combining-for-BAS_BULKWRITE.patch><v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch><v5-0002-Split-FlushBuffer-into-two-parts.patch><v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v5-0006-Add-database-Oid-to-CkptSortItem.patch><v5-0007-Implement-checkpointer-data-write-combining.patch>
1 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+ * The buffer must pinned and content locked and the buffer header spinlock
```
“Must pinned” -> “must be pinned"
2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+ if (XLogNeedsFlush(lsn))
+ {
+ /*
+ * Remove the dirty buffer from the ring; necessary to prevent an
+ * infinite loop if all ring members are dirty.
+ */
+ strategy->buffers[strategy->current] = InvalidBuffer;
+ return true;
+ }
- return true;
+ return false;
}
```
We can do:
```
If (!XLogNeedsFlush(lan))
Return false
/* Remove the dirty buffer ….
*/
Return true;
}
```
This way makes less diff.
3 - 0002
```
+ * Prepare to write and write a dirty victim buffer.
```
Prepare to write a dirty victim buffer.
4 - 0002
```
- /* OK, do the I/O */
- FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
- LWLockRelease(content_lock);
-
- ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
- &buf_hdr->tag);
+ CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
```
I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().
I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.
5 - 0002
```
* disastrous system-wide consequences. To make sure that can't happen,
* skip the flush if the buffer isn't permanent.
*/
- if (buf_state & BM_PERMANENT)
- XLogFlush(recptr);
+ if (!XLogRecPtrIsInvalid(buffer_lsn))
+ XLogFlush(buffer_lsn);
```
Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.
6 - 0003
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+ * patterns than lazily flushing buffers directly before reusing them.
+ */
```
Here “directly” is kind ambiguous. It could mean “immediately before” or “without going through something else”. My understanding is “immediately”, If that is true, please change “directly” to “immediately” or just remove it.
7 - 0003
```
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+ if (!strategy)
+ return;
```
I doubt if this “strategy” null check is needed. Because it is only called when strategy_supports_eager_flush() is true, and strategy_supports_eager_flush() has asserted “strategy”.
And as a pair function, StrategySweepNextBuffer() doesn’t do null check nor assert strategy.
8 - 0003
```
bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
```
This function is only used in bufmgr.c, can we move it there and make it static?
9 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ batch->start = batch->bufdescs[0]->tag.blockNum;
+ batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+ batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+ batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+ Assert(BlockNumberIsValid(batch->start));
```
Why don’t assert immediately after batch->start is assigned? So upon error, smgropen() will not be called.
11 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ for (batch->n = 1; batch->n < limit; batch->n++)
+ {
+ Buffer bufnum;
+
+ if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+ break;
```
Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.
12 - 0004
```
@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
if (from_ring && strategy_supports_eager_flush(strategy))
{
+ uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
```
I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.
13 - 0004
```
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+ WritebackContext *wb_context)
+{
+ ErrorContextCallback errcallback =
+ {
+ .callback = shared_buffer_write_error_callback,
+ .previous = error_context_stack,
+ };
+
+ error_context_stack = &errcallback;
+ pgBufferUsage.shared_blks_written += batch->n;
```
Should we only increase shared_blks_written only after the loop of write-back is done?
14 - 0004
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
```
I think this function can be moved to bufmgr.c and make it static.
15 - 0004
```
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
}
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
```
We should mark blknos as const to indicate it is readonly: const BlockNumber *blknos, which will also prevent from incidentally change on blknos in within the function.
16 - 0005
```
* instead. So "needs flush" is taken to mean whether minRecoveryPoint
* would need to be updated.
*/
- if (RecoveryInProgress())
+ if (RecoveryInProgress() && !XLogInsertAllowed())
```
As a new check is added, the comment should be updated accordingly.
17 - 0006
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
typedef struct CkptSortItem
{
Oid tsId;
+ Oid db_id;
```
I think “db_id” should be named “dbId” or “dbOid”. Let’s keep the name conversation consistent.
18 - 0007
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ max_batch_size = checkpointer_max_batch_size();
```
Look like we don’t need to calculate max_batch_size in the for loop.
19 - 0007
```
+ * Each batch will have exactly one start and one max lsn and one
+ * length.
*/
```
I don’t get what you want to explain with this comment. It sounds quite unnecessary.
Best regards,
—
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote: > Thanks for the review! For any of your feedback that I simply implemented, I omitted an inline comment about it. Those changes are included in the attached v6. My inline replies below are only for feedback requiring more discussion. > On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote: > > 2 - 0001 > ``` > --- a/src/backend/storage/buffer/freelist.c > +++ b/src/backend/storage/buffer/freelist.c > > + if (XLogNeedsFlush(lsn)) > + { > + /* > + * Remove the dirty buffer from the ring; necessary to prevent an > + * infinite loop if all ring members are dirty. > + */ > + strategy->buffers[strategy->current] = InvalidBuffer; > + return true; > + } > > - return true; > + return false; > } > ``` > > We can do: > ``` > If (!XLogNeedsFlush(lan)) > Return false > > /* Remove the dirty buffer …. > */ > Return true; > } > ``` This would make the order of evaluation the same as master but I actually prefer it this way because then we only take the buffer header spinlock if there is a chance we will reject the buffer (e.g. we don't need to examine it for strategies except BAS_BULKREAD) > 4 - 0002 > ``` > - /* OK, do the I/O */ > - FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context); > - LWLockRelease(content_lock); > - > - ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context, > - &buf_hdr->tag); > + CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context); > ``` > I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand.Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further readingCleanVictimBuffer(). > > I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer()will release the lock. I think for this specific patch in the set your idea makes sense. However, in the later patch to do write combining, I release the content locks for the batch in CompleteWriteBatchIO() and having the start buffer's lock separate as a parameter would force me to have a special case handling this. I've added a comment to both CleanVictimBuffer() and its caller specifying that the lock must be held and that it will be released inside CleanVictimBuffer. > 5 - 0002 > ``` > * disastrous system-wide consequences. To make sure that can't happen, > * skip the flush if the buffer isn't permanent. > */ > - if (buf_state & BM_PERMANENT) > - XLogFlush(recptr); > + if (!XLogRecPtrIsInvalid(buffer_lsn)) > + XLogFlush(buffer_lsn); > ``` > > Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, whichreflects to the old code. It's changed because I split the logic for flushing to that LSN and determining the LSN across the Prepare and Do functions. This is needed because when we do batches, we want to flush to the max LSN across all buffers in the batch. I check if the buffer is BM_PERMANENT in PrepareFlushBuffer(). You make a good point about my comment, though. I've moved it to PrepareFlushBuffer() and updated it. > 8 - 0003 > ``` > bool > +strategy_supports_eager_flush(BufferAccessStrategy strategy) > ``` > > This function is only used in bufmgr.c, can we move it there and make it static? BufferAccessStrategyData is opaque to bufmgr.c. Only freelist.c can access it. I agree it is gross that I have these helpers and functions that would otherwise be static, though. > 10 - 0004 > ``` > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > > + limit = Min(max_batch_size, limit); > ``` > > Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is definedwith length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”. I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway, since we are capping ourselves at io_combine_limit. Or is that your point? > 11 - 0004 > ``` > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > > + for (batch->n = 1; batch->n < limit; batch->n++) > + { > + Buffer bufnum; > + > + if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer) > + break; > ``` > > Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then isit possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start. Ah, great point. I didn't think about this. Our sweep will always start right after the start buffer, but then if it goes all the way around, it will "lap" the start buffer. Because of this and because I think it is weird to have the sweep variables in the BufferAccessStrategy object, I've changed my approach in attached v6. I set sweep_end to be the start block in the batch and then pass around a sweep cursor variable. Hitting sweep_end is the termination condition. > 12 - 0004 > ``` > @@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy, > > if (from_ring && strategy_supports_eager_flush(strategy)) > { > + uint32 max_batch_size = max_write_batch_size_for_strategy(strategy); > ``` > > I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculatein every round of clean. Actually, the max pin limit can change quite frequently. See GetAdditionalPinLimit()'s usage in read stream code. If the query is pinning other buffers in another part of the query, it can change our limit. I'm not sure if I should call GetAdditionalPinLImit() for each batch or for each run of batches (like in StrategyMaxWriteBatchSize()). Currently, I call it for each batch (in FindFlushAdjacents()). The read stream calls it pretty frequently (each read_stream_start_pending_read()). But, in the batch flush case, nothing could change between batches in a run of batches. So maybe I should move it up and out and make it per run of batches... > 13 - 0004 > ``` > +void > +CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context, > + WritebackContext *wb_context) > +{ > + ErrorContextCallback errcallback = > + { > + .callback = shared_buffer_write_error_callback, > + .previous = error_context_stack, > + }; > + > + error_context_stack = &errcallback; > + pgBufferUsage.shared_blks_written += batch->n; > ``` > > Should we only increase shared_blks_written only after the loop of write-back is done? On master, FlushBuffer() does it after smgrwrite() (before writeback). I think pgBufferUsage is mainly used in EXPLAIN (also pg_stat_statements) which won't be used until the end of the query and won't be displayed if we error out. > 14 - 0004 > ``` > --- a/src/backend/storage/buffer/freelist.c > +++ b/src/backend/storage/buffer/freelist.c > > +uint32 > +max_write_batch_size_for_strategy(BufferAccessStrategy strategy) > ``` > > I think this function can be moved to bufmgr.c and make it static. This technically could be moved, but it is a function giving you information about a strategy which seemed to fit better in freelist.c. > 18 - 0007 > ``` > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > > + max_batch_size = checkpointer_max_batch_size(); > ``` > > Look like we don’t need to calculate max_batch_size in the for loop. I don't think it's in the for loop. - Melanie
Вложения
- v6-0002-Split-FlushBuffer-into-two-parts.patch
- v6-0003-Eagerly-flush-bulkwrite-strategy-ring.patch
- v6-0005-Fix-XLogNeedsFlush-for-checkpointer.patch
- v6-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch
- v6-0004-Write-combining-for-BAS_BULKWRITE.patch
- v6-0006-Add-database-Oid-to-CkptSortItem.patch
- v6-0007-Implement-checkpointer-data-write-combining.patch
On Sep 12, 2025, at 07:11, Melanie Plageman <melanieplageman@gmail.com> wrote:On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:
Thanks for the review!
For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+ if (XLogNeedsFlush(lsn))
+ {
+ /*
+ * Remove the dirty buffer from the ring; necessary to prevent an
+ * infinite loop if all ring members are dirty.
+ */
+ strategy->buffers[strategy->current] = InvalidBuffer;
+ return true;
+ }
- return true;
+ return false;
}
```
We can do:
```
If (!XLogNeedsFlush(lan))
Return false
/* Remove the dirty buffer ….
*/
Return true;
}
```
This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)
I don’t understand why the two versions are different:
if (XLogNeedsFlush(lsn))
{
/*
* Remove the dirty buffer from the ring; necessary to prevent an
* infinite loop if all ring members are dirty.
*/
strategy->buffers[strategy->current] = InvalidBuffer;
return true;
}
return false;
VS
if (XLogNeedsFlush(lsn))
return false;
/*
* Remove the dirty buffer from the ring; necessary to prevent an
* infinite loop if all ring members are dirty.
*/
strategy->buffers[strategy->current] = InvalidBuffer;
return true;
10 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ limit = Min(max_batch_size, limit);
```
Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.
I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?
Please ignore comment 10. I think I cross-line it in my original email. I added the comment, then lately I found you have checked MAX_IO_COMBINE_LIMIT in the other function, so tried to delete it by cross-lining the comment.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/