Обсуждение: Checkpointer write combining

Поиск

Список

Период

Сортировка

Checkpointer write combining

От

Melanie Plageman

Дата:

03 сентября, 00:10:43

Hi,

The attached patchset implements checkpointer write combining -- which
makes immediate checkpoints at least 20% faster in my tests.
Checkpointer achieves higher write throughput and higher write IOPs
with the patch.

Besides the immediate performance gain with the patchset, we will
eventually need all writers to do write combining if we want to use
direct IO. Additionally, I think the general shape I refactored
BufferSync() into will be useful for AIO-ifying checkpointer.

The patch set has preliminary patches (0001-0004) that implement eager
flushing and write combining for bulkwrites (like COPY FROM). The
functions used to flush a batch of writes for bulkwrites (see 0004)
are reused for the checkpointer. The eager flushing component of this
patch set has been discussed elsewhere [1].

0005 implements a fix for XLogNeedsFlush() when called by checkpointer
during an end-of-crash-recovery checkpoint. I've already started
another thread about this [2], but the patch is required for the patch
set to pass tests.

One outstanding action item is to test to see if there are any
benefits to spread checkpoints.

More on how I measured the performance benefit to immediate checkpoints:

I tuned checkpoint_completion_target, checkpoint_timeout, and min and
max_wal_size to ensure no other checkpoints were initiated.

With 16 GB shared buffers and io_combine_limit 128, I created a 15 GB
table. To get consistent results, I used pg_prewarm to read the table
into shared buffers, issued a checkpoint, then used Bilal's patch [3]
to mark all the buffers as dirty again and issue another checkpoint.
On a fast local SSD, this proved to be a consistent 20%+ speed up
(~6.5 seconds to ~5 seconds).

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com

Вложения

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

09 сентября, 02:44:02

On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The attached patchset implements checkpointer write combining -- which
> makes immediate checkpoints at least 20% faster in my tests.
> Checkpointer achieves higher write throughput and higher write IOPs
> with the patch.

These needed a rebase. Attached v2.

- Melanie

Вложения

Re: Checkpointer write combining

От

Nazir Bilal Yavuz

Дата:

09 сентября, 16:27:47

Hi,

Thank you for working on this!

On Tue, 9 Sept 2025 at 02:44, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The attached patchset implements checkpointer write combining -- which
> > makes immediate checkpoints at least 20% faster in my tests.
> > Checkpointer achieves higher write throughput and higher write IOPs
> > with the patch.

I did the same benchmark you did and I found it is %50 faster (16
seconds to 8 seconds).

From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()

@@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy
strategy, BufferDesc *buf, bool from_r
         strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
         return false;

+    buf_state = LockBufHdr(buf);
+    lsn = BufferGetLSN(buf);
+    UnlockBufHdr(buf, buf_state);
+
+    if (!XLogNeedsFlush(lsn))
+        return true;

I think this should return false.

I am planning to review the other patches later and this is for the
first patch only.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

09 сентября, 16:39:19

On Tue, Sep 9, 2025 at 9:27 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>

Thanks for the review!

> From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 2 Sep 2025 11:00:44 -0400
> Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()
>
> @@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy
> strategy, BufferDesc *buf, bool from_r
>          strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
>          return false;
>
> +    buf_state = LockBufHdr(buf);
> +    lsn = BufferGetLSN(buf);
> +    UnlockBufHdr(buf, buf_state);
> +
> +    if (!XLogNeedsFlush(lsn))
> +        return true;
>
> I think this should return false.

Oops, you're right. v3 attached with that mistake fixed.

- Melanie

Вложения

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

09 сентября, 18:16:59

On Tue, Sep 9, 2025 at 9:39 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Oops, you're right. v3 attached with that mistake fixed.

One more fix and a bit more cleanup in attached v4.

- Melanie

Вложения

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

09 сентября, 20:55:08

On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> One more fix and a bit more cleanup in attached v4.

Okay one more version: I updated the thread on eager flushing the
bulkwrite ring [1], and some updates were needed here.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com

Вложения

Re: Checkpointer write combining

От

Jeff Davis

Дата:

10 сентября, 04:28:55

On Tue, 2025-09-09 at 13:55 -0400, Melanie Plageman wrote:
> On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > One more fix and a bit more cleanup in attached v4.
>
> Okay one more version: I updated the thread on eager flushing the
> bulkwrite ring [1], and some updates were needed here.

v5-0005 comments:

  * Please update the comment above the code change.
  * The last paragraph in the commit message has a typo: "potentially
update the local copy of min recovery point, when xlog inserts are
*not* allowed", right?
  * Shouldn't the code be consistent between XLogNeedsFlush() and
XLogFlush()? The latter only checks for !XLogInsertAllowed(), whereas
the former also checks for RecoveryInProgress().

I'm still not sure I understand the problem situation this is fixing,
but that's being discussed in another thread.

Regards,
    Jeff Davis

Re: Checkpointer write combining

От

Chao Li

Дата:

10 сентября, 11:24:13

On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:

[1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
<v5-0004-Write-combining-for-BAS_BULKWRITE.patch><v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch><v5-0002-Split-FlushBuffer-into-two-parts.patch><v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v5-0006-Add-database-Oid-to-CkptSortItem.patch><v5-0007-Implement-checkpointer-data-write-combining.patch>

1 - 0001

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+ * The buffer must pinned and content locked and the buffer header spinlock

```

“Must pinned” -> “must be pinned"

2 - 0001

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+ if (XLogNeedsFlush(lsn))

+ {

+ /*

+ * Remove the dirty buffer from the ring; necessary to prevent an

+ * infinite loop if all ring members are dirty.

+ */

+ strategy->buffers[strategy->current] = InvalidBuffer;

+ return true;

+ }

- return true;

+ return false;

}

```

We can do:

```

If (!XLogNeedsFlush(lan))

Return false

/* Remove the dirty buffer ….

Return true;

}

```

This way makes less diff.

3 - 0002

```

+ * Prepare to write and write a dirty victim buffer.

```

Prepare to write a dirty victim buffer.

4 - 0002

```

- /* OK, do the I/O */

- FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);

- LWLockRelease(content_lock);

- ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,

- &buf_hdr->tag);

+ CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);

```

I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().

I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

5 - 0002

```

* disastrous system-wide consequences. To make sure that can't happen,

* skip the flush if the buffer isn't permanent.

- if (buf_state & BM_PERMANENT)

- XLogFlush(recptr);

+ if (!XLogRecPtrIsInvalid(buffer_lsn))

+ XLogFlush(buffer_lsn);

```

Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

6 - 0003

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+ * patterns than lazily flushing buffers directly before reusing them.

+ */

```

Here “directly” is kind ambiguous. It could mean “immediately before” or “without going through something else”. My understanding is “immediately”, If that is true, please change “directly” to “immediately” or just remove it.

7 - 0003

```

+void

+StartStrategySweep(BufferAccessStrategy strategy)

+ if (!strategy)

+ return;

```

I doubt if this “strategy” null check is needed. Because it is only called when strategy_supports_eager_flush() is true, and strategy_supports_eager_flush() has asserted “strategy”.

And as a pair function, StrategySweepNextBuffer() doesn’t do null check nor assert strategy.

8 - 0003

```

bool

+strategy_supports_eager_flush(BufferAccessStrategy strategy)

```

This function is only used in bufmgr.c, can we move it there and make it static?

9 - 0004

```

--- a/src/backend/storage/buffer/bufmgr.c

+++ b/src/backend/storage/buffer/bufmgr.c

+ batch->start = batch->bufdescs[0]->tag.blockNum;

+ batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);

+ batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);

+ batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);

+ Assert(BlockNumberIsValid(batch->start));

```

Why don’t assert immediately after batch->start is assigned? So upon error, smgropen() will not be called.

~~10 - 0004~~

~~```~~

~~--- a/src/backend/storage/buffer/bufmgr.c~~

~~+++ b/src/backend/storage/buffer/bufmgr.c~~

~~+ limit = Min(max_batch_size, limit);~~

~~```~~

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

11 - 0004

```

--- a/src/backend/storage/buffer/bufmgr.c

+++ b/src/backend/storage/buffer/bufmgr.c

+ for (batch->n = 1; batch->n < limit; batch->n++)

+ {

+ Buffer bufnum;

+ if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)

+ break;

```

Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

12 - 0004

```

@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,

if (from_ring && strategy_supports_eager_flush(strategy))

{

+ uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);

```

I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

13 - 0004

```

+void

+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,

+ WritebackContext *wb_context)

+ ErrorContextCallback errcallback =

+ {

+ .callback = shared_buffer_write_error_callback,

+ .previous = error_context_stack,

+ };

+ error_context_stack = &errcallback;

+ pgBufferUsage.shared_blks_written += batch->n;

```

Should we only increase shared_blks_written only after the loop of write-back is done?

14 - 0004

```

--- a/src/backend/storage/buffer/freelist.c

+++ b/src/backend/storage/buffer/freelist.c

+uint32

+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)

```

I think this function can be moved to bufmgr.c and make it static.

15 - 0004

```

--- a/src/backend/storage/page/bufpage.c

+++ b/src/backend/storage/page/bufpage.c

@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)

((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);

}

+/*

+ * A helper to set multiple block's checksums.

+ */

+void

+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)

```

We should mark blknos as const to indicate it is readonly: const BlockNumber *blknos, which will also prevent from incidentally change on blknos in within the function.

16 - 0005

```

* instead. So "needs flush" is taken to mean whether minRecoveryPoint

* would need to be updated.

- if (RecoveryInProgress())

+ if (RecoveryInProgress() && !XLogInsertAllowed())

```

As a new check is added, the comment should be updated accordingly.

17 - 0006

```

--- a/src/include/storage/buf_internals.h

+++ b/src/include/storage/buf_internals.h

@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)

typedef struct CkptSortItem

{

Oid tsId;

+ Oid db_id;

```

I think “db_id” should be named “dbId” or “dbOid”. Let’s keep the name conversation consistent.

18 - 0007

```

--- a/src/backend/storage/buffer/bufmgr.c

+++ b/src/backend/storage/buffer/bufmgr.c

+ max_batch_size = checkpointer_max_batch_size();

```

Look like we don’t need to calculate max_batch_size in the for loop.

19 - 0007

```

+ * Each batch will have exactly one start and one max lsn and one

+ * length.

```

I don’t get what you want to explain with this comment. It sounds quite unnecessary.

Best regards,

—

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

12 сентября, 02:11:12

On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:
>

Thanks for the review!

For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.

> On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> 2 - 0001
> ```
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
>
> + if (XLogNeedsFlush(lsn))
> + {
> + /*
> + * Remove the dirty buffer from the ring; necessary to prevent an
> + * infinite loop if all ring members are dirty.
> + */
> + strategy->buffers[strategy->current] = InvalidBuffer;
> + return true;
> + }
>
> - return true;
> + return false;
>  }
> ```
>
> We can do:
> ```
>          If (!XLogNeedsFlush(lan))
>             Return false
>
>         /* Remove the dirty buffer ….
>          */
>        Return true;
> }
> ```

This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)

> 4 - 0002
> ```
> - /* OK, do the I/O */
> - FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
> - LWLockRelease(content_lock);
> -
> - ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
> -  &buf_hdr->tag);
> + CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
> ```
> I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to
understand.Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further
readingCleanVictimBuffer(). 
>
> I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that
CleanVictimBuffer()will release the lock. 

I think for this specific patch in the set your idea makes sense.
However, in the later patch to do write combining, I release the
content locks for the batch in CompleteWriteBatchIO() and having the
start buffer's lock separate as a parameter would force me to have a
special case handling this.

I've added a comment to both CleanVictimBuffer() and its caller
specifying that the lock must be held and that it will be released
inside CleanVictimBuffer.

> 5 - 0002
> ```
>   * disastrous system-wide consequences.  To make sure that can't happen,
>   * skip the flush if the buffer isn't permanent.
>   */
> - if (buf_state & BM_PERMANENT)
> - XLogFlush(recptr);
> + if (!XLogRecPtrIsInvalid(buffer_lsn))
> + XLogFlush(buffer_lsn);
> ```
>
> Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”,
whichreflects to the old code. 

It's changed because I split the logic for flushing to that LSN and
determining the LSN across the Prepare and Do functions. This is
needed because when we do batches, we want to flush to the max LSN
across all buffers in the batch.

I check if the buffer is BM_PERMANENT in PrepareFlushBuffer(). You
make a good point about my comment, though. I've moved it to
PrepareFlushBuffer() and updated it.

> 8 - 0003
> ```
> bool
> +strategy_supports_eager_flush(BufferAccessStrategy strategy)
> ```
>
> This function is only used in bufmgr.c, can we move it there and make it static?

BufferAccessStrategyData is opaque to bufmgr.c. Only freelist.c can
access it. I agree it is gross that I have these helpers and functions
that would otherwise be static, though.

> 10 - 0004
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + limit = Min(max_batch_size, limit);
> ```
>
> Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is
definedwith length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”. 

I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?

> 11 - 0004
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + for (batch->n = 1; batch->n < limit; batch->n++)
> + {
> + Buffer bufnum;
> +
> + if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
> + break;
> ```
>
> Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then
isit possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start. 

Ah, great point. I didn't think about this. Our sweep will always
start right after the start buffer, but then if it goes all the way
around, it will "lap" the start buffer. Because of this and because I
think it is weird to have the sweep variables in the
BufferAccessStrategy object, I've changed my approach in attached v6.
I set sweep_end to be the start block in the batch and then pass
around a sweep cursor variable. Hitting sweep_end is the termination
condition.

> 12 - 0004
> ```
> @@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
>
>   if (from_ring && strategy_supports_eager_flush(strategy))
>   {
> + uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
> ```
>
> I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to
calculatein every round of clean. 

Actually, the max pin limit can change quite frequently. See
GetAdditionalPinLimit()'s usage in read stream code. If the query is
pinning other buffers in another part of the query, it can change our
limit.

I'm not sure if I should call GetAdditionalPinLImit() for each batch
or for each run of batches (like in StrategyMaxWriteBatchSize()).
Currently, I call it for each batch (in FindFlushAdjacents()). The
read stream calls it pretty frequently (each
read_stream_start_pending_read()). But, in the batch flush case,
nothing could change between batches in a run of batches. So maybe I
should move it up and out and make it per run of batches...

> 13 - 0004
> ```
> +void
> +CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
> + WritebackContext *wb_context)
> +{
> + ErrorContextCallback errcallback =
> + {
> + .callback = shared_buffer_write_error_callback,
> + .previous = error_context_stack,
> + };
> +
> + error_context_stack = &errcallback;
> + pgBufferUsage.shared_blks_written += batch->n;
> ```
>
> Should we only increase shared_blks_written only after the loop of write-back is done?

On master, FlushBuffer() does it after smgrwrite() (before writeback).
I think pgBufferUsage is mainly used in EXPLAIN (also
pg_stat_statements) which won't be used until the end of the query and
won't be displayed if we error out.

> 14 - 0004
> ```
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
>
> +uint32
> +max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
> ```
>
> I think this function can be moved to bufmgr.c and make it static.

This technically could be moved, but it is a function giving you
information about a strategy which seemed to fit better in freelist.c.

> 18 - 0007
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + max_batch_size = checkpointer_max_batch_size();
> ```
>
> Look like we don’t need to calculate max_batch_size in the for loop.

I don't think it's in the for loop.

- Melanie

Вложения

Re: Checkpointer write combining

От

Chao Li

Дата:

12 сентября, 06:33:17

On Sep 12, 2025, at 07:11, Melanie Plageman <melanieplageman@gmail.com> wrote:

On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:

Thanks for the review!

For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.

On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:

2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+ if (XLogNeedsFlush(lsn))
+ {
+ /*
+ * Remove the dirty buffer from the ring; necessary to prevent an
+ * infinite loop if all ring members are dirty.
+ */
+ strategy->buffers[strategy->current] = InvalidBuffer;
+ return true;
+ }

- return true;
+ return false;
}
```

We can do:
```
        If (!XLogNeedsFlush(lan))
           Return false

       /* Remove the dirty buffer ….
        */
      Return true;
}
```

This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)

I don’t understand why the two versions are different:

if (XLogNeedsFlush(lsn))
{
 /*
 * Remove the dirty buffer from the ring; necessary to prevent an
 * infinite loop if all ring members are dirty.
 */
strategy->buffers[strategy->current] = InvalidBuffer;
return true;
}

return false;

if (XLogNeedsFlush(lsn))
return false;

 /*
 * Remove the dirty buffer from the ring; necessary to prevent an
 * infinite loop if all ring members are dirty.
 */
strategy->buffers[strategy->current] = InvalidBuffer;
return true;

10 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+ limit = Min(max_batch_size, limit);
```

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?

Please ignore comment 10. I think I cross-line it in my original email. I added the comment, then lately I found you have checked MAX_IO_COMBINE_LIMIT in the other function, so tried to delete it by cross-lining the comment.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

15 октября, 23:56:55

On Thu, Sep 11, 2025 at 11:33 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> I don’t understand why the two versions are different:
>
> if (XLogNeedsFlush(lsn))
> {
> /*
> * Remove the dirty buffer from the ring; necessary to prevent an
> * infinite loop if all ring members are dirty.
> */
> strategy->buffers[strategy->current] = InvalidBuffer;
> return true;
> }
>
> return false;
>
> VS
>
> if (XLogNeedsFlush(lsn))
> return false;

I think you mean
if (!XLogNeedsFlush(lsn))
{
   return false;
}
// remove buffer
return true

is the same as

if (XLogNeedsFlush(lsn))
{
//remove dirty buffer
return true
}
return false;

Which is true. I've changed it to be like that.

Attached version 7 is rebased and has some bug fixes.

I also added a bonus batch on the end (0007) that refactors
SyncOneBuffer() to use the CAS loop pattern for pinning the buffer
that Andres introduced in 5e89985928795f243. bgwriter is now the only
user of SyncOneBuffer() and it rejects writing out buffers that are
used, so it seemed like a decent use case for this.

- Melanie

Вложения

Re: Checkpointer write combining

От

Chao Li

Дата:

16 октября, 07:24:51

Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked
asWIP. 

>
> - Melanie
>
<v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```

Nit: here you added two empty lines, I think we need only 1.

2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                  bool from_ring, IOContext io_context)
+{

-    /* Find smgr relation for buffer */
-    if (reln == NULL)
-        reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+    XLogRecPtr    max_lsn = InvalidXLogRecPtr;
+    LWLock       *content_lock;
```

Nit: the empty line after “{“ should be removed.

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+    if (++(*sweep_cursor) >= strategy->nbuffers)
+        *sweep_cursor = 0;
+
+    return strategy->buffers[*sweep_cursor];
+}
```

Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is
justa getter. InvalidBuffer just implies the current sweep is over. 

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the
currentsweep is done." 

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function
namesall have the whole word of “strategy”, why don’t also use the whole word in this function name as well? 

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+    uint32        max_possible_buffer_limit;
+    uint32        max_write_batch_size;
+    int            strategy_pin_limit;
+
+    max_write_batch_size = io_combine_limit;
+
+    strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+    max_possible_buffer_limit = GetPinLimit();
+
+    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+    max_write_batch_size = Max(1, max_write_batch_size);
+    max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+    Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+    return max_write_batch_size;
+}
```

This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
    int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

    /* Clamp to io_combine_limit and enforce minimum of 1 */
    if (max_write_batch_size > io_combine_limit)
        max_write_batch_size = io_combine_limit;
    if (max_write_batch_size == 0)
        max_write_batch_size = 1;

    Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

От

BharatDB

Дата:

28 октября, 15:51:41

On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:

Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked as WIP.

>
> - Melanie
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
/*
* Internal buffer management routines
*/
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```

Nit: here you added two empty lines, I think we need only 1.

2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+ bool from_ring, IOContext io_context)
+{

- /* Find smgr relation for buffer */
- if (reln == NULL)
- reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+ XLogRecPtr max_lsn = InvalidXLogRecPtr;
+ LWLock *content_lock;
```

Nit: the empty line after “{“ should be removed.

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+ if (++(*sweep_cursor) >= strategy->nbuffers)
+ *sweep_cursor = 0;
+
+ return strategy->buffers[*sweep_cursor];
+}
```

Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function names all have the whole word of “strategy”, why don’t also use the whole word in this function name as well?

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+ uint32 max_possible_buffer_limit;
+ uint32 max_write_batch_size;
+ int strategy_pin_limit;
+
+ max_write_batch_size = io_combine_limit;
+
+ strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+ max_possible_buffer_limit = GetPinLimit();
+
+ max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+ max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+ max_write_batch_size = Max(1, max_write_batch_size);
+ max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+ Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+ return max_write_batch_size;
+}
```

This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Hello All,

As per reference to the previous mails, I understood the changes made and had tried to replicate the patches into the source code for the bug fix but it didn't show any significant bug. Also I ran some verification tests for the recent changes related to batched write statistics during checkpoints.
Below are my observations and results:

Test Setup
PostgreSQL version: 19devel (custom build)
OS: Ubuntu Linux
Port: 55432
Database: postgres
Test tool: pgbench
Duration: 120 seconds
Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres

Log Output

After running the workload, I triggered a manual checkpoint and checked the latest log entry:

2025-10-28 16:53:05.696 IST [11422] LOG: checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;

distance=33437 kB, estimate=308790 kB;

Observations:

Metric	Value	Source	Interpretation
Buffers written	1383	From log	Consistent with moderate workload
Checkpoint write time	0.023 s	From log	Realistic for ~11 MB write
Checkpoint sync time	0.017 s	From log	Reasonable
Total checkpoint time	0.071 s	From log	≈ write + sync + small overhead
CHECKPOINT runtime (psql)	-	-	Fast, confirms idle background activity

The total time closely matches the sum of write and sync times, with only a small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no missing or duplicated values.

Expected behavior observed:

write + sync ≈ total (no zero, truncation, or aggregation bug)
Buffer counts and timing scale realistically with workload
No evidence of under- or over-counted times
Checkpoint stats properly recorded in log and pg_stat_checkpointer

Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s overhead => normal for control file + metadata writes.

Comparison to Prior Reports:

Test	Pre-Patch	Post-Patch	Difference
Checkpoint duration	6.5 s	5.0 s	−23
Heavy workload test	16 s	8 s	−50

Result: It shows consistent and stable timing even under moderate pgbench load — confirming the patch is integrated and functioning correctly.

Final Status:

Category	Result
Batched Write Stats Accuracy	Verified OK
Timing Aggregation Correctness	Verified OK
pg_stat_checkpointer Consistency	Verified OK
Log Formatting	Normal
Performance Regression	None detected

Inference:

The reported write, sync, and total timings are consistent and realistic.
No anomalies or timing mismatches were seen.
Checkpoint performance seems stable and accurate under moderate load.
No regression or counter accumulation issues detected.

Also, I have been verifying output using manual queries recording the observations and find no significant delays. Observations are recorded below and the screenshots are attached herewith.

Observations:

Metric	Before Activity	After Activity	Notes
Buffers written	27949	27972	Matches wrote 23 buffers from log
Write time	206882	206887	Matches write=0.005 s from log
Sync time	840	877	Matches sync=0.037 s from log
num_done	6	7	Completed successfully
slru_written	8	9	Matches wrote 1 SLRU buffer from log
Stats reset	yes	yes	successful
num_requested	7	8	manual checkpoint recorded

For further review, I have attached the screenshots of my terminal herewith.

Kindly review the observations and please let me know if any additional details need to be focused on.

Thanking you.

Regards,

Soumya

Вложения

Re: Checkpointer write combining

От

Melanie Plageman

Дата:

04 ноября, 02:34:26

Thanks for continuing to review! I've revised the patches to
incorporate all of your feedback except for where I mention below.

There were failures in CI due to issues with max batch size, so
attached v8 also seeks to fix those.

- Melanie

On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> 3 - 0003
> ```
> +/*
> + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
> + * over.
> + */
> +Buffer
> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
> +{
> +       if (++(*sweep_cursor) >= strategy->nbuffers)
> +               *sweep_cursor = 0;
> +
> +       return strategy->buffers[*sweep_cursor];
> +}
> ```
>
> Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function
isjust a getter. InvalidBuffer just implies the current sweep is over. 
>
> Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies
thecurrent sweep is done." 

Yes, actually I think having these helpers mention the sweep is more
confusing than anything else. I've revised them to be named more
generically and updated the comments accordingly.

> 5 - 0004
> ```
> +uint32
> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> +{
> +       uint32          max_possible_buffer_limit;
> +       uint32          max_write_batch_size;
> +       int                     strategy_pin_limit;
> +
> +       max_write_batch_size = io_combine_limit;
> +
> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> +       max_possible_buffer_limit = GetPinLimit();
> +
> +       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
> +       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
> +       max_write_batch_size = Max(1, max_write_batch_size);
> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> +       return max_write_batch_size;
> +}
> ```
>
> This implementation is hard to understand. I tried to simplify it:
> ```
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>         uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
>
>         /* Clamp to io_combine_limit and enforce minimum of 1 */
>         if (max_write_batch_size > io_combine_limit)
>                 max_write_batch_size = io_combine_limit;
>         if (max_write_batch_size == 0)
>                 max_write_batch_size = 1;
>
>         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>         return max_write_batch_size;
> }
> ```

I agree that the implementation was hard to understand. I've not quite
gone with your version but I have rewritten it like this:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
    uint32        max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    uint32        max_possible_buffer_limit = GetPinLimit();

    /* Identify the minimum of the above */
    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

    /* Must allow at least 1 IO for forward progress */
    max_write_batch_size = Max(1, max_write_batch_size);

    return max_write_batch_size;
}

Is this better?

- Melanie

Вложения

Re: Checkpointer write combining

От

BharatDB

Дата:

12 ноября, 14:40:28

Hi all,

Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.

Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:

Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).
Cap early to a conservative compile_cap (MAX_IO_COMBINE_LIMIT - 1) to avoid array overflow. Otherwise if we confirm all slots are usable, change to MAX_IO_COMBINE_LIMIT.
Apply per-strategy pin and global pin limits only if they are positive.
Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.

Implementation logic:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
    uint32        max_write_batch_size;
    uint32        compile_cap = MAX_IO_COMBINE_LIMIT - 1;    /* conservative cap */
    int           strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    uint32        max_possible_buffer_limit = GetPinLimit();
    max_write_batch_size = (io_combine_limit == 0) ? 1 : io_combine_limit;
    if (max_write_batch_size > compile_cap)
        max_write_batch_size = compile_cap;
    if (strategy_pin_limit > 0 &&
        (uint32) strategy_pin_limit < max_write_batch_size)
        max_write_batch_size = (uint32) strategy_pin_limit;
    if (max_possible_buffer_limit > 0 &&
        max_possible_buffer_limit < max_write_batch_size)
        max_write_batch_size = max_possible_buffer_limit;
    if (max_write_batch_size == 0)
        max_write_batch_size = 1;
    Assert(max_write_batch_size <= compile_cap);
    return max_write_batch_size;
}

I hope this will be helpful for proceeding further. Looking forward to more feedback.

Thanking you.

Regards,

Soumya

On Tue, Nov 4, 2025 at 5:04 AM Melanie Plageman <melanieplageman@gmail.com> wrote:

Thanks for continuing to review! I've revised the patches to
incorporate all of your feedback except for where I mention below.

There were failures in CI due to issues with max batch size, so
attached v8 also seeks to fix those.

- Melanie

On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> 3 - 0003
> ```
> +/*
> + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
> + * over.
> + */
> +Buffer
> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
> +{
> + if (++(*sweep_cursor) >= strategy->nbuffers)
> + *sweep_cursor = 0;
> +
> + return strategy->buffers[*sweep_cursor];
> +}
> ```
>
> Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.
>
> Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

Yes, actually I think having these helpers mention the sweep is more
confusing than anything else. I've revised them to be named more
generically and updated the comments accordingly.

> 5 - 0004
> ```
> +uint32
> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> +{
> + uint32 max_possible_buffer_limit;
> + uint32 max_write_batch_size;
> + int strategy_pin_limit;
> +
> + max_write_batch_size = io_combine_limit;
> +
> + strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> + max_possible_buffer_limit = GetPinLimit();
> +
> + max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
> + max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
> + max_write_batch_size = Max(1, max_write_batch_size);
> + max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
> + Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> + return max_write_batch_size;
> +}
> ```
>
> This implementation is hard to understand. I tried to simplify it:
> ```
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
> int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
>
> /* Clamp to io_combine_limit and enforce minimum of 1 */
> if (max_write_batch_size > io_combine_limit)
> max_write_batch_size = io_combine_limit;
> if (max_write_batch_size == 0)
> max_write_batch_size = 1;
>
> Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> return max_write_batch_size;
> }
> ```

I agree that the implementation was hard to understand. I've not quite
gone with your version but I have rewritten it like this:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;
}

Is this better?

- Melanie

Re: Checkpointer write combining

От

Chao Li

Дата:

13 ноября, 11:29:50


> On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Thanks for continuing to review! I've revised the patches to
> incorporate all of your feedback except for where I mention below.
>
> There were failures in CI due to issues with max batch size, so
> attached v8 also seeks to fix those.
>
> - Melanie
>
> On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>> 3 - 0003
>> ```
>> +/*
>> + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
>> + * over.
>> + */
>> +Buffer
>> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
>> +{
>> +       if (++(*sweep_cursor) >= strategy->nbuffers)
>> +               *sweep_cursor = 0;
>> +
>> +       return strategy->buffers[*sweep_cursor];
>> +}
>> ```
>>
>> Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function
isjust a getter. InvalidBuffer just implies the current sweep is over. 
>>
>> Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies
thecurrent sweep is done." 
>
> Yes, actually I think having these helpers mention the sweep is more
> confusing than anything else. I've revised them to be named more
> generically and updated the comments accordingly.
>
>> 5 - 0004
>> ```
>> +uint32
>> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
>> +{
>> +       uint32          max_possible_buffer_limit;
>> +       uint32          max_write_batch_size;
>> +       int                     strategy_pin_limit;
>> +
>> +       max_write_batch_size = io_combine_limit;
>> +
>> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>> +       max_possible_buffer_limit = GetPinLimit();
>> +
>> +       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
>> +       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
>> +       max_write_batch_size = Max(1, max_write_batch_size);
>> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
>> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>> +       return max_write_batch_size;
>> +}
>> ```
>>
>> This implementation is hard to understand. I tried to simplify it:
>> ```
>> uint32
>> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
>> {
>>        int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>>        uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
>>
>>        /* Clamp to io_combine_limit and enforce minimum of 1 */
>>        if (max_write_batch_size > io_combine_limit)
>>                max_write_batch_size = io_combine_limit;
>>        if (max_write_batch_size == 0)
>>                max_write_batch_size = 1;
>>
>>        Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>>        return max_write_batch_size;
>> }
>> ```
>
> I agree that the implementation was hard to understand. I've not quite
> gone with your version but I have rewritten it like this:
>
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>    uint32        max_write_batch_size = Min(io_combine_limit,
> MAX_IO_COMBINE_LIMIT);
>    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>    uint32        max_possible_buffer_limit = GetPinLimit();
>
>    /* Identify the minimum of the above */
>    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
>    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
>
>    /* Must allow at least 1 IO for forward progress */
>    max_write_batch_size = Max(1, max_write_batch_size);
>
>    return max_write_batch_size;
> }
>
> Is this better?

Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug
builds.

>
> - Melanie
>
<v8-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v8-0002-Split-FlushBuffer-into-two-parts.patch><v8-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v8-0004-Write-combining-for-BAS_BULKWRITE.patch><v8-0005-Add-database-Oid-to-CkptSortItem.patch><v8-0006-Implement-checkpointer-data-write-combining.patch><v8-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Checkpointer write combining

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Log Output

Expected behavior observed:

Comparison to Prior Reports:

Inference:

Вложения

Вложения