Обсуждение: Streamify more code paths

Поиск
Список
Период
Сортировка

Streamify more code paths

От
Xuneng Zhou
Дата:
Hi Hackers,

I noticed several additional paths in contrib modules, beyond [1],
that are potentially suitable for streamification:

1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()

The following patches streamify those code paths. No benchmarks have
been run yet.

[1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com

Feedbacks welcome.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi Hackers,
>
> I noticed several additional paths in contrib modules, beyond [1],
> that are potentially suitable for streamification:
>
> 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
> 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()
>
> The following patches streamify those code paths. No benchmarks have
> been run yet.
>
> [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com
>
> Feedbacks welcome.
>

One more in ginvacuumcleanup().

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Nazir Bilal Yavuz
Дата:
Hi,

Thank you for working on this!

On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi Hackers,
> >
> > I noticed several additional paths in contrib modules, beyond [1],
> > that are potentially suitable for streamification:
> >
> > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
> > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()
> >
> > The following patches streamify those code paths. No benchmarks have
> > been run yet.
> >
> > [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com
> >
> > Feedbacks welcome.
> >
>
> One more in ginvacuumcleanup().

0001, 0002 and 0004 LGTM.

0003:

+        buf = read_stream_next_buffer(stream, NULL);
+        if (buf == InvalidBuffer)
+            break;

I think we are loosening the check here. We were sure that there were
no InvalidBuffers until the nblocks. Streamified version does not have
this check, it exits from the loop the first time it sees an
InvalidBuffer, which may be wrong. You might want to add
'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
have a similar check.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi Bilal,

Thanks for your review!

On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi Hackers,
> > >
> > > I noticed several additional paths in contrib modules, beyond [1],
> > > that are potentially suitable for streamification:
> > >
> > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
> > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()
> > >
> > > The following patches streamify those code paths. No benchmarks have
> > > been run yet.
> > >
> > > [1]
https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com
> > >
> > > Feedbacks welcome.
> > >
> >
> > One more in ginvacuumcleanup().
>
> 0001, 0002 and 0004 LGTM.
>
> 0003:
>
> +        buf = read_stream_next_buffer(stream, NULL);
> +        if (buf == InvalidBuffer)
> +            break;
>
> I think we are loosening the check here. We were sure that there were
> no InvalidBuffers until the nblocks. Streamified version does not have
> this check, it exits from the loop the first time it sees an
> InvalidBuffer, which may be wrong. You might want to add
> 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
> have a similar check.
>

Agree. The check has been added in v2 per your suggestion.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Sat, Dec 27, 2025 at 12:41 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi Bilal,
>
> Thanks for your review!
>
> On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you for working on this!
> >
> > On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi Hackers,
> > > >
> > > > I noticed several additional paths in contrib modules, beyond [1],
> > > > that are potentially suitable for streamification:
> > > >
> > > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
> > > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()
> > > >
> > > > The following patches streamify those code paths. No benchmarks have
> > > > been run yet.
> > > >
> > > > [1]
https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com
> > > >
> > > > Feedbacks welcome.
> > > >
> > >
> > > One more in ginvacuumcleanup().
> >
> > 0001, 0002 and 0004 LGTM.
> >
> > 0003:
> >
> > +        buf = read_stream_next_buffer(stream, NULL);
> > +        if (buf == InvalidBuffer)
> > +            break;
> >
> > I think we are loosening the check here. We were sure that there were
> > no InvalidBuffers until the nblocks. Streamified version does not have
> > this check, it exits from the loop the first time it sees an
> > InvalidBuffer, which may be wrong. You might want to add
> > 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
> > have a similar check.
> >
>
> Agree. The check has been added in v2 per your suggestion.
>

Two more to go:
patch 5: Streamify log_newpage_range() WAL logging path
patch 6: Streamify hash index VACUUM primary bucket page reads

Benchmarks will be conducted soon.


--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Sun, Dec 28, 2025 at 7:41 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Sat, Dec 27, 2025 at 12:41 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi Bilal,
> >
> > Thanks for your review!
> >
> > On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Thank you for working on this!
> > >
> > > On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > >
> > > > > Hi Hackers,
> > > > >
> > > > > I noticed several additional paths in contrib modules, beyond [1],
> > > > > that are potentially suitable for streamification:
> > > > >
> > > > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal
> > > > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete()
> > > > >
> > > > > The following patches streamify those code paths. No benchmarks have
> > > > > been run yet.
> > > > >
> > > > > [1]
https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com
> > > > >
> > > > > Feedbacks welcome.
> > > > >
> > > >
> > > > One more in ginvacuumcleanup().
> > >
> > > 0001, 0002 and 0004 LGTM.
> > >
> > > 0003:
> > >
> > > +        buf = read_stream_next_buffer(stream, NULL);
> > > +        if (buf == InvalidBuffer)
> > > +            break;
> > >
> > > I think we are loosening the check here. We were sure that there were
> > > no InvalidBuffers until the nblocks. Streamified version does not have
> > > this check, it exits from the loop the first time it sees an
> > > InvalidBuffer, which may be wrong. You might want to add
> > > 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to
> > > have a similar check.
> > >
> >
> > Agree. The check has been added in v2 per your suggestion.
> >
>
> Two more to go:
> patch 5: Streamify log_newpage_range() WAL logging path
> patch 6: Streamify hash index VACUUM primary bucket page reads
>
> Benchmarks will be conducted soon.
>

v6 in the last message has a problem and has not been updated. Attach
the right one again. Sorry for the noise.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Nazir Bilal Yavuz
Дата:
Hi,

On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
> >
> > Two more to go:
> > patch 5: Streamify log_newpage_range() WAL logging path
> > patch 6: Streamify hash index VACUUM primary bucket page reads
> >
> > Benchmarks will be conducted soon.
> >
>
> v6 in the last message has a problem and has not been updated. Attach
> the right one again. Sorry for the noise.

0003 and 0006:

You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.

0005:

@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
-                                                 RBM_NORMAL, NULL);
+            Buffer        buf = read_stream_next_buffer(stream, NULL);
+
+            if (!BufferIsValid(buf))
+                break;

We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.

0006:

You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.

Rest LGTM!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

Thanks for looking into this.

On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> > >
> > > Two more to go:
> > > patch 5: Streamify log_newpage_range() WAL logging path
> > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > >
> > > Benchmarks will be conducted soon.
> > >
> >
> > v6 in the last message has a problem and has not been updated. Attach
> > the right one again. Sorry for the noise.
>
> 0003 and 0006:
>
> You need to add 'StatApproxReadStreamPrivate' and
> 'HashBulkDeleteStreamPrivate' to the typedefs.list.

Done.

> 0005:
>
> @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
>          nbufs = 0;
>          while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
>          {
> -            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
> -                                                 RBM_NORMAL, NULL);
> +            Buffer        buf = read_stream_next_buffer(stream, NULL);
> +
> +            if (!BufferIsValid(buf))
> +                break;
>
> We are loosening a check here, there should not be a invalid buffer in
> the stream until the endblk. I think you can remove this
> BufferIsValid() check, then we can learn if something goes wrong.

My concern before for not adding assert at the end of streaming is the
potential early break in here:

/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
    break;

After looking more closely, it turns out to be a misunderstanding of the logic.

> 0006:
>
> You can use read_stream_reset() instead of read_stream_end(), then you
> can use the same stream with different variables, I believe this is
> the preferred way.
>
> Rest LGTM!
>

Yeah, reset seems a more proper way here.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> Thanks for looking into this.
>
> On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > > >
> > > > Two more to go:
> > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > >
> > > > Benchmarks will be conducted soon.
> > > >
> > >
> > > v6 in the last message has a problem and has not been updated. Attach
> > > the right one again. Sorry for the noise.
> >
> > 0003 and 0006:
> >
> > You need to add 'StatApproxReadStreamPrivate' and
> > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
>
> Done.
>
> > 0005:
> >
> > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> >          nbufs = 0;
> >          while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> >          {
> > -            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
> > -                                                 RBM_NORMAL, NULL);
> > +            Buffer        buf = read_stream_next_buffer(stream, NULL);
> > +
> > +            if (!BufferIsValid(buf))
> > +                break;
> >
> > We are loosening a check here, there should not be a invalid buffer in
> > the stream until the endblk. I think you can remove this
> > BufferIsValid() check, then we can learn if something goes wrong.
>
> My concern before for not adding assert at the end of streaming is the
> potential early break in here:
>
> /* Nothing more to do if all remaining blocks were empty. */
> if (nbufs == 0)
>     break;
>
> After looking more closely, it turns out to be a misunderstanding of the logic.
>
> > 0006:
> >
> > You can use read_stream_reset() instead of read_stream_end(), then you
> > can use the same stream with different variables, I believe this is
> > the preferred way.
> >
> > Rest LGTM!
> >
>
> Yeah, reset seems a more proper way here.
>

Run pgindent using the updated typedefs.list.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > Thanks for looking into this.
> >
> > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi,
> > > > >
> > > > > Two more to go:
> > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > >
> > > > > Benchmarks will be conducted soon.
> > > > >
> > > >
> > > > v6 in the last message has a problem and has not been updated. Attach
> > > > the right one again. Sorry for the noise.
> > >
> > > 0003 and 0006:
> > >
> > > You need to add 'StatApproxReadStreamPrivate' and
> > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> >
> > Done.
> >
> > > 0005:
> > >
> > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > >          nbufs = 0;
> > >          while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > >          {
> > > -            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
> > > -                                                 RBM_NORMAL, NULL);
> > > +            Buffer        buf = read_stream_next_buffer(stream, NULL);
> > > +
> > > +            if (!BufferIsValid(buf))
> > > +                break;
> > >
> > > We are loosening a check here, there should not be a invalid buffer in
> > > the stream until the endblk. I think you can remove this
> > > BufferIsValid() check, then we can learn if something goes wrong.
> >
> > My concern before for not adding assert at the end of streaming is the
> > potential early break in here:
> >
> > /* Nothing more to do if all remaining blocks were empty. */
> > if (nbufs == 0)
> >     break;
> >
> > After looking more closely, it turns out to be a misunderstanding of the logic.
> >
> > > 0006:
> > >
> > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > can use the same stream with different variables, I believe this is
> > > the preferred way.
> > >
> > > Rest LGTM!
> > >
> >
> > Yeah, reset seems a more proper way here.
> >
>
> Run pgindent using the updated typedefs.list.
>

I've completed benchmarking of the v4 streaming read patches across
three I/O methods (io_uring, sync, worker). Tests were run with cold
cache on large datasets.

--- Settings ---

shared_buffers = '8GB'
effective_io_concurrency = 200
io_method = $IO_METHOD
io_workers = $IO_WORKERS
io_max_concurrency = $IO_MAX_CONCURRENCY
track_io_timing = on
autovacuum = off
checkpoint_timeout = 1h
max_wal_size = 10GB
max_parallel_workers_per_gather = 0

--- Machine ---
CPU: 48-core
RAM: 256 GB DDR5
Disk: 2 x 1.92 TB NVMe SSD

--- Executive Summary ---

The patches provide significant benefits for I/O-bound sequential
operations, with the greatest improvements seen when using
asynchronous I/O methods (io_uring and worker). The synchronous I/O
mode shows reduced but still meaningful gains.

--- Results by I/O Method

Best Results: io_method=worker

bloom_scan: 4.14x (75.9% faster); 93% fewer reads
pgstattuple: 1.59x (37.1% faster); 94% fewer reads
hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads

io_method=io_uring

bloom_scan: 3.12x (68.0% faster); 93% fewer reads
pgstattuple: 1.50x (33.2% faster); 94% fewer reads
hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
wal_logging: 1.00x (-0.5%, neutral); no change in reads

io_method=sync (baseline comparison)

bloom_scan: 1.20x (16.4% faster); 93% fewer reads
pgstattuple: 1.10x (9.0% faster); 94% fewer reads
hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
wal_logging: 0.99x (-0.7%, neutral); no change in reads

--- Observations ---

Async I/O amplifies streaming benefits: The same patches show 3-4x
improvement with worker/io_uring vs 1.2x with sync.

I/O operation reduction is consistent: All modes show the same ~93-94%
reduction in I/O operations for bloom_scan and pgstattuple.

VACUUM operations show modest gains: Despite large I/O reductions
(76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
larger CPU overhead (tuple processing, index maintenance, WAL
logging).

log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Thanks for looking into this.
> > >
> > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > > >
> > > > > > Two more to go:
> > > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > > >
> > > > > > Benchmarks will be conducted soon.
> > > > > >
> > > > >
> > > > > v6 in the last message has a problem and has not been updated. Attach
> > > > > the right one again. Sorry for the noise.
> > > >
> > > > 0003 and 0006:
> > > >
> > > > You need to add 'StatApproxReadStreamPrivate' and
> > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> > >
> > > Done.
> > >
> > > > 0005:
> > > >
> > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > > >          nbufs = 0;
> > > >          while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > >          {
> > > > -            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
> > > > -                                                 RBM_NORMAL, NULL);
> > > > +            Buffer        buf = read_stream_next_buffer(stream, NULL);
> > > > +
> > > > +            if (!BufferIsValid(buf))
> > > > +                break;
> > > >
> > > > We are loosening a check here, there should not be a invalid buffer in
> > > > the stream until the endblk. I think you can remove this
> > > > BufferIsValid() check, then we can learn if something goes wrong.
> > >
> > > My concern before for not adding assert at the end of streaming is the
> > > potential early break in here:
> > >
> > > /* Nothing more to do if all remaining blocks were empty. */
> > > if (nbufs == 0)
> > >     break;
> > >
> > > After looking more closely, it turns out to be a misunderstanding of the logic.
> > >
> > > > 0006:
> > > >
> > > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > > can use the same stream with different variables, I believe this is
> > > > the preferred way.
> > > >
> > > > Rest LGTM!
> > > >
> > >
> > > Yeah, reset seems a more proper way here.
> > >
> >
> > Run pgindent using the updated typedefs.list.
> >
>
> I've completed benchmarking of the v4 streaming read patches across
> three I/O methods (io_uring, sync, worker). Tests were run with cold
> cache on large datasets.
>
> --- Settings ---
>
> shared_buffers = '8GB'
> effective_io_concurrency = 200
> io_method = $IO_METHOD
> io_workers = $IO_WORKERS
> io_max_concurrency = $IO_MAX_CONCURRENCY
> track_io_timing = on
> autovacuum = off
> checkpoint_timeout = 1h
> max_wal_size = 10GB
> max_parallel_workers_per_gather = 0
>
> --- Machine ---
> CPU: 48-core
> RAM: 256 GB DDR5
> Disk: 2 x 1.92 TB NVMe SSD
>
> --- Executive Summary ---
>
> The patches provide significant benefits for I/O-bound sequential
> operations, with the greatest improvements seen when using
> asynchronous I/O methods (io_uring and worker). The synchronous I/O
> mode shows reduced but still meaningful gains.
>
> --- Results by I/O Method
>
> Best Results: io_method=worker
>
> bloom_scan: 4.14x (75.9% faster); 93% fewer reads
> pgstattuple: 1.59x (37.1% faster); 94% fewer reads
> hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
> gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
> bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
> wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
>
> io_method=io_uring
>
> bloom_scan: 3.12x (68.0% faster); 93% fewer reads
> pgstattuple: 1.50x (33.2% faster); 94% fewer reads
> hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
> gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
> bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
> wal_logging: 1.00x (-0.5%, neutral); no change in reads
>
> io_method=sync (baseline comparison)
>
> bloom_scan: 1.20x (16.4% faster); 93% fewer reads
> pgstattuple: 1.10x (9.0% faster); 94% fewer reads
> hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
> gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
> bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
> wal_logging: 0.99x (-0.7%, neutral); no change in reads
>
> --- Observations ---
>
> Async I/O amplifies streaming benefits: The same patches show 3-4x
> improvement with worker/io_uring vs 1.2x with sync.
>
> I/O operation reduction is consistent: All modes show the same ~93-94%
> reduction in I/O operations for bloom_scan and pgstattuple.
>
> VACUUM operations show modest gains: Despite large I/O reductions
> (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
> larger CPU overhead (tuple processing, index maintenance, WAL
> logging).
>
> log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).
>
> --
> Best,
> Xuneng

There was an issue in the wal_log test of the original script.

--- The original benchmark used:
ALTER TABLE ... SET LOGGED

This path performs a full table rewrite via ATRewriteTable()
(tablecmds.c). It creates a new relfilenode and copies tuples into it.
It does not call log_newpage_range() on rewritten pages.

log_newpage_range() may only appear indirectly through the
pending-sync logic in storage.c, and only when:

wal_level = minimal, and
relation size < wal_skip_threshold (default 2MB).

Our test tables (1M–20M rows) are far larger than 2MB. In that case,
PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
previous benchmark measured table rewrite I/O, not the
log_newpage_range() path.

--- Current design: GIN index build

The benchmark now uses:
CREATE INDEX ... USING gin (doc_tsv)

This reliably exercises log_newpage_range() because:
- ginbuild() constructs the index and WAL-logs all new index pages
using log_newpage_range().
- This is part of the normal GIN build path, independent of wal_skip_threshold.
- The streaming-read patch modifies the WAL logging path inside
log_newpage_range(), which this test directly targets.

--- Results (wal_logging_large)
worker: 1.00x (+0.5%); no meaningful change in reads
io_uring: 1.01x (+1.3%); no meaningful change in reads
sync: 1.01x (+1.1%); no meaningful change in reads

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Mon, Feb 9, 2026 at 6:40 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > > >
> > > > > > > Two more to go:
> > > > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > > > >
> > > > > > > Benchmarks will be conducted soon.
> > > > > > >
> > > > > >
> > > > > > v6 in the last message has a problem and has not been updated. Attach
> > > > > > the right one again. Sorry for the noise.
> > > > >
> > > > > 0003 and 0006:
> > > > >
> > > > > You need to add 'StatApproxReadStreamPrivate' and
> > > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> > > >
> > > > Done.
> > > >
> > > > > 0005:
> > > > >
> > > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > > > >          nbufs = 0;
> > > > >          while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > > >          {
> > > > > -            Buffer        buf = ReadBufferExtended(rel, forknum, blkno,
> > > > > -                                                 RBM_NORMAL, NULL);
> > > > > +            Buffer        buf = read_stream_next_buffer(stream, NULL);
> > > > > +
> > > > > +            if (!BufferIsValid(buf))
> > > > > +                break;
> > > > >
> > > > > We are loosening a check here, there should not be a invalid buffer in
> > > > > the stream until the endblk. I think you can remove this
> > > > > BufferIsValid() check, then we can learn if something goes wrong.
> > > >
> > > > My concern before for not adding assert at the end of streaming is the
> > > > potential early break in here:
> > > >
> > > > /* Nothing more to do if all remaining blocks were empty. */
> > > > if (nbufs == 0)
> > > >     break;
> > > >
> > > > After looking more closely, it turns out to be a misunderstanding of the logic.
> > > >
> > > > > 0006:
> > > > >
> > > > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > > > can use the same stream with different variables, I believe this is
> > > > > the preferred way.
> > > > >
> > > > > Rest LGTM!
> > > > >
> > > >
> > > > Yeah, reset seems a more proper way here.
> > > >
> > >
> > > Run pgindent using the updated typedefs.list.
> > >
> >
> > I've completed benchmarking of the v4 streaming read patches across
> > three I/O methods (io_uring, sync, worker). Tests were run with cold
> > cache on large datasets.
> >
> > --- Settings ---
> >
> > shared_buffers = '8GB'
> > effective_io_concurrency = 200
> > io_method = $IO_METHOD
> > io_workers = $IO_WORKERS
> > io_max_concurrency = $IO_MAX_CONCURRENCY
> > track_io_timing = on
> > autovacuum = off
> > checkpoint_timeout = 1h
> > max_wal_size = 10GB
> > max_parallel_workers_per_gather = 0
> >
> > --- Machine ---
> > CPU: 48-core
> > RAM: 256 GB DDR5
> > Disk: 2 x 1.92 TB NVMe SSD
> >
> > --- Executive Summary ---
> >
> > The patches provide significant benefits for I/O-bound sequential
> > operations, with the greatest improvements seen when using
> > asynchronous I/O methods (io_uring and worker). The synchronous I/O
> > mode shows reduced but still meaningful gains.
> >
> > --- Results by I/O Method
> >
> > Best Results: io_method=worker
> >
> > bloom_scan: 4.14x (75.9% faster); 93% fewer reads
> > pgstattuple: 1.59x (37.1% faster); 94% fewer reads
> > hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
> > gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
> > bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
> > wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
> >
> > io_method=io_uring
> >
> > bloom_scan: 3.12x (68.0% faster); 93% fewer reads
> > pgstattuple: 1.50x (33.2% faster); 94% fewer reads
> > hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
> > gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
> > bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
> > wal_logging: 1.00x (-0.5%, neutral); no change in reads
> >
> > io_method=sync (baseline comparison)
> >
> > bloom_scan: 1.20x (16.4% faster); 93% fewer reads
> > pgstattuple: 1.10x (9.0% faster); 94% fewer reads
> > hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
> > gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
> > bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
> > wal_logging: 0.99x (-0.7%, neutral); no change in reads
> >
> > --- Observations ---
> >
> > Async I/O amplifies streaming benefits: The same patches show 3-4x
> > improvement with worker/io_uring vs 1.2x with sync.
> >
> > I/O operation reduction is consistent: All modes show the same ~93-94%
> > reduction in I/O operations for bloom_scan and pgstattuple.
> >
> > VACUUM operations show modest gains: Despite large I/O reductions
> > (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
> > larger CPU overhead (tuple processing, index maintenance, WAL
> > logging).
> >
> > log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).
> >
> > --
> > Best,
> > Xuneng
>
> There was an issue in the wal_log test of the original script.
>
> --- The original benchmark used:
> ALTER TABLE ... SET LOGGED
>
> This path performs a full table rewrite via ATRewriteTable()
> (tablecmds.c). It creates a new relfilenode and copies tuples into it.
> It does not call log_newpage_range() on rewritten pages.
>
> log_newpage_range() may only appear indirectly through the
> pending-sync logic in storage.c, and only when:
>
> wal_level = minimal, and
> relation size < wal_skip_threshold (default 2MB).
>
> Our test tables (1M–20M rows) are far larger than 2MB. In that case,
> PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
> previous benchmark measured table rewrite I/O, not the
> log_newpage_range() path.
>
> --- Current design: GIN index build
>
> The benchmark now uses:
> CREATE INDEX ... USING gin (doc_tsv)
>
> This reliably exercises log_newpage_range() because:
> - ginbuild() constructs the index and WAL-logs all new index pages
> using log_newpage_range().
> - This is part of the normal GIN build path, independent of wal_skip_threshold.
> - The streaming-read patch modifies the WAL logging path inside
> log_newpage_range(), which this test directly targets.
>
> --- Results (wal_logging_large)
> worker: 1.00x (+0.5%); no meaningful change in reads
> io_uring: 1.01x (+1.3%); no meaningful change in reads
> sync: 1.01x (+1.1%); no meaningful change in reads
>
> --
> Best,
> Xuneng

Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Michael Paquier
Дата:
On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:
> Here’s v5 of the patchset. The wal_logging_large patch has been
> removed, as no performance gains were observed in the benchmark runs.

Looking at the numbers you are posting, it is harder to get excited
about the hash, gin, bloom_vacuum and wal_logging.  The worker method
seems more efficient, may show that we are out of noise level.

The results associated to pgstattuple and the bloom scans are on a
different level for the three methods.

Saying that, it is really nice that you have sent the benchmark.  The
measurement method looks in line with the goal here after review (IO
stats, calculations), and I have taken some time to run it to get an
idea of the difference for these five code paths, as of (slightly
edited the script for my own environment, result is the same):
./run_streaming_benchmark --baseline --io-method=io_uring/worker

I am not much interested in the sync case, so I have tested the two
other methods:

1) method=IO-uring
bloom_scan_large           base=   725.3ms  patch=    99.9ms   7.26x
( 86.2%)  (reads=19676->1294, io_time=688.36->33.69ms)
bloom_vacuum_large         base=  7414.9ms  patch=  7455.2ms   0.99x
( -0.5%)  (reads=48361->11597, io_time=459.02->257.51ms)
pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
(  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
gin_vacuum_large           base=  3546.8ms  patch=  2317.9ms   1.53x
( 34.6%)  (reads=20734->17735, io_time=3244.40->2021.53ms)
hash_vacuum_large          base= 12268.5ms  patch= 11751.1ms   1.04x
(  4.2%)  (reads=76677->15606, io_time=1483.10->315.03ms)
wal_logging_large          base= 33713.0ms  patch= 32773.9ms   1.03x
(  2.8%)  (reads=21641->21641, io_time=81.18->77.25ms)

2) method=worker io-workers=3
bloom_scan_large           base=   725.0ms  patch=   465.7ms   1.56x
( 35.8%)  (reads=19676->1294, io_time=688.70->52.20ms)
bloom_vacuum_large         base=  7138.3ms  patch=  7156.0ms   1.00x
( -0.2%)  (reads=48361->11597, io_time=284.56->64.37ms)
pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
(  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
gin_vacuum_large           base=  3769.4ms  patch=  3716.7ms   1.01x
(  1.4%)  (reads=20775->17684, io_time=3562.21->3528.14ms)
hash_vacuum_large          base= 11750.1ms  patch= 11289.0ms   1.04x
(  3.9%)  (reads=76677->15606, io_time=1296.03->98.72ms)
wal_logging_large          base= 32862.3ms  patch= 33179.7ms   0.99x
( -1.0%)  (reads=21641->21641, io_time=91.42->90.59ms)

The bloom scan case is a winner in runtime for both cases, and in
terms of stats we get much better numbers for all of them.  These feel
rather in line with what you have, except for pgstattuple's runtime,
still its IO numbers feel good.  That's just to say that I'll review
them and try to do something about at least some of the pieces for
this release.
--
Michael

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi Michael,

On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:
> > Here’s v5 of the patchset. The wal_logging_large patch has been
> > removed, as no performance gains were observed in the benchmark runs.
>
> Looking at the numbers you are posting, it is harder to get excited
> about the hash, gin, bloom_vacuum and wal_logging.  The worker method
> seems more efficient, may show that we are out of noise level.
> The results associated to pgstattuple and the bloom scans are on a
> different level for the three methods.
>
> Saying that, it is really nice that you have sent the benchmark.  The
> measurement method looks in line with the goal here after review (IO
> stats, calculations), and I have taken some time to run it to get an
> idea of the difference for these five code paths, as of (slightly
> edited the script for my own environment, result is the same):
> ./run_streaming_benchmark --baseline --io-method=io_uring/worker
>
> I am not much interested in the sync case, so I have tested the two
> other methods:
>
> 1) method=IO-uring
> bloom_scan_large           base=   725.3ms  patch=    99.9ms   7.26x
> ( 86.2%)  (reads=19676->1294, io_time=688.36->33.69ms)
> bloom_vacuum_large         base=  7414.9ms  patch=  7455.2ms   0.99x
> ( -0.5%)  (reads=48361->11597, io_time=459.02->257.51ms)
> pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
> gin_vacuum_large           base=  3546.8ms  patch=  2317.9ms   1.53x
> ( 34.6%)  (reads=20734->17735, io_time=3244.40->2021.53ms)
> hash_vacuum_large          base= 12268.5ms  patch= 11751.1ms   1.04x
> (  4.2%)  (reads=76677->15606, io_time=1483.10->315.03ms)
> wal_logging_large          base= 33713.0ms  patch= 32773.9ms   1.03x
> (  2.8%)  (reads=21641->21641, io_time=81.18->77.25ms)
>
> 2) method=worker io-workers=3
> bloom_scan_large           base=   725.0ms  patch=   465.7ms   1.56x
> ( 35.8%)  (reads=19676->1294, io_time=688.70->52.20ms)
> bloom_vacuum_large         base=  7138.3ms  patch=  7156.0ms   1.00x
> ( -0.2%)  (reads=48361->11597, io_time=284.56->64.37ms)
> pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
> gin_vacuum_large           base=  3769.4ms  patch=  3716.7ms   1.01x
> (  1.4%)  (reads=20775->17684, io_time=3562.21->3528.14ms)
> hash_vacuum_large          base= 11750.1ms  patch= 11289.0ms   1.04x
> (  3.9%)  (reads=76677->15606, io_time=1296.03->98.72ms)
> wal_logging_large          base= 32862.3ms  patch= 33179.7ms   0.99x
> ( -1.0%)  (reads=21641->21641, io_time=91.42->90.59ms)
>
> The bloom scan case is a winner in runtime for both cases, and in
> terms of stats we get much better numbers for all of them.  These feel
> rather in line with what you have, except for pgstattuple's runtime,
> still its IO numbers feel good.

Thanks for running the benchmarks! The performance gains for hash,
gin, bloom_vacuum, and wal_logging is insignificant, likely because
these workloads are not I/O-bound. The default number of I/O workers
is three, which is fairly conservative. When I ran the benchmark
script with a higher number of I/O workers, some runs showed improved
performance.

> pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)

> pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)

Yeah, this looks somewhat strange. The io_time has been reduced
significantly, which should also lead to a substantial reduction in
runtime.

method=io_uring
pgstattuple_large          base=  5551.5ms  patch=  3498.2ms   1.59x
( 37.0%)  (reads=206945→12983, io_time=2323.49→207.14ms)

I ran the benchmark for this test again with io_uring, and the result
is consistent with previous runs. I’m not sure what might be
contributing to this behavior.

Another code path that showed significant performance improvement is
pgstatindex [1]. I've incorporated the test into the script too. Here
are the results from my testing:

method=worker io-workers=12
pgstatindex_large          base=   233.8ms  patch=    54.1ms   4.32x
( 76.8%)  (reads=27460→1757, io_time=213.94→6.31ms)

method=io_uring
pgstatindex_large          base=   224.2ms  patch=    56.4ms   3.98x
( 74.9%)  (reads=27460→1757, io_time=204.41→4.88ms)

>That's just to say that I'll review
> them and try to do something about at least some of the pieces for
> this release.

Thanks for that.

[1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Andres Freund
Дата:
Hi,

On 2026-03-10 19:28:29 +0900, Michael Paquier wrote:
> On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:
> > Here’s v5 of the patchset. The wal_logging_large patch has been
> > removed, as no performance gains were observed in the benchmark runs.
>
> Looking at the numbers you are posting, it is harder to get excited
> about the hash, gin, bloom_vacuum and wal_logging.

It's perhaps worth emphasizing that, to allow real world usage of direct IO,
we'll need streaming implementation for most of these. Also, on windows the OS
provided readahead is ... not aggressive, so you'll hit IO stalls much more
frequently than you'd on linux (and some of the BSDs).

It might be a good idea to run the benchmarks with debug_io_direct=data.
That'll make them very slow, since the write side doesn't yet use AIO and thus
will do a lot of synchronous writes, but it should still allow to evaluate the
gains from using read stream.


The other thing that's kinda important to evaluate read streams is to test on
higher latency storage, even without direct IO.  Many workloads are not at all
benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.


To be able to test such higher latencies locally, I've found it quite useful
to use dm_delay above a fast disk. See [1].


> The worker method seems more efficient, may show that we are out of noise
> level.

I think that's more likely to show that memory bandwidth, probably due to
checksum computations, is a factor. The memory copy (from the kernel page
cache, with buffered IO) and the checksum computations (when checksums are
enabled) are parallelized by worker, but not by io_uring.


Greetings,

Andres Freund


[1]

  https://docs.kernel.org/admin-guide/device-mapper/delay.html

  Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be
  introduced for it:

  umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0  && mount
/dev/mapper/delayed/srv/
 

  To update the amount of delay to 3ms the following can be used:
  dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3"
/dev/md0&& dmsetup resume delayed
 

  (I will often just update the delay to 0 for comparison runs, as that
  doesn't require remounting)



Re: Streamify more code paths

От
Andres Freund
Дата:
Hi,

On 2026-03-10 21:23:26 +0800, Xuneng Zhou wrote:
> On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote:
> Thanks for running the benchmarks! The performance gains for hash,
> gin, bloom_vacuum, and wal_logging is insignificant, likely because
> these workloads are not I/O-bound. The default number of I/O workers
> is three, which is fairly conservative. When I ran the benchmark
> script with a higher number of I/O workers, some runs showed improved
> performance.

FWIW, another thing that may be an issue is that you're restarting postgres
all the time, as part of drop_caches().  That means we'll spend time reloading
catalog metadata and initializing shared buffers (the first write to a shared
buffers page is considerably more expensive than later ones, as the backing
memory needs to be initialized first).

I found it useful to use the pg_buffercache extension (specifically
pg_buffercache_evict_relation()) to just drop the relation that is going to be
tested from shared_buffers.



> > pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> > (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
> 
> > pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> > (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
> 
> Yeah, this looks somewhat strange. The io_time has been reduced
> significantly, which should also lead to a substantial reduction in
> runtime.

It's possible that the bottleneck just moved, e.g to the checksum computation,
if you have data checksums enabled.

It's also worth noting that likely each of the test reps measures
something different, as likely
  psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"

leads to some out-of-page updates.

You're probably better off deleting some of the data in a transaction that is
then rolled back. That will also unset all-visible, but won't otherwise change
the layout, no matter how many test iterations you run.


I'd also guess that you're seeing a relatively small win because you're
updating every page. When reading every page from disk, the OS can do
efficient readahead.  If there are only occasional misses, that does not work.



> method=io_uring
> pgstattuple_large          base=  5551.5ms  patch=  3498.2ms   1.59x
> ( 37.0%)  (reads=206945→12983, io_time=2323.49→207.14ms)
> 
> I ran the benchmark for this test again with io_uring, and the result
> is consistent with previous runs. I’m not sure what might be
> contributing to this behavior.

What does a perf profile show?  Is the query CPU bound?


> Another code path that showed significant performance improvement is
> pgstatindex [1]. I've incorporated the test into the script too. Here
> are the results from my testing:
> 
> method=worker io-workers=12
> pgstatindex_large          base=   233.8ms  patch=    54.1ms   4.32x
> ( 76.8%)  (reads=27460→1757, io_time=213.94→6.31ms)
> 
> method=io_uring
> pgstatindex_large          base=   224.2ms  patch=    56.4ms   3.98x
> ( 74.9%)  (reads=27460→1757, io_time=204.41→4.88ms)

Nice!


Greetings,

Andres Freund



Re: Streamify more code paths

От
Michael Paquier
Дата:
On Tue, Mar 10, 2026 at 07:04:37PM -0400, Andres Freund wrote:
> It might be a good idea to run the benchmarks with debug_io_direct=data.
> That'll make them very slow, since the write side doesn't yet use AIO and thus
> will do a lot of synchronous writes, but it should still allow to evaluate the
> gains from using read stream.

Ah, thanks for the tip.  I'll go try that.

> The other thing that's kinda important to evaluate read streams is to test on
> higher latency storage, even without direct IO.  Many workloads are not at all
> benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
> severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.

My previous run was on a cloud instance, I don't have access to a SSD
with this amount of latency locally.

One thing that was standing on is the bloom bitmap case that was
looking really nice for a large number of rows, so I have applied
this part.  The rest is going to need a bit more testing to build more
confidence, as far as I can see.
--
Michael

Вложения

Re: Streamify more code paths

От
Andres Freund
Дата:
Hi,

On 2026-03-10 19:27:59 -0400, Andres Freund wrote:
> > > pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> > > (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
> > 
> > > pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> > > (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
> > 
> > Yeah, this looks somewhat strange. The io_time has been reduced
> > significantly, which should also lead to a substantial reduction in
> > runtime.
> 
> It's possible that the bottleneck just moved, e.g to the checksum computation,
> if you have data checksums enabled.
> 
> It's also worth noting that likely each of the test reps measures
> something different, as likely
>   psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"
> 
> leads to some out-of-page updates.
> 
> You're probably better off deleting some of the data in a transaction that is
> then rolled back. That will also unset all-visible, but won't otherwise change
> the layout, no matter how many test iterations you run.
> 
> 
> I'd also guess that you're seeing a relatively small win because you're
> updating every page. When reading every page from disk, the OS can do
> efficient readahead.  If there are only occasional misses, that does not work.

I think that last one is a big part - if I use
  BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK;
(which leaves a lot of 

I see much bigger wins due to the pgstattuple changes.

                       time buffered          time DIO
w/o read stream        2222.078 ms            2090.239 ms
w   read stream         299.455 ms             155.124 ms

That's with local storage. io_uring, but numbers with worker are similar.


Greetings,

Andres Freund



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi Andres,

On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-03-10 19:28:29 +0900, Michael Paquier wrote:
> > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:
> > > Here’s v5 of the patchset. The wal_logging_large patch has been
> > > removed, as no performance gains were observed in the benchmark runs.
> >
> > Looking at the numbers you are posting, it is harder to get excited
> > about the hash, gin, bloom_vacuum and wal_logging.
>
> It's perhaps worth emphasizing that, to allow real world usage of direct IO,
> we'll need streaming implementation for most of these. Also, on windows the OS
> provided readahead is ... not aggressive, so you'll hit IO stalls much more
> frequently than you'd on linux (and some of the BSDs).
>
> It might be a good idea to run the benchmarks with debug_io_direct=data.
> That'll make them very slow, since the write side doesn't yet use AIO and thus
> will do a lot of synchronous writes, but it should still allow to evaluate the
> gains from using read stream.
>
>
> The other thing that's kinda important to evaluate read streams is to test on
> higher latency storage, even without direct IO.  Many workloads are not at all
> benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
> severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.
>
>
> To be able to test such higher latencies locally, I've found it quite useful
> to use dm_delay above a fast disk. See [1].

Thanks for the tips! I currently don’t have access to a machine or
cloud instance with slower SSDs or HDDs that have higher latency. I’ll
try running the benchmark with debug_io_direct=data and dm_delay, as
you suggested, to see if the results vary.

>
> > The worker method seems more efficient, may show that we are out of noise
> > level.
>
> I think that's more likely to show that memory bandwidth, probably due to
> checksum computations, is a factor. The memory copy (from the kernel page
> cache, with buffered IO) and the checksum computations (when checksums are
> enabled) are parallelized by worker, but not by io_uring.
>
>
> Greetings,
>
> Andres Freund
>
>
> [1]
>
>   https://docs.kernel.org/admin-guide/device-mapper/delay.html
>
>   Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be
>   introduced for it:
>
>   umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0  &&
mount/dev/mapper/delayed /srv/ 
>
>   To update the amount of delay to 3ms the following can be used:
>   dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3"
/dev/md0&& dmsetup resume delayed 
>
>   (I will often just update the delay to 0 for comparison runs, as that
>   doesn't require remounting)



--
Best,
Xuneng



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Wed, Mar 11, 2026 at 7:28 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-03-10 21:23:26 +0800, Xuneng Zhou wrote:
> > On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote:
> > Thanks for running the benchmarks! The performance gains for hash,
> > gin, bloom_vacuum, and wal_logging is insignificant, likely because
> > these workloads are not I/O-bound. The default number of I/O workers
> > is three, which is fairly conservative. When I ran the benchmark
> > script with a higher number of I/O workers, some runs showed improved
> > performance.
>
> FWIW, another thing that may be an issue is that you're restarting postgres
> all the time, as part of drop_caches().  That means we'll spend time reloading
> catalog metadata and initializing shared buffers (the first write to a shared
> buffers page is considerably more expensive than later ones, as the backing
> memory needs to be initialized first).
>
> I found it useful to use the pg_buffercache extension (specifically
> pg_buffercache_evict_relation()) to just drop the relation that is going to be
> tested from shared_buffers.

Good point. I'll switch to using pg_buffercache_evict_relation() to
evict only the target relation, keeping the cluster running. That
should reduce measurement noise to some extend.

>
> > > pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> > > (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
> >
> > > pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> > > (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
> >
> > Yeah, this looks somewhat strange. The io_time has been reduced
> > significantly, which should also lead to a substantial reduction in
> > runtime.
>
> It's possible that the bottleneck just moved, e.g to the checksum computation,
> if you have data checksums enabled.
>
> It's also worth noting that likely each of the test reps measures
> something different, as likely
>   psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"
>
> leads to some out-of-page updates.
>
> You're probably better off deleting some of the data in a transaction that is
> then rolled back. That will also unset all-visible, but won't otherwise change
> the layout, no matter how many test iterations you run.
>
>
> I'd also guess that you're seeing a relatively small win because you're
> updating every page. When reading every page from disk, the OS can do
> efficient readahead.  If there are only occasional misses, that does not work.
>

Yeah, the repeated UPDATE changes the table layout across reps. I'll switch to:

BEGIN;
DELETE FROM heap_test WHERE id % N = 0;
ROLLBACK;

This clears the visibility map bits without altering the physical
layout, so every rep measures the same table state.

>
> > method=io_uring
> > pgstattuple_large          base=  5551.5ms  patch=  3498.2ms   1.59x
> > ( 37.0%)  (reads=206945→12983, io_time=2323.49→207.14ms)
> >
> > I ran the benchmark for this test again with io_uring, and the result
> > is consistent with previous runs. I’m not sure what might be
> > contributing to this behavior.
>
> What does a perf profile show?  Is the query CPU bound?

The runtime in my run of pgstattuple was reduced significantly due to
the reduction in I/O time. I don’t think running perf on my setup
would reveal anything particularly meaningful. The script has an
option to run with perf, so perhaps Michael could try it to see
whether the query becomes CPU-bound, if he’s interested and has time.

> > Another code path that showed significant performance improvement is
> > pgstatindex [1]. I've incorporated the test into the script too. Here
> > are the results from my testing:
> >
> > method=worker io-workers=12
> > pgstatindex_large          base=   233.8ms  patch=    54.1ms   4.32x
> > ( 76.8%)  (reads=27460→1757, io_time=213.94→6.31ms)
> >
> > method=io_uring
> > pgstatindex_large          base=   224.2ms  patch=    56.4ms   3.98x
> > ( 74.9%)  (reads=27460→1757, io_time=204.41→4.88ms)
>
> Nice!
>
>
> Greetings,
>
> Andres Freund



--
Best,
Xuneng



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Wed, Mar 11, 2026 at 7:29 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Mar 10, 2026 at 07:04:37PM -0400, Andres Freund wrote:
> > It might be a good idea to run the benchmarks with debug_io_direct=data.
> > That'll make them very slow, since the write side doesn't yet use AIO and thus
> > will do a lot of synchronous writes, but it should still allow to evaluate the
> > gains from using read stream.
>
> Ah, thanks for the tip.  I'll go try that.
>
> > The other thing that's kinda important to evaluate read streams is to test on
> > higher latency storage, even without direct IO.  Many workloads are not at all
> > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
> > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.
>
> My previous run was on a cloud instance, I don't have access to a SSD
> with this amount of latency locally.
>
> One thing that was standing on is the bloom bitmap case that was
> looking really nice for a large number of rows, so I have applied
> this part.  The rest is going to need a bit more testing to build more
> confidence, as far as I can see.
> --
> Michael

Thanks for pushing that. I’ll update the script with Andres’
suggestions and share it shortly.

--
Best,
Xuneng



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Wed, Mar 11, 2026 at 8:16 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-03-10 19:27:59 -0400, Andres Freund wrote:
> > > > pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> > > > (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
> > >
> > > > pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> > > > (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
> > >
> > > Yeah, this looks somewhat strange. The io_time has been reduced
> > > significantly, which should also lead to a substantial reduction in
> > > runtime.
> >
> > It's possible that the bottleneck just moved, e.g to the checksum computation,
> > if you have data checksums enabled.
> >
> > It's also worth noting that likely each of the test reps measures
> > something different, as likely
> >   psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"
> >
> > leads to some out-of-page updates.
> >
> > You're probably better off deleting some of the data in a transaction that is
> > then rolled back. That will also unset all-visible, but won't otherwise change
> > the layout, no matter how many test iterations you run.
> >
> >
> > I'd also guess that you're seeing a relatively small win because you're
> > updating every page. When reading every page from disk, the OS can do
> > efficient readahead.  If there are only occasional misses, that does not work.
>
> I think that last one is a big part - if I use
>   BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK;
> (which leaves a lot of
>
> I see much bigger wins due to the pgstattuple changes.
>
>                        time buffered          time DIO
> w/o read stream        2222.078 ms            2090.239 ms
> w   read stream         299.455 ms             155.124 ms
>
> That's with local storage. io_uring, but numbers with worker are similar.
>

The results look great and interesting. This looks far better than
what I observed in my earlier tests. I’ll run perf for pgstattuple
without the switching to see what is keeping the CPU busy.

--
Best,
Xuneng



Re: Streamify more code paths

От
Nazir Bilal Yavuz
Дата:
Hi,

On Tue, 10 Mar 2026 at 16:23, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Another code path that showed significant performance improvement is
> pgstatindex [1]. I've incorporated the test into the script too. Here
> are the results from my testing:
>
> method=worker io-workers=12
> pgstatindex_large          base=   233.8ms  patch=    54.1ms   4.32x
> ( 76.8%)  (reads=27460→1757, io_time=213.94→6.31ms)
>
> method=io_uring
> pgstatindex_large          base=   224.2ms  patch=    56.4ms   3.98x
> ( 74.9%)  (reads=27460→1757, io_time=204.41→4.88ms)

I didn't run the benchmark yet but here is a small suggestion for the
pgstatindex patch:

+    p.current_blocknum = BTREE_METAPAGE + 1;
+    p.last_exclusive = nblocks;

     for (blkno = 1; blkno < nblocks; blkno++)

...

+    p.current_blocknum = HASH_METAPAGE + 1;
+    p.last_exclusive = nblocks;

     for (blkno = 1; blkno < nblocks; blkno++)

Could you move 'BTREE_METAPAGE + 1' and 'HASH_METAPAGE + 1' into
variables and then set p.current_blocknum and blkno using those
variables? p.current_blocknum and blkno should have the same initial
values, this change makes code less error prone and easier to read in
my opinion.

Other than the comment above, LGTM.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Wed, Mar 11, 2026 at 10:23 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Wed, Mar 11, 2026 at 8:16 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2026-03-10 19:27:59 -0400, Andres Freund wrote:
> > > > > pgstattuple_large          base= 12429.3ms  patch= 11916.8ms   1.04x
> > > > > (  4.1%)  (reads=206945->12983, io_time=6501.91->32.24ms)
> > > >
> > > > > pgstattuple_large          base= 12642.9ms  patch= 11873.5ms   1.06x
> > > > > (  6.1%)  (reads=206945->12983, io_time=6516.70->143.46ms)
> > > >
> > > > Yeah, this looks somewhat strange. The io_time has been reduced
> > > > significantly, which should also lead to a substantial reduction in
> > > > runtime.
> > >
> > > It's possible that the bottleneck just moved, e.g to the checksum computation,
> > > if you have data checksums enabled.
> > >
> > > It's also worth noting that likely each of the test reps measures
> > > something different, as likely
> > >   psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"
> > >
> > > leads to some out-of-page updates.
> > >
> > > You're probably better off deleting some of the data in a transaction that is
> > > then rolled back. That will also unset all-visible, but won't otherwise change
> > > the layout, no matter how many test iterations you run.
> > >
> > >
> > > I'd also guess that you're seeing a relatively small win because you're
> > > updating every page. When reading every page from disk, the OS can do
> > > efficient readahead.  If there are only occasional misses, that does not work.
> >
> > I think that last one is a big part - if I use
> >   BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK;
> > (which leaves a lot of
> >
> > I see much bigger wins due to the pgstattuple changes.
> >
> >                        time buffered          time DIO
> > w/o read stream        2222.078 ms            2090.239 ms
> > w   read stream         299.455 ms             155.124 ms
> >
> > That's with local storage. io_uring, but numbers with worker are similar.
> >
>
> The results look great and interesting. This looks far better than
> what I observed in my earlier tests. I’ll run perf for pgstattuple
> without the switching to see what is keeping the CPU busy.
>
> --
> Best,
> Xuneng

io_uring
pgstattuple_large          base=  1090.6ms  patch=   143.3ms   7.61x
( 86.9%)  (reads=20049→20049, io_time=1040.80→46.91ms)

I observed a similar magnitude of runtime reduction after switching to
pg_buffercache_evict_relation() and using BEGIN; DELETE FROM heap_test
WHERE id % 500 = 0; ROLLBACK. However, I lost the original flame
graphs after running many performance tests. I will regenerate them
and post them later.

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
On Wed, Mar 11, 2026 at 9:37 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi Andres,
>
> On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2026-03-10 19:28:29 +0900, Michael Paquier wrote:
> > > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:
> > > > Here’s v5 of the patchset. The wal_logging_large patch has been
> > > > removed, as no performance gains were observed in the benchmark runs.
> > >
> > > Looking at the numbers you are posting, it is harder to get excited
> > > about the hash, gin, bloom_vacuum and wal_logging.
> >
> > It's perhaps worth emphasizing that, to allow real world usage of direct IO,
> > we'll need streaming implementation for most of these. Also, on windows the OS
> > provided readahead is ... not aggressive, so you'll hit IO stalls much more
> > frequently than you'd on linux (and some of the BSDs).
> >
> > It might be a good idea to run the benchmarks with debug_io_direct=data.
> > That'll make them very slow, since the write side doesn't yet use AIO and thus
> > will do a lot of synchronous writes, but it should still allow to evaluate the
> > gains from using read stream.
> >
> >
> > The other thing that's kinda important to evaluate read streams is to test on
> > higher latency storage, even without direct IO.  Many workloads are not at all
> > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
> > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.
> >
> >
> > To be able to test such higher latencies locally, I've found it quite useful
> > to use dm_delay above a fast disk. See [1].
>
> Thanks for the tips! I currently don’t have access to a machine or
> cloud instance with slower SSDs or HDDs that have higher latency. I’ll
> try running the benchmark with debug_io_direct=data and dm_delay, as
> you suggested, to see if the results vary.
>
> >
> > > The worker method seems more efficient, may show that we are out of noise
> > > level.
> >
> > I think that's more likely to show that memory bandwidth, probably due to
> > checksum computations, is a factor. The memory copy (from the kernel page
> > cache, with buffered IO) and the checksum computations (when checksums are
> > enabled) are parallelized by worker, but not by io_uring.
> >
> >
> > Greetings,
> >
> > Andres Freund
> >
> >
> > [1]
> >
> >   https://docs.kernel.org/admin-guide/device-mapper/delay.html
> >
> >   Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be
> >   introduced for it:
> >
> >   umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0  &&
mount/dev/mapper/delayed /srv/ 
> >
> >   To update the amount of delay to 3ms the following can be used:
> >   dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3"
/dev/md0&& dmsetup resume delayed 
> >
> >   (I will often just update the delay to 0 for comparison runs, as that
> >   doesn't require remounting)
>

With debug_io_direct=data and dm_delay, the results look quite promising!

medium size / io_uring
gin_vacuum_medium          base=  1619.9ms  patch=   301.8ms   5.37x
( 81.4%)  (reads=1571→947, io_time=1524.86→207.48ms)

The average runtime increases significantly after adding the manual
device delay, so it will take some time to complete all the test runs.
I was also busy with something else today... Once the runs are
finished, I’ll share the results and the script to reproduce them.

--
Best,
Xuneng



Re: Streamify more code paths

От
Michael Paquier
Дата:
On Wed, Mar 11, 2026 at 11:11:23PM +0800, Xuneng Zhou wrote:
> The average runtime increases significantly after adding the manual
> device delay, so it will take some time to complete all the test runs.
> I was also busy with something else today... Once the runs are
> finished, I’ll share the results and the script to reproduce them.

Thanks for doing that.  On my side, I am going to look at the gin and
hash vacuum paths first with more testing as these don't use a custom
callback.  I don't think that I am going to need a lot of convincing,
but I'd rather produce some numbers myself because doing something.
I'll tweak a mounting point with the delay trick, as well.
--
Michael

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
Hi,

On Wed, Mar 11, 2026 at 3:53 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Tue, 10 Mar 2026 at 16:23, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Another code path that showed significant performance improvement is
> > pgstatindex [1]. I've incorporated the test into the script too. Here
> > are the results from my testing:
> >
> > method=worker io-workers=12
> > pgstatindex_large          base=   233.8ms  patch=    54.1ms   4.32x
> > ( 76.8%)  (reads=27460→1757, io_time=213.94→6.31ms)
> >
> > method=io_uring
> > pgstatindex_large          base=   224.2ms  patch=    56.4ms   3.98x
> > ( 74.9%)  (reads=27460→1757, io_time=204.41→4.88ms)
>
> I didn't run the benchmark yet but here is a small suggestion for the
> pgstatindex patch:
>
> +    p.current_blocknum = BTREE_METAPAGE + 1;
> +    p.last_exclusive = nblocks;
>
>      for (blkno = 1; blkno < nblocks; blkno++)
>
> ...
>
> +    p.current_blocknum = HASH_METAPAGE + 1;
> +    p.last_exclusive = nblocks;
>
>      for (blkno = 1; blkno < nblocks; blkno++)
>
> Could you move 'BTREE_METAPAGE + 1' and 'HASH_METAPAGE + 1' into
> variables and then set p.current_blocknum and blkno using those
> variables? p.current_blocknum and blkno should have the same initial
> values, this change makes code less error prone and easier to read in
> my opinion.
>
> Other than the comment above, LGTM.
>

Thanks! That makes sense to me. Please see the patch I’ll post later.

--
Best,
Xuneng



Re: Streamify more code paths

От
Michael Paquier
Дата:
On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> Thanks for doing that.  On my side, I am going to look at the gin and
> hash vacuum paths first with more testing as these don't use a custom
> callback.  I don't think that I am going to need a lot of convincing,
> but I'd rather produce some numbers myself because doing something.
> I'll tweak a mounting point with the delay trick, as well.

While debug_io_direct has been helping a bit, the trick for the delay
to throttle the IO activity has helped much more with my runtime
numbers.  I have mounted a separate partition with a delay of 5ms,
disabled checkums (this part did not make a real difference), and
evicted shared buffers for relation and indexes before the VACUUM.

Then I got better numbers.  Here is an extract:
- worker=3:
gin_vacuum (100k tuples)   base=  1448.2ms  patch=   572.5ms   2.53x
( 60.5%)  (reads=175→104, io_time=1382.70→506.64ms)
gin_vacuum (300k tuples)   base=  3728.0ms  patch=  1332.0ms   2.80x
( 64.3%)  (reads=486→293, io_time=3669.89→1266.27ms)
bloom_vacuum (100k tuples) base= 21826.8ms  patch= 17220.3ms   1.27x
( 21.1%)  (reads=485→117, io_time=4773.33→270.56ms)
bloom_vacuum (300k tuples) base= 67054.0ms  patch= 53164.7ms   1.26x
( 20.7%)  (reads=1431.5→327.5, io_time=13880.2→381.395ms)
- io_uring:
gin_vacuum (100k tuples)   base=  1240.3ms  patch=   360.5ms   3.44x
( 70.9%)  (reads=175→104, io_time=1175.35→299.75ms)
gin_vacuum (300k tuples)   base=  2829.9ms  patch=   642.0ms   4.41x
( 77.3%)  (reads=465.5→293, io_time=2768.46→579.04ms)
bloom_vacuum (100k tuples) base= 22121.7ms  patch= 17532.3ms   1.26x
( 20.7%)  (reads=485→117, io_time=4850.46→285.28ms)
bloom_vacuum (300k tuples) base= 67058.0ms  patch= 53118.0ms   1.26x
( 20.8%)  (reads=1431.5→327.5, io_time=13870.9→305.44ms)

The higher the number of tuples, the better the performance for each
individual operation, but the tests take a much longer time (tens of
seconds vs tens of minutes).  For GIN, the numbers can be quite good
once these reads are pushed.  For bloom, the runtime is improved, and
the IO numbers are much better.

At the end, I have applied these two parts.  Remains now the hash
vacuum and the two parts for pgstattuple.
--
Michael

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> > Thanks for doing that.  On my side, I am going to look at the gin and
> > hash vacuum paths first with more testing as these don't use a custom
> > callback.  I don't think that I am going to need a lot of convincing,
> > but I'd rather produce some numbers myself because doing something.
> > I'll tweak a mounting point with the delay trick, as well.
>
> While debug_io_direct has been helping a bit, the trick for the delay
> to throttle the IO activity has helped much more with my runtime
> numbers.  I have mounted a separate partition with a delay of 5ms,
> disabled checkums (this part did not make a real difference), and
> evicted shared buffers for relation and indexes before the VACUUM.
>
> Then I got better numbers.  Here is an extract:
> - worker=3:
> gin_vacuum (100k tuples)   base=  1448.2ms  patch=   572.5ms   2.53x
> ( 60.5%)  (reads=175→104, io_time=1382.70→506.64ms)
> gin_vacuum (300k tuples)   base=  3728.0ms  patch=  1332.0ms   2.80x
> ( 64.3%)  (reads=486→293, io_time=3669.89→1266.27ms)
> bloom_vacuum (100k tuples) base= 21826.8ms  patch= 17220.3ms   1.27x
> ( 21.1%)  (reads=485→117, io_time=4773.33→270.56ms)
> bloom_vacuum (300k tuples) base= 67054.0ms  patch= 53164.7ms   1.26x
> ( 20.7%)  (reads=1431.5→327.5, io_time=13880.2→381.395ms)
> - io_uring:
> gin_vacuum (100k tuples)   base=  1240.3ms  patch=   360.5ms   3.44x
> ( 70.9%)  (reads=175→104, io_time=1175.35→299.75ms)
> gin_vacuum (300k tuples)   base=  2829.9ms  patch=   642.0ms   4.41x
> ( 77.3%)  (reads=465.5→293, io_time=2768.46→579.04ms)
> bloom_vacuum (100k tuples) base= 22121.7ms  patch= 17532.3ms   1.26x
> ( 20.7%)  (reads=485→117, io_time=4850.46→285.28ms)
> bloom_vacuum (300k tuples) base= 67058.0ms  patch= 53118.0ms   1.26x
> ( 20.8%)  (reads=1431.5→327.5, io_time=13870.9→305.44ms)
>
> The higher the number of tuples, the better the performance for each
> individual operation, but the tests take a much longer time (tens of
> seconds vs tens of minutes).  For GIN, the numbers can be quite good
> once these reads are pushed.  For bloom, the runtime is improved, and
> the IO numbers are much better.
>
> At the end, I have applied these two parts.  Remains now the hash
> vacuum and the two parts for pgstattuple.
> --
> Michael

Thanks for running the benchmarks and pushing!

Here're the results of my test with debug_io_direct and delay :

-- io_uring, medium size

bloom_vacuum_medium        base=  8355.2ms  patch=   715.0ms  11.68x
( 91.4%)  (reads=4732→1056, io_time=7699.47→86.52ms)
pgstattuple_medium         base=  4012.8ms  patch=   213.7ms  18.78x
( 94.7%)  (reads=2006→2006, io_time=4001.66→200.24ms)
pgstatindex_medium         base=  5490.6ms  patch=    37.9ms  144.88x
( 99.3%)  (reads=2745→173, io_time=5481.54→7.82ms)
hash_vacuum_medium         base= 34483.4ms  patch=  2703.5ms  12.75x
( 92.2%)  (reads=19166→3901, io_time=31948.33→308.05ms)
wal_logging_medium         base=  7778.6ms  patch=  7814.5ms   1.00x
( -0.5%)  (reads=2857→2845, io_time=11.84→11.45ms)

-- worker, medium size
bloom_vacuum_medium        base=  8376.2ms  patch=   747.7ms  11.20x
( 91.1%)  (reads=4732→1056, io_time=7688.91→65.49ms)
pgstattuple_medium         base=  4012.7ms  patch=   339.0ms  11.84x
( 91.6%)  (reads=2006→2006, io_time=4002.23→49.99ms)
pgstatindex_medium         base=  5490.3ms  patch=    38.3ms  143.23x
( 99.3%)  (reads=2745→173, io_time=5480.60→16.24ms)
hash_vacuum_medium         base= 34638.4ms  patch=  2940.2ms  11.78x
( 91.5%)  (reads=19166→3901, io_time=31881.61→242.01ms)
wal_logging_medium         base=  7440.1ms  patch=  7434.0ms   1.00x
(  0.1%)  (reads=2861→2825, io_time=10.62→10.71ms)

-- Setting read delay only
sudo dmsetup reload "$DM_DELAY_DEV" --table "0 $size delay $dev 0 $ms $dev 0 0"
Setting dm_delay on delayed to 2ms read / 0ms write

After setting the write delay to 0ms, I can observe more pronounced
speedups overall, since vacuum operation is write-intensive — delaying
writes might dominate the runtime and mask the read-path improvement
we're measuring. It also speeds up the runtime of the test.

-- wal_logging
The wal_logging patch does not seem to benefit from streamification in
this configuration either.

-- Delay settup
For anyone wanting to reproduce the results with a simulated-latency
device, here is the setup I used.

1. Create a 50GB file-backed block device (enough for PG data + indexes)

sudo dd if=/dev/zero of=/srv/delay_disk.img bs=1M count=50000 status=progress
sudo losetup /dev/loop0 /srv/delay_disk.img

2. Create the dm_delay device with 2ms delay
sudo dmsetup create delayed --table "0 $(sudo blockdev --getsz
/dev/loop0) delay /dev/loop0 0 2"

3. Format and mount it

sudo mkfs.ext4 /dev/mapper/delayed
sudo mkdir -p /srv/pg_delayed
sudo mount /dev/mapper/delayed /srv/pg_delayed
sudo chown $(whoami) /srv/pg_delayed

4. Run benchmark with WORKROOT pointing to the delayed device

WORKROOT=/srv/pg_delayed SIZES=medium REPS=3 \
  ./run_streaming_benchmark.sh --baseline --io-method io_uring \
    --test gin_vacuum --direct-io --io-delay 2 \
     the targeted patch


--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
On Thu, Mar 12, 2026 at 12:39 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> > > Thanks for doing that.  On my side, I am going to look at the gin and
> > > hash vacuum paths first with more testing as these don't use a custom
> > > callback.  I don't think that I am going to need a lot of convincing,
> > > but I'd rather produce some numbers myself because doing something.
> > > I'll tweak a mounting point with the delay trick, as well.
> >
> > While debug_io_direct has been helping a bit, the trick for the delay
> > to throttle the IO activity has helped much more with my runtime
> > numbers.  I have mounted a separate partition with a delay of 5ms,
> > disabled checkums (this part did not make a real difference), and
> > evicted shared buffers for relation and indexes before the VACUUM.
> >
> > Then I got better numbers.  Here is an extract:
> > - worker=3:
> > gin_vacuum (100k tuples)   base=  1448.2ms  patch=   572.5ms   2.53x
> > ( 60.5%)  (reads=175→104, io_time=1382.70→506.64ms)
> > gin_vacuum (300k tuples)   base=  3728.0ms  patch=  1332.0ms   2.80x
> > ( 64.3%)  (reads=486→293, io_time=3669.89→1266.27ms)
> > bloom_vacuum (100k tuples) base= 21826.8ms  patch= 17220.3ms   1.27x
> > ( 21.1%)  (reads=485→117, io_time=4773.33→270.56ms)
> > bloom_vacuum (300k tuples) base= 67054.0ms  patch= 53164.7ms   1.26x
> > ( 20.7%)  (reads=1431.5→327.5, io_time=13880.2→381.395ms)
> > - io_uring:
> > gin_vacuum (100k tuples)   base=  1240.3ms  patch=   360.5ms   3.44x
> > ( 70.9%)  (reads=175→104, io_time=1175.35→299.75ms)
> > gin_vacuum (300k tuples)   base=  2829.9ms  patch=   642.0ms   4.41x
> > ( 77.3%)  (reads=465.5→293, io_time=2768.46→579.04ms)
> > bloom_vacuum (100k tuples) base= 22121.7ms  patch= 17532.3ms   1.26x
> > ( 20.7%)  (reads=485→117, io_time=4850.46→285.28ms)
> > bloom_vacuum (300k tuples) base= 67058.0ms  patch= 53118.0ms   1.26x
> > ( 20.8%)  (reads=1431.5→327.5, io_time=13870.9→305.44ms)
> >
> > The higher the number of tuples, the better the performance for each
> > individual operation, but the tests take a much longer time (tens of
> > seconds vs tens of minutes).  For GIN, the numbers can be quite good
> > once these reads are pushed.  For bloom, the runtime is improved, and
> > the IO numbers are much better.
> >
>
> -- io_uring, medium size
>
> bloom_vacuum_medium        base=  8355.2ms  patch=   715.0ms  11.68x
> ( 91.4%)  (reads=4732→1056, io_time=7699.47→86.52ms)
> pgstattuple_medium         base=  4012.8ms  patch=   213.7ms  18.78x
> ( 94.7%)  (reads=2006→2006, io_time=4001.66→200.24ms)
> pgstatindex_medium         base=  5490.6ms  patch=    37.9ms  144.88x
> ( 99.3%)  (reads=2745→173, io_time=5481.54→7.82ms)
> hash_vacuum_medium         base= 34483.4ms  patch=  2703.5ms  12.75x
> ( 92.2%)  (reads=19166→3901, io_time=31948.33→308.05ms)
> wal_logging_medium         base=  7778.6ms  patch=  7814.5ms   1.00x
> ( -0.5%)  (reads=2857→2845, io_time=11.84→11.45ms)
>
> -- worker, medium size
> bloom_vacuum_medium        base=  8376.2ms  patch=   747.7ms  11.20x
> ( 91.1%)  (reads=4732→1056, io_time=7688.91→65.49ms)
> pgstattuple_medium         base=  4012.7ms  patch=   339.0ms  11.84x
> ( 91.6%)  (reads=2006→2006, io_time=4002.23→49.99ms)
> pgstatindex_medium         base=  5490.3ms  patch=    38.3ms  143.23x
> ( 99.3%)  (reads=2745→173, io_time=5480.60→16.24ms)
> hash_vacuum_medium         base= 34638.4ms  patch=  2940.2ms  11.78x
> ( 91.5%)  (reads=19166→3901, io_time=31881.61→242.01ms)
> wal_logging_medium         base=  7440.1ms  patch=  7434.0ms   1.00x
> (  0.1%)  (reads=2861→2825, io_time=10.62→10.71ms)
>

Our io_time metric currently measures only read time and ignores write
I/O, which can be misleading. We now separate it into read_time and
write_time.

-- write-delay 2 ms
WORKROOT=/srv/pg_delayed SIZES=small REPS=3
./run_streaming_benchmark.sh --baseline --io-method worker
--io-workers 12 --test hash_vacuum --direct-io --read-delay 2
--write-delay 2
v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch

hash_vacuum_small          base= 16652.8ms  patch= 13493.2ms   1.23x
( 19.0%)  (reads=2338→815, read_time=4136.19→884.79ms,
writes=6218→6206, write_time=12313.81→12289.58ms)

-- write-delay 0 ms
WORKROOT=/srv/pg_delayed SIZES=small REPS=3
./run_streaming_benchmark.sh --baseline --io-method worker
--io-workers 12 --test hash_vacuum --direct-io --read-delay 2
--write-delay 0
v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch

hash_vacuum_small          base=  4310.2ms  patch=  1146.7ms   3.76x
( 73.4%)  (reads=2338→815, read_time=4002.24→833.47ms,
writes=6218→6206, write_time=186.69→140.96ms)

--
Best,
Xuneng

Вложения

Re: Streamify more code paths

От
Michael Paquier
Дата:
On Thu, Mar 12, 2026 at 11:35:48PM +0800, Xuneng Zhou wrote:
> Our io_time metric currently measures only read time and ignores write
> I/O, which can be misleading. We now separate it into read_time and
> write_time.

I had a look at the pgstatindex part this morning, running my own test
under conditions similar to 6c228755add8, and here's one extract with
io_uring:
pgstatindex (100k tuples) base=32938.2ms patch=83.3ms 395.60x ( 99.7%)
(reads=2745->173, io_time=32932.09->59.75ms)

There was one issue with a declaration put in the middle of the code,
that I have fixed.  This one is now done, remains 3 pieces to
evaluate.
--
Michael

Вложения

Re: Streamify more code paths

От
Xuneng Zhou
Дата:
On Fri, Mar 13, 2026 at 9:50 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 12, 2026 at 11:35:48PM +0800, Xuneng Zhou wrote:
> > Our io_time metric currently measures only read time and ignores write
> > I/O, which can be misleading. We now separate it into read_time and
> > write_time.
>
> I had a look at the pgstatindex part this morning, running my own test
> under conditions similar to 6c228755add8, and here's one extract with
> io_uring:
> pgstatindex (100k tuples) base=32938.2ms patch=83.3ms 395.60x ( 99.7%)
> (reads=2745->173, io_time=32932.09->59.75ms)

This result looks great!

> There was one issue with a declaration put in the middle of the code,
> that I have fixed.  This one is now done, remains 3 pieces to
> evaluate.

Thanks for fixing this and for taking the time to review and test the patches.

--
Best,
Xuneng