Обсуждение: Streamify more code paths
Hi Hackers, I noticed several additional paths in contrib modules, beyond [1], that are potentially suitable for streamification: 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete() The following patches streamify those code paths. No benchmarks have been run yet. [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com Feedbacks welcome. -- Best, Xuneng
Вложения
Hi, On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi Hackers, > > I noticed several additional paths in contrib modules, beyond [1], > that are potentially suitable for streamification: > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete() > > The following patches streamify those code paths. No benchmarks have > been run yet. > > [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com > > Feedbacks welcome. > One more in ginvacuumcleanup(). -- Best, Xuneng
Вложения
Hi, Thank you for working on this! On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi Hackers, > > > > I noticed several additional paths in contrib modules, beyond [1], > > that are potentially suitable for streamification: > > > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete() > > > > The following patches streamify those code paths. No benchmarks have > > been run yet. > > > > [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com > > > > Feedbacks welcome. > > > > One more in ginvacuumcleanup(). 0001, 0002 and 0004 LGTM. 0003: + buf = read_stream_next_buffer(stream, NULL); + if (buf == InvalidBuffer) + break; I think we are loosening the check here. We were sure that there were no InvalidBuffers until the nblocks. Streamified version does not have this check, it exits from the loop the first time it sees an InvalidBuffer, which may be wrong. You might want to add 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to have a similar check. -- Regards, Nazir Bilal Yavuz Microsoft
Hi Bilal, Thanks for your review! On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > Thank you for working on this! > > On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi, > > > > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > Hi Hackers, > > > > > > I noticed several additional paths in contrib modules, beyond [1], > > > that are potentially suitable for streamification: > > > > > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal > > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete() > > > > > > The following patches streamify those code paths. No benchmarks have > > > been run yet. > > > > > > [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com > > > > > > Feedbacks welcome. > > > > > > > One more in ginvacuumcleanup(). > > 0001, 0002 and 0004 LGTM. > > 0003: > > + buf = read_stream_next_buffer(stream, NULL); > + if (buf == InvalidBuffer) > + break; > > I think we are loosening the check here. We were sure that there were > no InvalidBuffers until the nblocks. Streamified version does not have > this check, it exits from the loop the first time it sees an > InvalidBuffer, which may be wrong. You might want to add > 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to > have a similar check. > Agree. The check has been added in v2 per your suggestion. -- Best, Xuneng
Вложения
Hi, On Sat, Dec 27, 2025 at 12:41 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi Bilal, > > Thanks for your review! > > On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > Hi, > > > > Thank you for working on this! > > > > On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > Hi, > > > > > > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > > > Hi Hackers, > > > > > > > > I noticed several additional paths in contrib modules, beyond [1], > > > > that are potentially suitable for streamification: > > > > > > > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal > > > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete() > > > > > > > > The following patches streamify those code paths. No benchmarks have > > > > been run yet. > > > > > > > > [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com > > > > > > > > Feedbacks welcome. > > > > > > > > > > One more in ginvacuumcleanup(). > > > > 0001, 0002 and 0004 LGTM. > > > > 0003: > > > > + buf = read_stream_next_buffer(stream, NULL); > > + if (buf == InvalidBuffer) > > + break; > > > > I think we are loosening the check here. We were sure that there were > > no InvalidBuffers until the nblocks. Streamified version does not have > > this check, it exits from the loop the first time it sees an > > InvalidBuffer, which may be wrong. You might want to add > > 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to > > have a similar check. > > > > Agree. The check has been added in v2 per your suggestion. > Two more to go: patch 5: Streamify log_newpage_range() WAL logging path patch 6: Streamify hash index VACUUM primary bucket page reads Benchmarks will be conducted soon. -- Best, Xuneng
Вложения
- v2-0002-Streamify-Bloom-VACUUM-paths-Use-streaming-re.patch
- v2-0004-Replace-synchronous-ReadBufferExtended-loop-with.patch
- v2-0001-Switch-Bloom-scan-paths-to-streaming-read.patch
- v2-0003-Streamify-heap-bloat-estimation-scan-Introduc.patch
- v2-0005-Streamify-log_newpage_range-WAL-logging-path.patch
- v2-0006-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch
Hi, On Sun, Dec 28, 2025 at 7:41 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Sat, Dec 27, 2025 at 12:41 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi Bilal, > > > > Thanks for your review! > > > > On Fri, Dec 26, 2025 at 6:59 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > > > Hi, > > > > > > Thank you for working on this! > > > > > > On Thu, 25 Dec 2025 at 09:34, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > On Thu, Dec 25, 2025 at 1:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > > > > > Hi Hackers, > > > > > > > > > > I noticed several additional paths in contrib modules, beyond [1], > > > > > that are potentially suitable for streamification: > > > > > > > > > > 1) pgstattuple — pgstatapprox.c and parts of pgstattuple_approx_internal > > > > > 2) Bloom — scan paths in blgetbitmap() and maintenance paths in blbulkdelete() > > > > > > > > > > The following patches streamify those code paths. No benchmarks have > > > > > been run yet. > > > > > > > > > > [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com > > > > > > > > > > Feedbacks welcome. > > > > > > > > > > > > > One more in ginvacuumcleanup(). > > > > > > 0001, 0002 and 0004 LGTM. > > > > > > 0003: > > > > > > + buf = read_stream_next_buffer(stream, NULL); > > > + if (buf == InvalidBuffer) > > > + break; > > > > > > I think we are loosening the check here. We were sure that there were > > > no InvalidBuffers until the nblocks. Streamified version does not have > > > this check, it exits from the loop the first time it sees an > > > InvalidBuffer, which may be wrong. You might want to add > > > 'Assert(p.current_blocknum == nblocks);' before read_stream_end() to > > > have a similar check. > > > > > > > Agree. The check has been added in v2 per your suggestion. > > > > Two more to go: > patch 5: Streamify log_newpage_range() WAL logging path > patch 6: Streamify hash index VACUUM primary bucket page reads > > Benchmarks will be conducted soon. > v6 in the last message has a problem and has not been updated. Attach the right one again. Sorry for the noise. -- Best, Xuneng
Вложения
- v2-0002-Streamify-Bloom-VACUUM-paths-Use-streaming-re.patch
- v2-0001-Switch-Bloom-scan-paths-to-streaming-read.patch
- v2-0004-Replace-synchronous-ReadBufferExtended-loop-with.patch
- v2-0003-Streamify-heap-bloat-estimation-scan-Introduc.patch
- v2-0005-Streamify-log_newpage_range-WAL-logging-path.patch
- v2-0006-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch
Hi,
On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
> >
> > Two more to go:
> > patch 5: Streamify log_newpage_range() WAL logging path
> > patch 6: Streamify hash index VACUUM primary bucket page reads
> >
> > Benchmarks will be conducted soon.
> >
>
> v6 in the last message has a problem and has not been updated. Attach
> the right one again. Sorry for the noise.
0003 and 0006:
You need to add 'StatApproxReadStreamPrivate' and
'HashBulkDeleteStreamPrivate' to the typedefs.list.
0005:
@@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
- Buffer buf = ReadBufferExtended(rel, forknum, blkno,
- RBM_NORMAL, NULL);
+ Buffer buf = read_stream_next_buffer(stream, NULL);
+
+ if (!BufferIsValid(buf))
+ break;
We are loosening a check here, there should not be a invalid buffer in
the stream until the endblk. I think you can remove this
BufferIsValid() check, then we can learn if something goes wrong.
0006:
You can use read_stream_reset() instead of read_stream_end(), then you
can use the same stream with different variables, I believe this is
the preferred way.
Rest LGTM!
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
Thanks for looking into this.
On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> > >
> > > Two more to go:
> > > patch 5: Streamify log_newpage_range() WAL logging path
> > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > >
> > > Benchmarks will be conducted soon.
> > >
> >
> > v6 in the last message has a problem and has not been updated. Attach
> > the right one again. Sorry for the noise.
>
> 0003 and 0006:
>
> You need to add 'StatApproxReadStreamPrivate' and
> 'HashBulkDeleteStreamPrivate' to the typedefs.list.
Done.
> 0005:
>
> @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> nbufs = 0;
> while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> {
> - Buffer buf = ReadBufferExtended(rel, forknum, blkno,
> - RBM_NORMAL, NULL);
> + Buffer buf = read_stream_next_buffer(stream, NULL);
> +
> + if (!BufferIsValid(buf))
> + break;
>
> We are loosening a check here, there should not be a invalid buffer in
> the stream until the endblk. I think you can remove this
> BufferIsValid() check, then we can learn if something goes wrong.
My concern before for not adding assert at the end of streaming is the
potential early break in here:
/* Nothing more to do if all remaining blocks were empty. */
if (nbufs == 0)
break;
After looking more closely, it turns out to be a misunderstanding of the logic.
> 0006:
>
> You can use read_stream_reset() instead of read_stream_end(), then you
> can use the same stream with different variables, I believe this is
> the preferred way.
>
> Rest LGTM!
>
Yeah, reset seems a more proper way here.
--
Best,
Xuneng
Вложения
- v3-0003-Streamify-heap-bloat-estimation-scan.-Introduce-a.patch
- v3-0005-Streamify-log_newpage_range-WAL-logging-path.patch
- v3-0002-Streamify-Bloom-VACUUM-paths.-n-nUse-streaming-re.patch
- v3-0001-Switch-Bloom-scan-paths-to-streaming-read.-n-nRep.patch
- v3-0004-Replace-synchronous-ReadBufferExtended-loop-with-.patch
- v3-0006-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch
Hi,
On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> Thanks for looking into this.
>
> On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > > >
> > > > Two more to go:
> > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > >
> > > > Benchmarks will be conducted soon.
> > > >
> > >
> > > v6 in the last message has a problem and has not been updated. Attach
> > > the right one again. Sorry for the noise.
> >
> > 0003 and 0006:
> >
> > You need to add 'StatApproxReadStreamPrivate' and
> > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
>
> Done.
>
> > 0005:
> >
> > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > nbufs = 0;
> > while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > {
> > - Buffer buf = ReadBufferExtended(rel, forknum, blkno,
> > - RBM_NORMAL, NULL);
> > + Buffer buf = read_stream_next_buffer(stream, NULL);
> > +
> > + if (!BufferIsValid(buf))
> > + break;
> >
> > We are loosening a check here, there should not be a invalid buffer in
> > the stream until the endblk. I think you can remove this
> > BufferIsValid() check, then we can learn if something goes wrong.
>
> My concern before for not adding assert at the end of streaming is the
> potential early break in here:
>
> /* Nothing more to do if all remaining blocks were empty. */
> if (nbufs == 0)
> break;
>
> After looking more closely, it turns out to be a misunderstanding of the logic.
>
> > 0006:
> >
> > You can use read_stream_reset() instead of read_stream_end(), then you
> > can use the same stream with different variables, I believe this is
> > the preferred way.
> >
> > Rest LGTM!
> >
>
> Yeah, reset seems a more proper way here.
>
Run pgindent using the updated typedefs.list.
--
Best,
Xuneng
Вложения
- v4-0001-Switch-Bloom-scan-paths-to-streaming-read.-n-nRep.patch
- v4-0005-Streamify-log_newpage_range-WAL-logging-path.patch
- v4-0004-Replace-synchronous-ReadBufferExtended-loop-with-.patch
- v4-0002-Streamify-Bloom-VACUUM-paths.-n-nUse-streaming-re.patch
- v4-0006-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch
- v4-0003-Streamify-heap-bloat-estimation-scan.-Introduce-a.patch
Hi,
On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > Thanks for looking into this.
> >
> > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi,
> > > > >
> > > > > Two more to go:
> > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > >
> > > > > Benchmarks will be conducted soon.
> > > > >
> > > >
> > > > v6 in the last message has a problem and has not been updated. Attach
> > > > the right one again. Sorry for the noise.
> > >
> > > 0003 and 0006:
> > >
> > > You need to add 'StatApproxReadStreamPrivate' and
> > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> >
> > Done.
> >
> > > 0005:
> > >
> > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > > nbufs = 0;
> > > while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > {
> > > - Buffer buf = ReadBufferExtended(rel, forknum, blkno,
> > > - RBM_NORMAL, NULL);
> > > + Buffer buf = read_stream_next_buffer(stream, NULL);
> > > +
> > > + if (!BufferIsValid(buf))
> > > + break;
> > >
> > > We are loosening a check here, there should not be a invalid buffer in
> > > the stream until the endblk. I think you can remove this
> > > BufferIsValid() check, then we can learn if something goes wrong.
> >
> > My concern before for not adding assert at the end of streaming is the
> > potential early break in here:
> >
> > /* Nothing more to do if all remaining blocks were empty. */
> > if (nbufs == 0)
> > break;
> >
> > After looking more closely, it turns out to be a misunderstanding of the logic.
> >
> > > 0006:
> > >
> > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > can use the same stream with different variables, I believe this is
> > > the preferred way.
> > >
> > > Rest LGTM!
> > >
> >
> > Yeah, reset seems a more proper way here.
> >
>
> Run pgindent using the updated typedefs.list.
>
I've completed benchmarking of the v4 streaming read patches across
three I/O methods (io_uring, sync, worker). Tests were run with cold
cache on large datasets.
--- Settings ---
shared_buffers = '8GB'
effective_io_concurrency = 200
io_method = $IO_METHOD
io_workers = $IO_WORKERS
io_max_concurrency = $IO_MAX_CONCURRENCY
track_io_timing = on
autovacuum = off
checkpoint_timeout = 1h
max_wal_size = 10GB
max_parallel_workers_per_gather = 0
--- Machine ---
CPU: 48-core
RAM: 256 GB DDR5
Disk: 2 x 1.92 TB NVMe SSD
--- Executive Summary ---
The patches provide significant benefits for I/O-bound sequential
operations, with the greatest improvements seen when using
asynchronous I/O methods (io_uring and worker). The synchronous I/O
mode shows reduced but still meaningful gains.
--- Results by I/O Method
Best Results: io_method=worker
bloom_scan: 4.14x (75.9% faster); 93% fewer reads
pgstattuple: 1.59x (37.1% faster); 94% fewer reads
hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
io_method=io_uring
bloom_scan: 3.12x (68.0% faster); 93% fewer reads
pgstattuple: 1.50x (33.2% faster); 94% fewer reads
hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
wal_logging: 1.00x (-0.5%, neutral); no change in reads
io_method=sync (baseline comparison)
bloom_scan: 1.20x (16.4% faster); 93% fewer reads
pgstattuple: 1.10x (9.0% faster); 94% fewer reads
hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
wal_logging: 0.99x (-0.7%, neutral); no change in reads
--- Observations ---
Async I/O amplifies streaming benefits: The same patches show 3-4x
improvement with worker/io_uring vs 1.2x with sync.
I/O operation reduction is consistent: All modes show the same ~93-94%
reduction in I/O operations for bloom_scan and pgstattuple.
VACUUM operations show modest gains: Despite large I/O reductions
(76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
larger CPU overhead (tuple processing, index maintenance, WAL
logging).
log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).
--
Best,
Xuneng
Вложения
Hi,
On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Thanks for looking into this.
> > >
> > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > > >
> > > > > > Two more to go:
> > > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > > >
> > > > > > Benchmarks will be conducted soon.
> > > > > >
> > > > >
> > > > > v6 in the last message has a problem and has not been updated. Attach
> > > > > the right one again. Sorry for the noise.
> > > >
> > > > 0003 and 0006:
> > > >
> > > > You need to add 'StatApproxReadStreamPrivate' and
> > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> > >
> > > Done.
> > >
> > > > 0005:
> > > >
> > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > > > nbufs = 0;
> > > > while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > > {
> > > > - Buffer buf = ReadBufferExtended(rel, forknum, blkno,
> > > > - RBM_NORMAL, NULL);
> > > > + Buffer buf = read_stream_next_buffer(stream, NULL);
> > > > +
> > > > + if (!BufferIsValid(buf))
> > > > + break;
> > > >
> > > > We are loosening a check here, there should not be a invalid buffer in
> > > > the stream until the endblk. I think you can remove this
> > > > BufferIsValid() check, then we can learn if something goes wrong.
> > >
> > > My concern before for not adding assert at the end of streaming is the
> > > potential early break in here:
> > >
> > > /* Nothing more to do if all remaining blocks were empty. */
> > > if (nbufs == 0)
> > > break;
> > >
> > > After looking more closely, it turns out to be a misunderstanding of the logic.
> > >
> > > > 0006:
> > > >
> > > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > > can use the same stream with different variables, I believe this is
> > > > the preferred way.
> > > >
> > > > Rest LGTM!
> > > >
> > >
> > > Yeah, reset seems a more proper way here.
> > >
> >
> > Run pgindent using the updated typedefs.list.
> >
>
> I've completed benchmarking of the v4 streaming read patches across
> three I/O methods (io_uring, sync, worker). Tests were run with cold
> cache on large datasets.
>
> --- Settings ---
>
> shared_buffers = '8GB'
> effective_io_concurrency = 200
> io_method = $IO_METHOD
> io_workers = $IO_WORKERS
> io_max_concurrency = $IO_MAX_CONCURRENCY
> track_io_timing = on
> autovacuum = off
> checkpoint_timeout = 1h
> max_wal_size = 10GB
> max_parallel_workers_per_gather = 0
>
> --- Machine ---
> CPU: 48-core
> RAM: 256 GB DDR5
> Disk: 2 x 1.92 TB NVMe SSD
>
> --- Executive Summary ---
>
> The patches provide significant benefits for I/O-bound sequential
> operations, with the greatest improvements seen when using
> asynchronous I/O methods (io_uring and worker). The synchronous I/O
> mode shows reduced but still meaningful gains.
>
> --- Results by I/O Method
>
> Best Results: io_method=worker
>
> bloom_scan: 4.14x (75.9% faster); 93% fewer reads
> pgstattuple: 1.59x (37.1% faster); 94% fewer reads
> hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
> gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
> bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
> wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
>
> io_method=io_uring
>
> bloom_scan: 3.12x (68.0% faster); 93% fewer reads
> pgstattuple: 1.50x (33.2% faster); 94% fewer reads
> hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
> gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
> bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
> wal_logging: 1.00x (-0.5%, neutral); no change in reads
>
> io_method=sync (baseline comparison)
>
> bloom_scan: 1.20x (16.4% faster); 93% fewer reads
> pgstattuple: 1.10x (9.0% faster); 94% fewer reads
> hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
> gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
> bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
> wal_logging: 0.99x (-0.7%, neutral); no change in reads
>
> --- Observations ---
>
> Async I/O amplifies streaming benefits: The same patches show 3-4x
> improvement with worker/io_uring vs 1.2x with sync.
>
> I/O operation reduction is consistent: All modes show the same ~93-94%
> reduction in I/O operations for bloom_scan and pgstattuple.
>
> VACUUM operations show modest gains: Despite large I/O reductions
> (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
> larger CPU overhead (tuple processing, index maintenance, WAL
> logging).
>
> log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).
>
> --
> Best,
> Xuneng
There was an issue in the wal_log test of the original script.
--- The original benchmark used:
ALTER TABLE ... SET LOGGED
This path performs a full table rewrite via ATRewriteTable()
(tablecmds.c). It creates a new relfilenode and copies tuples into it.
It does not call log_newpage_range() on rewritten pages.
log_newpage_range() may only appear indirectly through the
pending-sync logic in storage.c, and only when:
wal_level = minimal, and
relation size < wal_skip_threshold (default 2MB).
Our test tables (1M–20M rows) are far larger than 2MB. In that case,
PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
previous benchmark measured table rewrite I/O, not the
log_newpage_range() path.
--- Current design: GIN index build
The benchmark now uses:
CREATE INDEX ... USING gin (doc_tsv)
This reliably exercises log_newpage_range() because:
- ginbuild() constructs the index and WAL-logs all new index pages
using log_newpage_range().
- This is part of the normal GIN build path, independent of wal_skip_threshold.
- The streaming-read patch modifies the WAL logging path inside
log_newpage_range(), which this test directly targets.
--- Results (wal_logging_large)
worker: 1.00x (+0.5%); no meaningful change in reads
io_uring: 1.01x (+1.3%); no meaningful change in reads
sync: 1.01x (+1.1%); no meaningful change in reads
--
Best,
Xuneng
Вложения
Hi,
On Mon, Feb 9, 2026 at 6:40 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > > >
> > > > > > > Two more to go:
> > > > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > > > >
> > > > > > > Benchmarks will be conducted soon.
> > > > > > >
> > > > > >
> > > > > > v6 in the last message has a problem and has not been updated. Attach
> > > > > > the right one again. Sorry for the noise.
> > > > >
> > > > > 0003 and 0006:
> > > > >
> > > > > You need to add 'StatApproxReadStreamPrivate' and
> > > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> > > >
> > > > Done.
> > > >
> > > > > 0005:
> > > > >
> > > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > > > > nbufs = 0;
> > > > > while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > > > {
> > > > > - Buffer buf = ReadBufferExtended(rel, forknum, blkno,
> > > > > - RBM_NORMAL, NULL);
> > > > > + Buffer buf = read_stream_next_buffer(stream, NULL);
> > > > > +
> > > > > + if (!BufferIsValid(buf))
> > > > > + break;
> > > > >
> > > > > We are loosening a check here, there should not be a invalid buffer in
> > > > > the stream until the endblk. I think you can remove this
> > > > > BufferIsValid() check, then we can learn if something goes wrong.
> > > >
> > > > My concern before for not adding assert at the end of streaming is the
> > > > potential early break in here:
> > > >
> > > > /* Nothing more to do if all remaining blocks were empty. */
> > > > if (nbufs == 0)
> > > > break;
> > > >
> > > > After looking more closely, it turns out to be a misunderstanding of the logic.
> > > >
> > > > > 0006:
> > > > >
> > > > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > > > can use the same stream with different variables, I believe this is
> > > > > the preferred way.
> > > > >
> > > > > Rest LGTM!
> > > > >
> > > >
> > > > Yeah, reset seems a more proper way here.
> > > >
> > >
> > > Run pgindent using the updated typedefs.list.
> > >
> >
> > I've completed benchmarking of the v4 streaming read patches across
> > three I/O methods (io_uring, sync, worker). Tests were run with cold
> > cache on large datasets.
> >
> > --- Settings ---
> >
> > shared_buffers = '8GB'
> > effective_io_concurrency = 200
> > io_method = $IO_METHOD
> > io_workers = $IO_WORKERS
> > io_max_concurrency = $IO_MAX_CONCURRENCY
> > track_io_timing = on
> > autovacuum = off
> > checkpoint_timeout = 1h
> > max_wal_size = 10GB
> > max_parallel_workers_per_gather = 0
> >
> > --- Machine ---
> > CPU: 48-core
> > RAM: 256 GB DDR5
> > Disk: 2 x 1.92 TB NVMe SSD
> >
> > --- Executive Summary ---
> >
> > The patches provide significant benefits for I/O-bound sequential
> > operations, with the greatest improvements seen when using
> > asynchronous I/O methods (io_uring and worker). The synchronous I/O
> > mode shows reduced but still meaningful gains.
> >
> > --- Results by I/O Method
> >
> > Best Results: io_method=worker
> >
> > bloom_scan: 4.14x (75.9% faster); 93% fewer reads
> > pgstattuple: 1.59x (37.1% faster); 94% fewer reads
> > hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
> > gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
> > bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
> > wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
> >
> > io_method=io_uring
> >
> > bloom_scan: 3.12x (68.0% faster); 93% fewer reads
> > pgstattuple: 1.50x (33.2% faster); 94% fewer reads
> > hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
> > gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
> > bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
> > wal_logging: 1.00x (-0.5%, neutral); no change in reads
> >
> > io_method=sync (baseline comparison)
> >
> > bloom_scan: 1.20x (16.4% faster); 93% fewer reads
> > pgstattuple: 1.10x (9.0% faster); 94% fewer reads
> > hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
> > gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
> > bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
> > wal_logging: 0.99x (-0.7%, neutral); no change in reads
> >
> > --- Observations ---
> >
> > Async I/O amplifies streaming benefits: The same patches show 3-4x
> > improvement with worker/io_uring vs 1.2x with sync.
> >
> > I/O operation reduction is consistent: All modes show the same ~93-94%
> > reduction in I/O operations for bloom_scan and pgstattuple.
> >
> > VACUUM operations show modest gains: Despite large I/O reductions
> > (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
> > larger CPU overhead (tuple processing, index maintenance, WAL
> > logging).
> >
> > log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).
> >
> > --
> > Best,
> > Xuneng
>
> There was an issue in the wal_log test of the original script.
>
> --- The original benchmark used:
> ALTER TABLE ... SET LOGGED
>
> This path performs a full table rewrite via ATRewriteTable()
> (tablecmds.c). It creates a new relfilenode and copies tuples into it.
> It does not call log_newpage_range() on rewritten pages.
>
> log_newpage_range() may only appear indirectly through the
> pending-sync logic in storage.c, and only when:
>
> wal_level = minimal, and
> relation size < wal_skip_threshold (default 2MB).
>
> Our test tables (1M–20M rows) are far larger than 2MB. In that case,
> PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
> previous benchmark measured table rewrite I/O, not the
> log_newpage_range() path.
>
> --- Current design: GIN index build
>
> The benchmark now uses:
> CREATE INDEX ... USING gin (doc_tsv)
>
> This reliably exercises log_newpage_range() because:
> - ginbuild() constructs the index and WAL-logs all new index pages
> using log_newpage_range().
> - This is part of the normal GIN build path, independent of wal_skip_threshold.
> - The streaming-read patch modifies the WAL logging path inside
> log_newpage_range(), which this test directly targets.
>
> --- Results (wal_logging_large)
> worker: 1.00x (+0.5%); no meaningful change in reads
> io_uring: 1.01x (+1.3%); no meaningful change in reads
> sync: 1.01x (+1.1%); no meaningful change in reads
>
> --
> Best,
> Xuneng
Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.
--
Best,
Xuneng
Вложения
On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote: > Here’s v5 of the patchset. The wal_logging_large patch has been > removed, as no performance gains were observed in the benchmark runs. Looking at the numbers you are posting, it is harder to get excited about the hash, gin, bloom_vacuum and wal_logging. The worker method seems more efficient, may show that we are out of noise level. The results associated to pgstattuple and the bloom scans are on a different level for the three methods. Saying that, it is really nice that you have sent the benchmark. The measurement method looks in line with the goal here after review (IO stats, calculations), and I have taken some time to run it to get an idea of the difference for these five code paths, as of (slightly edited the script for my own environment, result is the same): ./run_streaming_benchmark --baseline --io-method=io_uring/worker I am not much interested in the sync case, so I have tested the two other methods: 1) method=IO-uring bloom_scan_large base= 725.3ms patch= 99.9ms 7.26x ( 86.2%) (reads=19676->1294, io_time=688.36->33.69ms) bloom_vacuum_large base= 7414.9ms patch= 7455.2ms 0.99x ( -0.5%) (reads=48361->11597, io_time=459.02->257.51ms) pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) gin_vacuum_large base= 3546.8ms patch= 2317.9ms 1.53x ( 34.6%) (reads=20734->17735, io_time=3244.40->2021.53ms) hash_vacuum_large base= 12268.5ms patch= 11751.1ms 1.04x ( 4.2%) (reads=76677->15606, io_time=1483.10->315.03ms) wal_logging_large base= 33713.0ms patch= 32773.9ms 1.03x ( 2.8%) (reads=21641->21641, io_time=81.18->77.25ms) 2) method=worker io-workers=3 bloom_scan_large base= 725.0ms patch= 465.7ms 1.56x ( 35.8%) (reads=19676->1294, io_time=688.70->52.20ms) bloom_vacuum_large base= 7138.3ms patch= 7156.0ms 1.00x ( -0.2%) (reads=48361->11597, io_time=284.56->64.37ms) pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) gin_vacuum_large base= 3769.4ms patch= 3716.7ms 1.01x ( 1.4%) (reads=20775->17684, io_time=3562.21->3528.14ms) hash_vacuum_large base= 11750.1ms patch= 11289.0ms 1.04x ( 3.9%) (reads=76677->15606, io_time=1296.03->98.72ms) wal_logging_large base= 32862.3ms patch= 33179.7ms 0.99x ( -1.0%) (reads=21641->21641, io_time=91.42->90.59ms) The bloom scan case is a winner in runtime for both cases, and in terms of stats we get much better numbers for all of them. These feel rather in line with what you have, except for pgstattuple's runtime, still its IO numbers feel good. That's just to say that I'll review them and try to do something about at least some of the pieces for this release. -- Michael
Вложения
Hi Michael, On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote: > > Here’s v5 of the patchset. The wal_logging_large patch has been > > removed, as no performance gains were observed in the benchmark runs. > > Looking at the numbers you are posting, it is harder to get excited > about the hash, gin, bloom_vacuum and wal_logging. The worker method > seems more efficient, may show that we are out of noise level. > The results associated to pgstattuple and the bloom scans are on a > different level for the three methods. > > Saying that, it is really nice that you have sent the benchmark. The > measurement method looks in line with the goal here after review (IO > stats, calculations), and I have taken some time to run it to get an > idea of the difference for these five code paths, as of (slightly > edited the script for my own environment, result is the same): > ./run_streaming_benchmark --baseline --io-method=io_uring/worker > > I am not much interested in the sync case, so I have tested the two > other methods: > > 1) method=IO-uring > bloom_scan_large base= 725.3ms patch= 99.9ms 7.26x > ( 86.2%) (reads=19676->1294, io_time=688.36->33.69ms) > bloom_vacuum_large base= 7414.9ms patch= 7455.2ms 0.99x > ( -0.5%) (reads=48361->11597, io_time=459.02->257.51ms) > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) > gin_vacuum_large base= 3546.8ms patch= 2317.9ms 1.53x > ( 34.6%) (reads=20734->17735, io_time=3244.40->2021.53ms) > hash_vacuum_large base= 12268.5ms patch= 11751.1ms 1.04x > ( 4.2%) (reads=76677->15606, io_time=1483.10->315.03ms) > wal_logging_large base= 33713.0ms patch= 32773.9ms 1.03x > ( 2.8%) (reads=21641->21641, io_time=81.18->77.25ms) > > 2) method=worker io-workers=3 > bloom_scan_large base= 725.0ms patch= 465.7ms 1.56x > ( 35.8%) (reads=19676->1294, io_time=688.70->52.20ms) > bloom_vacuum_large base= 7138.3ms patch= 7156.0ms 1.00x > ( -0.2%) (reads=48361->11597, io_time=284.56->64.37ms) > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) > gin_vacuum_large base= 3769.4ms patch= 3716.7ms 1.01x > ( 1.4%) (reads=20775->17684, io_time=3562.21->3528.14ms) > hash_vacuum_large base= 11750.1ms patch= 11289.0ms 1.04x > ( 3.9%) (reads=76677->15606, io_time=1296.03->98.72ms) > wal_logging_large base= 32862.3ms patch= 33179.7ms 0.99x > ( -1.0%) (reads=21641->21641, io_time=91.42->90.59ms) > > The bloom scan case is a winner in runtime for both cases, and in > terms of stats we get much better numbers for all of them. These feel > rather in line with what you have, except for pgstattuple's runtime, > still its IO numbers feel good. Thanks for running the benchmarks! The performance gains for hash, gin, bloom_vacuum, and wal_logging is insignificant, likely because these workloads are not I/O-bound. The default number of I/O workers is three, which is fairly conservative. When I ran the benchmark script with a higher number of I/O workers, some runs showed improved performance. > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) Yeah, this looks somewhat strange. The io_time has been reduced significantly, which should also lead to a substantial reduction in runtime. method=io_uring pgstattuple_large base= 5551.5ms patch= 3498.2ms 1.59x ( 37.0%) (reads=206945→12983, io_time=2323.49→207.14ms) I ran the benchmark for this test again with io_uring, and the result is consistent with previous runs. I’m not sure what might be contributing to this behavior. Another code path that showed significant performance improvement is pgstatindex [1]. I've incorporated the test into the script too. Here are the results from my testing: method=worker io-workers=12 pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x ( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms) method=io_uring pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x ( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms) >That's just to say that I'll review > them and try to do something about at least some of the pieces for > this release. Thanks for that. [1] https://www.postgresql.org/message-id/flat/CABPTF7UeN2o-trr9r7K76rZExnO2M4SLfvTfbUY2CwQjCekgnQ%40mail.gmail.com -- Best, Xuneng
Вложения
Hi, On 2026-03-10 19:28:29 +0900, Michael Paquier wrote: > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote: > > Here’s v5 of the patchset. The wal_logging_large patch has been > > removed, as no performance gains were observed in the benchmark runs. > > Looking at the numbers you are posting, it is harder to get excited > about the hash, gin, bloom_vacuum and wal_logging. It's perhaps worth emphasizing that, to allow real world usage of direct IO, we'll need streaming implementation for most of these. Also, on windows the OS provided readahead is ... not aggressive, so you'll hit IO stalls much more frequently than you'd on linux (and some of the BSDs). It might be a good idea to run the benchmarks with debug_io_direct=data. That'll make them very slow, since the write side doesn't yet use AIO and thus will do a lot of synchronous writes, but it should still allow to evaluate the gains from using read stream. The other thing that's kinda important to evaluate read streams is to test on higher latency storage, even without direct IO. Many workloads are not at all benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency. To be able to test such higher latencies locally, I've found it quite useful to use dm_delay above a fast disk. See [1]. > The worker method seems more efficient, may show that we are out of noise > level. I think that's more likely to show that memory bandwidth, probably due to checksum computations, is a factor. The memory copy (from the kernel page cache, with buffered IO) and the checksum computations (when checksums are enabled) are parallelized by worker, but not by io_uring. Greetings, Andres Freund [1] https://docs.kernel.org/admin-guide/device-mapper/delay.html Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be introduced for it: umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0 && mount /dev/mapper/delayed/srv/ To update the amount of delay to 3ms the following can be used: dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0&& dmsetup resume delayed (I will often just update the delay to 0 for comparison runs, as that doesn't require remounting)
Hi, On 2026-03-10 21:23:26 +0800, Xuneng Zhou wrote: > On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote: > Thanks for running the benchmarks! The performance gains for hash, > gin, bloom_vacuum, and wal_logging is insignificant, likely because > these workloads are not I/O-bound. The default number of I/O workers > is three, which is fairly conservative. When I ran the benchmark > script with a higher number of I/O workers, some runs showed improved > performance. FWIW, another thing that may be an issue is that you're restarting postgres all the time, as part of drop_caches(). That means we'll spend time reloading catalog metadata and initializing shared buffers (the first write to a shared buffers page is considerably more expensive than later ones, as the backing memory needs to be initialized first). I found it useful to use the pg_buffercache extension (specifically pg_buffercache_evict_relation()) to just drop the relation that is going to be tested from shared_buffers. > > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x > > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) > > > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x > > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) > > Yeah, this looks somewhat strange. The io_time has been reduced > significantly, which should also lead to a substantial reduction in > runtime. It's possible that the bottleneck just moved, e.g to the checksum computation, if you have data checksums enabled. It's also worth noting that likely each of the test reps measures something different, as likely psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;" leads to some out-of-page updates. You're probably better off deleting some of the data in a transaction that is then rolled back. That will also unset all-visible, but won't otherwise change the layout, no matter how many test iterations you run. I'd also guess that you're seeing a relatively small win because you're updating every page. When reading every page from disk, the OS can do efficient readahead. If there are only occasional misses, that does not work. > method=io_uring > pgstattuple_large base= 5551.5ms patch= 3498.2ms 1.59x > ( 37.0%) (reads=206945→12983, io_time=2323.49→207.14ms) > > I ran the benchmark for this test again with io_uring, and the result > is consistent with previous runs. I’m not sure what might be > contributing to this behavior. What does a perf profile show? Is the query CPU bound? > Another code path that showed significant performance improvement is > pgstatindex [1]. I've incorporated the test into the script too. Here > are the results from my testing: > > method=worker io-workers=12 > pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x > ( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms) > > method=io_uring > pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x > ( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms) Nice! Greetings, Andres Freund
On Tue, Mar 10, 2026 at 07:04:37PM -0400, Andres Freund wrote: > It might be a good idea to run the benchmarks with debug_io_direct=data. > That'll make them very slow, since the write side doesn't yet use AIO and thus > will do a lot of synchronous writes, but it should still allow to evaluate the > gains from using read stream. Ah, thanks for the tip. I'll go try that. > The other thing that's kinda important to evaluate read streams is to test on > higher latency storage, even without direct IO. Many workloads are not at all > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency. My previous run was on a cloud instance, I don't have access to a SSD with this amount of latency locally. One thing that was standing on is the bloom bitmap case that was looking really nice for a large number of rows, so I have applied this part. The rest is going to need a bit more testing to build more confidence, as far as I can see. -- Michael
Вложения
Hi,
On 2026-03-10 19:27:59 -0400, Andres Freund wrote:
> > > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x
> > > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms)
> >
> > > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x
> > > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms)
> >
> > Yeah, this looks somewhat strange. The io_time has been reduced
> > significantly, which should also lead to a substantial reduction in
> > runtime.
>
> It's possible that the bottleneck just moved, e.g to the checksum computation,
> if you have data checksums enabled.
>
> It's also worth noting that likely each of the test reps measures
> something different, as likely
> psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;"
>
> leads to some out-of-page updates.
>
> You're probably better off deleting some of the data in a transaction that is
> then rolled back. That will also unset all-visible, but won't otherwise change
> the layout, no matter how many test iterations you run.
>
>
> I'd also guess that you're seeing a relatively small win because you're
> updating every page. When reading every page from disk, the OS can do
> efficient readahead. If there are only occasional misses, that does not work.
I think that last one is a big part - if I use
BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK;
(which leaves a lot of
I see much bigger wins due to the pgstattuple changes.
time buffered time DIO
w/o read stream 2222.078 ms 2090.239 ms
w read stream 299.455 ms 155.124 ms
That's with local storage. io_uring, but numbers with worker are similar.
Greetings,
Andres Freund
Hi Andres, On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2026-03-10 19:28:29 +0900, Michael Paquier wrote: > > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote: > > > Here’s v5 of the patchset. The wal_logging_large patch has been > > > removed, as no performance gains were observed in the benchmark runs. > > > > Looking at the numbers you are posting, it is harder to get excited > > about the hash, gin, bloom_vacuum and wal_logging. > > It's perhaps worth emphasizing that, to allow real world usage of direct IO, > we'll need streaming implementation for most of these. Also, on windows the OS > provided readahead is ... not aggressive, so you'll hit IO stalls much more > frequently than you'd on linux (and some of the BSDs). > > It might be a good idea to run the benchmarks with debug_io_direct=data. > That'll make them very slow, since the write side doesn't yet use AIO and thus > will do a lot of synchronous writes, but it should still allow to evaluate the > gains from using read stream. > > > The other thing that's kinda important to evaluate read streams is to test on > higher latency storage, even without direct IO. Many workloads are not at all > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency. > > > To be able to test such higher latencies locally, I've found it quite useful > to use dm_delay above a fast disk. See [1]. Thanks for the tips! I currently don’t have access to a machine or cloud instance with slower SSDs or HDDs that have higher latency. I’ll try running the benchmark with debug_io_direct=data and dm_delay, as you suggested, to see if the results vary. > > > The worker method seems more efficient, may show that we are out of noise > > level. > > I think that's more likely to show that memory bandwidth, probably due to > checksum computations, is a factor. The memory copy (from the kernel page > cache, with buffered IO) and the checksum computations (when checksums are > enabled) are parallelized by worker, but not by io_uring. > > > Greetings, > > Andres Freund > > > [1] > > https://docs.kernel.org/admin-guide/device-mapper/delay.html > > Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be > introduced for it: > > umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0 && mount/dev/mapper/delayed /srv/ > > To update the amount of delay to 3ms the following can be used: > dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0&& dmsetup resume delayed > > (I will often just update the delay to 0 for comparison runs, as that > doesn't require remounting) -- Best, Xuneng
Hi, On Wed, Mar 11, 2026 at 7:28 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2026-03-10 21:23:26 +0800, Xuneng Zhou wrote: > > On Tue, Mar 10, 2026 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote: > > Thanks for running the benchmarks! The performance gains for hash, > > gin, bloom_vacuum, and wal_logging is insignificant, likely because > > these workloads are not I/O-bound. The default number of I/O workers > > is three, which is fairly conservative. When I ran the benchmark > > script with a higher number of I/O workers, some runs showed improved > > performance. > > FWIW, another thing that may be an issue is that you're restarting postgres > all the time, as part of drop_caches(). That means we'll spend time reloading > catalog metadata and initializing shared buffers (the first write to a shared > buffers page is considerably more expensive than later ones, as the backing > memory needs to be initialized first). > > I found it useful to use the pg_buffercache extension (specifically > pg_buffercache_evict_relation()) to just drop the relation that is going to be > tested from shared_buffers. Good point. I'll switch to using pg_buffercache_evict_relation() to evict only the target relation, keeping the cluster running. That should reduce measurement noise to some extend. > > > > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x > > > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) > > > > > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x > > > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) > > > > Yeah, this looks somewhat strange. The io_time has been reduced > > significantly, which should also lead to a substantial reduction in > > runtime. > > It's possible that the bottleneck just moved, e.g to the checksum computation, > if you have data checksums enabled. > > It's also worth noting that likely each of the test reps measures > something different, as likely > psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;" > > leads to some out-of-page updates. > > You're probably better off deleting some of the data in a transaction that is > then rolled back. That will also unset all-visible, but won't otherwise change > the layout, no matter how many test iterations you run. > > > I'd also guess that you're seeing a relatively small win because you're > updating every page. When reading every page from disk, the OS can do > efficient readahead. If there are only occasional misses, that does not work. > Yeah, the repeated UPDATE changes the table layout across reps. I'll switch to: BEGIN; DELETE FROM heap_test WHERE id % N = 0; ROLLBACK; This clears the visibility map bits without altering the physical layout, so every rep measures the same table state. > > > method=io_uring > > pgstattuple_large base= 5551.5ms patch= 3498.2ms 1.59x > > ( 37.0%) (reads=206945→12983, io_time=2323.49→207.14ms) > > > > I ran the benchmark for this test again with io_uring, and the result > > is consistent with previous runs. I’m not sure what might be > > contributing to this behavior. > > What does a perf profile show? Is the query CPU bound? The runtime in my run of pgstattuple was reduced significantly due to the reduction in I/O time. I don’t think running perf on my setup would reveal anything particularly meaningful. The script has an option to run with perf, so perhaps Michael could try it to see whether the query becomes CPU-bound, if he’s interested and has time. > > Another code path that showed significant performance improvement is > > pgstatindex [1]. I've incorporated the test into the script too. Here > > are the results from my testing: > > > > method=worker io-workers=12 > > pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x > > ( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms) > > > > method=io_uring > > pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x > > ( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms) > > Nice! > > > Greetings, > > Andres Freund -- Best, Xuneng
Hi, On Wed, Mar 11, 2026 at 7:29 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 10, 2026 at 07:04:37PM -0400, Andres Freund wrote: > > It might be a good idea to run the benchmarks with debug_io_direct=data. > > That'll make them very slow, since the write side doesn't yet use AIO and thus > > will do a lot of synchronous writes, but it should still allow to evaluate the > > gains from using read stream. > > Ah, thanks for the tip. I'll go try that. > > > The other thing that's kinda important to evaluate read streams is to test on > > higher latency storage, even without direct IO. Many workloads are not at all > > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are > > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency. > > My previous run was on a cloud instance, I don't have access to a SSD > with this amount of latency locally. > > One thing that was standing on is the bloom bitmap case that was > looking really nice for a large number of rows, so I have applied > this part. The rest is going to need a bit more testing to build more > confidence, as far as I can see. > -- > Michael Thanks for pushing that. I’ll update the script with Andres’ suggestions and share it shortly. -- Best, Xuneng
Hi, On Wed, Mar 11, 2026 at 8:16 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2026-03-10 19:27:59 -0400, Andres Freund wrote: > > > > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x > > > > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) > > > > > > > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x > > > > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) > > > > > > Yeah, this looks somewhat strange. The io_time has been reduced > > > significantly, which should also lead to a substantial reduction in > > > runtime. > > > > It's possible that the bottleneck just moved, e.g to the checksum computation, > > if you have data checksums enabled. > > > > It's also worth noting that likely each of the test reps measures > > something different, as likely > > psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;" > > > > leads to some out-of-page updates. > > > > You're probably better off deleting some of the data in a transaction that is > > then rolled back. That will also unset all-visible, but won't otherwise change > > the layout, no matter how many test iterations you run. > > > > > > I'd also guess that you're seeing a relatively small win because you're > > updating every page. When reading every page from disk, the OS can do > > efficient readahead. If there are only occasional misses, that does not work. > > I think that last one is a big part - if I use > BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK; > (which leaves a lot of > > I see much bigger wins due to the pgstattuple changes. > > time buffered time DIO > w/o read stream 2222.078 ms 2090.239 ms > w read stream 299.455 ms 155.124 ms > > That's with local storage. io_uring, but numbers with worker are similar. > The results look great and interesting. This looks far better than what I observed in my earlier tests. I’ll run perf for pgstattuple without the switching to see what is keeping the CPU busy. -- Best, Xuneng
Hi,
On Tue, 10 Mar 2026 at 16:23, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Another code path that showed significant performance improvement is
> pgstatindex [1]. I've incorporated the test into the script too. Here
> are the results from my testing:
>
> method=worker io-workers=12
> pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x
> ( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms)
>
> method=io_uring
> pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x
> ( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms)
I didn't run the benchmark yet but here is a small suggestion for the
pgstatindex patch:
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ p.last_exclusive = nblocks;
for (blkno = 1; blkno < nblocks; blkno++)
...
+ p.current_blocknum = HASH_METAPAGE + 1;
+ p.last_exclusive = nblocks;
for (blkno = 1; blkno < nblocks; blkno++)
Could you move 'BTREE_METAPAGE + 1' and 'HASH_METAPAGE + 1' into
variables and then set p.current_blocknum and blkno using those
variables? p.current_blocknum and blkno should have the same initial
values, this change makes code less error prone and easier to read in
my opinion.
Other than the comment above, LGTM.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi, On Wed, Mar 11, 2026 at 10:23 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Wed, Mar 11, 2026 at 8:16 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2026-03-10 19:27:59 -0400, Andres Freund wrote: > > > > > pgstattuple_large base= 12429.3ms patch= 11916.8ms 1.04x > > > > > ( 4.1%) (reads=206945->12983, io_time=6501.91->32.24ms) > > > > > > > > > pgstattuple_large base= 12642.9ms patch= 11873.5ms 1.06x > > > > > ( 6.1%) (reads=206945->12983, io_time=6516.70->143.46ms) > > > > > > > > Yeah, this looks somewhat strange. The io_time has been reduced > > > > significantly, which should also lead to a substantial reduction in > > > > runtime. > > > > > > It's possible that the bottleneck just moved, e.g to the checksum computation, > > > if you have data checksums enabled. > > > > > > It's also worth noting that likely each of the test reps measures > > > something different, as likely > > > psql_run "$ROOT" "$PORT" -c "UPDATE heap_test SET data = data || '!' WHERE id % 5 = 0;" > > > > > > leads to some out-of-page updates. > > > > > > You're probably better off deleting some of the data in a transaction that is > > > then rolled back. That will also unset all-visible, but won't otherwise change > > > the layout, no matter how many test iterations you run. > > > > > > > > > I'd also guess that you're seeing a relatively small win because you're > > > updating every page. When reading every page from disk, the OS can do > > > efficient readahead. If there are only occasional misses, that does not work. > > > > I think that last one is a big part - if I use > > BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK; > > (which leaves a lot of > > > > I see much bigger wins due to the pgstattuple changes. > > > > time buffered time DIO > > w/o read stream 2222.078 ms 2090.239 ms > > w read stream 299.455 ms 155.124 ms > > > > That's with local storage. io_uring, but numbers with worker are similar. > > > > The results look great and interesting. This looks far better than > what I observed in my earlier tests. I’ll run perf for pgstattuple > without the switching to see what is keeping the CPU busy. > > -- > Best, > Xuneng io_uring pgstattuple_large base= 1090.6ms patch= 143.3ms 7.61x ( 86.9%) (reads=20049→20049, io_time=1040.80→46.91ms) I observed a similar magnitude of runtime reduction after switching to pg_buffercache_evict_relation() and using BEGIN; DELETE FROM heap_test WHERE id % 500 = 0; ROLLBACK. However, I lost the original flame graphs after running many performance tests. I will regenerate them and post them later. -- Best, Xuneng
Вложения
On Wed, Mar 11, 2026 at 9:37 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi Andres, > > On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2026-03-10 19:28:29 +0900, Michael Paquier wrote: > > > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote: > > > > Here’s v5 of the patchset. The wal_logging_large patch has been > > > > removed, as no performance gains were observed in the benchmark runs. > > > > > > Looking at the numbers you are posting, it is harder to get excited > > > about the hash, gin, bloom_vacuum and wal_logging. > > > > It's perhaps worth emphasizing that, to allow real world usage of direct IO, > > we'll need streaming implementation for most of these. Also, on windows the OS > > provided readahead is ... not aggressive, so you'll hit IO stalls much more > > frequently than you'd on linux (and some of the BSDs). > > > > It might be a good idea to run the benchmarks with debug_io_direct=data. > > That'll make them very slow, since the write side doesn't yet use AIO and thus > > will do a lot of synchronous writes, but it should still allow to evaluate the > > gains from using read stream. > > > > > > The other thing that's kinda important to evaluate read streams is to test on > > higher latency storage, even without direct IO. Many workloads are not at all > > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are > > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency. > > > > > > To be able to test such higher latencies locally, I've found it quite useful > > to use dm_delay above a fast disk. See [1]. > > Thanks for the tips! I currently don’t have access to a machine or > cloud instance with slower SSDs or HDDs that have higher latency. I’ll > try running the benchmark with debug_io_direct=data and dm_delay, as > you suggested, to see if the results vary. > > > > > > The worker method seems more efficient, may show that we are out of noise > > > level. > > > > I think that's more likely to show that memory bandwidth, probably due to > > checksum computations, is a factor. The memory copy (from the kernel page > > cache, with buffered IO) and the checksum computations (when checksums are > > enabled) are parallelized by worker, but not by io_uring. > > > > > > Greetings, > > > > Andres Freund > > > > > > [1] > > > > https://docs.kernel.org/admin-guide/device-mapper/delay.html > > > > Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be > > introduced for it: > > > > umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0 && mount/dev/mapper/delayed /srv/ > > > > To update the amount of delay to 3ms the following can be used: > > dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0&& dmsetup resume delayed > > > > (I will often just update the delay to 0 for comparison runs, as that > > doesn't require remounting) > With debug_io_direct=data and dm_delay, the results look quite promising! medium size / io_uring gin_vacuum_medium base= 1619.9ms patch= 301.8ms 5.37x ( 81.4%) (reads=1571→947, io_time=1524.86→207.48ms) The average runtime increases significantly after adding the manual device delay, so it will take some time to complete all the test runs. I was also busy with something else today... Once the runs are finished, I’ll share the results and the script to reproduce them. -- Best, Xuneng
On Wed, Mar 11, 2026 at 11:11:23PM +0800, Xuneng Zhou wrote: > The average runtime increases significantly after adding the manual > device delay, so it will take some time to complete all the test runs. > I was also busy with something else today... Once the runs are > finished, I’ll share the results and the script to reproduce them. Thanks for doing that. On my side, I am going to look at the gin and hash vacuum paths first with more testing as these don't use a custom callback. I don't think that I am going to need a lot of convincing, but I'd rather produce some numbers myself because doing something. I'll tweak a mounting point with the delay trick, as well. -- Michael
Вложения
Hi, On Wed, Mar 11, 2026 at 3:53 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > On Tue, 10 Mar 2026 at 16:23, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Another code path that showed significant performance improvement is > > pgstatindex [1]. I've incorporated the test into the script too. Here > > are the results from my testing: > > > > method=worker io-workers=12 > > pgstatindex_large base= 233.8ms patch= 54.1ms 4.32x > > ( 76.8%) (reads=27460→1757, io_time=213.94→6.31ms) > > > > method=io_uring > > pgstatindex_large base= 224.2ms patch= 56.4ms 3.98x > > ( 74.9%) (reads=27460→1757, io_time=204.41→4.88ms) > > I didn't run the benchmark yet but here is a small suggestion for the > pgstatindex patch: > > + p.current_blocknum = BTREE_METAPAGE + 1; > + p.last_exclusive = nblocks; > > for (blkno = 1; blkno < nblocks; blkno++) > > ... > > + p.current_blocknum = HASH_METAPAGE + 1; > + p.last_exclusive = nblocks; > > for (blkno = 1; blkno < nblocks; blkno++) > > Could you move 'BTREE_METAPAGE + 1' and 'HASH_METAPAGE + 1' into > variables and then set p.current_blocknum and blkno using those > variables? p.current_blocknum and blkno should have the same initial > values, this change makes code less error prone and easier to read in > my opinion. > > Other than the comment above, LGTM. > Thanks! That makes sense to me. Please see the patch I’ll post later. -- Best, Xuneng
On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote: > Thanks for doing that. On my side, I am going to look at the gin and > hash vacuum paths first with more testing as these don't use a custom > callback. I don't think that I am going to need a lot of convincing, > but I'd rather produce some numbers myself because doing something. > I'll tweak a mounting point with the delay trick, as well. While debug_io_direct has been helping a bit, the trick for the delay to throttle the IO activity has helped much more with my runtime numbers. I have mounted a separate partition with a delay of 5ms, disabled checkums (this part did not make a real difference), and evicted shared buffers for relation and indexes before the VACUUM. Then I got better numbers. Here is an extract: - worker=3: gin_vacuum (100k tuples) base= 1448.2ms patch= 572.5ms 2.53x ( 60.5%) (reads=175→104, io_time=1382.70→506.64ms) gin_vacuum (300k tuples) base= 3728.0ms patch= 1332.0ms 2.80x ( 64.3%) (reads=486→293, io_time=3669.89→1266.27ms) bloom_vacuum (100k tuples) base= 21826.8ms patch= 17220.3ms 1.27x ( 21.1%) (reads=485→117, io_time=4773.33→270.56ms) bloom_vacuum (300k tuples) base= 67054.0ms patch= 53164.7ms 1.26x ( 20.7%) (reads=1431.5→327.5, io_time=13880.2→381.395ms) - io_uring: gin_vacuum (100k tuples) base= 1240.3ms patch= 360.5ms 3.44x ( 70.9%) (reads=175→104, io_time=1175.35→299.75ms) gin_vacuum (300k tuples) base= 2829.9ms patch= 642.0ms 4.41x ( 77.3%) (reads=465.5→293, io_time=2768.46→579.04ms) bloom_vacuum (100k tuples) base= 22121.7ms patch= 17532.3ms 1.26x ( 20.7%) (reads=485→117, io_time=4850.46→285.28ms) bloom_vacuum (300k tuples) base= 67058.0ms patch= 53118.0ms 1.26x ( 20.8%) (reads=1431.5→327.5, io_time=13870.9→305.44ms) The higher the number of tuples, the better the performance for each individual operation, but the tests take a much longer time (tens of seconds vs tens of minutes). For GIN, the numbers can be quite good once these reads are pushed. For bloom, the runtime is improved, and the IO numbers are much better. At the end, I have applied these two parts. Remains now the hash vacuum and the two parts for pgstattuple. -- Michael
Вложения
On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> > Thanks for doing that. On my side, I am going to look at the gin and
> > hash vacuum paths first with more testing as these don't use a custom
> > callback. I don't think that I am going to need a lot of convincing,
> > but I'd rather produce some numbers myself because doing something.
> > I'll tweak a mounting point with the delay trick, as well.
>
> While debug_io_direct has been helping a bit, the trick for the delay
> to throttle the IO activity has helped much more with my runtime
> numbers. I have mounted a separate partition with a delay of 5ms,
> disabled checkums (this part did not make a real difference), and
> evicted shared buffers for relation and indexes before the VACUUM.
>
> Then I got better numbers. Here is an extract:
> - worker=3:
> gin_vacuum (100k tuples) base= 1448.2ms patch= 572.5ms 2.53x
> ( 60.5%) (reads=175→104, io_time=1382.70→506.64ms)
> gin_vacuum (300k tuples) base= 3728.0ms patch= 1332.0ms 2.80x
> ( 64.3%) (reads=486→293, io_time=3669.89→1266.27ms)
> bloom_vacuum (100k tuples) base= 21826.8ms patch= 17220.3ms 1.27x
> ( 21.1%) (reads=485→117, io_time=4773.33→270.56ms)
> bloom_vacuum (300k tuples) base= 67054.0ms patch= 53164.7ms 1.26x
> ( 20.7%) (reads=1431.5→327.5, io_time=13880.2→381.395ms)
> - io_uring:
> gin_vacuum (100k tuples) base= 1240.3ms patch= 360.5ms 3.44x
> ( 70.9%) (reads=175→104, io_time=1175.35→299.75ms)
> gin_vacuum (300k tuples) base= 2829.9ms patch= 642.0ms 4.41x
> ( 77.3%) (reads=465.5→293, io_time=2768.46→579.04ms)
> bloom_vacuum (100k tuples) base= 22121.7ms patch= 17532.3ms 1.26x
> ( 20.7%) (reads=485→117, io_time=4850.46→285.28ms)
> bloom_vacuum (300k tuples) base= 67058.0ms patch= 53118.0ms 1.26x
> ( 20.8%) (reads=1431.5→327.5, io_time=13870.9→305.44ms)
>
> The higher the number of tuples, the better the performance for each
> individual operation, but the tests take a much longer time (tens of
> seconds vs tens of minutes). For GIN, the numbers can be quite good
> once these reads are pushed. For bloom, the runtime is improved, and
> the IO numbers are much better.
>
> At the end, I have applied these two parts. Remains now the hash
> vacuum and the two parts for pgstattuple.
> --
> Michael
Thanks for running the benchmarks and pushing!
Here're the results of my test with debug_io_direct and delay :
-- io_uring, medium size
bloom_vacuum_medium base= 8355.2ms patch= 715.0ms 11.68x
( 91.4%) (reads=4732→1056, io_time=7699.47→86.52ms)
pgstattuple_medium base= 4012.8ms patch= 213.7ms 18.78x
( 94.7%) (reads=2006→2006, io_time=4001.66→200.24ms)
pgstatindex_medium base= 5490.6ms patch= 37.9ms 144.88x
( 99.3%) (reads=2745→173, io_time=5481.54→7.82ms)
hash_vacuum_medium base= 34483.4ms patch= 2703.5ms 12.75x
( 92.2%) (reads=19166→3901, io_time=31948.33→308.05ms)
wal_logging_medium base= 7778.6ms patch= 7814.5ms 1.00x
( -0.5%) (reads=2857→2845, io_time=11.84→11.45ms)
-- worker, medium size
bloom_vacuum_medium base= 8376.2ms patch= 747.7ms 11.20x
( 91.1%) (reads=4732→1056, io_time=7688.91→65.49ms)
pgstattuple_medium base= 4012.7ms patch= 339.0ms 11.84x
( 91.6%) (reads=2006→2006, io_time=4002.23→49.99ms)
pgstatindex_medium base= 5490.3ms patch= 38.3ms 143.23x
( 99.3%) (reads=2745→173, io_time=5480.60→16.24ms)
hash_vacuum_medium base= 34638.4ms patch= 2940.2ms 11.78x
( 91.5%) (reads=19166→3901, io_time=31881.61→242.01ms)
wal_logging_medium base= 7440.1ms patch= 7434.0ms 1.00x
( 0.1%) (reads=2861→2825, io_time=10.62→10.71ms)
-- Setting read delay only
sudo dmsetup reload "$DM_DELAY_DEV" --table "0 $size delay $dev 0 $ms $dev 0 0"
Setting dm_delay on delayed to 2ms read / 0ms write
After setting the write delay to 0ms, I can observe more pronounced
speedups overall, since vacuum operation is write-intensive — delaying
writes might dominate the runtime and mask the read-path improvement
we're measuring. It also speeds up the runtime of the test.
-- wal_logging
The wal_logging patch does not seem to benefit from streamification in
this configuration either.
-- Delay settup
For anyone wanting to reproduce the results with a simulated-latency
device, here is the setup I used.
1. Create a 50GB file-backed block device (enough for PG data + indexes)
sudo dd if=/dev/zero of=/srv/delay_disk.img bs=1M count=50000 status=progress
sudo losetup /dev/loop0 /srv/delay_disk.img
2. Create the dm_delay device with 2ms delay
sudo dmsetup create delayed --table "0 $(sudo blockdev --getsz
/dev/loop0) delay /dev/loop0 0 2"
3. Format and mount it
sudo mkfs.ext4 /dev/mapper/delayed
sudo mkdir -p /srv/pg_delayed
sudo mount /dev/mapper/delayed /srv/pg_delayed
sudo chown $(whoami) /srv/pg_delayed
4. Run benchmark with WORKROOT pointing to the delayed device
WORKROOT=/srv/pg_delayed SIZES=medium REPS=3 \
./run_streaming_benchmark.sh --baseline --io-method io_uring \
--test gin_vacuum --direct-io --io-delay 2 \
the targeted patch
--
Best,
Xuneng
Вложения
On Thu, Mar 12, 2026 at 12:39 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote: > > > Thanks for doing that. On my side, I am going to look at the gin and > > > hash vacuum paths first with more testing as these don't use a custom > > > callback. I don't think that I am going to need a lot of convincing, > > > but I'd rather produce some numbers myself because doing something. > > > I'll tweak a mounting point with the delay trick, as well. > > > > While debug_io_direct has been helping a bit, the trick for the delay > > to throttle the IO activity has helped much more with my runtime > > numbers. I have mounted a separate partition with a delay of 5ms, > > disabled checkums (this part did not make a real difference), and > > evicted shared buffers for relation and indexes before the VACUUM. > > > > Then I got better numbers. Here is an extract: > > - worker=3: > > gin_vacuum (100k tuples) base= 1448.2ms patch= 572.5ms 2.53x > > ( 60.5%) (reads=175→104, io_time=1382.70→506.64ms) > > gin_vacuum (300k tuples) base= 3728.0ms patch= 1332.0ms 2.80x > > ( 64.3%) (reads=486→293, io_time=3669.89→1266.27ms) > > bloom_vacuum (100k tuples) base= 21826.8ms patch= 17220.3ms 1.27x > > ( 21.1%) (reads=485→117, io_time=4773.33→270.56ms) > > bloom_vacuum (300k tuples) base= 67054.0ms patch= 53164.7ms 1.26x > > ( 20.7%) (reads=1431.5→327.5, io_time=13880.2→381.395ms) > > - io_uring: > > gin_vacuum (100k tuples) base= 1240.3ms patch= 360.5ms 3.44x > > ( 70.9%) (reads=175→104, io_time=1175.35→299.75ms) > > gin_vacuum (300k tuples) base= 2829.9ms patch= 642.0ms 4.41x > > ( 77.3%) (reads=465.5→293, io_time=2768.46→579.04ms) > > bloom_vacuum (100k tuples) base= 22121.7ms patch= 17532.3ms 1.26x > > ( 20.7%) (reads=485→117, io_time=4850.46→285.28ms) > > bloom_vacuum (300k tuples) base= 67058.0ms patch= 53118.0ms 1.26x > > ( 20.8%) (reads=1431.5→327.5, io_time=13870.9→305.44ms) > > > > The higher the number of tuples, the better the performance for each > > individual operation, but the tests take a much longer time (tens of > > seconds vs tens of minutes). For GIN, the numbers can be quite good > > once these reads are pushed. For bloom, the runtime is improved, and > > the IO numbers are much better. > > > > -- io_uring, medium size > > bloom_vacuum_medium base= 8355.2ms patch= 715.0ms 11.68x > ( 91.4%) (reads=4732→1056, io_time=7699.47→86.52ms) > pgstattuple_medium base= 4012.8ms patch= 213.7ms 18.78x > ( 94.7%) (reads=2006→2006, io_time=4001.66→200.24ms) > pgstatindex_medium base= 5490.6ms patch= 37.9ms 144.88x > ( 99.3%) (reads=2745→173, io_time=5481.54→7.82ms) > hash_vacuum_medium base= 34483.4ms patch= 2703.5ms 12.75x > ( 92.2%) (reads=19166→3901, io_time=31948.33→308.05ms) > wal_logging_medium base= 7778.6ms patch= 7814.5ms 1.00x > ( -0.5%) (reads=2857→2845, io_time=11.84→11.45ms) > > -- worker, medium size > bloom_vacuum_medium base= 8376.2ms patch= 747.7ms 11.20x > ( 91.1%) (reads=4732→1056, io_time=7688.91→65.49ms) > pgstattuple_medium base= 4012.7ms patch= 339.0ms 11.84x > ( 91.6%) (reads=2006→2006, io_time=4002.23→49.99ms) > pgstatindex_medium base= 5490.3ms patch= 38.3ms 143.23x > ( 99.3%) (reads=2745→173, io_time=5480.60→16.24ms) > hash_vacuum_medium base= 34638.4ms patch= 2940.2ms 11.78x > ( 91.5%) (reads=19166→3901, io_time=31881.61→242.01ms) > wal_logging_medium base= 7440.1ms patch= 7434.0ms 1.00x > ( 0.1%) (reads=2861→2825, io_time=10.62→10.71ms) > Our io_time metric currently measures only read time and ignores write I/O, which can be misleading. We now separate it into read_time and write_time. -- write-delay 2 ms WORKROOT=/srv/pg_delayed SIZES=small REPS=3 ./run_streaming_benchmark.sh --baseline --io-method worker --io-workers 12 --test hash_vacuum --direct-io --read-delay 2 --write-delay 2 v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch hash_vacuum_small base= 16652.8ms patch= 13493.2ms 1.23x ( 19.0%) (reads=2338→815, read_time=4136.19→884.79ms, writes=6218→6206, write_time=12313.81→12289.58ms) -- write-delay 0 ms WORKROOT=/srv/pg_delayed SIZES=small REPS=3 ./run_streaming_benchmark.sh --baseline --io-method worker --io-workers 12 --test hash_vacuum --direct-io --read-delay 2 --write-delay 0 v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch hash_vacuum_small base= 4310.2ms patch= 1146.7ms 3.76x ( 73.4%) (reads=2338→815, read_time=4002.24→833.47ms, writes=6218→6206, write_time=186.69→140.96ms) -- Best, Xuneng
Вложения
On Thu, Mar 12, 2026 at 11:35:48PM +0800, Xuneng Zhou wrote: > Our io_time metric currently measures only read time and ignores write > I/O, which can be misleading. We now separate it into read_time and > write_time. I had a look at the pgstatindex part this morning, running my own test under conditions similar to 6c228755add8, and here's one extract with io_uring: pgstatindex (100k tuples) base=32938.2ms patch=83.3ms 395.60x ( 99.7%) (reads=2745->173, io_time=32932.09->59.75ms) There was one issue with a declaration put in the middle of the code, that I have fixed. This one is now done, remains 3 pieces to evaluate. -- Michael
Вложения
On Fri, Mar 13, 2026 at 9:50 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Mar 12, 2026 at 11:35:48PM +0800, Xuneng Zhou wrote: > > Our io_time metric currently measures only read time and ignores write > > I/O, which can be misleading. We now separate it into read_time and > > write_time. > > I had a look at the pgstatindex part this morning, running my own test > under conditions similar to 6c228755add8, and here's one extract with > io_uring: > pgstatindex (100k tuples) base=32938.2ms patch=83.3ms 395.60x ( 99.7%) > (reads=2745->173, io_time=32932.09->59.75ms) This result looks great! > There was one issue with a declaration put in the middle of the code, > that I have fixed. This one is now done, remains 3 pieces to > evaluate. Thanks for fixing this and for taking the time to review and test the patches. -- Best, Xuneng