Обсуждение: pgstattuple: Use streaming read API in pgstatindex functions

Поиск
Список
Период
Сортировка

pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi hackers,

While reading the code related to streaming reads and their current
use cases, I noticed that pgstatindex could potentially benefit from
adopting the streaming read API. The required change is relatively
simple—similar to what has already been implemented in the pg_warm and
pg_visibility extensions. I also ran some performance tests on an
experimental patch to validate the improvement.

Summary
Cold cache performance (the typical use case for diagnostic tools):
  - Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup)
  - Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup)
  - Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup)

Hardware: AX162-R from hetzner

Test matrix:
  - Index types: Primary key, timestamp, float, composite (3 columns)
  - Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge
(50M rows, ~ 1351MB))
  - Layouts: Unfragmented (sequential) and Fragmented (random insert order)
  - Cache states: Cold (dropped OS cache) and Warm (pg_prewarm)

Xlarge fragmented example:
==> Creating secondary indexes on test_xlarge
     Created 3 secondary indexes: created_at, score, composite
     Created test_xlarge_pkey: 1351 MB
     Fragmentation stats (random insert order):
 leaf_frag_pct | avg_density_pct | leaf_pages |  size
---------------+-----------------+------------+---------
          49.9 |            71.5 |     172272 | 1351 MB
(1 row)

configuration:
  - shared_buffers = 16GB
  - effective_io_concurrency = 500
  - io_combine_limit = 16
  - autovacuum = off
  - checkpoint_timeout = 1h
  - bgwriter_delay = 10000ms (minimize background writes)
  - jit = off
  - max_parallel_workers_per_gather = 0

Unfragmented Indexes (Cold Cache)

Index Type    Size  Baseline  Patched  Speedup
Primary Key    Medium  31.5 ms  19.6 ms  1.58×
Primary Key    Large  184.0 ms  119.0 ms  1.54×
Timestamp     Medium  13.4 ms  10.5 ms  1.28×
Timestamp     Large  85.0 ms  45.6 ms  1.86×
Float (score)   Medium  13.7 ms  11.4 ms  1.21×
Float (score)   Large  84.0 ms  45.0 ms  1.86×
Composite (3 col) Medium  26.7 ms  17.2 ms  1.56×
Composite (3 col) Large  89.8 ms  51.2 ms  1.75×

⸻

Fragmented Indexes (Cold Cache)

To address concerns about filesystem fragmentation, I tested indexes built
with random inserts (ORDER BY random()) to trigger page splits and create
fragmented indexes:

Index Type    Size  Baseline  Patched  Speedup
Primary Key    Medium  41.9 ms  23.5 ms  1.79×
Primary Key    Large  236.0 ms  148.0 ms  1.58×
Primary Key    XLarge  953.4 ms  663.1 ms  1.43×
Timestamp     Medium  32.1 ms  18.8 ms  1.70×
Timestamp     Large  188.0 ms  117.0 ms  1.59×
Timestamp     XLarge  493.0 ms  518.6 ms  0.95×
Float (score)   Medium  14.0 ms  10.9 ms  1.28×
Float (score)   Large  85.8 ms  45.2 ms  1.89×
Float (score)   XLarge  263.2 ms  176.5 ms  1.49×
Composite (3 col) Medium  42.3 ms  24.1 ms  1.75×
Composite (3 col) Large  245.0 ms  162.0 ms  1.51×
Composite (3 col) XLarge  1052.5 ms  716.5 ms  1.46×

Summary: Fragmentation generally does not hurt streaming reads; most
fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge
Timestamp) shows a slight regression (0.95×).

⸻

Warm Cache Results
When indexes are fully cached in shared_buffers:
Unfragmented:  infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Fragmented:   infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index

Best,
Xuneng

Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi,

On Mon, Oct 13, 2025 at 10:07 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi hackers,
>
> While reading the code related to streaming reads and their current
> use cases, I noticed that pgstatindex could potentially benefit from
> adopting the streaming read API. The required change is relatively
> simple—similar to what has already been implemented in the pg_warm and
> pg_visibility extensions. I also ran some performance tests on an
> experimental patch to validate the improvement.
>
> Summary
> Cold cache performance (the typical use case for diagnostic tools):
>   - Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup)
>   - Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup)
>   - Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup)
>
> Hardware: AX162-R from hetzner
>
> Test matrix:
>   - Index types: Primary key, timestamp, float, composite (3 columns)
>   - Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge
> (50M rows, ~ 1351MB))
>   - Layouts: Unfragmented (sequential) and Fragmented (random insert order)
>   - Cache states: Cold (dropped OS cache) and Warm (pg_prewarm)
>
> Xlarge fragmented example:
> ==> Creating secondary indexes on test_xlarge
>      Created 3 secondary indexes: created_at, score, composite
>      Created test_xlarge_pkey: 1351 MB
>      Fragmentation stats (random insert order):
>  leaf_frag_pct | avg_density_pct | leaf_pages |  size
> ---------------+-----------------+------------+---------
>           49.9 |            71.5 |     172272 | 1351 MB
> (1 row)
>
> configuration:
>   - shared_buffers = 16GB
>   - effective_io_concurrency = 500
>   - io_combine_limit = 16
>   - autovacuum = off
>   - checkpoint_timeout = 1h
>   - bgwriter_delay = 10000ms (minimize background writes)
>   - jit = off
>   - max_parallel_workers_per_gather = 0
>
> Unfragmented Indexes (Cold Cache)
>
> Index Type    Size  Baseline  Patched  Speedup
> Primary Key    Medium  31.5 ms  19.6 ms  1.58×
> Primary Key    Large  184.0 ms  119.0 ms  1.54×
> Timestamp     Medium  13.4 ms  10.5 ms  1.28×
> Timestamp     Large  85.0 ms  45.6 ms  1.86×
> Float (score)   Medium  13.7 ms  11.4 ms  1.21×
> Float (score)   Large  84.0 ms  45.0 ms  1.86×
> Composite (3 col) Medium  26.7 ms  17.2 ms  1.56×
> Composite (3 col) Large  89.8 ms  51.2 ms  1.75×
>
> ⸻
>
> Fragmented Indexes (Cold Cache)
>
> To address concerns about filesystem fragmentation, I tested indexes built
> with random inserts (ORDER BY random()) to trigger page splits and create
> fragmented indexes:
>
> Index Type    Size  Baseline  Patched  Speedup
> Primary Key    Medium  41.9 ms  23.5 ms  1.79×
> Primary Key    Large  236.0 ms  148.0 ms  1.58×
> Primary Key    XLarge  953.4 ms  663.1 ms  1.43×
> Timestamp     Medium  32.1 ms  18.8 ms  1.70×
> Timestamp     Large  188.0 ms  117.0 ms  1.59×
> Timestamp     XLarge  493.0 ms  518.6 ms  0.95×
> Float (score)   Medium  14.0 ms  10.9 ms  1.28×
> Float (score)   Large  85.8 ms  45.2 ms  1.89×
> Float (score)   XLarge  263.2 ms  176.5 ms  1.49×
> Composite (3 col) Medium  42.3 ms  24.1 ms  1.75×
> Composite (3 col) Large  245.0 ms  162.0 ms  1.51×
> Composite (3 col) XLarge  1052.5 ms  716.5 ms  1.46×
>
> Summary: Fragmentation generally does not hurt streaming reads; most
> fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge
> Timestamp) shows a slight regression (0.95×).
>
> ⸻
>
> Warm Cache Results
> When indexes are fully cached in shared_buffers:
> Unfragmented:  infrequent little regression for small to medium size
> index(single digit ms variance, barely noticeable); small gains for
> large size index
> Fragmented:   infrequent little regression for small to medium size
> index(single digit ms variance, barely noticeable); small gains for
> large size index
>

Fix indentation issue in v1.

Best,
Xuneng

Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Nazir Bilal Yavuz
Дата:
Hi,

Thank you for working on this!

On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Fix indentation issue in v1.

I did not look at the benchmarks, so here are my code comments.

- I would avoid creating a new scope for the streaming read. While it
makes the streaming code easier to interpret, it introduces a large
diff due to indentation changes.

- I suggest adding the following comment about streaming batching, as
it is used in other places:

        /*
         * It is safe to use batchmode as block_range_read_stream_cb takes no
         * locks.
         */

- For the '/* Scan all blocks except the metapage using streaming
reads */' comments, it might be helpful to clarify that the 0th page
is the metapage. Something like: '/* Scan all blocks except the
metapage (0th page) using streaming reads */'.

Other than these comments, the code looks good to me.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi Bilal,

Thanks for looking into this.

On Mon, Oct 13, 2025 at 3:00 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Fix indentation issue in v1.
>
> I did not look at the benchmarks, so here are my code comments.
>
> - I would avoid creating a new scope for the streaming read. While it
> makes the streaming code easier to interpret, it introduces a large
> diff due to indentation changes.
>
> - I suggest adding the following comment about streaming batching, as
> it is used in other places:
>
>         /*
>          * It is safe to use batchmode as block_range_read_stream_cb takes no
>          * locks.
>          */
>
> - For the '/* Scan all blocks except the metapage using streaming
> reads */' comments, it might be helpful to clarify that the 0th page
> is the metapage. Something like: '/* Scan all blocks except the
> metapage (0th page) using streaming reads */'.
>
> Other than these comments, the code looks good to me.
>

Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.

Best,
Xuneng

Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
wenhui qiu
Дата:
Hi Xuneng Zhou
   

> - /* Unlock and release buffer */
> - LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> - ReleaseBuffer(buffer);
> + UnlockReleaseBuffer(buffer);
>  }
 Thanks for your patch! Just to nitpick a bit — I think this comment is worth keeping, even though the function name already conveys its meaning.

On Mon, Oct 13, 2025 at 4:42 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Bilal,

Thanks for looking into this.

On Mon, Oct 13, 2025 at 3:00 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Fix indentation issue in v1.
>
> I did not look at the benchmarks, so here are my code comments.
>
> - I would avoid creating a new scope for the streaming read. While it
> makes the streaming code easier to interpret, it introduces a large
> diff due to indentation changes.
>
> - I suggest adding the following comment about streaming batching, as
> it is used in other places:
>
>         /*
>          * It is safe to use batchmode as block_range_read_stream_cb takes no
>          * locks.
>          */
>
> - For the '/* Scan all blocks except the metapage using streaming
> reads */' comments, it might be helpful to clarify that the 0th page
> is the metapage. Something like: '/* Scan all blocks except the
> metapage (0th page) using streaming reads */'.
>
> Other than these comments, the code looks good to me.
>

Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.

Best,
Xuneng

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Nazir Bilal Yavuz
Дата:
Hi,

On Mon, 13 Oct 2025 at 11:42, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Here is patch v3. The comments have been added, and the extra scope
> ({}) has been removed as suggested.

Thanks, the code looks good to me!

The benchmark results are nice. I have a quick question about the
setup, if you are benchmarking with the default io_method (worker),
then the results might show the combined benefit of async I/O and
streaming I/O, rather than just streaming I/O alone. If that is the
case, I would suggest also running a benchmark with io_method=sync to
isolate the performance impact of streaming I/O. You might get
interesting results.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi Wenhui,

Thanks for looking into this.

On Mon, Oct 13, 2025 at 5:41 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>
> Hi Xuneng Zhou
>
>
> > - /* Unlock and release buffer */
> > - LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> > - ReleaseBuffer(buffer);
> > + UnlockReleaseBuffer(buffer);
> >  }
>  Thanks for your patch! Just to nitpick a bit — I think this comment is worth keeping, even though the function name
alreadyconveys its meaning. 
>

It makes sense. I'll keep it.

Best,
Xuneng



Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi,

On Mon, Oct 13, 2025 at 5:42 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Mon, 13 Oct 2025 at 11:42, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Here is patch v3. The comments have been added, and the extra scope
> > ({}) has been removed as suggested.
>
> Thanks, the code looks good to me!
>
> The benchmark results are nice. I have a quick question about the
> setup, if you are benchmarking with the default io_method (worker),
> then the results might show the combined benefit of async I/O and
> streaming I/O, rather than just streaming I/O alone. If that is the
> case, I would suggest also running a benchmark with io_method=sync to
> isolate the performance impact of streaming I/O. You might get
> interesting results.

The previous benchmark was conducted using the default
io_method=worker. The following results are from a new benchmark run
with io_method=sync. The performance gains from streaming reads are
still present, though smaller than those observed when asynchronous
I/O and streaming I/O are combined — which is expected. Thanks for the
suggestion!

Fragmented Indexes (Cold Cache)

Index Type    Size  Baseline  Patched  Speedup
Primary Key    Medium  43.80 ms  38.61 ms  1.13×
Primary Key    Large  238.24 ms  202.47 ms  1.17×
Primary Key    XLarge  962.90 ms  793.57 ms  1.21×
Timestamp     Medium  33.34 ms  29.98 ms  1.11×
Timestamp     Large  190.41 ms  161.34 ms  1.18×
Timestamp     XLarge  794.52 ms  647.82 ms  1.22×
Float (score)   Medium  14.46 ms  13.51 ms  1.06×
Float (score)   Large  87.38 ms  77.22 ms  1.13×
Float (score)   XLarge  278.47 ms  233.22 ms  1.19×
Composite (3 col) Medium  44.49 ms  40.10 ms  1.10×
Composite (3 col) Large  244.86 ms  211.01 ms  1.16×
Composite (3 col) XLarge  1073.32 ms  872.42 ms  1.23×

________________________________

Fragmented Indexes (Warm Cache)

Index Type    Size  Baseline  Patched  Speedup
Primary Key    Medium  7.91 ms  7.88 ms  1.00×
Primary Key    Large  35.58 ms  36.41 ms  0.97×
Primary Key    XLarge  126.29 ms  126.95 ms  0.99×
Timestamp     Medium  5.14 ms  6.82 ms  0.75×
Timestamp     Large  22.96 ms  29.55 ms  0.77×
Timestamp     XLarge  104.26 ms  106.01 ms  0.98×
Float (score)   Medium  3.76 ms  4.18 ms  0.90×
Float (score)   Large  13.65 ms  13.01 ms  1.04×
Float (score)   XLarge  40.58 ms  41.28 ms  0.98×
Composite (3 col) Medium  8.23 ms  8.25 ms  0.99×
Composite (3 col) Large  37.10 ms  37.59 ms  0.98×
Composite (3 col) XLarge  139.89 ms  138.21 ms  1.01×

Best,
Xuneng

Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi,

Here’s the updated summary report(cold cache, fragmented index), now including results for the streaming I/O + io_uring configuration.

image.png


Best,
Xuneng

Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Shinya Kato
Дата:
Hi,

On Wed, Oct 15, 2025 at 10:25 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,

Here’s the updated summary report(cold cache, fragmented index), now including results for the streaming I/O + io_uring configuration.

image.png


Thank you for the additional tests. I can see the image on Gmail, but I cannot on pgsql-hackers archive [0], so it might be a good idea to attach it and not to paste it on the body.

I saw the patch and have a few minor comments.

+ p.current_blocknum = 1;

To improve readability, how about using the following, which is consistent with nbtree.c [1]?
p.current_blocknum = BTREE_METAPAGE + 1;

Similarly, for hash index:
p.current_blocknum = HASH_METAPAGE + 1;

+ /* Unlock and release buffer */
  UnlockReleaseBuffer(buf);

I think this comment is redundant since the function name UnlockReleaseBuffer is self-explanatory. I suggest omitting it from pgstathashindex and removing the existing one from pgstatindex_impl.



--
Best regards,
Shinya Kato
NTT OSS Center
Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
Xuneng Zhou
Дата:
Hi Kato-san,

Thanks for looking into this.

On Thu, Oct 16, 2025 at 4:21 PM Shinya Kato <shinya11.kato@gmail.com> wrote:
>
> Hi,
>
> On Wed, Oct 15, 2025 at 10:25 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>>
>> Hi,
>>
>> Here’s the updated summary report(cold cache, fragmented index), now including results for the streaming I/O +
io_uringconfiguration. 
>>
>
> Thank you for the additional tests. I can see the image on Gmail, but I cannot on pgsql-hackers archive [0], so it
mightbe a good idea to attach it and not to paste it on the body. 

Please see the attachment.

>
>
> I saw the patch and have a few minor comments.
>
> + p.current_blocknum = 1;
>
> To improve readability, how about using the following, which is consistent with nbtree.c [1]?
> p.current_blocknum = BTREE_METAPAGE + 1;
>
> Similarly, for hash index:
> p.current_blocknum = HASH_METAPAGE + 1;

This is more readable. I made the replacements.

>
> + /* Unlock and release buffer */
>   UnlockReleaseBuffer(buf);
>
> I think this comment is redundant since the function name UnlockReleaseBuffer is self-explanatory. I suggest omitting
itfrom pgstathashindex and removing the existing one from pgstatindex_impl. 

UnlockReleaseBuffer seems clearer and simpler than the original

> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> ReleaseBuffer(buffer);

So the comment might be less meaningful for UnlockReleaseBuffer. I
removed it as you suggested.

Best,
Xuneng

Вложения

Re: pgstattuple: Use streaming read API in pgstatindex functions

От
wenhui qiu
Дата:
HI Xuneng Zhou 

> - /* Unlock and release buffer */
> - LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> - ReleaseBuffer(buffer);
> + UnlockReleaseBuffer(buffer);
>  }
The previous suggestion to keep it was based on the fact that the original code already had a similar comment.
In fact, the code itself is quite easy to understand. My earlier email was simply following the style of the existing code when making the suggestion.
If anyone thinks the comment is unnecessary, it can certainly be removed.

On Thu, Oct 16, 2025 at 5:39 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Kato-san,

Thanks for looking into this.

On Thu, Oct 16, 2025 at 4:21 PM Shinya Kato <shinya11.kato@gmail.com> wrote:
>
> Hi,
>
> On Wed, Oct 15, 2025 at 10:25 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>>
>> Hi,
>>
>> Here’s the updated summary report(cold cache, fragmented index), now including results for the streaming I/O + io_uring configuration.
>>
>
> Thank you for the additional tests. I can see the image on Gmail, but I cannot on pgsql-hackers archive [0], so it might be a good idea to attach it and not to paste it on the body.

Please see the attachment.

>
>
> I saw the patch and have a few minor comments.
>
> + p.current_blocknum = 1;
>
> To improve readability, how about using the following, which is consistent with nbtree.c [1]?
> p.current_blocknum = BTREE_METAPAGE + 1;
>
> Similarly, for hash index:
> p.current_blocknum = HASH_METAPAGE + 1;

This is more readable. I made the replacements.

>
> + /* Unlock and release buffer */
>   UnlockReleaseBuffer(buf);
>
> I think this comment is redundant since the function name UnlockReleaseBuffer is self-explanatory. I suggest omitting it from pgstathashindex and removing the existing one from pgstatindex_impl.

UnlockReleaseBuffer seems clearer and simpler than the original

> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> ReleaseBuffer(buffer);

So the comment might be less meaningful for UnlockReleaseBuffer. I
removed it as you suggested.

Best,
Xuneng