Обсуждение: Metadata and record block access stats for indexes
Hi, For the purpose of writing a blog post I was checking the index stats recorded for a workload, but found them rather confusing. Following along the code with the debugger it eventually made sense, and I could eventually understand what's counted. Looking around a bit, I discovered an older discussion [1] in the mailing lists and learned that the issue is known. The proposal in that thread is to start counting separate metadata and record stats depending on what type of index block is retrieved. I realized those would have helped me better understand the collected index stats, so I started working on a patch to add these in the system views. Attached is a WIP patch file with partial coverage of the B-Tree index code. The implementation follows the existing stats collection approach and the naming convention proposed in [1]. Let me know if what I'm doing is feasible and if there's any concerns I could address. Next steps would be to replace all places where I currently pass in NULL with proper counting, as well as update tests and docs. Looking forward to your feedback! Thanks! Cheers, Mircea [1]: https://www.postgresql.org/message-id/flat/CAH2-WzmdZqxCS1widYzjDAM%2BZ-Jz%3DejJoaWXDVw9Qy1UsK0tLA%40mail.gmail.com
Вложения
Hi, Just attaching the complete patch which now covers all index types, docs and tests. You can run the following to see it in action: create table test (id serial primary key); insert into test select * from generate_series(0,30000); select pg_stat_reset(); select * from test where id=3000; select * from pg_statio_all_indexes where indexrelname = 'test_pkey'; This will show that there were 2 index blocks read from shared buffers (hit): 1 metadata and one record. Cheers, Mircea On 28/02/2025 21:58, Mircea Cadariu wrote: > Hi, > > For the purpose of writing a blog post I was checking the index stats > recorded for a workload, but found them rather confusing. Following > along the code with the debugger it eventually made sense, and I could > eventually understand what's counted. Looking around a bit, I > discovered an older discussion [1] in the mailing lists and learned > that the issue is known. The proposal in that thread is to start > counting separate metadata and record stats depending on what type of > index block is retrieved. > > I realized those would have helped me better understand the collected > index stats, so I started working on a patch to add these in the > system views. Attached is a WIP patch file with partial coverage of > the B-Tree index code. The implementation follows the existing stats > collection approach and the naming convention proposed in [1]. Let me > know if what I'm doing is feasible and if there's any concerns I could > address. Next steps would be to replace all places where I currently > pass in NULL with proper counting, as well as update tests and docs. > > Looking forward to your feedback! Thanks! > > Cheers, > Mircea > > [1]: > https://www.postgresql.org/message-id/flat/CAH2-WzmdZqxCS1widYzjDAM%2BZ-Jz%3DejJoaWXDVw9Qy1UsK0tLA%40mail.gmail.com >
Вложения
On 7/4/25 18:00, Mircea Cadariu wrote:
> Just attaching v2 of the patch.
Hi Mircea,
Your patch applies cleanly and seems to work well.
IIUC, the index hit ratio should be computed with the following formula:
(idx_blks_hit - idx_metadata_blks) / (idx_blks_hit - idx_metadata_blks +
idx_blks_read)
because most of the index non-leaf pages should be in the cache. Right?
This should probably be documented somewhere?
Here is my testing:
# select tree_level, internal_pages, leaf_pages from
pgstatindex('pgbench_accounts_pkey');
tree_level | internal_pages | leaf_pages
------------+----------------+------------
2 | 20 | 5465
(1 row)
# SELECT DISTINCT pg_buffercache_evict(bufferid)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('pgbench_accounts_pkey');
pg_buffercache_evict
----------------------
(t,f)
(1 row)
# SELECT pg_stat_reset();
pg_stat_reset
---------------
(1 row)
# SELECT max(abalance) FROM pgbench_accounts WHERE aid = 100;
max
-----
0
(1 row)
# select idx_blks_read, idx_blks_hit, idx_metadata_blks from
pg_statio_all_indexes where indexrelname = 'pgbench_accounts_pkey';
idx_blks_read | idx_blks_hit | idx_metadata_blks
---------------+--------------+-------------------
3 | 0 | 2
(1 row)
--> 3 pages: the root of the tree, one internal page and one leaf
#
\q
fyhuel@framework:~$ psql bench
psql (19devel)
Type "help" for help.
# SELECT max(abalance) FROM pgbench_accounts WHERE aid = 100;
max
-----
0
(1 row)
primary sleaf bench [42323] # select idx_blks_read, idx_blks_hit,
idx_metadata_blks from pg_statio_all_indexes where indexrelname =
'pgbench_accounts_pkey';
idx_blks_read | idx_blks_hit | idx_metadata_blks
---------------+--------------+-------------------
4 | 3 | 5
--> 4 more pages: same as before, already in cache, plus the index meta
page, read outside shared buffers because we started a new session?
Hi Frédéric, Thanks a lot for trying out my (first) patch! Much appreciated. On 20/07/2025 21:54, Frédéric Yhuel wrote: > Your patch applies cleanly and seems to work well. Cool! > because most of the index non-leaf pages should be in the cache. Right? Yes indeed, it's an assumption in the implementation, that the non-leaf pages will roughly always be in the cache. > This should probably be documented somewhere? I'm still familiarising myself about what to document where, whether things should be in the official docs or separate blog posts. In the patch I only documented the new column next to the existing ones for now. > --> 3 pages: the root of the tree, one internal page and one leaf Yes, this is correct. > primary sleaf bench [42323] # select idx_blks_read, idx_blks_hit, > idx_metadata_blks from pg_statio_all_indexes where indexrelname = > 'pgbench_accounts_pkey'; > idx_blks_read | idx_blks_hit | idx_metadata_blks > ---------------+--------------+------------------- > 4 | 3 | 5 > > --> 4 more pages: same as before, already in cache, plus the index > meta page, read outside shared buffers because we started a new session? Yes, that's my understanding too. Thanks! Kind regards, Mircea Cadariu
Rebased and dusted off this patch.
Вложения
Hi Mircea, > Rebased and dusted off this patch. Thanks for the patch. Here are my two cents. IMO it would be helpful if you could come up with a few more practical use cases. This change is going to affect pretty much everyone. If only a few users will benefit from it once in several years, the value of the patch is arguably low. As an example, can you think of how the new counters can be used for debugging, checking index integrity, writing new access methods or perhaps writing property-based tests? Just several examples that came to my mind first. Also I'm a bit concerned about the performance impact. It's probably next to nothing, but if you could measure it on a relatively large amount of data that would be great. Note that it's not uncommon to have dozens of different indexes for a single table. -- Best regards, Aleksander Alekseev
Hi, Aleksander! On 20/11/2025 14:56, Aleksander Alekseev wrote: > Hi Mircea, > >> Rebased and dusted off this patch. > Thanks for the patch. Here are my two cents. > > IMO it would be helpful if you could come up with a few more practical > use cases. This change is going to affect pretty much everyone. If > only a few users will benefit from it once in several years, the value > of the patch is arguably low. As an example, can you think of how the > new counters can be used for debugging, checking index integrity, > writing new access methods or perhaps writing property-based tests? > Just several examples that came to my mind first. > > Also I'm a bit concerned about the performance impact. It's probably > next to nothing, but if you could measure it on a relatively large > amount of data that would be great. Note that it's not uncommon to > have dozens of different indexes for a single table. > Thanks! It hasn't caught on, might just stay an educational exercise on my side on patch writing and we move on. Thanks for thinking out loud, your ideas lead me to this: when working on improving the performance of index traversals, it could be a direct way to show improvement and even write regression tests: lower amount of internal pages are read than before. But, one could also just use the existing stats and still be able to track this.. For the performance checks, indeed, I'm using this script, I observe no meaningful difference on my laptop: https://gist.github.com/mcadariu/fc4a6d4eccd56b4447d1d9d05f9b5d79. -- Thanks, Mircea Cadariu
Hi Aleksander,
Thanks for the patch. Here are my two cents.
This is a follow-up on my earlier answers to your questions. Below is a motivating example and a performance comparison with HEAD.
For database performance we want to ensure our working set fits in memory (shared buffers and OS page cache).
Index leaf-page hit ratios allows a DBA to detect when actual index lookups hit disk, which the overall ratio can mask, because the internal pages are typically cached.
To see this in action, we can do the following. After applying the patch:
CREATE TABLE test_data_large (
id SERIAL PRIMARY KEY,
search_key INTEGER,
data_col TEXT
);
INSERT INTO test_data_large (search_key, data_col)
SELECT i, 'Data-' || i
FROM generate_series(1, 3200000) i;
CREATE INDEX idx_search_key_large ON test_data_large(search_key);
ANALYZE test_data_large;
SELECT pg_stat_reset();
This is our workload we'll run and collect stats for:
DO $$
BEGIN
FOR i IN 1..20000 LOOP
PERFORM 1
FROM test_data_large
WHERE search_key = ((i * 160) % 3200000)
LIMIT 1;
END LOOP;
END $$;
SELECT pg_stat_force_next_flush();
Now we can inspect the stats:
SELECT
ROUND(100.0 * idx_blks_hit / (idx_blks_read + idx_blks_hit), 2) || '%' AS "Overall hit ratio",
ROUND(100.0 * (idx_blks_hit - idx_metadata_blks) / ((idx_blks_read + idx_blks_hit) - idx_metadata_blks), 2) || '%' AS "Leaf hit ratio"
FROM pg_statio_all_indexes
WHERE indexrelname = 'idx_search_key_large';
This is the result:
Cache hit ratio | Leaf hit ratio
-------------+-----------
85.37% | 56.11%
(1 row)
The overall cache hit ratio looks healthy at ~85%, but the leaf-page hit ratio is much lower at ~56%, indicating that a large share of index leaf blocks are actually being read from disk. This suggests that internal index pages remain cached (as expected, B-tree hierarchy) while the working set of leaf pages does not fit in memory, leading to more disk I/O during actual lookups.
Now showing the performance comparison of running this script [1] on my laptop on HEAD and patch.
HEAD
====
Point Queries
-----------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 30098 |
| 2 | 30164 |
| 3 | 29903 |
| 4 | 30024 |
| 5 | 29898 |
| 6 | 30023 |
| 7 | 29952 |
| 8 | 29413 |
| 9 | 30062 |
| 10 | 29821 |
+------+--------+
Median: 29988
Range Scans
----------
+------+------+
| Run | TPS |
+------+------+
| 1 | 586 |
| 2 | 584 |
| 3 | 584 |
| 4 | 584 |
| 5 | 584 |
| 6 | 587 |
| 7 | 578 |
| 8 | 562 |
| 9 | 583 |
| 10 | 586 |
+------+------+
Median: 584
Mixed Load
----------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 16446 |
| 2 | 15842 |
| 3 | 16701 |
| 4 | 16293 |
| 5 | 16633 |
| 6 | 16292 |
| 7 | 16753 |
| 8 | 17047 |
| 9 | 17094 |
| 10 | 17078 |
+------+--------+
Median: 16667
Patch
===
Point Queries
-----------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 30335 |
| 2 | 30448 |
| 3 | 30372 |
| 4 | 30447 |
| 5 | 30478 |
| 6 | 30482 |
| 7 | 30428 |
| 8 | 30443 |
| 9 | 30433 |
| 10 | 30478 |
+------+--------+
Median: 30445
Range Scans
----------
+------+------+
| Run | TPS |
+------+------+
| 1 | 578 |
| 2 | 586 |
| 3 | 585 |
| 4 | 586 |
| 5 | 587 |
| 6 | 585 |
| 7 | 586 |
| 8 | 586 |
| 9 | 586 |
| 10 | 586 |
+------+------+
Median: 586
Mixed Load
---------
+------+--------+
| Run | TPS |
+------+--------+
| 1 | 17002 |
| 2 | 17078 |
| 3 | 17042 |
| 4 | 17046 |
| 5 | 17007 |
| 6 | 17023 |
| 7 | 17056 |
| 8 | 17071 |
| 9 | 17084 |
| 10 | 17068 |
+------+--------+
Median: 17051
HEAD vs Patch Summary
===============
+----------------------+-----------+-----------+
| Test | HEAD TPS | Patch TPS |
+----------------------+-----------+-----------+
| Point Queries Median | 29988 | 30445 |
| Range Scans Median | 584 | 586 |
| Mixed Load Median | 16667 | 17051 |
+----------------------+-----------+-----------+
Any feedback appreciated on the performance methodology/patch as a whole welcome.
[1] https://gist.github.com/mcadariu/fc4a6d4eccd56b4447d1d9d05f9b5d79
-- Thanks, Mircea Cadariu