Обсуждение: Moving _bt_readpage and _bt_checkkeys into a new .c file
Attached patch v1-0001-* moves _bt_readpage (from nbtsearch.c) and
_bt_checkkeys (from nbtutils.c) into a new .c file -- nbtreadpage.c.
It also moves all of the functions that _bt_checkkeys itself calls
(either directly or indirectly) over to nbtreadpage.c. This first
patch is strictly mechanical, in that it only moves existing functions
around, without directly changing anything.
There's also a second patch, which adds a performance optimization
that builds on the first patch. This makes the functions moved to the
new file pass around the scan's direction, rather than requiring the
functions to fish it out of so->currPos as before.
With large range query variants of pgbench consisting of queries like
"SELECT abalance FROM pgbench_accounts WHERE aid between 1000 AND
3000", at scale 100, with a single client, I find that there's a ~4.5%
increase in transaction throughput with both patches applied (when
compared to master). The new module structure coupled with the
optimization in the second patch gives the compiler the ability to
produce more efficient code.
Here's a set of results from a series on interlaced 5 minute runs:
master: tps = 2682.600100, lat avg = 0.373 ms, lat stddev = 0.006 ms
cpu/txn = 0.011320, cs/txn = 8.1114, in/txn = 0.1832
patch: tps = 2827.020776, lat avg = 0.354 ms, lat stddev = 0.006 ms
cpu/txn = 0.010777, cs/txn = 8.1083, in/txn = 0.1768
master: tps = 2711.945605, lat avg = 0.369 ms, lat stddev = 0.006 ms
cpu/txn = 0.011173, cs/txn = 8.1109, in/txn = 0.1815
patch: tps = 2832.105849, lat avg = 0.353 ms, lat stddev = 0.006 ms
cpu/txn = 0.010793, cs/txn = 8.1090, in/txn = 0.1769
master: tps = 2699.348706, lat avg = 0.370 ms, lat stddev = 0.007 ms
cpu/txn = 0.011225, cs/txn = 8.1123, in/txn = 0.1813
patch: tps = 2822.511539, lat avg = 0.354 ms, lat stddev = 0.006 ms
cpu/txn = 0.010759, cs/txn = 8.1078, in/txn = 0.1760
master: tps = 2715.051289, lat avg = 0.368 ms, lat stddev = 0.006 ms
cpu/txn = 0.011160, cs/txn = 8.1116, in/txn = 0.1824
Patch: tps = 2818.002045, lat avg = 0.355 ms, lat stddev = 0.007 ms
cpu/txn = 0.010752, cs/txn = 8.1078, in/txn = 0.1762
master: tps = 2720.488826, lat avg = 0.367 ms, lat stddev = 0.011 ms
cpu/txn = 0.011138, cs/txn = 8.1103, in/txn = 0.1815
Patch: tps = 2818.035923, lat avg = 0.355 ms, lat stddev = 0.006 ms
cpu/txn = 0.010752, cs/txn = 8.1087, in/txn = 0.1762
Other kinds of queries benefit much less. I think that there's a sub
1% improvement with standard pgbench SELECT, though it's hard to
distinguish from noise. There are fewer code paths affected by the
second patch in use with such queries, compared to the favorable range
scan query case that I've shown detailed numbers for, which might
explain why simple equality lookups aren't improved by very much.
This seems like an enhancement that is pretty easy to justify. Note
that the changes in the second patch essentially restore things to how
they already were prior to my commit 763d65ae. I doubt that that
change caused a regression at the time, since the speedup that I see
now depends on the changes in the first patch (though I must admit
that I haven't benchmarked the changes made by the second patch in
isolation).
--
Peter Geoghegan
Вложения
сб, 6 дек. 2025 г. в 06:49, Peter Geoghegan <pg@bowt.ie>:
Attached patch v1-0001-* moves _bt_readpage (from nbtsearch.c) and
_bt_checkkeys (from nbtutils.c) into a new .c file -- nbtreadpage.c.
It also moves all of the functions that _bt_checkkeys itself calls
(either directly or indirectly) over to nbtreadpage.c. This first
patch is strictly mechanical, in that it only moves existing functions
around, without directly changing anything.
…
This seems like an enhancement that is pretty easy to justify. Note
that the changes in the second patch essentially restore things to how
they already were prior to my commit 763d65ae. I doubt that that
change caused a regression at the time, since the speedup that I see
now depends on the changes in the first patch (though I must admit
that I haven't benchmarked the changes made by the second patch in
isolation).
Hey.
I like this change and I agree that it's both handy and gives an easy performance boost.
Patch applies and compiles cleanly. I can barely see a performance boost on my end (VM on a busy host), round 1%, but I still consider this change beneficial.
Victor Yegorov
On Sat, Dec 6, 2025 at 3:07 AM Victor Yegorov <vyegorov@gmail.com> wrote: > I like this change and I agree that it's both handy and gives an easy performance boost. There's a number of things that I find counterintuitive about the performance impact of this patch: * It doesn't make very much difference (under a 1% improvement) on similar index-only scans, with queries such as "SELECT count(*) FROM pgbench_accounts WHERE aid between :aid AND :endrange". One might expect this case to be improved at least as much as the plain index scan case, since the relative cost of _bt_readpage is higher (since it's much cheaper to access the visibility map than to access the heap). * The patch makes an even larger difference when I make pgbench use a larger range of values in the "between". For example, if I increase the number of values/tuples returned from 2,000 (which is what I used initially/reported on in the first email) to 15,000, I find that the patch increases TPS by as much as 5.5%. * These queries make maximum use of the _bt_set_startikey optimization -- most individual leaf pages don't need to evaluate any scan key (after an initial page-level check within _bt_set_startikey). So the patch really helps in exactly those cases where we don't truly need to access the scan direction at all -- the loop inside _bt_check_compare always has 0 iterations with these queries, which means that scan direction doesn't actually ever need to be considered at that point. My best guess is that the benefits I see come from eliminating a dependent load. Without the second patch applied, I see this disassembly for _bt_checkkeys: mov rax,QWORD PTR [rdi+0x38] ; Load scan->opaque mov r15d,DWORD PTR [rax+0x70] ; Load so->dir A version with the second patch applied still loads a pointer passed by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase another pointer to get to it. Maybe this significantly ameliorates execution port pressure in the cases where I see a speedup? > Patch applies and compiles cleanly. I can barely see a performance boost on my end (VM on a busy host), round 1%, but Istill consider this change beneficial. It seems to have no downsides, and some upside. I wouldn't be surprised if the results I'm seeing are dependent on microarchitectural details. I myself use a Zen 3 chip (a Ryzen 9 5950X). -- Peter Geoghegan
On Sat, Dec 6, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > My best guess is that the benefits I see come from eliminating a > dependent load. Without the second patch applied, I see this > disassembly for _bt_checkkeys: > > mov rax,QWORD PTR [rdi+0x38] ; Load scan->opaque > mov r15d,DWORD PTR [rax+0x70] ; Load so->dir > > A version with the second patch applied still loads a pointer passed > by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase > another pointer to get to it. Maybe this significantly ameliorates > execution port pressure in the cases where I see a speedup? I found a way to further speed up the queries that the second patch already helped with, following profiling with perf: if _bt_readpage takes a local copy of scan->ignore_killed_tuples when first called, and then uses that local copy within its per-tuple loop (instead of using scan->ignore_killed_tuples directly), it gives me an additional 1% speedup over what I reported earlier today. In other words, the range/BETWEEN pgbench variant I summarized earlier today goes from being about 4.5% faster than master, to being about ~5.5% faster than master. Testing has also shown that the ignore_killed_tuples enhancement doesn't significantly change the picture with other types of queries (such as the default pgbench SELECT). In short, this ignore_killed_tuples change makes the second patch from v1 more effective, seemingly by further ameliorating the same bottleneck. Apparently accessing scan->ignore_killed_tuples created another load-use hazard in the same tight inner loop (the per-tuple _bt_readpage loop). Which matters with these queries, where we don't need to do very much work per-tuple (_bt_readpage's pstate.startikey optimization is as effective as possible here) and have quite a few tuples (2,000 tuples) that need to be returned by each test query run. Since this ignore_killed_tuples change is also very simple, and also seems like an easy win, I think that it can be committed as part of the second patch. Without it needing to wait for too much more performance validation. Attached are 2 text files showing pgbench output/summary info, generated by my test script (both are from runs that took place within the last 2 hours). One of these result sets just confirms what I reported earlier on, with an unmodified v1 patchset. The other set of results/file shows detailed results for the v1 patchset with the ignore_killed_tuples change also applied, for the same pgbench config/workload. This second file gives full details to back up my "~5.5% faster than master" claim. The pgbench script used for this is as follows: \set aid random_exponential(1, 100000 * :scale, 3.0) \set endrange :aid + 2000 SELECT abalance FROM pgbench_accounts WHERE aid between :aid AND :endrange; I'm deliberately not attaching a new v2 for this ignore_killed_tuples change right now. The first patch is a few hundred KBs, and I don't want this email to get held up in moderation. Though I will attach the ignore_killed_tuples change in its own patch, which I've also attached (with a .txt extension, just to avoid confusing CFTester). -- Peter Geoghegan
Вложения
On Sat, Dec 6, 2025 at 9:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > Since this ignore_killed_tuples change is also very simple, and also > seems like an easy win, I think that it can be committed as part of > the second patch. Without it needing to wait for too much more > performance validation. My plan is to commit the entire patch series (with a modified second patch that includes the ignore_killed_tuples change) in the next couple of days. As far as I can determine through performance validation that tested a variety of different scan types (simple point lookups, range scans, and a variety of different SAOP scan patterns), the patch series is an unambiguous win. It appears to be slightly faster even in unsympathetic cases, such as standard pgbench SELECT. -- Peter Geoghegan
пн, 8 дек. 2025 г. в 01:31, Peter Geoghegan <pg@bowt.ie>:
On Sat, Dec 6, 2025 at 9:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Since this ignore_killed_tuples change is also very simple, and also
> seems like an easy win, I think that it can be committed as part of
> the second patch. Without it needing to wait for too much more
> performance validation.
My plan is to commit the entire patch series (with a modified second
patch that includes the ignore_killed_tuples change) in the next
couple of days.
As far as I can determine through performance validation that tested a
variety of different scan types (simple point lookups, range scans,
and a variety of different SAOP scan patterns), the patch series is an
unambiguous win. It appears to be slightly faster even in
unsympathetic cases, such as standard pgbench SELECT.
Even without the performance increase, I think this a valuable code reorganization, worth committing.
I've compiled and run test (check and installcheck) with all 3 patches, did a small standard pgbench run:
pgbench -s 100 -i
pgbench -P 60 -T 300
Results:
master: 569 TPS
patched: 590 TPS +3.7%
LGTM.
pgbench -P 60 -T 300
Results:
master: 569 TPS
patched: 590 TPS +3.7%
LGTM.
Victor Yegorov
On 08/12/2025 11:13, Victor Yegorov wrote: > Even without the performance increase, I think this a valuable code > reorganization, worth committing. +1 - Heikki
On Mon, Dec 8, 2025 at 4:13 AM Victor Yegorov <vyegorov@gmail.com> wrote: > I've compiled and run test (check and installcheck) with all 3 patches, did a small standard pgbench run: > pgbench -s 100 -i > pgbench -P 60 -T 300 > > Results: > master: 569 TPS > patched: 590 TPS +3.7% Pushed both patches just now. Thanks for the review! -- Peter Geoghegan