Обсуждение: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Hi Andrei,
Thank you and sorry for the delay in response.
We have found the slowness in multiple other products in our portfolio and we are blocked to proceed further.
In regards to your point below, it is true that the index only scan is 2.5 times slower on PG17 and primary difference is in that index-only scan of ui_stream_file_id_component at line 14. It takes 6 microseconds per row in PG 15, 2 microseconds per row in 16, and 14 microseconds in 17
I found out something which might be linked to, in release notes of PG17 (http://postgresql.org/docs/release/17.0/):
“ Allow btree indexes to more efficiently find a set of values, such as those supplied by IN clauses using constants.”
@Tom Lane, Sorry I could not get you reproducer yet but please do comment on this insight meanwhile I am working on it.
Thanks.
From: Andrei Lepikhov <lepihov@gmail.com>
Date: Thursday, 15 May 2025 at 7:56 PM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Cc: Peter Geoghegan <pg@bowt.ie>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
On 15/5/2025 07:33, Sajith Prabhakar Shetty wrote:
> Hi,
>
>
> Most of the queries got slower after upgrading our postgres from version
> 15 to 17 using pg_upgrade. I reconfirmed that "vacuum full, analyze"
> were all taken care.
>
> To debug, instead of upgrade, I installed two instances one with
> postgres 15 and another postgres 17 with the same application dump restored.
>
> Now surprisingly one of the query i took from application which used to
> execute in 2s in PG15, is now taking 1min+ in PG17. I also observed that
> some of the operations involving DML operations slowed down too in PG17.
>
> Explain plan of the two queries almost same, all the joins and paths
> used are exactly same.
>
> Could anybody please provide some insights here?
Curious, the difference in Index Only Scan node:
-> Index Scan using stream_file_pkey on stream_file sf
(cost=1.63..1.86 rows=1 width=8)
(actual time=0.006..0.006 rows=1 loops=598916)
Index Cond: (id = sdo.stream_file_id)
Filter: (component_id = ANY
-> Index Only Scan using ui_stream_file_id_component on stream_file sf
(cost=0.43..0.51 rows=1 width=8)
(actual time=0.014..0.014 rows=1 loops=598916)
Index Cond: ((id = sdo.stream_file_id) AND (component_id = ANY
Each time the index scan is 2.5 times slower on PG17. But:
PG 15:
Buffers: shared hit=2338397 read=57267
I/O Timings: shared read=3384.286
PG 17:
Buffers: shared hit=1909772 read=9933
I/O Timings: shared read=686.506
If I'm not mistaken, it seems like an insight.
--
regards, Andrei Lepikhov
Вложения
On Thu, Jul 17, 2025 at 5:58 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > In regards to your point below, it is true that the index only scan is 2.5 times slower on PG17 and primary differenceis in that index-only scan of ui_stream_file_id_component at line 14. It takes 6 microseconds per row in PG 15,2 microseconds per row in 16, and 14 microseconds in 17 The important difference is the choice of index for the outermost nestloop join's inner index scan. A different index is used on Postgres 17: On Postgres 15, you're using the likely-single-column stream_file_pkey index, which uses filter quals for the ScalarArrayOp/= ANY condition. Whereas on Postgres 17, you're using the ui_stream_file_id_component index instead (a multicolumn index), which uses a true index qual for the "id = sdo.stream_file_id" as well as for the ScalarArrayOp/= ANY condition. -- Peter Geoghegan
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Hi Peter,
Thanks for the response, but I don’t understand when you meant “you are using different index”, by any chance did you mean the optimizer?
Because I have used exactly the same data dump for all PG15,16 and 17 for my tests with no difference in data nor schema structure.
From: Peter Geoghegan <pg@bowt.ie>
Date: Thursday, 17 July 2025 at 8:35 PM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
On Thu, Jul 17, 2025 at 5:58 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> In regards to your point below, it is true that the index only scan is 2.5 times slower on PG17 and primary difference is in that index-only scan of ui_stream_file_id_component at line 14. It takes 6 microseconds per row in PG 15, 2 microseconds per row in 16, and 14 microseconds in 17
The important difference is the choice of index for the outermost
nestloop join's inner index scan. A different index is used on
Postgres 17:
On Postgres 15, you're using the likely-single-column stream_file_pkey
index, which uses filter quals for the ScalarArrayOp/= ANY condition.
Whereas on Postgres 17, you're using the ui_stream_file_id_component
index instead (a multicolumn index), which uses a true index qual for
the "id = sdo.stream_file_id" as well as for the ScalarArrayOp/= ANY
condition.
--
Peter Geoghegan
Вложения
On Thu, Jul 17, 2025 at 11:49 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > Thanks for the response, but I don’t understand when you meant “you are using different index”, by any chance did you meanthe optimizer? > Because I have used exactly the same data dump for all PG15,16 and 17 for my tests with no difference in data nor schemastructure. I simply mean that the plan is substantially different, in that there is an index scan node on 17 that uses a completely different index to the corresponding index scan node on 15. While the plan looks almost the same, this one detail is huge. In other words, I disagree with your summary of the plan, when you said "Explain plan of the two queries almost same, all the joins and paths used are exactly same". The paths are not the same. -- Peter Geoghegan
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Hello,
We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files.
Sorry for the delay, since our initial reproducer from the product was too large and sensitive to share.
This can clearly demonstrate performance degradation from postgres 17 versus postrges 15.
NOTE: This degradation has blocked our PG upgrade on multiple products in our portfolio.
From: Peter Geoghegan <pg@bowt.ie>
Date: Thursday, 17 July 2025 at 9:25 PM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
On Thu, Jul 17, 2025 at 11:49 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> Thanks for the response, but I don’t understand when you meant “you are using different index”, by any chance did you mean the optimizer?
> Because I have used exactly the same data dump for all PG15,16 and 17 for my tests with no difference in data nor schema structure.
I simply mean that the plan is substantially different, in that there
is an index scan node on 17 that uses a completely different index to
the corresponding index scan node on 15. While the plan looks almost
the same, this one detail is huge.
In other words, I disagree with your summary of the plan, when you
said "Explain plan of the two queries almost same, all the joins and
paths used are exactly same". The paths are not the same.
--
Peter Geoghegan
Вложения
On Mon, Jul 28, 2025 at 2:20 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files. I find that your test case spends a great deal of time on nbtree preprocessing, which happens once per execution of the inner index scan on "zsf". According to "perf top", most cycles on spent on these: 32.02% postgres [.] FunctionCall2Coll 22.01% postgres [.] qsort_arg 18.64% postgres [.] _bt_compare_array_elements 8.20% postgres [.] btint8cmp 3.97% postgres [.] _bt_preprocess_keys ... The query takes ~1550ms on my local workstation. If I just comment out the relevant qsort, it'll take only ~190 ms. That qsort might not be the only problem here, but it is the immediate problem. Note that commenting out the qsort should produce the same answer, at least for this one query, since the constants that appear in the query are already sorted (the EXPLAIN row counts match what they show with the qsort in place). In principle, we could limit the use of the qsort to the first inner index scan, and safely skip each subsequent qsort -- at least in cases where the array was a constant (which includes this case). Obviously, we *do* need a qsort (what if the constants aren't exactly in sorted order?), but we generally don't need to do it once per inner index scan. There is a separate question as to whether or not the planner should pick this plan in the first place. I find that I can get a faster plan (without commenting out anything) by tricking the planner into using the single column "zsf_pkey", rather than the multi-column "zsf_id_fpi_cid_key". Even then, I can only get a merge join with the "zsf_pkey" index on the inner side -- not a nested loop join with the "zsf_pkey" index on the inner side, as expected. The merge join plan brings the execution time down to ~90ms, which is better, but certainly still less than ideal. -- Peter Geoghegan
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Thanks Peter for detailed response. We really appreciate your time and effort in this regard.
Please help me on the next step of actions required from our end. I understand you have found out the root cause, can you confirm if any fix would be approved for this case in upcoming patches and any ETA is really helpful.
Also note that this has affected many of our SQL queries, and we cannot afford to modify our code to accommodate the PG17 planner changes.
Looping in the email, my colleague Todd, who helped me get the reproducer steps.
From: Peter Geoghegan <pg@bowt.ie>
Date: Tuesday, 29 July 2025 at 2:12 AM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
On Mon, Jul 28, 2025 at 2:20 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files.
I find that your test case spends a great deal of time on nbtree
preprocessing, which happens once per execution of the inner index
scan on "zsf". According to "perf top", most cycles on spent on these:
32.02% postgres [.] FunctionCall2Coll
22.01% postgres [.] qsort_arg
18.64% postgres [.] _bt_compare_array_elements
8.20% postgres [.] btint8cmp
3.97% postgres [.] _bt_preprocess_keys
...
The query takes ~1550ms on my local workstation. If I just comment out
the relevant qsort, it'll take only ~190 ms. That qsort might not be
the only problem here, but it is the immediate problem. Note that
commenting out the qsort should produce the same answer, at least for
this one query, since the constants that appear in the query are
already sorted (the EXPLAIN row counts match what they show with the
qsort in place).
In principle, we could limit the use of the qsort to the first inner
index scan, and safely skip each subsequent qsort -- at least in cases
where the array was a constant (which includes this case). Obviously,
we *do* need a qsort (what if the constants aren't exactly in sorted
order?), but we generally don't need to do it once per inner index
scan.
There is a separate question as to whether or not the planner should
pick this plan in the first place. I find that I can get a faster plan
(without commenting out anything) by tricking the planner into using
the single column "zsf_pkey", rather than the multi-column
"zsf_id_fpi_cid_key". Even then, I can only get a merge join with the
"zsf_pkey" index on the inner side -- not a nested loop join with the
"zsf_pkey" index on the inner side, as expected. The merge join plan
brings the execution time down to ~90ms, which is better, but
certainly still less than ideal.
--
Peter Geoghegan
Вложения
On Mon, Jul 28, 2025 at 4:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > The query takes ~1550ms on my local workstation. If I just comment out > the relevant qsort, it'll take only ~190 ms. That qsort might not be > the only problem here, but it is the immediate problem. Note that > commenting out the qsort should produce the same answer, at least for > this one query, since the constants that appear in the query are > already sorted (the EXPLAIN row counts match what they show with the > qsort in place). Actually, that isn't quite true -- the constants weren't in sorted order. I find that if I presort the elements within the query text itself, the runtime goes down to only ~410ms. That's still not great, but it is a vast improvement. -- Peter Geoghegan
On Mon, Jul 28, 2025 at 2:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jul 28, 2025 at 2:20 AM Sajith Prabhakar Shetty > <ssajith@blackduck.com> wrote: > > We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files. > > I find that your test case spends a great deal of time on nbtree > preprocessing, which happens once per execution of the inner index > scan on "zsf". According to "perf top", most cycles on spent on these: > > 32.02% postgres [.] FunctionCall2Coll > 22.01% postgres [.] qsort_arg > 18.64% postgres [.] _bt_compare_array_elements > 8.20% postgres [.] btint8cmp > 3.97% postgres [.] _bt_preprocess_keys > ... > > The query takes ~1550ms on my local workstation. If I just comment out > the relevant qsort, it'll take only ~190 ms. That qsort might not be side question: if 50% of time (per perf top) is spent in qsort and subroutines, how come query time goes down ~85%? is this a limitation of perf or some other issue? side question #2: 32% time spent on FunctionCall2Coll seems like a lot -- is this inclusive of the routine under the function pointer? just curious on both of these merlin
On Tue, Jul 29, 2025 at 1:49 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > I understand you have found out the root cause I wouldn't say that -- it isn't clear that this issue with qsorting during preprocessing is the root cause. As I said, it is (at a minimum) the immediate problem with your query. But the underlying code path didn't change all that much in Postgres 17 -- it just started to be hit a lot more with this particular query. This is more or less a consequence of the fact that the new-to-17 query plan has an inner index scan with an inconvenient combination of 2 things: it is very selective and fast (fast per individual execution), even though it has a couple of SAOPs (a SAOP is an = ANY() condition) that each have several hundred array elements. That's what allows these startup costs (i.e. array preprocessing that does these 2 qsorts on each execution) to dominate so much. My Postgres 17 commit 5bf748b8 made the planner stop generating distinct index paths that make lower-order SAOPs into "Filter:" conditions, which are executed outside of the core B-Tree code, using a completely different code path. Those index paths are used by the Postgres 15 plan. I am very hesitant to add anything like that back, though, because they're very unlikely to be faster than an index path that makes the executor push down the SAOP condition into the B-Tree code (your counterexample notwithstanding). Perhaps Tom can weigh-in here. I removed code that generated these alternative index paths from the planner because its original justification (see bugfix commit a4523c5a, a follow-up to bugfix commit 807a40c5) no longer applied. Perhaps this should be revisited now, or perhaps the issue should be ameliorated on the nbtree side. Or maybe we should just do nothing -- the issue can be worked around in the application itself. -- Peter Geoghegan
On 7/30/25, 3:17 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote: Perhaps Tom can weigh-in here. I removed code that generated these alternative index paths from the planner because its original justification (see bugfix commit a4523c5a, a follow-up to bugfix commit 807a40c5) no longer applied. Perhaps this should be revisited now, or perhaps the issue should be ameliorated on the nbtree side. Or maybe we should just do nothing -- the issue can be worked around in the application itself. I work at the same company as Sajith, but on a different product. The reproducer he provided is just a sample; it's not the only problem. Load testing in my team shows that PG 17 is about 4x slower than PG 15 across the board. It's bordering on unusable for production deployments. Unfortunately, the load testing setup doesn't really help isolate individual, regressing queries. However, I'm more than willing to help support any further investigation if needed or helpful. -- todd
On Wed, Jul 30, 2025 at 2:59 PM Merlin Moncure <mmoncure@gmail.com> wrote: > side question: if 50% of time (per perf top) is spent in qsort and > subroutines, how come query time goes down ~85%? is this a limitation > of perf or some other issue? I used perf's default event type for this, which is "cycles". It's well known that the relationship between cycles and wallclock time is very complicated. In my experience "cycles" tends to be useful for a first order pass, to get a very general sense of what's going on, before considering possible solutions. That's all I needed here -- I wasn't trying to be precise. Separately, there's bound to be complicated nonlinear effects in play. That's one of the reasons why it is so hard to apply profiling information effectively, at least when optimizing a given piece of code that is already reasonably well understood. > side question #2: 32% time spent on FunctionCall2Coll seems like a lot > -- is this inclusive of the routine under the function pointer? > > just curious on both of these I ran perf in a way that counted FunctionCall2Coll separately from btint8cmp. It's not that surprising that FunctionCall2Coll dominated, since btint8cmp is so simple. -- Peter Geoghegan
On Wed, Jul 30, 2025 at 3:48 PM Todd Cook <cookt@blackduck.com> wrote: > I work at the same company as Sajith, but on a different product. The reproducer he > provided is just a sample; it's not the only problem. Load testing in my team shows > that PG 17 is about 4x slower than PG 15 across the board. It's bordering on unusable > for production deployments. I suggest that you rewrite affected queries to make them join against a VALUES() with the same constants as those currently used in the larger IN() list. If you're not sure whether the set of constants from the application will be reliably unique, you can use DISTINCT to make sure. -- Peter Geoghegan
On Thu, 31 Jul 2025 at 08:14, Peter Geoghegan <pg@bowt.ie> wrote: > I suggest that you rewrite affected queries to make them join against > a VALUES() with the same constants as those currently used in the > larger IN() list. If you're not sure whether the set of constants from > the application will be reliably unique, you can use DISTINCT to make > sure. Even just presorting the IN list constants would make it better. If I manually adjust the recreator query to sort the items in both IN lists, I get: Execution Time: 263.365 ms vs: Execution Time: 804.377 ms Of course, the qsort_arg() call still happens, it'll hit qsort_arg's presorted short-circuit case and will do very little. I've not looked in great detail, but I did wonder if it's worth adjusting ExecIndexBuildScanKeys() to sort the array in a ScalarArrayOpExpr when it's Const beforehand. That might be a bit of wasted effort if there's just one scan, however. David
Вложения
On Thu, 31 Jul 2025 at 07:49, Todd Cook <cookt@blackduck.com> wrote: > I work at the same company as Sajith, but on a different product. The reproducer he > provided is just a sample; it's not the only problem. Load testing in my team shows > that PG 17 is about 4x slower than PG 15 across the board. It's bordering on unusable > for production deployments. > > Unfortunately, the load testing setup doesn't really help isolate individual, regressing > queries. However, I'm more than willing to help support any further investigation if > needed or helpful. Unfortunately, we can't really work with that much information. It's not like we have a list of things we know are slower in newer versions vs older versions. Changes generally go through a large amount of testing to help ensure these regressions don't happen, so if you report one, unless someone else beat you to it, there's a decent chance we didn't know about it. There's nothing we can really do to help you based on this much information. There's just no chance we'd have shipped PG17 if it was known to be x4 slower than some previous version. You may be able to narrow down what's slower using pg_stat_statements. If you can, then use EXPLAIN and compare the plans. Did PG17 choose a different plan? Does EXPLAIN ANALYZE reveal any inaccurate statistics? Are both instances configured the same way? Once you find a specific query that's causing the issue, then a report similar to what Sajith has done is a good way to get help. David
On Wed, Jul 30, 2025 at 10:39 PM David Rowley <dgrowleyml@gmail.com> wrote: > I've not looked in great detail, but I did wonder if it's worth > adjusting ExecIndexBuildScanKeys() to sort the array in a > ScalarArrayOpExpr when it's Const beforehand. That might be a bit of > wasted effort if there's just one scan, however. If we were to do that, it wouldn't necessarily waste any effort. We're sorting the array either way; sorting it a bit earlier seems unlikely to hurt performance. The main difficulty with such a scheme is finding a natural way to structure everything. We'd have to look up the relevant ORDER proc earlier, which might be awkward and non-modular (it has to use the correct ORDER proc in cross-type scenarios, for example). We'd also have to mark the relevant input scan key with a new flag indicating that it is already sorted (and already has any NULL array elements removed). That way, when nbtree preprocessing saw the flag, it'd know that there was no need to sort the array. The sort would be skipped, regardless of whether preprocessing took place during the first btrescan/amrescan, or during some subsequent rescan. -- Peter Geoghegan
On 7/30/25, 10:49 PM, "David Rowley" <dgrowleyml@gmail.com <mailto:dgrowleyml@gmail.com>> wrote: > On Thu, 31 Jul 2025 at 07:49, Todd Cook <cookt@blackduck.com <mailto:cookt@blackduck.com>> wrote: >> I work at the same company as Sajith, but on a different product. The reproducer he >> provided is just a sample; it's not the only problem. Load testing in my team shows >> that PG 17 is about 4x slower than PG 15 across the board. It's bordering on unusable >> for production deployments. >> >> Unfortunately, the load testing setup doesn't really help isolate individual, regressing >> queries. However, I'm more than willing to help support any further investigation if >> needed or helpful. > Unfortunately, we can't really work with that much information. It's > not like we have a list of things we know are slower in newer versions > vs older versions. Changes generally go through a large amount of > testing to help ensure these regressions don't happen, so if you > report one, unless someone else beat you to it, there's a decent > chance we didn't know about it. There's nothing we can really do to > help you based on this much information. There's just no chance we'd > have shipped PG17 if it was known to be x4 slower than some previous > version. Sorry, it was not my intent to cast aspersions on PG; I was simply trying to indicate the scale of the problem we're facing. I've been using PG for 21+ years now, and it always has been, and still is, the most reliable component in our software stack. > You may be able to narrow down what's slower using pg_stat_statements. > If you can, then use EXPLAIN and compare the plans. Did PG17 choose a > different plan? Does EXPLAIN ANALYZE reveal any inaccurate statistics? > Are both instances configured the same way? We make liberal use of "IN (unsorted_constant_list)" in our queries, so a regression there could easily explain all the slowdown we're seeing. However, I will confirm that just to be certain. Also FWIW, I've bisected the slowdown to this commit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5bf748b86. The reproducer that Sajith posted takes ~1.1 seconds when run on a build at that commit (with the same index-only scan as 17.5), but only ~85ms on the prior commit (with a filtered index scan like with 15). -- todd
Peter Geoghegan <pg@bowt.ie> writes: > Perhaps Tom can weigh-in here. I removed code that generated these > alternative index paths from the planner because its original > justification (see bugfix commit a4523c5a, a follow-up to bugfix > commit 807a40c5) no longer applied. Perhaps this should be revisited > now, or perhaps the issue should be ameliorated on the nbtree side. Or > maybe we should just do nothing -- the issue can be worked around in > the application itself. Well, maybe it was a mistake to no longer consider such plans, but this example doesn't prove it. Quoting the submitted readme file, we selected this plan in v15: -> Index Scan using zsf_pkey on zsf sf (cost=1.49..1.51 rows=1 width=24) (actual time=0.001..0.001 rows=1 loops=47089) Index Cond: (id = sdo.sfi) Filter: (cid = ANY ('{...}'::bigint[])) versus this in v17: -> Index Only Scan using zsf_id_fpi_cid_key on zsf sf (cost=0.29..0.31 rows=1 width=24) (actual time=0.023..0.023rows=1 loops=47089) Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[]))) IIUC you're saying the planner no longer even considers the first case --- but if it did, it'd surely still pick the second one, because the estimated cost is a lot less. So undoing that choice would not help the blackduck folks. I do think we should do something about this, though. My suggestion is that we should always presort in the planner if the SAOP argument is a Const array, and then skip the run-time sort if the executor sees the argument is a Const. Yes, there will be cases where the plan-time sort is wasted effort, but not too darn many. An alternative thought is that maybe the run-time sort is expensive enough that the planner ought to account for it in its estimates. However, that's a bit of a research project, and I don't think we'd dare shove it into v17 at this point even if it turns out to fix this particular case. But a pre-sort seems like a pretty safe change. regards, tom lane
On Thu, Jul 31, 2025 at 4:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, maybe it was a mistake to no longer consider such plans, but > this example doesn't prove it. To be clear, I strongly suspect that it wasn't a mistake. I was just laying out options. > Quoting the submitted readme file, > we selected this plan in v15: > > -> Index Scan using zsf_pkey on zsf sf (cost=1.49..1.51 rows=1 width=24) (actual time=0.001..0.001 rows=1 loops=47089) > Index Cond: (id = sdo.sfi) > Filter: (cid = ANY ('{...}'::bigint[])) > > versus this in v17: > > -> Index Only Scan using zsf_id_fpi_cid_key on zsf sf (cost=0.29..0.31 rows=1 width=24) (actual time=0.023..0.023rows=1 loops=47089) > Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[]))) > > IIUC you're saying the planner no longer even considers the first > case Yes, exactly. > but if it did, it'd surely still pick the second one, > because the estimated cost is a lot less. So undoing that choice > would not help the blackduck folks. Right. Theoretically, we could go back to generating these filter qual index paths if we also revised the costing to account for every little nuance. But I have little faith in our ability to come up with a cost model that could ever be accurate enough to reliably pick between two very similar alternatives like this. There are huge disadvantages to using filter quals when index quals can be used instead. Weighing that against the cost of the sort seems like it would be very brittle. > I do think we should do something about this, though. My suggestion > is that we should always presort in the planner if the SAOP argument > is a Const array, and then skip the run-time sort if the executor > sees the argument is a Const. I agree. Is there a convenient choke point for this in the planner? I suspect that making this change will have a side-effect: it'll make EXPLAIN show the array as sorted and deduplicated. That seems like a minor positive to me, but it's something to consider. > Yes, there will be cases where the > plan-time sort is wasted effort, but not too darn many. It's also only something that is noticeable with very large arrays, which are themselves kind of rare. > An alternative thought is that maybe the run-time sort is expensive > enough that the planner ought to account for it in its estimates. I tend to doubt that this will ever make much sense. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Thu, Jul 31, 2025 at 4:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do think we should do something about this, though. My suggestion >> is that we should always presort in the planner if the SAOP argument >> is a Const array, and then skip the run-time sort if the executor >> sees the argument is a Const. > I agree. Cool, will you do the legwork? > Is there a convenient choke point for this in the planner? I'd be inclined to do it as late as possible, in create_plan (so that we don't expend the effort if we don't choose that index path). So in or near fix_indexqual_references is probably a good spot. > I suspect that making this change will have a side-effect: it'll make > EXPLAIN show the array as sorted and deduplicated. That seems like a > minor positive to me, but it's something to consider. Indeed. We can make use of that in test cases, perhaps. >> An alternative thought is that maybe the run-time sort is expensive >> enough that the planner ought to account for it in its estimates. > I tend to doubt that this will ever make much sense. As you say, getting the cost estimates accurate enough is daunting, which is why I called it a research project. regards, tom lane
On Thu, Jul 31, 2025 at 5:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Cool, will you do the legwork? I'll give it a go. > > Is there a convenient choke point for this in the planner? > > I'd be inclined to do it as late as possible, in create_plan > (so that we don't expend the effort if we don't choose that > index path). So in or near fix_indexqual_references is > probably a good spot. Understood. > >> An alternative thought is that maybe the run-time sort is expensive > >> enough that the planner ought to account for it in its estimates. > > > I tend to doubt that this will ever make much sense. > > As you say, getting the cost estimates accurate enough is daunting, > which is why I called it a research project. I think that it might make sense to add a little more startup cost to plans with SAOP arrays. But that doesn't seem likely to help with this particular problem. -- Peter Geoghegan
On 7/31/25, 4:24 PM, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: > versus this in v17: > > -> Index Only Scan using zsf_id_fpi_cid_key on zsf sf (cost=0.29..0.31 rows=1 width=24) (actual time=0.023..0.023 rows=1loops=47089) > Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[]))) I've been doing some experimenting with the indexes on the zsf table. As a reminder, the definition of the zsf table is Column | Type | Collation | Nullable | Default --------+--------+-----------+----------+--------- id | bigint | | not null | cid | bigint | | not null | fpi | bigint | | not null | Indexes: "zsf_pkey" PRIMARY KEY, btree (id) "zsf_cid_idx" btree (cid) "zsf_id_fpi_cid_key" UNIQUE CONSTRAINT, btree (id, fpi, cid) As an experiment, I added another unique constraint that puts the two columns referenced by the query first, giving Indexes: "zsf_pkey" PRIMARY KEY, btree (id) "zsf_cid_idx" btree (cid) "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi) "zsf_id_fpi_cid_key" UNIQUE CONSTRAINT, btree (id, fpi, cid) Unexpectedly (to me), the planner still chose the same index as before (with the unreferenced fpi in the middle). Dropping the original unique constraint led the planner to switch to a sequential scan of zsf: Indexes: "zsf_pkey" PRIMARY KEY, btree (id) "zsf_cid_idx" btree (cid) "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi) Seq Scan on zsf sf (cost=1.20..751.52 rows=34211 width=24) (actual time=0.014..2.781 rows=34205 loops=1) Filter: (cid = ANY ('{...}')) which I assume means that the planner estimated the seq scan to be cheaper than using the unique index on (id, cid, fpi). Adding a plain index on (id, cid) switches back to an index scan: Indexes: "zsf_pkey" PRIMARY KEY, btree (id) "zsf_cid_idx" btree (cid) "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi) "zsf_id_cid_idx" btree (id, cid) Index Scan using zsf_id_cid_idx on zsf sf (cost=0.29..1010.86 rows=34211 width=24) (actual time=0.033..4.239 rows=34205loops=1) Index Cond: (cid = ANY ('{...}')) Every plan that didn't use the original unique index ran in ~60 to ~80 ms. It seems to me (perhaps naively) that the planner is being inconsistent about how it is costing the three possible index traversals. Why would using zsf_id_cid_idx cost so much more than using the original index? -- todd
On Thu, Jul 31, 2025 at 5:59 PM Todd Cook <cookt@blackduck.com> wrote: > On 7/31/25, 4:24 PM, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: > > versus this in v17: > > > > -> Index Only Scan using zsf_id_fpi_cid_key on zsf sf (cost=0.29..0.31 rows=1 width=24) (actual time=0.023..0.023 rows=1loops=47089) > > Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[]))) > > I've been doing some experimenting with the indexes on the zsf table. > Dropping the original unique constraint led the planner to switch to > a sequential scan of zsf: I noticed that too. The relevant zsf SAOP array has a unique match for most of the rows in zsf, so it isn't particularly suprising that we get a sequential scan when the details are tweaked/with a different set of indexes. > Adding a plain index on (id, cid) switches back to an index scan: > > Indexes: > "zsf_pkey" PRIMARY KEY, btree (id) > "zsf_cid_idx" btree (cid) > "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi) > "zsf_id_cid_idx" btree (id, cid) > > Index Scan using zsf_id_cid_idx on zsf sf (cost=0.29..1010.86 rows=34211 width=24) (actual time=0.033..4.239 rows=34205loops=1) > Index Cond: (cid = ANY ('{...}')) > > Every plan that didn't use the original unique index ran in ~60 to ~80 ms. Notice that this plan shows "rows=34205 loops=1" for zsf_id_cid_idx -- not "rows=1 loops=47089", as in your original test case. It looks like this isn't revised plan makes its scan of zsf_id_cid_idx appear someplace that isn't the inner side of a nested loop join. In other words, it looks like the basic structure/join order of the plan is significantly different to that of your original Postgres 15 plan (and that of your original Postgres 17 plan, which had the same basic join order as the 15 one). -- Peter Geoghegan
On Thu, Jul 31, 2025 at 5:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jul 31, 2025 at 5:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Cool, will you do the legwork? > > I'll give it a go. Update on my progress: I find that the test query takes ~95ms on Postgres 16. We're getting "Inner Unique: true" on Postgres 16, but if I replace zsf_pkey with a non-unique index it brings the execution time up slightly, to ~100ms. Ideally, we'll be able to get Postgres 17+ at about parity with that. I find that my WIP patch (which does the required sort in the planner in the obvious way) brings the runtime down, from ~1500ms to ~165ms. Obviously, this is a massive improvement -- but it's a bit disappointing that we can't do better still. Getting closer to ~100ms seems intuitively achievable to me, since the index scan we're getting on 17 isn't supposed to be a whole lot different to the one we get on 16 (in spite of the fact that we're using a different index) -- all of the other executor nodes from the plan are pretty much the same on each version. Why the remaining shortfall? One problem remains here: we're still doing more work than one would hope at the start of btrescan, during array preprocessing. We're able to skip the sort, of course, but just building a simple Datum array via a call to deconstruct_array() is enough of a remaining bottleneck to matter. Ideally, we'd do *all* of the work once for each SAOP array, in the planner (assuming a Const argument). In order to do that, we'd have to make the planner/executor pass nbtree an array that has the exact structure that it works with at runtime: a raw Datum array. I find that once I make the planner and executor pass a raw Datum array, we're much closer to my soft performance target: the query runtime goes down to ~135ms. This isn't perfect, but it's much closer to the theoretical ideal that I have in mind. We're still doing extra work in the 17 index scan, compared to the one in 16, but I can't feel too bad about that; looking up a separate ORDER proc for the lower-order column isn't free, and being prepared to use a SAOP array necessitates a little more memory allocation for preprocessing (we make _bt_preprocess_keys use a partially-preprocessed copy of the original input keys as its input keys when there's an SAOP array). So ~135ms is roughly in line with what I expect. The problem with this more ambitious approach is that it is also much more invasive. It bleeds into things like the plan cache. EXPLAIN/ruleutils.c would need its own built in way to show the qual that can work with this alternative "raw datum array" representation, which I haven't bothered adding. I doubt that that complexity will pay for itself. My inclination is to pursue the simpler approach, and just accept the remaining performance shortfall. This is a rare enough case that I think that that'll be acceptable. But input on how to make this trade-off would be helpful. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > My inclination is to pursue the simpler approach, and just accept the > remaining performance shortfall. This is a rare enough case that I > think that that'll be acceptable. But input on how to make this > trade-off would be helpful. My thought here is that there is not a lot you can do in v17: it's been out for a year and we dare not destabilize it. Take the easy near-guaranteed win and be happy. Additional improvements can be pursued at our leisure in the master branch. regards, tom lane
I wrote: > My thought here is that there is not a lot you can do in v17: it's > been out for a year and we dare not destabilize it. Take the easy > near-guaranteed win and be happy. Additional improvements can be > pursued at our leisure in the master branch. Also, the calendar is a factor here: we are only a week away from release freeze for the August releases. Even if you're feeling optimistic about what we could risk in v17 with some development effort, I'd counsel first working on a minimal patch that would be safe to apply on short notice, so we can get the easy win in this release cycle. regards, tom lane
On Fri, Aug 1, 2025 at 9:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > My thought here is that there is not a lot you can do in v17: it's > been out for a year and we dare not destabilize it. Take the easy > near-guaranteed win and be happy. Additional improvements can be > pursued at our leisure in the master branch. I agree. I always thought that it was unlikely that anybody would seriously argue in favor of the more ambitious approach I laid out. I nevertheless felt obligated to present it as an option. I find it natural to approach fixing this regression (and any other regression with a fairly crisp problem statement) by starting from the premise that the query should be at least as fast as it was on a previous release, and working backwards from there. Actually meeting that standard might not be practically achievable, for whatever reason. But even then there is still value in understanding and clearly communicating why it is that we fell short of that standard. -- Peter Geoghegan
On Fri, Aug 1, 2025 at 10:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, the calendar is a factor here: we are only a week away from > release freeze for the August releases. Even if you're feeling > optimistic about what we could risk in v17 with some development > effort, I'd counsel first working on a minimal patch that would > be safe to apply on short notice, so we can get the easy win > in this release cycle. Attached is what I came up with. It makes the problematic test case query execute a little under 10x faster, compared to unpatched master -- see my numbers from Friday (the numbers for the less ambitious approach/the approach taken by this patch). This is still a WIP. I'm unsure of how to assess my approach from a modularity perspective. The patch simply assumes that any amsearcharray index AM is either nbtree, or some other amsearcharray index AM that is okay with having its array deduplicated based on nbtree rules/a B-Tree ORDER proc. So there had better be a B-Tree opclass ORDER proc for us to work off of (otherwise planning will throw an error). This is a little bit like how we deal with RowCompareExpr within nodeIndexScan.c -- there are similar hard-coded BTORDER_PROC lookups there. I don't *think* that I need to use the new IndexAmTranslateCompareType facility (added by commit c09e5a6a) for this, since most existing code doesn't bother with any of that, either. I haven't given much thought to the increase in planner cycles. I'll stress-test that aspect of the patch soon. Todd, Sajith: it would be helpful if you could test this patch. Possibly by using the original problematic query, rather than the minimized version that you posted to the list. The patch won't bring performance up to parity with Postgres 15, but it should be a great deal faster. Note that the patch that I've posted will only apply against the current master branch (I'll prepare patches for earlier branches once I have some buy-in). See https://wiki.postgresql.org/wiki/Compile_and_Install_from_source_code and https://www.postgresql.org/docs/current/installation.html for instructions. -- Peter Geoghegan
Вложения
Peter Geoghegan <pg@bowt.ie> writes: > Attached is what I came up with. It makes the problematic test case > query execute a little under 10x faster, compared to unpatched > master -- see my numbers from Friday (the numbers for the less > ambitious approach/the approach taken by this patch). This is still a > WIP. I took a very quick look at this, and there are a couple of things I don't like: * I don't love the fact that _bt_presort_const_array shares no code with _bt_sort_array_elements. Maybe there's nothing to be done about it given that they're working on different representations. But the comments could at least point out that the two functions had better give equivalent results. * I really don't like the fact that SK_PRESORTED gets set so far away from the code that's actually doing the work to make that valid (not to mention far from the code that consults that flag). Even if it's bug-free today, that seems like an invitation to maintenance problems. Can we factor that differently? > I'm unsure of how to assess my approach from a modularity perspective. > The patch simply assumes that any amsearcharray index AM is either > nbtree, or some other amsearcharray index AM that is okay with having > its array deduplicated based on nbtree rules/a B-Tree ORDER proc. Well, it's worse than that: it's assuming that support procedure 1 is in fact a btree ordering proc in any opclass of an amsearcharray AM. I don't think that's too safe. We have our hands on enough info about the index that we could skip trying to do the sort if it's not a btree, and I think we should do so, at least for today. > This is a little bit like > how we deal with RowCompareExpr within nodeIndexScan.c -- there are > similar hard-coded BTORDER_PROC lookups there. One big difference is that we *know* that the comparisons in a RowCompareExpr are btree operators, because the parser constructed it that way. We cannot make such assumptions about arbitrary ScalarArrayOps, even if they are indexquals. regards, tom lane
On Mon, Aug 4, 2025 at 6:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I took a very quick look at this Thanks for the review. > * I don't love the fact that _bt_presort_const_array shares no code > with _bt_sort_array_elements. Maybe there's nothing to be done > about it given that they're working on different representations. > But the comments could at least point out that the two functions > had better give equivalent results. I'll try to improve matters here for the next revision. But it probably won't look all that different. > * I really don't like the fact that SK_PRESORTED gets set so far > away from the code that's actually doing the work to make that > valid (not to mention far from the code that consults that flag). > Even if it's bug-free today, that seems like an invitation to > maintenance problems. Can we factor that differently? It sounds like you're not so much complaining about SK_PRESORTED itself as you are about the v1 patch's assumption that all scans of amsearcharray-index-AM indexes with a Const arg must already be presorted. In other words, it seems as if this complaint is closely related to (if not exactly the same as) your later complaint about modularity issues. If there *was* a separate "presorted" SAOP property/struct field, then you might not find it odd to see a SK_PRESORTED flag, used to represent that presortedness property at the scan key level. It'd be fairly similar to the way that we already don't pass a ScalarArrayOpExpr down to nbtree directly (we just pass down an ArrayType and set the SK_SEARCHARRAY scan key flag). Note that SK_SEARCHARRAY is set in precisely one place on the master branch, which is also the only place where the new SK_PRESORTED flag gets set with the patch applied: within ExecIndexBuildScanKeys. > > I'm unsure of how to assess my approach from a modularity perspective. > > The patch simply assumes that any amsearcharray index AM is either > > nbtree, or some other amsearcharray index AM that is okay with having > > its array deduplicated based on nbtree rules/a B-Tree ORDER proc. > > Well, it's worse than that: it's assuming that support procedure 1 > is in fact a btree ordering proc in any opclass of an amsearcharray > AM. I don't think that's too safe. Agreed that that's not safe (and wouldn't be even if it was fine to make aggressive assumptions about amsearcharray index AMs being nbtree-like). > We have our hands on enough > info about the index that we could skip trying to do the sort if > it's not a btree, and I think we should do so, at least for today. I think that this boils down to the following (correct me if I'm wrong): There is a need to remember that we've presorted, rather than making any kind of hardcoded assumption that that has happened. In practice we'll always presort with a Const-arg'd ScalarArrayOp nbtree index qual, but ExecIndexBuildScanKeys has no business knowing about any of that. This can be achieved by limiting application of the optimization to nbtree from within createplan.c -- probably by testing "IndexOptInfo.relam == BTREE_AM_OID" (much like match_rowcompare_to_indexcol does it already). Does that summary sound about right? And if it does, do you have any thoughts on the best place for the new "presorted" struct field? Should it go in the ScalarArrayOpExpr struct itself? -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Aug 4, 2025 at 6:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We have our hands on enough >> info about the index that we could skip trying to do the sort if >> it's not a btree, and I think we should do so, at least for today. > I think that this boils down to the following (correct me if I'm wrong): > There is a need to remember that we've presorted, rather than making > any kind of hardcoded assumption that that has happened. In practice > we'll always presort with a Const-arg'd ScalarArrayOp nbtree index > qual, but ExecIndexBuildScanKeys has no business knowing about any of > that. This can be achieved by limiting application of the optimization > to nbtree from within createplan.c -- probably by testing > "IndexOptInfo.relam == BTREE_AM_OID" (much like > match_rowcompare_to_indexcol does it already). I agree that ExecIndexBuildScanKeys should not be hard-wiring such an assumption: it's too far away from both the producer and the consumer of the knowledge. Having said that, if we're to produce a back-patchable fix, I'm not seeing a way to have an end-to-end confirmation that the planner did the sort. That would require some sort of change to the plan tree. I'd be for doing that in master only, but I think it's unworkable given ABI restrictions in v17. I think the most expedient answer for v17 is to have a private agreement between createplan.c and access/nbtree that if the index type is btree then createplan.c will presort Const array arguments for ScalarArrayOp indexquals. The nbtree code would have to check for this by looking to see if the indexqual node is a Const. I do agree with the idea that a scankey flag could be a better answer in HEAD. But it has to be fed forward from some planner output, or else it's just a kluge. Not that what I said in the previous para isn't a kluge ... but my kluge touches fewer bits of code than your v1, so I think it's better. Ugliness is in the eye of the beholder of course, so maybe you'll disagree. Anyway, my points are (1) it's definitely not safe to call the BTORDER_PROC unless the index is a btree, and (2) we should not expect that what we do today to put into the back branches will be a good long-term answer. > And if it does, do you have any > thoughts on the best place for the new "presorted" struct field? > Should it go in the ScalarArrayOpExpr struct itself? Doubt it. I suppose if we had an immediate application for presorting ScalarArrayOpExpr arrays in other contexts, that could start to look attractive. But for now I'd think more about adding additional info to IndexScan plan nodes. Maybe it'd be interesting to store an array of scankey flag bitmasks, with an eye to precomputing some of those flags at plan time? regards, tom lane
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Hi Peter,
Thank you for sharing the patch in response to the bug I raised with the Postgres community. I appreciate your prompt support and the effort you’ve put into addressing the issue.
Following your email, I noticed that Tom has provided additional suggestions and raised some concerns regarding the proposed changes. Before proceeding with testing the patch against the original problematic query, I wanted to check with you — should I go ahead and apply the current patch, or would it be better to wait for any revised version based on Tom’s feedback?
Please let me know what you recommend. I’m happy to test whenever you feel the patch is ready.
Thanks again for your guidance and support.
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tuesday, 5 August 2025 at 7:40 AM
To: Peter Geoghegan <pg@bowt.ie>
Cc: Sajith Prabhakar Shetty <ssajith@blackduck.com>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 4, 2025 at 6:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We have our hands on enough
>> info about the index that we could skip trying to do the sort if
>> it's not a btree, and I think we should do so, at least for today.
> I think that this boils down to the following (correct me if I'm wrong):
> There is a need to remember that we've presorted, rather than making
> any kind of hardcoded assumption that that has happened. In practice
> we'll always presort with a Const-arg'd ScalarArrayOp nbtree index
> qual, but ExecIndexBuildScanKeys has no business knowing about any of
> that. This can be achieved by limiting application of the optimization
> to nbtree from within createplan.c -- probably by testing
> "IndexOptInfo.relam == BTREE_AM_OID" (much like
> match_rowcompare_to_indexcol does it already).
I agree that ExecIndexBuildScanKeys should not be hard-wiring such
an assumption: it's too far away from both the producer and the
consumer of the knowledge. Having said that, if we're to produce
a back-patchable fix, I'm not seeing a way to have an end-to-end
confirmation that the planner did the sort. That would require some
sort of change to the plan tree. I'd be for doing that in master
only, but I think it's unworkable given ABI restrictions in v17.
I think the most expedient answer for v17 is to have a private
agreement between createplan.c and access/nbtree that if the
index type is btree then createplan.c will presort Const array
arguments for ScalarArrayOp indexquals. The nbtree code would
have to check for this by looking to see if the indexqual node
is a Const.
I do agree with the idea that a scankey flag could be a better
answer in HEAD. But it has to be fed forward from some planner
output, or else it's just a kluge. Not that what I said in the
previous para isn't a kluge ... but my kluge touches fewer bits
of code than your v1, so I think it's better. Ugliness is in
the eye of the beholder of course, so maybe you'll disagree.
Anyway, my points are (1) it's definitely not safe to call the
BTORDER_PROC unless the index is a btree, and (2) we should not
expect that what we do today to put into the back branches will
be a good long-term answer.
> And if it does, do you have any
> thoughts on the best place for the new "presorted" struct field?
> Should it go in the ScalarArrayOpExpr struct itself?
Doubt it. I suppose if we had an immediate application for
presorting ScalarArrayOpExpr arrays in other contexts, that
could start to look attractive. But for now I'd think more
about adding additional info to IndexScan plan nodes. Maybe
it'd be interesting to store an array of scankey flag bitmasks,
with an eye to precomputing some of those flags at plan time?
regards, tom lane
Вложения
On Tue, Aug 5, 2025 at 1:31 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > Before proceeding with testing the patch against the original problematic query, I wanted to check with you — should Igo ahead and apply the current patch, or would it be better to wait for any revised version based on Tom’s feedback? I doubt that Tom's feedback will affect the performance profile of the patch. v1 is very likely to be representative of the final patch in every way that your testing will consider, so I see no reason to wait. -- Peter Geoghegan
On 8/4/25, 4:12 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote: > Todd, Sajith: it would be helpful if you could test this patch. > Possibly by using the original problematic query, rather than the > minimized version that you posted to the list. The patch won't bring > performance up to parity with Postgres 15, but it should be a great > deal faster. Note that the patch that I've posted will only apply > against the current master branch (I'll prepare patches for earlier > branches once I have some buy-in). I'm working with our performance testing team to rerun our load tests with the patch applied. However, the current infrastructure doesn't support deploying custom PG builds, so it's unlikely we'll be able to provide results in time for the upcoming minor releases. Also, is it significant effort to produce a patch for PG 17? Running the load tests against master would make it chancy to compare the results with the data we already have. -- todd
On Thu, Aug 7, 2025 at 11:58 AM Todd Cook <cookt@blackduck.com> wrote: > > On 8/4/25, 4:12 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote: > > Todd, Sajith: it would be helpful if you could test this patch. > > Possibly by using the original problematic query, rather than the > > minimized version that you posted to the list. The patch won't bring > > performance up to parity with Postgres 15, but it should be a great > > deal faster. Note that the patch that I've posted will only apply > > against the current master branch (I'll prepare patches for earlier > > branches once I have some buy-in). > > I'm working with our performance testing team to rerun our load tests > with the patch applied. However, the current infrastructure doesn't > support deploying custom PG builds, so it's unlikely we'll be able to > provide results in time for the upcoming minor releases. > > Also, is it significant effort to produce a patch for PG 17? Running the > load tests against master would make it chancy to compare the results > with the data we already have. Looks like it, _bt_preprocess_array_keys was whacked around quite a bit in master vs 17 as part of the skip scan work, and a couple of the bits the patch relies on are not there. I would also question the point of that, since the patch vs master is much closer to what you would ultimately be using as master is pretty close to the v18 release (currently in beta, scheduled for sep/oct). I think it is very likely going to address your issue, but confirmation helps, and will make it more likely for this patch to be pushed. After that, you will have to decide to downgrade to 15 or suffer the status. quo until 18 releases. Running a custom patched 17 is possible if a patch could be made to appear, but I do not recommend it given where you are at in the release calendar. merlin
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Database Server | Execution Time | Execution Plan | Notes |
---|---|---|---|
PostgreSQL 17.5 (Stable) | 9–10 sec | Execution time remains consistent across runs. | |
PostgreSQL 16.9 (Stable) | 1–2 sec | First run takes ~5 sec; subsequent runs stabilize at 1–2 sec. | |
PostgreSQL Master (with patch) | 5–6 sec | Execution time is consistent across runs. |
Date: Saturday, 9 August 2025 at 3:34 AM
To: Todd Cook <cookt@blackduck.com>
Cc: Peter Geoghegan <pg@bowt.ie>, Tom Lane <tgl@sss.pgh.pa.us>, Sajith Prabhakar Shetty <ssajith@blackduck.com>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
>
> On 8/4/25, 4:12 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
> > Todd, Sajith: it would be helpful if you could test this patch.
> > Possibly by using the original problematic query, rather than the
> > minimized version that you posted to the list. The patch won't bring
> > performance up to parity with Postgres 15, but it should be a great
> > deal faster. Note that the patch that I've posted will only apply
> > against the current master branch (I'll prepare patches for earlier
> > branches once I have some buy-in).
>
> I'm working with our performance testing team to rerun our load tests
> with the patch applied. However, the current infrastructure doesn't
> support deploying custom PG builds, so it's unlikely we'll be able to
> provide results in time for the upcoming minor releases.
>
> Also, is it significant effort to produce a patch for PG 17? Running the
> load tests against master would make it chancy to compare the results
> with the data we already have.
Looks like it, _bt_preprocess_array_keys was whacked around quite a
bit in master vs 17 as part of the skip scan work, and a couple of the
bits the patch relies on are not there. I would also question the
point of that, since the patch vs master is much closer to what you
would ultimately be using as master is pretty close to the v18 release
(currently in beta, scheduled for sep/oct).
I think it is very likely going to address your issue, but
confirmation helps, and will make it more likely for this patch to be
pushed. After that, you will have to decide to downgrade to 15 or
suffer the status. quo until 18 releases. Running a custom patched 17
is possible if a patch could be made to appear, but I do not recommend
it given where you are at in the release calendar.
merlin
Вложения
On Mon, Aug 11, 2025 at 10:01 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9. Postgres 17 shows "Buffers: shared hit=5071180 read=42049" for the top-level scan node/the plan as a whole, while patched master shows "Buffers: shared hit=5608415 read=103886 written=2". There are only minor differences in each plan. It looks like these results aren't representative. I see lots of heap fetches for index-only scans on patched master. That factor alone could easily make a huge difference. I suggest running "VACUUM ANALYZE" on both setups to get more consistent results. I also suggest using pg_prewarm and/or repeated execution to make sure that the number of buffer misses/reads is kept to an absolute minimum. You should also make sure to use exactly the same settings for each test case -- ideally while using the same hardware for both. > Is there any scope for further optimization to bring it closer to PG16's performance levels? Probably not. At least not if it must be backpatched to Postgres 17. The patch needs to be reasonably non-invasive for that to happen. -- Peter Geoghegan
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
VACUUM
and ANALYZE
were executed on all three instances whose results I shared. Additionally, each instance was freshly created on same hardware and configured with the same postgresql.conf
settings.Date: Tuesday, 12 August 2025 at 12:15 AM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Merlin Moncure <mmoncure@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
<ssajith@blackduck.com> wrote:
> While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9.
Postgres 17 shows "Buffers: shared hit=5071180 read=42049" for the
top-level scan node/the plan as a whole, while patched master shows
"Buffers: shared hit=5608415 read=103886 written=2". There are only
minor differences in each plan. It looks like these results aren't
representative.
I see lots of heap fetches for index-only scans on patched master.
That factor alone could easily make a huge difference. I suggest
running "VACUUM ANALYZE" on both setups to get more consistent
results. I also suggest using pg_prewarm and/or repeated execution to
make sure that the number of buffer misses/reads is kept to an
absolute minimum. You should also make sure to use exactly the same
settings for each test case -- ideally while using the same hardware
for both.
> Is there any scope for further optimization to bring it closer to PG16's performance levels?
Probably not. At least not if it must be backpatched to Postgres 17.
The patch needs to be reasonably non-invasive for that to happen.
--
Peter Geoghegan
Вложения
Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
Date: Tuesday, 12 August 2025 at 10:13 AM
To: Peter Geoghegan <pg@bowt.ie>
Cc: Merlin Moncure <mmoncure@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
VACUUM
and ANALYZE
were executed on all three instances whose results I shared. Additionally, each instance was freshly created on same hardware and configured with the same postgresql.conf
settings.Date: Tuesday, 12 August 2025 at 12:15 AM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Merlin Moncure <mmoncure@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15
<ssajith@blackduck.com> wrote:
> While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9.
Postgres 17 shows "Buffers: shared hit=5071180 read=42049" for the
top-level scan node/the plan as a whole, while patched master shows
"Buffers: shared hit=5608415 read=103886 written=2". There are only
minor differences in each plan. It looks like these results aren't
representative.
I see lots of heap fetches for index-only scans on patched master.
That factor alone could easily make a huge difference. I suggest
running "VACUUM ANALYZE" on both setups to get more consistent
results. I also suggest using pg_prewarm and/or repeated execution to
make sure that the number of buffer misses/reads is kept to an
absolute minimum. You should also make sure to use exactly the same
settings for each test case -- ideally while using the same hardware
for both.
> Is there any scope for further optimization to bring it closer to PG16's performance levels?
Probably not. At least not if it must be backpatched to Postgres 17.
The patch needs to be reasonably non-invasive for that to happen.
--
Peter Geoghegan
Вложения
On Tue, Aug 12, 2025 at 12:46 AM Sajith Prabhakar Shetty <ssajith@blackduck.com> wrote: > I can confirm with full certainty that both VACUUM and ANALYZE were executed on all three instances whose results I shared. The results that you shared showed "Heap Fetches: 598,916" -- one heap fetch per index search (for the problematic index scan). Perhaps VACUUM ran without being truly effective, due to the removable cutoff/horizon being held back by a long running query. Though that generally won't happen on a quiescent system -- so it really does look like VACUUM hasn't been run. -- Peter Geoghegan
On Wed, Aug 20, 2025 at 7:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Aug 12, 2025 at 12:46 AM Sajith Prabhakar Shetty > <ssajith@blackduck.com> wrote: > > I can confirm with full certainty that both VACUUM and ANALYZE were executed on all three instances whose results I shared. > > The results that you shared showed "Heap Fetches: 598,916" -- one heap > fetch per index search (for the problematic index scan). > > Perhaps VACUUM ran without being truly effective, due to the removable > cutoff/horizon being held back by a long running query. Though that > generally won't happen on a quiescent system -- so it really does look > like VACUUM hasn't been run. Try running VACUUM FREEZE. Do you have a way to frame up a dataset for testing? merlin
On 8/8/25, 6:04 PM, "Merlin Moncure" <mmoncure@gmail.com <mailto:mmoncure@gmail.com>> wrote: > I think it is very likely going to address your issue, but > confirmation helps, and will make it more likely for this patch to be > pushed. After that, you will have to decide to downgrade to 15 or > suffer the status. quo until 18 releases. Running a custom patched 17 > is possible if a patch could be made to appear, but I do not recommend > it given where you are at in the release calendar. We're still working on providing confirmation. Our load testing environment depends on third-party PG providers, so there's a significant amount of ongoing work to set up a configuration where we can use a custom PG build. This is a high priority for us, so it will get done, though maybe not for a couple more weeks yet. -- todd
On 8/25/25, 8:30 AM, "Todd Cook" <cookt@blackduck.com <mailto:cookt@blackduck.com>> wrote: > On 8/8/25, 6:04 PM, "Merlin Moncure" <mmoncure@gmail.com <mailto:mmoncure@gmail.com> <mailto:mmoncure@gmail.com > <mailto:mmoncure@gmail.com>>>wrote: >> I think it is very likely going to address your issue, but >> confirmation helps, and will make it more likely for this patch to be >> pushed. After that, you will have to decide to downgrade to 15 or >> suffer the status. quo until 18 releases. Running a custom patched 17 >> is possible if a patch could be made to appear, but I do not recommend >> it given where you are at in the release calendar. > > We're still working on providing confirmation. Our load testing environment depends on > third-party PG providers, so there's a significant amount of ongoing work to set up a > configuration where we can use a custom PG build. This is a high priority for us, so it will > get done, though maybe not for a couple more weeks yet. We've finished the tests comparing PG18rc1 with and without the proposed patch, including several runs each to verify reproducibility. In short, the variance between the two is well within what we normally see, so statistically speaking, the results are indistinguishable. However, the results are roughly on par with what we had been seeing with PG 15 and 16, so maybe some other change from 17 to 18 eliminated the regression we were seeing? -- todd
On Thu, Sep 25, 2025 at 2:07 PM Todd Cook <cookt@blackduck.com> wrote: > We've finished the tests comparing PG18rc1 with and without the proposed patch, > including several runs each to verify reproducibility. In short, the variance between the > two is well within what we normally see, so statistically speaking, the results are > indistinguishable. That's interesting. It clearly wasn't like that with your minimal test case, but the issue with the actual query is perhaps more complicated/has more than one issue in play. > However, the results are roughly on par with what we had been > seeing with PG 15 and 16, so maybe some other change from 17 to 18 eliminated the > regression we were seeing? There's no specific reason to expect this: there's simply no way that Postgres 18 (or Postgres 17) will get the same plan that you had on 15 and 16. That is, it is simply impossible for 17 or 18 to use a plan where the problematic index scan node shows its ANY condition as a filter qual ("Filter: ...") -- it *must* be a true index qual ("Cond: ..."). I always understood that your observed slowdown was a perverse consequence of this relatively minor plan difference. Since of course that was precisely what your original minimal test case actually demonstrated ("perf top" showed we were bottlenecked on sorting the SAOP array during index scan preprocessing). What does EXPLAIN ANALYZE actually show you on 18, compared to 16, with the same real world (non-minimal) test case? Are the joins and scan nodes all the same as before (i.e. is the Postgres 18 plan *identical to the "bad" Postgres 17 plan)? Do you see any "Heap Fetches:", particularly with the problematic index-only scan? Could it just be that you made sure to run VACUUM ahead of the test this time, allowing the index-only scan seen on Postgres 17 and 18 to avoid heap accesses? Recall that the 15 and 16 plan had a plain index scan on another index, and that the 17 plan you showed a few weeks back had "Heap Fetches: 598,916" (meaning that the index-only scan was completely ineffective at avoiding heap accesses). If you're now able to get the most out of index-only access, it could be enough to flip things in favor of the new plan -- in spite of the fact that there is definitely still a regression tied to needlessly sorting the scan's SAOP array many times. -- Peter Geoghegan
On 9/25/25, 2:46 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote: > What does EXPLAIN ANALYZE actually show you on 18, compared to 16, > with the same real world (non-minimal) test case? Are the joins and > scan nodes all the same as before (i.e. is the Postgres 18 plan > *identical to the "bad" Postgres 17 plan)? > > Do you see any "Heap Fetches:", particularly with the problematic > index-only scan? Could it just be that you made sure to run VACUUM > ahead of the test this time, allowing the index-only scan seen on > Postgres 17 and 18 to avoid heap accesses? > > Recall that the 15 and 16 plan had a plain index scan on another > index, and that the 17 plan you showed a few weeks back had "Heap > Fetches: 598,916" (meaning that the index-only scan was completely > ineffective at avoiding heap accesses). If you're now able to get the > most out of index-only access, it could be enough to flip things in > favor of the new plan -- in spite of the fact that there is definitely > still a regression tied to needlessly sorting the scan's SAOP array > many times. The tests I reported on are application-level load tests that last about 10 hours and generate ~4 million queries, so I don't have any of that per-query info. We generate a lot of queries with "IN (constant_list)" expressions, so that seemed like a logical explanation, but maybe something else is going on too? Nothing really stands out from crawling through pg_stat_statements, so maybe it's a small effect that is multiplied by repetition? FWIW, except for large multi-join queries with "IN (constant_list)" expressions, nearly every individual query I run is as fast or faster in 17 than in 16. The ones that are slower are ones that were previously munged around to get a specific plan; those that I've unmunged have so far been much faster on 17. It just occurred to me while typing this that I should go count joins to see if we're exceeding join_collapse_limit. Could something have changed in 17 that would affect how such queries are planned? -- todd
On Thu, Sep 25, 2025 at 3:28 PM Todd Cook <cookt@blackduck.com> wrote: > The tests I reported on are application-level load tests that last about > 10 hours and generate ~4 million queries, so I don't have any of that > per-query info. We generate a lot of queries with "IN (constant_list)" > expressions, so that seemed like a logical explanation, but maybe > something else is going on too? The Postgres 17 improvements to "IN (constant_list)" index scans (which indirectly caused this regression) tend to make things much better with workloads such as this. For example, one blog post reported an across-the-board 30% increase in application throughput after an upgrade to 17, which was tied back to that work: https://www.crunchydata.com/blog/real-world-performance-gains-with-postgres-17-btree-bulk-scans So, yes, it's plausible that an application that made heavy use of queries with large IN() lists could be much faster now. For better or worse, that's a reasonably common pattern with the "eager loading" used by frameworks such as Ruby on Rails. > Nothing really stands out from > crawling through pg_stat_statements, so maybe it's a small effect that > is multiplied by repetition? Should be noted that the regression itself is definitely "a small effect that is multiplied by repetition". This makes the problem rather perverse (though it's not exactly unusual for a complaint about a planner regression to be perverse, in one way or another). > FWIW, except for large multi-join queries with "IN (constant_list)" > expressions, nearly every individual query I run is as fast or faster in > 17 than in 16. The ones that are slower are ones that were previously > munged around to get a specific plan; those that I've unmunged have > so far been much faster on 17. That's great news. > It just occurred to me while typing this that I should go count joins to > see if we're exceeding join_collapse_limit. Could something have > changed in 17 that would affect how such queries are planned? It'll be hard (very hard) to establish any relationship between the overall improvements you're seeing and any changes in the planner. Anything is possible, but in general planner changes rarely make an across-the-board difference to most of an application's queries. Plus the workload itself is probably fairly varied and complicated. I only suggested that the improvements to the B-Tree code were likely relevant because you mentioned that your application had many queries with "IN (constant_list)", often with larger arrays of constants (hundreds, say). And because I've direct knowledge of across-the-board speedups for applications with those kinds of queries (the blog post is the best example of this). -- Peter Geoghegan
On 9/25/25, 3:47 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote: > Should be noted that the regression itself is definitely "a small > effect that is multiplied by repetition". This makes the problem > rather perverse (though it's not exactly unusual for a complaint about > a planner regression to be perverse, in one way or another). Thanks for confirming. We can develop a strategy for working around this effect, so from my point of view, we can consider this matter resolved. I'm guessing that you feel the same way, but if not, I'm happy to help out in any way I can. In any case, I do appreciate your efforts and the time you took to write out the very helpful explanations you provided. __ -- todd
On Fri, Sep 26, 2025 at 8:30 AM Todd Cook <cookt@blackduck.com> wrote: > Thanks for confirming. We can develop a strategy for working around > this effect, so from my point of view, we can consider this matter > resolved. I'm guessing that you feel the same way, but if not, I'm > happy to help out in any way I can. Yes, I do feel the same way. I'm going to withdraw my proposed patch, at least for now. If there's another complaint along the same lines at some point in the future, then I'll certainly need to reconsider my patch. -- Peter Geoghegan