Обсуждение: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

Поиск
Список
Период
Сортировка

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:

Hi Andrei,

 

Thank you and sorry for the delay in response.

 

We have found the slowness in multiple other products in our portfolio and we are blocked to proceed further.

 

In regards to your point below, it is true that the index only scan is 2.5 times slower on PG17 and  primary difference is in that index-only scan of ui_stream_file_id_component at line 14.  It takes 6 microseconds per row in PG 15, 2 microseconds per row in 16, and 14 microseconds in 17

 

I found out something which might be linked to, in release notes of PG17 (http://postgresql.org/docs/release/17.0/):

 

                Allow btree indexes to more efficiently find a set of values, such as those supplied by IN clauses using constants.”

 

@Tom Lane, Sorry I could not get you reproducer yet but please do comment on this insight meanwhile I am working on it.

 

Thanks.

 

 

 

Sajith P Shetty

Principal Engineer

Black Duck 

+91 9448389989ssajith@blackduck.com

signature_778616162

 

 

From: Andrei Lepikhov <lepihov@gmail.com>
Date: Thursday, 15 May 2025 at 7:56
PM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Cc: Peter Geoghegan <pg@bowt.ie>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On 15/5/2025 07:33, Sajith Prabhakar Shetty wrote:
> Hi,
>
>
> Most of the queries got slower after upgrading our postgres from version
> 15 to 17 using pg_upgrade. I reconfirmed that "vacuum full, analyze"
> were all taken care.
>
> To debug, instead of upgrade, I installed two instances one with
> postgres 15 and another postgres 17 with the same application dump restored.
>
> Now surprisingly one of the query i took from application which used to
> execute in 2s in PG15, is now taking 1min+ in PG17. I also observed that
> some of the operations involving DML operations slowed down too in PG17.
>
> Explain plan of the two queries almost same, all the joins and paths
> used are exactly same.
>
> Could anybody please provide some insights here?
Curious, the difference in Index Only Scan node:

->  Index Scan using stream_file_pkey on stream_file sf
     (cost=1.63..1.86 rows=1 width=8)
     (actual time=0.006..0.006 rows=1 loops=598916)
     Index Cond: (id = sdo.stream_file_id)
     Filter: (component_id = ANY

->  Index Only Scan using ui_stream_file_id_component on stream_file sf
     (cost=0.43..0.51 rows=1 width=8)
     (actual time=0.014..0.014 rows=1 loops=598916)
     Index Cond: ((id = sdo.stream_file_id) AND (component_id = ANY

Each time the index scan is 2.5 times slower on PG17. But:

PG 15:
Buffers: shared hit=2338397 read=57267
I/O Timings: shared read=3384.286

PG 17:
Buffers: shared hit=1909772 read=9933
I/O Timings: shared read=686.506

If I'm not mistaken, it seems like an insight.

--
regards, Andrei Lepikhov

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Jul 17, 2025 at 5:58 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> In regards to your point below, it is true that the index only scan is 2.5 times slower on PG17 and  primary
differenceis in that index-only scan of ui_stream_file_id_component at line 14.  It takes 6 microseconds per row in PG
15,2 microseconds per row in 16, and 14 microseconds in 17 

The important difference is the choice of index for the outermost
nestloop join's inner index scan. A different index is used on
Postgres 17:

On Postgres 15, you're using the likely-single-column stream_file_pkey
index, which uses filter quals for the ScalarArrayOp/= ANY condition.
Whereas on Postgres 17, you're using the ui_stream_file_id_component
index instead (a multicolumn index), which uses a true index qual for
the "id = sdo.stream_file_id" as well as for the ScalarArrayOp/= ANY
condition.


--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:

Hi Peter,

Thanks for the response, but I don’t understand when you meant “you are using different index”, by any chance did you mean the optimizer?
Because I have used exactly the same data dump for all PG15,16 and 17 for my tests with no difference in data nor schema structure.

 

 

 

Sajith P Shetty

Principal Engineer

Black Duck 

+91 9448389989ssajith@blackduck.com

signature_778616162

 

 

From: Peter Geoghegan <pg@bowt.ie>
Date: Thursday, 17 July 2025 at 8:35
PM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On Thu, Jul 17, 2025 at 5:58AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> In regards to your point below, it is true that the index only scan is 2.5 times slower on PG17 and  primary difference is in that index-only scan of ui_stream_file_id_component at line 14.  It takes 6 microseconds per row in PG 15, 2 microseconds per row in 16, and 14 microseconds in 17

The important difference is the choice of index for the outermost
nestloop join's inner index scan. A different index is used on
Postgres 17:

On Postgres 15, you're using the likely-single-column stream_file_pkey
index, which uses filter quals for the ScalarArrayOp/= ANY condition.
Whereas on Postgres 17, you're using the ui_stream_file_id_component
index instead (a multicolumn index), which uses a true index qual for
the "id = sdo.stream_file_id" as well as for the ScalarArrayOp/= ANY
condition.


--
Peter Geoghegan

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Jul 17, 2025 at 11:49 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> Thanks for the response, but I don’t understand when you meant “you are using different index”, by any chance did you
meanthe optimizer? 
> Because I have used exactly the same data dump for all PG15,16 and 17 for my tests with no difference in data nor
schemastructure. 

I simply mean that the plan is substantially different, in that there
is an index scan node on 17 that uses a completely different index to
the corresponding index scan node on 15. While the plan looks almost
the same, this one detail is huge.

In other words, I disagree with your summary of the plan, when you
said "Explain plan of the two queries almost same, all the joins and
paths used are exactly same". The paths are not the same.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:

Hello,

 

We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files.

Sorry for the delay, since our initial reproducer from the product was too large and sensitive to share.

 

This can clearly demonstrate performance degradation from postgres 17 versus postrges 15.  

 

NOTE: This degradation has blocked our PG upgrade on multiple products in our portfolio.

 

Thanks.

 

Sajith P Shetty

Principal Engineer

Black Duck 

+91 9448389989ssajith@blackduck.com

signature_778616162

 

 

From: Peter Geoghegan <pg@bowt.ie>
Date: Thursday, 17 July 2025 at 9:25
PM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On Thu, Jul 17, 2025 at 11:49AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> Thanks for the response, but I don’t understand when you meant “you are using different index”, by any chance did you mean the optimizer?
> Because I have used exactly the same data dump for all PG15,16 and 17 for my tests with no difference in data nor schema structure.

I simply mean that the plan is substantially different, in that there
is an index scan node on 17 that uses a completely different index to
the corresponding index scan node on 15. While the plan looks almost
the same, this one detail is huge.

In other words, I disagree with your summary of the plan, when you
said "Explain plan of the two queries almost same, all the joins and
paths used are exactly same". The paths are not the same.

--
Peter Geoghegan

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Mon, Jul 28, 2025 at 2:20 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files.

I find that your test case spends a great deal of time on nbtree
preprocessing, which happens once per execution of the inner index
scan on "zsf". According to "perf top", most cycles on spent on these:

 32.02%  postgres                    [.] FunctionCall2Coll
  22.01%  postgres                    [.] qsort_arg
  18.64%  postgres                    [.] _bt_compare_array_elements
   8.20%  postgres                    [.] btint8cmp
   3.97%  postgres                    [.] _bt_preprocess_keys
  ...

The query takes ~1550ms on my local workstation. If I just comment out
the relevant qsort, it'll take only ~190 ms. That qsort might not be
the only problem here, but it is the immediate problem. Note that
commenting out the qsort should produce the same answer, at least for
this one query, since the constants that appear in the query are
already sorted (the EXPLAIN row counts match what they show with the
qsort in place).

In principle, we could limit the use of the qsort to the first inner
index scan, and safely skip each subsequent qsort -- at least in cases
where the array was a constant (which includes this case). Obviously,
we *do* need a qsort (what if the constants aren't exactly in sorted
order?), but we generally don't need to do it once per inner index
scan.

There is a separate question as to whether or not the planner should
pick this plan in the first place. I find that I can get a faster plan
(without commenting out anything) by tricking the planner into using
the single column "zsf_pkey", rather than the multi-column
"zsf_id_fpi_cid_key". Even then, I can only get a merge join with the
"zsf_pkey" index on the inner side -- not a nested loop join with the
"zsf_pkey" index on the inner side, as expected. The merge join plan
brings the execution time down to ~90ms, which is better, but
certainly still less than ideal.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:

Thanks Peter for detailed response. We really appreciate your time and effort in this regard.

Please help me on the next step of actions required from our end. I understand you have found out the root cause, can you confirm if any fix would be approved for this case in upcoming patches and any ETA is really helpful.

 

Also note that this has affected many of our SQL queries, and we cannot afford to modify our code to accommodate the PG17 planner changes.

 

Looping in the email, my colleague Todd, who helped me get the reproducer steps.

 

 

 

Sajith P Shetty

Principal Engineer

Black Duck 

+91 9448389989ssajith@blackduck.com

signature_778616162

 

 

From: Peter Geoghegan <pg@bowt.ie>
Date: Tuesday, 29 July 2025 at 2:12
AM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On Mon, Jul 28, 2025 at 2:20AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files.

I find that your test case spends a great deal of time on nbtree
preprocessing, which happens once per execution of the inner index
scan on "zsf". According to "perf top", most cycles on spent on these:

 32.02%  postgres                    [.] FunctionCall2Coll
  22.01%  postgres                    [.] qsort_arg
  18.64%  postgres                    [.] _bt_compare_array_elements
   8.20%  postgres                    [.] btint8cmp
   3.97%  postgres                    [.] _bt_preprocess_keys
  ...

The query takes ~1550ms on my local workstation. If I just comment out
the relevant qsort, it'll take only ~190 ms. That qsort might not be
the only problem here, but it is the immediate problem. Note that
commenting out the qsort should produce the same answer, at least for
this one query, since the constants that appear in the query are
already sorted (the EXPLAIN row counts match what they show with the
qsort in place).

In principle, we could limit the use of the qsort to the first inner
index scan, and safely skip each subsequent qsort -- at least in cases
where the array was a constant (which includes this case). Obviously,
we *do* need a qsort (what if the constants aren't exactly in sorted
order?), but we generally don't need to do it once per inner index
scan.

There is a separate question as to whether or not the planner should
pick this plan in the first place. I find that I can get a faster plan
(without commenting out anything) by tricking the planner into using
the single column "zsf_pkey", rather than the multi-column
"zsf_id_fpi_cid_key". Even then, I can only get a merge join with the
"zsf_pkey" index on the inner side -- not a nested loop join with the
"zsf_pkey" index on the inner side, as expected. The merge join plan
brings the execution time down to ~90ms, which is better, but
certainly still less than ideal.

--
Peter Geoghegan

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Mon, Jul 28, 2025 at 4:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The query takes ~1550ms on my local workstation. If I just comment out
> the relevant qsort, it'll take only ~190 ms. That qsort might not be
> the only problem here, but it is the immediate problem. Note that
> commenting out the qsort should produce the same answer, at least for
> this one query, since the constants that appear in the query are
> already sorted (the EXPLAIN row counts match what they show with the
> qsort in place).

Actually, that isn't quite true -- the constants weren't in sorted order.

I find that if I presort the elements within the query text itself,
the runtime goes down to only ~410ms. That's still not great, but it
is a vast improvement.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Merlin Moncure
Дата:
On Mon, Jul 28, 2025 at 2:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jul 28, 2025 at 2:20 AM Sajith Prabhakar Shetty
> <ssajith@blackduck.com> wrote:
> > We are able to get you a self-contained reproducer, please find attached dump, sql script and read me files.
>
> I find that your test case spends a great deal of time on nbtree
> preprocessing, which happens once per execution of the inner index
> scan on "zsf". According to "perf top", most cycles on spent on these:
>
>  32.02%  postgres                    [.] FunctionCall2Coll
>   22.01%  postgres                    [.] qsort_arg
>   18.64%  postgres                    [.] _bt_compare_array_elements
>    8.20%  postgres                    [.] btint8cmp
>    3.97%  postgres                    [.] _bt_preprocess_keys
>   ...
>
> The query takes ~1550ms on my local workstation. If I just comment out
> the relevant qsort, it'll take only ~190 ms. That qsort might not be

side question: if 50% of time (per perf top) is spent in qsort and
subroutines, how come query time goes down ~85%? is this a limitation
of perf or some other issue?

side question #2: 32% time spent on FunctionCall2Coll seems like a lot
-- is this inclusive of the routine under the function pointer?

just curious on both of these

merlin



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Tue, Jul 29, 2025 at 1:49 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> I understand you have found out the root cause

I wouldn't say that -- it isn't clear that this issue with qsorting
during preprocessing is the root cause.

As I said, it is (at a minimum) the immediate problem with your query.
But the underlying code path didn't change all that much in Postgres
17 -- it just started to be hit a lot more with this particular query.
This is more or less a consequence of the fact that the new-to-17
query plan has an inner index scan with an inconvenient combination of
2 things: it is very selective and fast (fast per individual
execution), even though it has a couple of SAOPs (a SAOP is an = ANY()
condition) that each have several hundred array elements. That's what
allows these startup costs (i.e. array preprocessing that does these 2
qsorts on each execution) to dominate so much.

My Postgres 17 commit 5bf748b8 made the planner stop generating
distinct index paths that make lower-order SAOPs into "Filter:"
conditions, which are executed outside of the core B-Tree code, using
a completely different code path. Those index paths are used by the
Postgres 15 plan. I am very hesitant to add anything like that back,
though, because they're very unlikely to be faster than an index path
that makes the executor push down the SAOP condition into the B-Tree
code (your counterexample notwithstanding).

Perhaps Tom can weigh-in here. I removed code that generated these
alternative index paths from the planner because its original
justification (see bugfix commit a4523c5a, a follow-up to bugfix
commit 807a40c5) no longer applied. Perhaps this should be revisited
now, or perhaps the issue should be ameliorated on the nbtree side. Or
maybe we should just do nothing -- the issue can be worked around in
the application itself.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 7/30/25, 3:17 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
    Perhaps Tom can weigh-in here. I removed code that generated these
    alternative index paths from the planner because its original
    justification (see bugfix commit a4523c5a, a follow-up to bugfix
    commit 807a40c5) no longer applied. Perhaps this should be revisited
    now, or perhaps the issue should be ameliorated on the nbtree side. Or
    maybe we should just do nothing -- the issue can be worked around in
    the application itself.

I work at the same company as Sajith, but on a different product.  The reproducer he
provided is just a sample; it's not the only problem.  Load testing in my team shows
that PG 17 is about 4x slower than PG 15 across the board.  It's bordering on unusable
for production deployments.

Unfortunately, the load testing setup doesn't really help isolate individual, regressing
queries.  However, I'm more than willing to help support any further investigation if
needed or helpful.

-- todd



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Wed, Jul 30, 2025 at 2:59 PM Merlin Moncure <mmoncure@gmail.com> wrote:
> side question: if 50% of time (per perf top) is spent in qsort and
> subroutines, how come query time goes down ~85%? is this a limitation
> of perf or some other issue?

I used perf's default event type for this, which is "cycles". It's
well known that the relationship between cycles and wallclock time is
very complicated. In my experience "cycles" tends to be useful for a
first order pass, to get a very general sense of what's going on,
before considering possible solutions. That's all I needed here -- I
wasn't trying to be precise.

Separately, there's bound to be complicated nonlinear effects in play.
That's one of the reasons why it is so hard to apply profiling
information effectively, at least when optimizing a given piece of
code that is already reasonably well understood.

> side question #2: 32% time spent on FunctionCall2Coll seems like a lot
> -- is this inclusive of the routine under the function pointer?
>
> just curious on both of these

I ran perf in a way that counted FunctionCall2Coll separately from
btint8cmp. It's not that surprising that FunctionCall2Coll dominated,
since btint8cmp is so simple.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Wed, Jul 30, 2025 at 3:48 PM Todd Cook <cookt@blackduck.com> wrote:
> I work at the same company as Sajith, but on a different product.  The reproducer he
> provided is just a sample; it's not the only problem.  Load testing in my team shows
> that PG 17 is about 4x slower than PG 15 across the board.  It's bordering on unusable
> for production deployments.

I suggest that you rewrite affected queries to make them join against
a VALUES() with the same constants as those currently used in the
larger IN() list. If you're not sure whether the set of constants from
the application will be reliably unique, you can use DISTINCT to make
sure.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
David Rowley
Дата:
On Thu, 31 Jul 2025 at 08:14, Peter Geoghegan <pg@bowt.ie> wrote:
> I suggest that you rewrite affected queries to make them join against
> a VALUES() with the same constants as those currently used in the
> larger IN() list. If you're not sure whether the set of constants from
> the application will be reliably unique, you can use DISTINCT to make
> sure.

Even just presorting the IN list constants would make it better. If I
manually adjust the recreator query to sort the items in both IN
lists, I get:

Execution Time: 263.365 ms

vs:

Execution Time: 804.377 ms

Of course, the qsort_arg() call still happens, it'll hit qsort_arg's
presorted short-circuit case and will do very little.

I've not looked in great detail, but I did wonder if it's worth
adjusting ExecIndexBuildScanKeys() to sort the array in a
ScalarArrayOpExpr when it's Const beforehand. That might be a bit of
wasted effort if there's just one scan, however.

David

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
David Rowley
Дата:
On Thu, 31 Jul 2025 at 07:49, Todd Cook <cookt@blackduck.com> wrote:
> I work at the same company as Sajith, but on a different product.  The reproducer he
> provided is just a sample; it's not the only problem.  Load testing in my team shows
> that PG 17 is about 4x slower than PG 15 across the board.  It's bordering on unusable
> for production deployments.
>
> Unfortunately, the load testing setup doesn't really help isolate individual, regressing
> queries.  However, I'm more than willing to help support any further investigation if
> needed or helpful.

Unfortunately, we can't really work with that much information. It's
not like we have a list of things we know are slower in newer versions
vs older versions. Changes generally go through a large amount of
testing to help ensure these regressions don't happen, so if you
report one, unless someone else beat you to it, there's a decent
chance we didn't know about it. There's nothing we can really do to
help you based on this much information. There's just no chance we'd
have shipped PG17 if it was known to be x4 slower than some previous
version.

You may be able to narrow down what's slower using pg_stat_statements.
If you can, then use EXPLAIN and compare the plans. Did PG17 choose a
different plan? Does EXPLAIN ANALYZE reveal any inaccurate statistics?
Are both instances configured the same way?

Once you find a specific query that's causing the issue, then a report
similar to what Sajith has done is a good way to get help.

David



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Wed, Jul 30, 2025 at 10:39 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I've not looked in great detail, but I did wonder if it's worth
> adjusting ExecIndexBuildScanKeys() to sort the array in a
> ScalarArrayOpExpr when it's Const beforehand. That might be a bit of
> wasted effort if there's just one scan, however.

If we were to do that, it wouldn't necessarily waste any effort. We're
sorting the array either way; sorting it a bit earlier seems unlikely
to hurt performance. The main difficulty with such a scheme is finding
a natural way to structure everything. We'd have to look up the
relevant ORDER proc earlier, which might be awkward and non-modular
(it has to use the correct ORDER proc in cross-type scenarios, for
example).

We'd also have to mark the relevant input scan key with a new flag
indicating that it is already sorted (and already has any NULL array
elements removed). That way, when nbtree preprocessing saw the flag,
it'd know that there was no need to sort the array. The sort would be
skipped, regardless of whether preprocessing took place during the
first btrescan/amrescan, or during some subsequent rescan.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 7/30/25, 10:49 PM, "David Rowley" <dgrowleyml@gmail.com <mailto:dgrowleyml@gmail.com>> wrote:
> On Thu, 31 Jul 2025 at 07:49, Todd Cook <cookt@blackduck.com <mailto:cookt@blackduck.com>> wrote:
>> I work at the same company as Sajith, but on a different product. The reproducer he
>> provided is just a sample; it's not the only problem. Load testing in my team shows
>> that PG 17 is about 4x slower than PG 15 across the board. It's bordering on unusable
>> for production deployments.
>>
>> Unfortunately, the load testing setup doesn't really help isolate individual, regressing
>> queries. However, I'm more than willing to help support any further investigation if
>> needed or helpful.

> Unfortunately, we can't really work with that much information. It's
> not like we have a list of things we know are slower in newer versions
> vs older versions. Changes generally go through a large amount of
> testing to help ensure these regressions don't happen, so if you
> report one, unless someone else beat you to it, there's a decent
> chance we didn't know about it. There's nothing we can really do to
> help you based on this much information. There's just no chance we'd
> have shipped PG17 if it was known to be x4 slower than some previous
> version.

Sorry, it was not my intent to cast aspersions on PG; I was simply trying
to indicate the scale of the problem we're facing.  I've been using PG for
21+ years now, and it always has been, and still is, the most reliable
component in our software stack.

> You may be able to narrow down what's slower using pg_stat_statements.
> If you can, then use EXPLAIN and compare the plans. Did PG17 choose a
> different plan? Does EXPLAIN ANALYZE reveal any inaccurate statistics?
> Are both instances configured the same way?

We make liberal use of "IN (unsorted_constant_list)" in our queries, so a
regression there could easily explain all the slowdown we're seeing.  However,
I will confirm that just to be certain.

Also FWIW, I've bisected the slowdown to this commit:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5bf748b86.
The reproducer that Sajith posted takes ~1.1 seconds when run on a build at
that commit (with the same index-only scan as 17.5), but only ~85ms on the
prior commit (with a filtered index scan like with 15).

-- todd


Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> Perhaps Tom can weigh-in here. I removed code that generated these
> alternative index paths from the planner because its original
> justification (see bugfix commit a4523c5a, a follow-up to bugfix
> commit 807a40c5) no longer applied. Perhaps this should be revisited
> now, or perhaps the issue should be ameliorated on the nbtree side. Or
> maybe we should just do nothing -- the issue can be worked around in
> the application itself.

Well, maybe it was a mistake to no longer consider such plans, but
this example doesn't prove it.  Quoting the submitted readme file,
we selected this plan in v15:

      ->  Index Scan using zsf_pkey on zsf sf  (cost=1.49..1.51 rows=1 width=24) (actual time=0.001..0.001 rows=1
loops=47089)
            Index Cond: (id = sdo.sfi)
            Filter: (cid = ANY ('{...}'::bigint[]))

versus this in v17:

      ->  Index Only Scan using zsf_id_fpi_cid_key on zsf sf  (cost=0.29..0.31 rows=1 width=24) (actual
time=0.023..0.023rows=1 loops=47089) 
            Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[])))

IIUC you're saying the planner no longer even considers the first
case --- but if it did, it'd surely still pick the second one,
because the estimated cost is a lot less.  So undoing that choice
would not help the blackduck folks.

I do think we should do something about this, though.  My suggestion
is that we should always presort in the planner if the SAOP argument
is a Const array, and then skip the run-time sort if the executor
sees the argument is a Const.  Yes, there will be cases where the
plan-time sort is wasted effort, but not too darn many.

An alternative thought is that maybe the run-time sort is expensive
enough that the planner ought to account for it in its estimates.
However, that's a bit of a research project, and I don't think we'd
dare shove it into v17 at this point even if it turns out to fix
this particular case.  But a pre-sort seems like a pretty safe change.

            regards, tom lane



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Jul 31, 2025 at 4:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, maybe it was a mistake to no longer consider such plans, but
> this example doesn't prove it.

To be clear, I strongly suspect that it wasn't a mistake. I was just
laying out options.

> Quoting the submitted readme file,
> we selected this plan in v15:
>
>       ->  Index Scan using zsf_pkey on zsf sf  (cost=1.49..1.51 rows=1 width=24) (actual time=0.001..0.001 rows=1
loops=47089)
>             Index Cond: (id = sdo.sfi)
>             Filter: (cid = ANY ('{...}'::bigint[]))
>
> versus this in v17:
>
>       ->  Index Only Scan using zsf_id_fpi_cid_key on zsf sf  (cost=0.29..0.31 rows=1 width=24) (actual
time=0.023..0.023rows=1 loops=47089) 
>             Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[])))
>
> IIUC you're saying the planner no longer even considers the first
> case

Yes, exactly.

> but if it did, it'd surely still pick the second one,
> because the estimated cost is a lot less.  So undoing that choice
> would not help the blackduck folks.

Right.

Theoretically, we could go back to generating these filter qual index
paths if we also revised the costing to account for every little
nuance. But I have little faith in our ability to come up with a cost
model that could ever be accurate enough to reliably pick between two
very similar alternatives like this.

There are huge disadvantages to using filter quals when index quals
can be used instead. Weighing that against the cost of the sort seems
like it would be very brittle.

> I do think we should do something about this, though.  My suggestion
> is that we should always presort in the planner if the SAOP argument
> is a Const array, and then skip the run-time sort if the executor
> sees the argument is a Const.

I agree.

Is there a convenient choke point for this in the planner?

I suspect that making this change will have a side-effect: it'll make
EXPLAIN show the array as sorted and deduplicated. That seems like a
minor positive to me, but it's something to consider.

> Yes, there will be cases where the
> plan-time sort is wasted effort, but not too darn many.

It's also only something that is noticeable with very large arrays,
which are themselves kind of rare.

> An alternative thought is that maybe the run-time sort is expensive
> enough that the planner ought to account for it in its estimates.

I tend to doubt that this will ever make much sense.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Jul 31, 2025 at 4:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do think we should do something about this, though.  My suggestion
>> is that we should always presort in the planner if the SAOP argument
>> is a Const array, and then skip the run-time sort if the executor
>> sees the argument is a Const.

> I agree.

Cool, will you do the legwork?

> Is there a convenient choke point for this in the planner?

I'd be inclined to do it as late as possible, in create_plan
(so that we don't expend the effort if we don't choose that
index path).  So in or near fix_indexqual_references is
probably a good spot.

> I suspect that making this change will have a side-effect: it'll make
> EXPLAIN show the array as sorted and deduplicated. That seems like a
> minor positive to me, but it's something to consider.

Indeed.  We can make use of that in test cases, perhaps.

>> An alternative thought is that maybe the run-time sort is expensive
>> enough that the planner ought to account for it in its estimates.

> I tend to doubt that this will ever make much sense.

As you say, getting the cost estimates accurate enough is daunting,
which is why I called it a research project.

            regards, tom lane



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Jul 31, 2025 at 5:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Cool, will you do the legwork?

I'll give it a go.

> > Is there a convenient choke point for this in the planner?
>
> I'd be inclined to do it as late as possible, in create_plan
> (so that we don't expend the effort if we don't choose that
> index path).  So in or near fix_indexqual_references is
> probably a good spot.

Understood.

> >> An alternative thought is that maybe the run-time sort is expensive
> >> enough that the planner ought to account for it in its estimates.
>
> > I tend to doubt that this will ever make much sense.
>
> As you say, getting the cost estimates accurate enough is daunting,
> which is why I called it a research project.

I think that it might make sense to add a little more startup cost to
plans with SAOP arrays. But that doesn't seem likely to help with this
particular problem.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 7/31/25, 4:24 PM, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote:
> versus this in v17:
>
>   -> Index Only Scan using zsf_id_fpi_cid_key on zsf sf (cost=0.29..0.31 rows=1 width=24) (actual time=0.023..0.023
rows=1loops=47089)
 
>       Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[])))

I've been doing some experimenting with the indexes on the zsf table.

As a reminder, the definition of the zsf table is

   Column |  Type  | Collation | Nullable | Default
  --------+--------+-----------+----------+---------
   id     | bigint |           | not null |
   cid    | bigint |           | not null |
   fpi    | bigint |           | not null |
  Indexes:
      "zsf_pkey" PRIMARY KEY, btree (id)
      "zsf_cid_idx" btree (cid)
      "zsf_id_fpi_cid_key" UNIQUE CONSTRAINT, btree (id, fpi, cid)

As an experiment, I added another unique constraint that puts the
two columns referenced by the query first, giving

  Indexes:
      "zsf_pkey" PRIMARY KEY, btree (id)
      "zsf_cid_idx" btree (cid)
      "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi)
      "zsf_id_fpi_cid_key" UNIQUE CONSTRAINT, btree (id, fpi, cid)

Unexpectedly (to me), the planner still chose the same index as before
(with the unreferenced fpi in the middle).

Dropping the original unique constraint led the planner to switch to
a sequential scan of zsf:

  Indexes:
      "zsf_pkey" PRIMARY KEY, btree (id)
      "zsf_cid_idx" btree (cid)
      "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi)

  Seq Scan on zsf sf  (cost=1.20..751.52 rows=34211 width=24) (actual time=0.014..2.781 rows=34205 loops=1)
    Filter: (cid = ANY ('{...}'))

which I assume means that the planner estimated the seq scan to be cheaper
than using the unique index on (id, cid, fpi).

Adding a plain index on (id, cid) switches back to an index scan:

  Indexes:
      "zsf_pkey" PRIMARY KEY, btree (id)
      "zsf_cid_idx" btree (cid)
      "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi)
      "zsf_id_cid_idx" btree (id, cid)

  Index Scan using zsf_id_cid_idx on zsf sf  (cost=0.29..1010.86 rows=34211 width=24) (actual time=0.033..4.239
rows=34205loops=1)
 
    Index Cond: (cid = ANY ('{...}'))

Every plan that didn't use the original unique index ran in ~60 to ~80 ms.

It seems to me (perhaps naively) that the planner is being inconsistent
about how it is costing the three possible index traversals.  Why would
using zsf_id_cid_idx cost so much more than using the original index?

-- todd


Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Jul 31, 2025 at 5:59 PM Todd Cook <cookt@blackduck.com> wrote:
> On 7/31/25, 4:24 PM, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote:
> > versus this in v17:
> >
> >   -> Index Only Scan using zsf_id_fpi_cid_key on zsf sf (cost=0.29..0.31 rows=1 width=24) (actual time=0.023..0.023
rows=1loops=47089) 
> >       Index Cond: ((id = sdo.sfi) AND (cid = ANY ('{...}'::bigint[])))
>
> I've been doing some experimenting with the indexes on the zsf table.

> Dropping the original unique constraint led the planner to switch to
> a sequential scan of zsf:

I noticed that too. The relevant zsf SAOP array has a unique match for
most of the rows in zsf, so it isn't particularly suprising that we
get a sequential scan when the details are tweaked/with a different
set of indexes.

> Adding a plain index on (id, cid) switches back to an index scan:
>
>   Indexes:
>       "zsf_pkey" PRIMARY KEY, btree (id)
>       "zsf_cid_idx" btree (cid)
>       "zsf_id_cid_fpi_key" UNIQUE CONSTRAINT, btree (id, cid, fpi)
>       "zsf_id_cid_idx" btree (id, cid)
>
>   Index Scan using zsf_id_cid_idx on zsf sf  (cost=0.29..1010.86 rows=34211 width=24) (actual time=0.033..4.239
rows=34205loops=1) 
>     Index Cond: (cid = ANY ('{...}'))
>
> Every plan that didn't use the original unique index ran in ~60 to ~80 ms.

Notice that this plan shows "rows=34205 loops=1" for zsf_id_cid_idx --
not "rows=1 loops=47089", as in your original test case. It looks like
this isn't revised plan makes its scan of zsf_id_cid_idx appear
someplace that isn't the inner side of a nested loop join. In other
words, it looks like the basic structure/join order of the plan is
significantly different to that of your original Postgres 15 plan (and
that of your original Postgres 17 plan, which had the same basic join
order as the 15 one).

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Jul 31, 2025 at 5:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jul 31, 2025 at 5:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Cool, will you do the legwork?
>
> I'll give it a go.

Update on my progress:

I find that the test query takes ~95ms on Postgres 16.  We're getting
"Inner Unique: true" on Postgres 16, but if I replace zsf_pkey with a
non-unique index it brings the execution time up slightly, to ~100ms.
Ideally, we'll be able to get Postgres 17+ at about parity with that.

I find that my WIP patch (which does the required sort in the planner
in the obvious way) brings the runtime down, from ~1500ms to ~165ms.
Obviously, this is a massive improvement -- but it's a bit
disappointing that we can't do better still.

Getting closer to ~100ms seems intuitively achievable to me, since the
index scan we're getting on 17 isn't supposed to be a whole lot
different to the one we get on 16 (in spite of the fact that we're
using a different index) -- all of the other executor nodes from the
plan are pretty much the same on each version. Why the remaining
shortfall?

One problem remains here: we're still doing more work than one would
hope at the start of btrescan, during array preprocessing. We're able
to skip the sort, of course, but just building a simple Datum array
via a call to deconstruct_array() is enough of a remaining bottleneck
to matter. Ideally, we'd do *all* of the work once for each SAOP
array, in the planner (assuming a Const argument). In order to do
that, we'd have to make the planner/executor pass nbtree an array that
has the exact structure that it works with at runtime: a raw Datum
array.

I find that once I make the planner and executor pass a raw Datum
array, we're much closer to my soft performance target: the query
runtime goes down to ~135ms. This isn't perfect, but it's much closer
to the theoretical ideal that I have in mind. We're still doing extra
work in the 17 index scan, compared to the one in 16, but I can't feel
too bad about that; looking up a separate ORDER proc for the
lower-order column isn't free, and being prepared to use a SAOP array
necessitates a little more memory allocation for preprocessing (we
make _bt_preprocess_keys use a partially-preprocessed copy of the
original input keys as its input keys when there's an SAOP array). So
~135ms is roughly in line with what I expect.

The problem with this more ambitious approach is that it is also much
more invasive. It bleeds into things like the plan cache.
EXPLAIN/ruleutils.c would need its own built in way to show the qual
that can work with this alternative "raw datum array" representation,
which I haven't bothered adding. I doubt that that complexity will pay
for itself.

My inclination is to pursue the simpler approach, and just accept the
remaining performance shortfall. This is a rare enough case that I
think that that'll be acceptable. But input on how to make this
trade-off would be helpful.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> My inclination is to pursue the simpler approach, and just accept the
> remaining performance shortfall. This is a rare enough case that I
> think that that'll be acceptable. But input on how to make this
> trade-off would be helpful.

My thought here is that there is not a lot you can do in v17: it's
been out for a year and we dare not destabilize it.  Take the easy
near-guaranteed win and be happy.  Additional improvements can be
pursued at our leisure in the master branch.

            regards, tom lane



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Tom Lane
Дата:
I wrote:
> My thought here is that there is not a lot you can do in v17: it's
> been out for a year and we dare not destabilize it.  Take the easy
> near-guaranteed win and be happy.  Additional improvements can be
> pursued at our leisure in the master branch.

Also, the calendar is a factor here: we are only a week away from
release freeze for the August releases.  Even if you're feeling
optimistic about what we could risk in v17 with some development
effort, I'd counsel first working on a minimal patch that would
be safe to apply on short notice, so we can get the easy win
in this release cycle.

            regards, tom lane



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Fri, Aug 1, 2025 at 9:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My thought here is that there is not a lot you can do in v17: it's
> been out for a year and we dare not destabilize it.  Take the easy
> near-guaranteed win and be happy.  Additional improvements can be
> pursued at our leisure in the master branch.

I agree. I always thought that it was unlikely that anybody would
seriously argue in favor of the more ambitious approach I laid out. I
nevertheless felt obligated to present it as an option.

I find it natural to approach fixing this regression (and any other
regression with a fairly crisp problem statement) by starting from the
premise that the query should be at least as fast as it was on a
previous release, and working backwards from there. Actually meeting
that standard might not be practically achievable, for whatever
reason. But even then there is still value in understanding and
clearly communicating why it is that we fell short of that standard.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Fri, Aug 1, 2025 at 10:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, the calendar is a factor here: we are only a week away from
> release freeze for the August releases.  Even if you're feeling
> optimistic about what we could risk in v17 with some development
> effort, I'd counsel first working on a minimal patch that would
> be safe to apply on short notice, so we can get the easy win
> in this release cycle.

Attached is what I came up with. It makes the problematic test case
query execute a little under 10x faster, compared to unpatched
master -- see my numbers from Friday (the numbers for the less
ambitious approach/the approach taken by this patch). This is still a
WIP.

I'm unsure of how to assess my approach from a modularity perspective.
The patch simply assumes that any amsearcharray index AM is either
nbtree, or some other amsearcharray index AM that is okay with having
its array deduplicated based on nbtree rules/a B-Tree ORDER proc. So
there had better be a B-Tree opclass ORDER proc for us to work off of
(otherwise planning will throw an error). This is a little bit like
how we deal with RowCompareExpr within nodeIndexScan.c -- there are
similar hard-coded BTORDER_PROC lookups there. I don't *think* that I
need to use the new IndexAmTranslateCompareType facility (added by
commit c09e5a6a) for this, since most existing code doesn't bother
with any of that, either.

I haven't given much thought to the increase in planner cycles. I'll
stress-test that aspect of the patch soon.

Todd, Sajith: it would be helpful if you could test this patch.
Possibly by using the original problematic query, rather than the
minimized version that you posted to the list. The patch won't bring
performance up to parity with Postgres 15, but it should be a great
deal faster. Note that the patch that I've posted will only apply
against the current master branch (I'll prepare patches for earlier
branches once I have some buy-in). See
https://wiki.postgresql.org/wiki/Compile_and_Install_from_source_code
and https://www.postgresql.org/docs/current/installation.html for
instructions.


--
Peter Geoghegan

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> Attached is what I came up with. It makes the problematic test case
> query execute a little under 10x faster, compared to unpatched
> master -- see my numbers from Friday (the numbers for the less
> ambitious approach/the approach taken by this patch). This is still a
> WIP.

I took a very quick look at this, and there are a couple of things
I don't like:

* I don't love the fact that _bt_presort_const_array shares no code
with _bt_sort_array_elements.  Maybe there's nothing to be done
about it given that they're working on different representations.
But the comments could at least point out that the two functions
had better give equivalent results.

* I really don't like the fact that SK_PRESORTED gets set so far
away from the code that's actually doing the work to make that
valid (not to mention far from the code that consults that flag).
Even if it's bug-free today, that seems like an invitation to
maintenance problems.  Can we factor that differently?

> I'm unsure of how to assess my approach from a modularity perspective.
> The patch simply assumes that any amsearcharray index AM is either
> nbtree, or some other amsearcharray index AM that is okay with having
> its array deduplicated based on nbtree rules/a B-Tree ORDER proc.

Well, it's worse than that: it's assuming that support procedure 1
is in fact a btree ordering proc in any opclass of an amsearcharray
AM.  I don't think that's too safe.  We have our hands on enough
info about the index that we could skip trying to do the sort if
it's not a btree, and I think we should do so, at least for today.

> This is a little bit like
> how we deal with RowCompareExpr within nodeIndexScan.c -- there are
> similar hard-coded BTORDER_PROC lookups there.

One big difference is that we *know* that the comparisons in a
RowCompareExpr are btree operators, because the parser constructed it
that way.  We cannot make such assumptions about arbitrary
ScalarArrayOps, even if they are indexquals.

            regards, tom lane



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Mon, Aug 4, 2025 at 6:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took a very quick look at this

Thanks for the review.

> * I don't love the fact that _bt_presort_const_array shares no code
> with _bt_sort_array_elements.  Maybe there's nothing to be done
> about it given that they're working on different representations.
> But the comments could at least point out that the two functions
> had better give equivalent results.

I'll try to improve matters here for the next revision. But it
probably won't look all that different.

> * I really don't like the fact that SK_PRESORTED gets set so far
> away from the code that's actually doing the work to make that
> valid (not to mention far from the code that consults that flag).
> Even if it's bug-free today, that seems like an invitation to
> maintenance problems.  Can we factor that differently?

It sounds like you're not so much complaining about SK_PRESORTED
itself as you are about the v1 patch's assumption that all scans of
amsearcharray-index-AM indexes with a Const arg must already be
presorted. In other words, it seems as if this complaint is closely
related to (if not exactly the same as) your later complaint about
modularity issues.

If there *was* a separate "presorted" SAOP property/struct field, then
you might not find it odd to see a SK_PRESORTED flag, used to
represent that presortedness property at the scan key level. It'd be
fairly similar to the way that we already don't pass a
ScalarArrayOpExpr down to nbtree directly (we just pass down an
ArrayType and set the SK_SEARCHARRAY scan key flag).

Note that SK_SEARCHARRAY is set in precisely one place on the master
branch, which is also the only place where the new SK_PRESORTED flag
gets set with the patch applied: within ExecIndexBuildScanKeys.

> > I'm unsure of how to assess my approach from a modularity perspective.
> > The patch simply assumes that any amsearcharray index AM is either
> > nbtree, or some other amsearcharray index AM that is okay with having
> > its array deduplicated based on nbtree rules/a B-Tree ORDER proc.
>
> Well, it's worse than that: it's assuming that support procedure 1
> is in fact a btree ordering proc in any opclass of an amsearcharray
> AM.  I don't think that's too safe.

Agreed that that's not safe (and wouldn't be even if it was fine to
make aggressive assumptions about amsearcharray index AMs being
nbtree-like).

> We have our hands on enough
> info about the index that we could skip trying to do the sort if
> it's not a btree, and I think we should do so, at least for today.

I think that this boils down to the following (correct me if I'm wrong):

There is a need to remember that we've presorted, rather than making
any kind of hardcoded assumption that that has happened. In practice
we'll always presort with a Const-arg'd ScalarArrayOp nbtree index
qual, but ExecIndexBuildScanKeys has no business knowing about any of
that. This can be achieved by limiting application of the optimization
to nbtree from within createplan.c -- probably by testing
"IndexOptInfo.relam == BTREE_AM_OID" (much like
match_rowcompare_to_indexcol does it already).

Does that summary sound about right? And if it does, do you have any
thoughts on the best place for the new "presorted" struct field?
Should it go in the ScalarArrayOpExpr struct itself?

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 4, 2025 at 6:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We have our hands on enough
>> info about the index that we could skip trying to do the sort if
>> it's not a btree, and I think we should do so, at least for today.

> I think that this boils down to the following (correct me if I'm wrong):

> There is a need to remember that we've presorted, rather than making
> any kind of hardcoded assumption that that has happened. In practice
> we'll always presort with a Const-arg'd ScalarArrayOp nbtree index
> qual, but ExecIndexBuildScanKeys has no business knowing about any of
> that. This can be achieved by limiting application of the optimization
> to nbtree from within createplan.c -- probably by testing
> "IndexOptInfo.relam == BTREE_AM_OID" (much like
> match_rowcompare_to_indexcol does it already).

I agree that ExecIndexBuildScanKeys should not be hard-wiring such
an assumption: it's too far away from both the producer and the
consumer of the knowledge.  Having said that, if we're to produce
a back-patchable fix, I'm not seeing a way to have an end-to-end
confirmation that the planner did the sort.  That would require some
sort of change to the plan tree.  I'd be for doing that in master
only, but I think it's unworkable given ABI restrictions in v17.

I think the most expedient answer for v17 is to have a private
agreement between createplan.c and access/nbtree that if the
index type is btree then createplan.c will presort Const array
arguments for ScalarArrayOp indexquals.  The nbtree code would
have to check for this by looking to see if the indexqual node
is a Const.

I do agree with the idea that a scankey flag could be a better
answer in HEAD.  But it has to be fed forward from some planner
output, or else it's just a kluge.  Not that what I said in the
previous para isn't a kluge ... but my kluge touches fewer bits
of code than your v1, so I think it's better.  Ugliness is in
the eye of the beholder of course, so maybe you'll disagree.

Anyway, my points are (1) it's definitely not safe to call the
BTORDER_PROC unless the index is a btree, and (2) we should not
expect that what we do today to put into the back branches will
be a good long-term answer.

> And if it does, do you have any
> thoughts on the best place for the new "presorted" struct field?
> Should it go in the ScalarArrayOpExpr struct itself?

Doubt it.  I suppose if we had an immediate application for
presorting ScalarArrayOpExpr arrays in other contexts, that
could start to look attractive.  But for now I'd think more
about adding additional info to IndexScan plan nodes.  Maybe
it'd be interesting to store an array of scankey flag bitmasks,
with an eye to precomputing some of those flags at plan time?

            regards, tom lane



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:

Hi Peter,

 

Thank you for sharing the patch in response to the bug I raised with the Postgres community. I appreciate your prompt support and the effort you’ve put into addressing the issue.

 

Following your email, I noticed that Tom has provided additional suggestions and raised some concerns regarding the proposed changes. Before proceeding with testing the patch against the original problematic query, I wanted to check with you — should I go ahead and apply the current patch, or would it be better to wait for any revised version based on Tom’s feedback?

 

Please let me know what you recommend. I’m happy to test whenever you feel the patch is ready.

 

Thanks again for your guidance and support.

 

 

Sajith P Shetty

Principal Engineer

Black Duck 

+91 9448389989ssajith@blackduck.com

signature_778616162

 

 

From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tuesday, 5 August 2025 at 7:40
AM
To: Peter Geoghegan <pg@bowt.ie>
Cc: Sajith Prabhakar Shetty <ssajith@blackduck.com>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 4, 2025 at 6:26
PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We have our hands on enough
>> info about the index that we could skip trying to do the sort if
>> it's not a btree, and I think we should do so, at least for today.

> I think that this boils down to the following (correct me if I'm wrong):

> There is a need to remember that we've presorted, rather than making
> any kind of hardcoded assumption that that has happened. In practice
> we'll always presort with a Const-arg'd ScalarArrayOp nbtree index
> qual, but ExecIndexBuildScanKeys has no business knowing about any of
> that. This can be achieved by limiting application of the optimization
> to nbtree from within createplan.c -- probably by testing
> "IndexOptInfo.relam == BTREE_AM_OID" (much like
> match_rowcompare_to_indexcol does it already).

I agree that ExecIndexBuildScanKeys should not be hard-wiring such
an assumption: it's too far away from both the producer and the
consumer of the knowledge.  Having said that, if we're to produce
a back-patchable fix, I'm not seeing a way to have an end-to-end
confirmation that the planner did the sort.  That would require some
sort of change to the plan tree.  I'd be for doing that in master
only, but I think it's unworkable given ABI restrictions in v17.

I think the most expedient answer for v17 is to have a private
agreement between createplan.c and access/nbtree that if the
index type is btree then createplan.c will presort Const array
arguments for ScalarArrayOp indexquals.  The nbtree code would
have to check for this by looking to see if the indexqual node
is a Const.

I do agree with the idea that a scankey flag could be a better
answer in HEAD.  But it has to be fed forward from some planner
output, or else it's just a kluge.  Not that what I said in the
previous para isn't a kluge ... but my kluge touches fewer bits
of code than your v1, so I think it's better.  Ugliness is in
the eye of the beholder of course, so maybe you'll disagree.

Anyway, my points are (1) it's definitely not safe to call the
BTORDER_PROC unless the index is a btree, and (2) we should not
expect that what we do today to put into the back branches will
be a good long-term answer.

> And if it does, do you have any
> thoughts on the best place for the new "presorted" struct field?
> Should it go in the ScalarArrayOpExpr struct itself?

Doubt it.  I suppose if we had an immediate application for
presorting ScalarArrayOpExpr arrays in other contexts, that
could start to look attractive.  But for now I'd think more
about adding additional info to IndexScan plan nodes.  Maybe
it'd be interesting to store an array of scankey flag bitmasks,
with an eye to precomputing some of those flags at plan time?

                        regards, tom lane

Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Tue, Aug 5, 2025 at 1:31 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> Before proceeding with testing the patch against the original problematic query, I wanted to check with you — should
Igo ahead and apply the current patch, or would it be better to wait for any revised version based on Tom’s feedback? 

I doubt that Tom's feedback will affect the performance profile of the
patch. v1 is very likely to be representative of the final patch in
every way that your testing will consider, so I see no reason to wait.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 8/4/25, 4:12 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
> Todd, Sajith: it would be helpful if you could test this patch.
> Possibly by using the original problematic query, rather than the
> minimized version that you posted to the list. The patch won't bring
> performance up to parity with Postgres 15, but it should be a great
> deal faster. Note that the patch that I've posted will only apply
> against the current master branch (I'll prepare patches for earlier
> branches once I have some buy-in).

I'm working with our performance testing team to rerun our load tests
with the patch applied.  However, the current infrastructure doesn't
support deploying custom PG builds, so it's unlikely we'll be able to
provide results in time for the upcoming minor releases.

Also, is it significant effort to produce a patch for PG 17?  Running the
load tests against master would make it chancy to compare the results
with the data we already have.

-- todd




Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Merlin Moncure
Дата:
On Thu, Aug 7, 2025 at 11:58 AM Todd Cook <cookt@blackduck.com> wrote:
>
> On 8/4/25, 4:12 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
> > Todd, Sajith: it would be helpful if you could test this patch.
> > Possibly by using the original problematic query, rather than the
> > minimized version that you posted to the list. The patch won't bring
> > performance up to parity with Postgres 15, but it should be a great
> > deal faster. Note that the patch that I've posted will only apply
> > against the current master branch (I'll prepare patches for earlier
> > branches once I have some buy-in).
>
> I'm working with our performance testing team to rerun our load tests
> with the patch applied.  However, the current infrastructure doesn't
> support deploying custom PG builds, so it's unlikely we'll be able to
> provide results in time for the upcoming minor releases.
>
> Also, is it significant effort to produce a patch for PG 17?  Running the
> load tests against master would make it chancy to compare the results
> with the data we already have.

Looks like it, _bt_preprocess_array_keys was whacked around quite a
bit in master vs 17 as part of the skip scan work, and a couple of the
bits the patch relies on are not there.  I would also question the
point of that, since the patch vs master is much closer to what you
would ultimately be using as master is pretty close to the v18 release
(currently in beta, scheduled for sep/oct).

I think it is very likely going to address your issue, but
confirmation helps, and will make it more likely for this patch to be
pushed.   After that, you will have to decide to downgrade to 15 or
suffer the status. quo until 18 releases. Running a custom patched 17
is possible if a patch could be made to appear, but I do not recommend
it given where you are at in the release calendar.
merlin



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:
Hi Peter,
I successfully applied your patch and ran tests using the same database and query that originally triggered this bug.
Please find below the EXPLAIN ANALYZE data for your review:
Database Server
Execution Time
Execution Plan
Notes
PostgreSQL 17.5 (Stable)
9–10 sec
Execution time remains consistent across runs.
PostgreSQL 16.9 (Stable)
1–2 sec
First run takes ~5 sec; subsequent runs stabilize at 1–2 sec.
PostgreSQL Master (with patch)
5–6 sec
Execution time is consistent across runs.
While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9. Is there any scope for further optimization to bring it closer to PG16's performance levels?
Please let me know if you'd like me to run additional tests or provide more details.



 

 

Sajith P Shetty

Principal Engineer

Black Duck 

M +91 9448389989| ssajith@blackduck.com

signature_778616162

 

From: Merlin Moncure <mmoncure@gmail.com>
Date: Saturday, 9 August 2025 at 3:34 AM
To: Todd Cook <cookt@blackduck.com>
Cc: Peter Geoghegan <pg@bowt.ie>, Tom Lane <tgl@sss.pgh.pa.us>, Sajith Prabhakar Shetty <ssajith@blackduck.com>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On Thu, Aug 7, 2025 at 11:58 AM Todd Cook <cookt@blackduck.com> wrote:
>
> On 8/4/25, 4:12 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
> > Todd, Sajith: it would be helpful if you could test this patch.
> > Possibly by using the original problematic query, rather than the
> > minimized version that you posted to the list. The patch won't bring
> > performance up to parity with Postgres 15, but it should be a great
> > deal faster. Note that the patch that I've posted will only apply
> > against the current master branch (I'll prepare patches for earlier
> > branches once I have some buy-in).
>
> I'm working with our performance testing team to rerun our load tests
> with the patch applied.  However, the current infrastructure doesn't
> support deploying custom PG builds, so it's unlikely we'll be able to
> provide results in time for the upcoming minor releases.
>
> Also, is it significant effort to produce a patch for PG 17?  Running the
> load tests against master would make it chancy to compare the results
> with the data we already have.

Looks like it, _bt_preprocess_array_keys was whacked around quite a
bit in master vs 17 as part of the skip scan work, and a couple of the
bits the patch relies on are not there.  I would also question the
point of that, since the patch vs master is much closer to what you
would ultimately be using as master is pretty close to the v18 release
(currently in beta, scheduled for sep/oct).

I think it is very likely going to address your issue, but
confirmation helps, and will make it more likely for this patch to be
pushed.   After that, you will have to decide to downgrade to 15 or
suffer the status. quo until 18 releases. Running a custom patched 17
is possible if a patch could be made to appear, but I do not recommend
it given where you are at in the release calendar.
merlin
Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Mon, Aug 11, 2025 at 10:01 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9.

Postgres 17 shows "Buffers: shared hit=5071180 read=42049" for the
top-level scan node/the plan as a whole, while patched master shows
"Buffers: shared hit=5608415 read=103886 written=2". There are only
minor differences in each plan. It looks like these results aren't
representative.

I see lots of heap fetches for index-only scans on patched master.
That factor alone could easily make a huge difference. I suggest
running "VACUUM ANALYZE" on both setups to get more consistent
results. I also suggest using pg_prewarm and/or repeated execution to
make sure that the number of buffer misses/reads is kept to an
absolute minimum. You should also make sure to use exactly the same
settings for each test case -- ideally while using the same hardware
for both.

> Is there any scope for further optimization to bring it closer to PG16's performance levels?

Probably not. At least not if it must be backpatched to Postgres 17.
The patch needs to be reasonably non-invasive for that to happen.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:
Hi Peter,

I can confirm with full certainty that both VACUUM and ANALYZE were executed on all three instances whose results I shared. Additionally, each instance was freshly created on same hardware and configured with the same postgresql.conf settings.

 

Sajith P Shetty

Principal Engineer

Black Duck 

M +91 9448389989| ssajith@blackduck.com

signature_778616162

 

From: Peter Geoghegan <pg@bowt.ie>
Date: Tuesday, 12 August 2025 at 12:15 AM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Merlin Moncure <mmoncure@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On Mon, Aug 11, 2025 at 10:01 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9.

Postgres 17 shows "Buffers: shared hit=5071180 read=42049" for the
top-level scan node/the plan as a whole, while patched master shows
"Buffers: shared hit=5608415 read=103886 written=2". There are only
minor differences in each plan. It looks like these results aren't
representative.

I see lots of heap fetches for index-only scans on patched master.
That factor alone could easily make a huge difference. I suggest
running "VACUUM ANALYZE" on both setups to get more consistent
results. I also suggest using pg_prewarm and/or repeated execution to
make sure that the number of buffer misses/reads is kept to an
absolute minimum. You should also make sure to use exactly the same
settings for each test case -- ideally while using the same hardware
for both.

> Is there any scope for further optimization to bring it closer to PG16's performance levels?

Probably not. At least not if it must be backpatched to Postgres 17.
The patch needs to be reasonably non-invasive for that to happen.

--
Peter Geoghegan
Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Sajith Prabhakar Shetty
Дата:
Sorry missed to mention that for each instance, five runs were executed, and the latest execution plan and timing metrics were recorded and shared.

 

 

Sajith P Shetty

Principal Engineer

Black Duck 

M +91 9448389989| ssajith@blackduck.com

signature_778616162

 

From: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Date: Tuesday, 12 August 2025 at 10:13 AM
To: Peter Geoghegan <pg@bowt.ie>
Cc: Merlin Moncure <mmoncure@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

Hi Peter,

I can confirm with full certainty that both VACUUM and ANALYZE were executed on all three instances whose results I shared. Additionally, each instance was freshly created on same hardware and configured with the same postgresql.conf settings.

 

Sajith P Shetty

Principal Engineer

Black Duck 

M +91 9448389989| ssajith@blackduck.com

signature_778616162

 

From: Peter Geoghegan <pg@bowt.ie>
Date: Tuesday, 12 August 2025 at 12:15 AM
To: Sajith Prabhakar Shetty <ssajith@blackduck.com>
Cc: Merlin Moncure <mmoncure@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Andrei Lepikhov <lepihov@gmail.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>, Todd Cook <cookt@blackduck.com>
Subject: Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

On Mon, Aug 11, 2025 at 10:01 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> While the patch improves performance compared to PG17, it still doesn't match the efficiency observed in PG16.9.

Postgres 17 shows "Buffers: shared hit=5071180 read=42049" for the
top-level scan node/the plan as a whole, while patched master shows
"Buffers: shared hit=5608415 read=103886 written=2". There are only
minor differences in each plan. It looks like these results aren't
representative.

I see lots of heap fetches for index-only scans on patched master.
That factor alone could easily make a huge difference. I suggest
running "VACUUM ANALYZE" on both setups to get more consistent
results. I also suggest using pg_prewarm and/or repeated execution to
make sure that the number of buffer misses/reads is kept to an
absolute minimum. You should also make sure to use exactly the same
settings for each test case -- ideally while using the same hardware
for both.

> Is there any scope for further optimization to bring it closer to PG16's performance levels?

Probably not. At least not if it must be backpatched to Postgres 17.
The patch needs to be reasonably non-invasive for that to happen.

--
Peter Geoghegan
Вложения

Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Tue, Aug 12, 2025 at 12:46 AM Sajith Prabhakar Shetty
<ssajith@blackduck.com> wrote:
> I can confirm with full certainty that both VACUUM and ANALYZE were executed on all three instances whose results I
shared.

The results that you shared showed "Heap Fetches: 598,916" -- one heap
fetch per index search (for the problematic index scan).

Perhaps VACUUM ran without being truly effective, due to the removable
cutoff/horizon being held back by a long running query. Though that
generally won't happen on a quiescent system -- so it really does look
like VACUUM hasn't been run.


--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Merlin Moncure
Дата:
On Wed, Aug 20, 2025 at 7:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Aug 12, 2025 at 12:46 AM Sajith Prabhakar Shetty
> <ssajith@blackduck.com> wrote:
> > I can confirm with full certainty that both VACUUM and ANALYZE were executed on all three instances whose results I
shared.
>
> The results that you shared showed "Heap Fetches: 598,916" -- one heap
> fetch per index search (for the problematic index scan).
>
> Perhaps VACUUM ran without being truly effective, due to the removable
> cutoff/horizon being held back by a long running query. Though that
> generally won't happen on a quiescent system -- so it really does look
> like VACUUM hasn't been run.

Try running VACUUM FREEZE.   Do you have a way to frame up a dataset
for testing?

merlin



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 8/8/25, 6:04 PM, "Merlin Moncure" <mmoncure@gmail.com <mailto:mmoncure@gmail.com>> wrote:
> I think it is very likely going to address your issue, but
> confirmation helps, and will make it more likely for this patch to be
> pushed. After that, you will have to decide to downgrade to 15 or
> suffer the status. quo until 18 releases. Running a custom patched 17
> is possible if a patch could be made to appear, but I do not recommend
> it given where you are at in the release calendar.

We're still working on providing confirmation.  Our load testing environment depends on
third-party PG providers, so there's a significant amount of ongoing work to set up a
configuration where we can use a custom PG build.  This is a high priority for us, so it will
get done, though maybe not for a couple more weeks yet.

-- todd


Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 8/25/25, 8:30 AM, "Todd Cook" <cookt@blackduck.com <mailto:cookt@blackduck.com>> wrote:
> On 8/8/25, 6:04 PM, "Merlin Moncure" <mmoncure@gmail.com <mailto:mmoncure@gmail.com> <mailto:mmoncure@gmail.com >
<mailto:mmoncure@gmail.com>>>wrote:
 
>> I think it is very likely going to address your issue, but
>> confirmation helps, and will make it more likely for this patch to be
>> pushed. After that, you will have to decide to downgrade to 15 or
>> suffer the status. quo until 18 releases. Running a custom patched 17
>> is possible if a patch could be made to appear, but I do not recommend
>> it given where you are at in the release calendar.
>
> We're still working on providing confirmation. Our load testing environment depends on
> third-party PG providers, so there's a significant amount of ongoing work to set up a
> configuration where we can use a custom PG build. This is a high priority for us, so it will
> get done, though maybe not for a couple more weeks yet.

We've finished the tests comparing PG18rc1 with and without the proposed patch,
including several runs each to verify reproducibility.  In short, the variance between the
two is well within what we normally see, so statistically speaking, the results are
indistinguishable.  However, the results are roughly on par with what we had been
seeing with PG 15 and 16, so maybe some other change from 17 to 18 eliminated the
regression we were seeing?

-- todd





Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Sep 25, 2025 at 2:07 PM Todd Cook <cookt@blackduck.com> wrote:
> We've finished the tests comparing PG18rc1 with and without the proposed patch,
> including several runs each to verify reproducibility.  In short, the variance between the
> two is well within what we normally see, so statistically speaking, the results are
> indistinguishable.

That's interesting. It clearly wasn't like that with your minimal test
case, but the issue with the actual query is perhaps more
complicated/has more than one issue in play.

> However, the results are roughly on par with what we had been
> seeing with PG 15 and 16, so maybe some other change from 17 to 18 eliminated the
> regression we were seeing?

There's no specific reason to expect this: there's simply no way that
Postgres 18 (or Postgres 17) will get the same plan that you had on 15
and 16. That is, it is simply impossible for 17 or 18 to use a plan
where the problematic index scan node shows its ANY condition as a
filter qual ("Filter: ...") -- it *must* be a true index qual ("Cond:
..."). I always understood that your observed slowdown was a perverse
consequence of this relatively minor plan difference. Since of course
that was precisely what your original minimal test case actually
demonstrated ("perf top" showed we were bottlenecked on sorting the
SAOP array during index scan preprocessing).

What does EXPLAIN ANALYZE actually show you on 18, compared to 16,
with the same real world (non-minimal) test case? Are the joins and
scan nodes all the same as before (i.e. is the Postgres 18 plan
*identical to the "bad" Postgres 17 plan)?

Do you see any "Heap Fetches:", particularly with the problematic
index-only scan? Could it just be that you made sure to run VACUUM
ahead of the test this time, allowing the index-only scan seen on
Postgres 17 and 18 to avoid heap accesses?

Recall that the 15 and 16 plan had a plain index scan on another
index, and that the 17 plan you showed a few weeks back had  "Heap
Fetches: 598,916" (meaning that the index-only scan was completely
ineffective at avoiding heap accesses). If you're now able to get the
most out of index-only access, it could be enough to flip things in
favor of the new plan -- in spite of the fact that there is definitely
still a regression tied to needlessly sorting the scan's SAOP array
many times.

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 9/25/25, 2:46 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
> What does EXPLAIN ANALYZE actually show you on 18, compared to 16,
> with the same real world (non-minimal) test case? Are the joins and
> scan nodes all the same as before (i.e. is the Postgres 18 plan
> *identical to the "bad" Postgres 17 plan)?
>
> Do you see any "Heap Fetches:", particularly with the problematic
> index-only scan? Could it just be that you made sure to run VACUUM
> ahead of the test this time, allowing the index-only scan seen on
> Postgres 17 and 18 to avoid heap accesses?
>
> Recall that the 15 and 16 plan had a plain index scan on another
> index, and that the 17 plan you showed a few weeks back had "Heap
> Fetches: 598,916" (meaning that the index-only scan was completely
> ineffective at avoiding heap accesses). If you're now able to get the
> most out of index-only access, it could be enough to flip things in
> favor of the new plan -- in spite of the fact that there is definitely
> still a regression tied to needlessly sorting the scan's SAOP array
> many times.

The tests I reported on are application-level load tests that last about
10 hours and generate ~4 million queries, so I don't have any of that
per-query info.  We generate a lot of queries with "IN (constant_list)"
expressions, so that seemed like a logical explanation, but maybe
something else is going on too?  Nothing really stands out from
crawling through pg_stat_statements, so maybe it's a small effect that
is multiplied by repetition?

FWIW, except for large multi-join queries with "IN (constant_list)"
expressions, nearly every individual query I run is as fast or faster in
17 than in 16.  The ones that are slower are ones that were previously
munged around to get a specific plan; those that I've unmunged have
so far been much faster on 17.

It just occurred to me while typing this that I should go count joins to
see if we're exceeding join_collapse_limit.  Could something have
changed in 17 that would affect how such queries are planned?

-- todd


Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Thu, Sep 25, 2025 at 3:28 PM Todd Cook <cookt@blackduck.com> wrote:
> The tests I reported on are application-level load tests that last about
> 10 hours and generate ~4 million queries, so I don't have any of that
> per-query info.  We generate a lot of queries with "IN (constant_list)"
> expressions, so that seemed like a logical explanation, but maybe
> something else is going on too?

The Postgres 17 improvements to "IN (constant_list)" index scans
(which indirectly caused this regression) tend to make things much
better with workloads such as this. For example, one blog post
reported an across-the-board 30% increase in application throughput
after an upgrade to 17, which was tied back to that work:

https://www.crunchydata.com/blog/real-world-performance-gains-with-postgres-17-btree-bulk-scans

So, yes, it's plausible that an application that made heavy use of
queries with large IN() lists could be much faster now. For better or
worse, that's a reasonably common pattern with the "eager loading"
used by frameworks such as Ruby on Rails.

> Nothing really stands out from
> crawling through pg_stat_statements, so maybe it's a small effect that
> is multiplied by repetition?

Should be noted that the regression itself is definitely "a small
effect that is multiplied by repetition". This makes the problem
rather perverse (though it's not exactly unusual for a complaint about
a planner regression to be perverse, in one way or another).

> FWIW, except for large multi-join queries with "IN (constant_list)"
> expressions, nearly every individual query I run is as fast or faster in
> 17 than in 16.  The ones that are slower are ones that were previously
> munged around to get a specific plan; those that I've unmunged have
> so far been much faster on 17.

That's great news.

> It just occurred to me while typing this that I should go count joins to
> see if we're exceeding join_collapse_limit.  Could something have
> changed in 17 that would affect how such queries are planned?

It'll be hard (very hard) to establish any relationship between the
overall improvements you're seeing and any changes in the planner.
Anything is possible, but in general planner changes rarely make an
across-the-board difference to most of an application's queries. Plus
the workload itself is probably fairly varied and complicated.

I only suggested that the improvements to the B-Tree code were likely
relevant because you mentioned that your application had many queries
with "IN (constant_list)", often with larger arrays of constants
(hundreds, say). And because I've direct knowledge of across-the-board
speedups for applications with those kinds of queries (the blog post
is the best example of this).

--
Peter Geoghegan



Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Todd Cook
Дата:
On 9/25/25, 3:47 PM, "Peter Geoghegan" <pg@bowt.ie <mailto:pg@bowt.ie>> wrote:
> Should be noted that the regression itself is definitely "a small
> effect that is multiplied by repetition". This makes the problem
> rather perverse (though it's not exactly unusual for a complaint about
> a planner regression to be perverse, in one way or another).

Thanks for confirming.  We can develop a strategy for working around
this effect, so from my point of view, we can consider this matter
resolved.  I'm guessing that you feel the same way, but if not, I'm
happy to help out in any way I can.

In any case, I do appreciate your efforts and the time you took to
write out the very helpful explanations you provided. __

-- todd


Re: Postgres: Queries are too slow after upgrading to PG17 from PG15

От
Peter Geoghegan
Дата:
On Fri, Sep 26, 2025 at 8:30 AM Todd Cook <cookt@blackduck.com> wrote:
> Thanks for confirming.  We can develop a strategy for working around
> this effect, so from my point of view, we can consider this matter
> resolved.  I'm guessing that you feel the same way, but if not, I'm
> happy to help out in any way I can.

Yes, I do feel the same way.

I'm going to withdraw my proposed patch, at least for now. If there's
another complaint along the same lines at some point in the future,
then I'll certainly need to reconsider my patch.

--
Peter Geoghegan