Обсуждение: Why isn't PG using an index-only scan?
Hello,
I have this very simple query:
INSERT INTO copyrightad (idoeu, idad, role, mechowned, perfowned, iscontrolled)
SELECT o.idoeu, c.idad, LEFT(sipa_capacity1,3), sipa_mech_owned, sipa_perf_owned, sipa_controlled='Y'
FROM imaestro.msipfl ip
JOIN oeu o ON o.imworkid=ip.sipa_song_code
JOIN ad c ON c.imcompid=ip.sipa_ip_code
WHERE ip.sipa_comp_or_publ='C';
And here are the number of elements in each table:
imaestro.msipfl: 1550019
oeu: 1587533
ad: 304986
The explain plan is saying this:
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Insert on copyrightad (cost=613790.59..4448045.97 rows=0 width=0)
-> Merge Join (cost=613790.59..4448045.97 rows=84972138 width=328)
Merge Cond: (((c.imcompid)::numeric) = ip.sipa_ip_code)
-> Sort (cost=35712.97..36475.44 rows=304986 width=8)
Sort Key: ((c.imcompid)::numeric)
-> Index Only Scan using ix_ad_imcompid on ad c (cost=0.42..7931.21 rows=304986 width=8)
-> Materialize (cost=578077.61..594521.17 rows=3288712 width=23)
-> Sort (cost=578077.61..586299.39 rows=3288712 width=23)
Sort Key: ip.sipa_ip_code
-> Hash Join (cost=56043.04..154644.48 rows=3288712 width=23)
Hash Cond: ((o.imworkid)::numeric = ip.sipa_song_code)
-> Seq Scan on oeu o (cost=0.00..41901.33 rows=1587533 width=8)
-> Hash (cost=48542.24..48542.24 rows=600064 width=25)
-> Seq Scan on msipfl ip (cost=0.00..48542.24 rows=600064 width=25)
Filter: ((sipa_comp_or_publ)::text = 'C'::text)
Table ad has this index:
"ix_ad_imcompid" btree (imcompid, idad)
Table oeu has this one:
"ix_oeu_imcompid" btree (imworkid, idoeu)
idad and idoeu are both primary keys of, respectively, ad and oeu.
The resultset should be 591615 rows because:
select sipa_comp_or_publ, count(*) from imaestro.msipfl group by 1;
sipa_comp_or_publ | count
-------------------+--------
C | 591615
P | 958404
Is it normal that PG is doing a seq scan on table oeu but an index-only scan on ad? I had to stop the query after 5 hours, how can I make this faster? Of course I ran VACUUM ANALYZE.
These are the memory settings I have, but I have plenty of unused RAM. Should I bump something up?
shared_buffers = 16GB # min 128kB
#huge_pages = try # on, off, or try
#huge_page_size = 0 # zero for system default
#temp_buffers = 8MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
work_mem = 128MB # min 64kB
$ free -h
total used free shared buff/cache available
Mem: 125Gi 18Gi 3.1Gi 15Gi 121Gi 106Gi
Swap: 4.0Gi 2.0Gi 2.0Gi
Thanks for your help,
JC
On Thu, 18 Sept 2025 at 18:36, Jean-Christophe BOGGIO <postgresql@thefreecat.org> wrote: > Insert on copyrightad (cost=613790.59..4448045.97 rows=0 width=0) > -> Merge Join (cost=613790.59..4448045.97 rows=84972138 width=328) > Merge Cond: (((c.imcompid)::numeric) = ip.sipa_ip_code) > -> Sort (cost=35712.97..36475.44 rows=304986 width=8) > Sort Key: ((c.imcompid)::numeric) > -> Index Only Scan using ix_ad_imcompid on ad c (cost=0.42..7931.21 rows=304986 width=8) > -> Materialize (cost=578077.61..594521.17 rows=3288712 width=23) > -> Sort (cost=578077.61..586299.39 rows=3288712 width=23) > Sort Key: ip.sipa_ip_code > -> Hash Join (cost=56043.04..154644.48 rows=3288712 width=23) > Hash Cond: ((o.imworkid)::numeric = ip.sipa_song_code) > -> Seq Scan on oeu o (cost=0.00..41901.33 rows=1587533 width=8) > -> Hash (cost=48542.24..48542.24 rows=600064 width=25) > -> Seq Scan on msipfl ip (cost=0.00..48542.24 rows=600064 width=25) > Filter: ((sipa_comp_or_publ)::text = 'C'::text) > Is it normal that PG is doing a seq scan on table oeu but an index-only scan on ad? I had to stop the query after 5 hours,how can I make this faster? Of course I ran VACUUM ANALYZE. Yes. Since *all* records of "oeu" are required and they're not required in any particular order, then Seq Scan should be the fastest way to access those records. 5 hours seems very slow for the estimated number of records. Have you tried running the SELECT using EXPLAIN ANALYZE without the INSERT part? Even if the 84 million Merge Join row estimate is accurate, 5 hours seems excessively long. If it still takes a long time, you might try SET enable_mergejoin = 0; and run the EXPLAIN ANALYZE SELECT .. part. That'll at least give us more accurate row counts of what we're actually working with. David
Thanks David, Le 18/09/2025 à 09:20, David Rowley a écrit : > Yes. Since *all* records of "oeu" are required and they're not > required in any particular order, then Seq Scan should be the fastest > way to access those records. Ok but then why is it doing it on the AD table? Is it because of the number of rows? > 5 hours seems very slow for the estimated number of records. Have you > tried running the SELECT using EXPLAIN ANALYZE without the INSERT > part? Even if the 84 million Merge Join row estimate is accurate, 5 > hours seems excessively long. I tried this, it took 1 second. And then I discovered that I had an old trigger on the copyrightad table. The insert took 2 seconds. Very sorry for wasting your time and thanks again for putting me on the right track. Have a great day!
On 18/9/2025 09:20, David Rowley wrote: > On Thu, 18 Sept 2025 at 18:36, Jean-Christophe BOGGIO > If it still takes a long time, you might try SET enable_mergejoin = 0; > and run the EXPLAIN ANALYZE SELECT .. part. That'll at least give us > more accurate row counts of what we're actually working with.This appears to be a good example of a missing feature: thein-execution callback or hook. Imagine if we had a hook within the ExecProcNode. In that scenario, we could create a trivial extension that would stop the query after, let's say, 10 minutes of execution and display the current state. This would give us more reliable data on estimation and the state of the plan tree. What are your thoughts? -- regards, Andrei Lepikhov
On Thu, 18 Sept 2025 at 19:45, Jean-Christophe BOGGIO <postgresql@thefreecat.org> wrote: > Ok but then why is it doing it on the AD table? Is it because of the > number of rows? It's hard to tell as I don't have a clear view on which columns are from which tables. Perhaps "ad" can Index Only Scan because all of the columns required are in an index but with "oeu" there's a column being used that's not indexed? David
On Thu, 18 Sept 2025 at 19:55, Andrei Lepikhov <lepihov@gmail.com> wrote: > Imagine if we had a hook within the ExecProcNode. In that scenario, we > could create a trivial extension that would stop the query after, let's > say, 10 minutes of execution and display the current state. This would > give us more reliable data on estimation and the state of the plan tree. I recall something along those lines existing once in the extension world. Or maybe just from a previous employer. I don't recall many of the details. Maybe something like a function you pass an SQL string and if you cancelled the query it reported the EXPLAIN ANALYZE done so far. I assume it must have done something like LOG it as it couldn't have shown the EXPLAIN as query results on cancel. David
On 18/9/2025 13:35, David Rowley wrote: > On Thu, 18 Sept 2025 at 19:55, Andrei Lepikhov <lepihov@gmail.com> wrote: >> Imagine if we had a hook within the ExecProcNode. In that scenario, we >> could create a trivial extension that would stop the query after, let's >> say, 10 minutes of execution and display the current state. This would >> give us more reliable data on estimation and the state of the plan tree. > > I recall something along those lines existing once in the extension > world. Or maybe just from a previous employer. I don't recall many of > the details. Maybe something like a function you pass an SQL string > and if you cancelled the query it reported the EXPLAIN ANALYZE done so > far. I assume it must have done something like LOG it as it couldn't > have shown the EXPLAIN as query results on cancel. It looks like a makeshift solution. By implementing a callback, we could elevate 'interrupter' to a first-class feature, enabling us to monitor the state of the entire query tree (it is especially cool in EXPLAIN ANALYZE mode when we may be OK with a partial result). What's more interesting if Robert's work on nodes' extensibility successes, it enables modules to save planning decisions in the plan. Thereby, the in-executor callback provides a means to kick off the query planned with incorrect estimations (e.g., too many spilled tuples). Perhaps we should start working on introducing this type of callback/ hook? -- regards, Andrei Lepikhov
On Thu, 18 Sept 2025 at 23:55, Andrei Lepikhov <lepihov@gmail.com> wrote: > It looks like a makeshift solution. By implementing a callback, we could > elevate 'interrupter' to a first-class feature, enabling us to monitor > the state of the entire query tree (it is especially cool in EXPLAIN > ANALYZE mode when we may be OK with a partial result). > What's more interesting if Robert's work on nodes' extensibility > successes, it enables modules to save planning decisions in the plan. > Thereby, the in-executor callback provides a means to kick off the query > planned with incorrect estimations (e.g., too many spilled tuples). > > Perhaps we should start working on introducing this type of callback/ hook? I don't really have a fully formed opinion here, but I expect that if there is something that needs a new hook function, it might be easier to convince everyone that it's a good idea to accept that change if you first write a patch to add the hook, then go off and write the extension that uses it and then come back with that as evidence that the hook is useful enough to be committed to core. There's certainly places where you could add a hook that would just add an unacceptable overhead that we couldn't stomach. I expect ExecProcNode would be one of those places. I do agree that trying to come up with something for this is worthy of some time and effort. Helping people with performance issues when they can't even get EXPLAIN ANALYZE to finish is quite tricky. David
On 19/9/2025 03:05, David Rowley wrote: > On Thu, 18 Sept 2025 at 23:55, Andrei Lepikhov <lepihov@gmail.com> wrote: >> Perhaps we should start working on introducing this type of callback/ hook? > There's certainly places where you could add a hook that would just > add an unacceptable overhead that we couldn't stomach. I expect > ExecProcNode would be one of those places. That seems a bit odd. It is not a significant issue to instruct the compiler to trust the 'else' branch of the "if (PlanState::hook != NULL)" condition and reduce overhead to zero. However, if the hook is set, it indicates that the user (module or extension) genuinely needs it to be called for this specific node. > I do agree that trying to come up with something for this is worthy of > some time and effort. Helping people with performance issues when they > can't even get EXPLAIN ANALYZE to finish is quite tricky.That's all I need; thank you. Starting a PostgreSQL project without positive feedback from a committer usually results in wasted time, unless you are purely conducting research ;). -- regards, Andrei Lepikhov