Обсуждение: BUG #17964: Missed query planner optimization
The following bug has been logged on the website: Bug reference: 17964 Logged by: Mathias Kunter Email address: mathiaskunter@gmail.com PostgreSQL version: 14.8 Operating system: x86_64-pc-linux-gnu Description: In the example below, the query planner uses a sequential scan (query 1) even though it could use an index scan (query 2). CREATE TABLE table1 (id SERIAL NOT NULL, name VARCHAR, CONSTRAINT table1_pkey PRIMARY KEY (id)); CREATE TABLE table2 (id SERIAL NOT NULL, name VARCHAR, CONSTRAINT table2_pkey PRIMARY KEY (id)); CREATE TABLE table3 (id INTEGER NOT NULL); INSERT INTO table1 (id, name) SELECT generate_series(1, 10000), md5(random()::text); INSERT INTO table2 (id, name) SELECT generate_series(1, 10000), md5(random()::text); INSERT INTO table3 (id) VALUES (1538),(8836),(5486),(3464),(2673); ANALYZE; EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION SELECT id, name FROM table2) AS q WHERE id IN (SELECT id FROM table3); --------------------------------------------------------------- Hash Semi Join (cost=769.11..1227.17 rows=500 width=36) (actual time=10.952..15.094 rows=10 loops=1) Hash Cond: (table1.id = table3.id) -> HashAggregate (cost=768.00..968.00 rows=20000 width=36) (actual time=10.757..13.597 rows=20000 loops=1) Group Key: table1.id, table1.name Batches: 1 Memory Usage: 2577kB -> Append (cost=0.00..668.00 rows=20000 width=36) (actual time=0.010..4.135 rows=20000 loops=1) -> Seq Scan on table1 (cost=0.00..184.00 rows=10000 width=37) (actual time=0.009..1.288 rows=10000 loops=1) -> Seq Scan on table2 (cost=0.00..184.00 rows=10000 width=37) (actual time=0.010..1.336 rows=10000 loops=1) -> Hash (cost=1.05..1.05 rows=5 width=4) (actual time=0.014..0.015 rows=5 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on table3 (cost=0.00..1.05 rows=5 width=4) (actual time=0.009..0.010 rows=5 loops=1) Planning Time: 0.444 ms Execution Time: 15.257 ms EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION SELECT id, name FROM table2) AS q WHERE id IN (1538,8836,5486,3464,2673); --------------------------------------------------------------- HashAggregate (cost=51.23..51.33 rows=10 width=36) (actual time=0.065..0.067 rows=10 loops=1) Group Key: table1.id, table1.name Batches: 1 Memory Usage: 24kB -> Append (cost=0.29..51.18 rows=10 width=36) (actual time=0.025..0.057 rows=10 loops=1) -> Index Scan using table1_pkey on table1 (cost=0.29..25.51 rows=5 width=37) (actual time=0.024..0.038 rows=5 loops=1) Index Cond: (id = ANY ('{1538,8836,5486,3464,2673}'::integer[])) -> Index Scan using table2_pkey on table2 (cost=0.29..25.51 rows=5 width=37) (actual time=0.006..0.017 rows=5 loops=1) Index Cond: (id = ANY ('{1538,8836,5486,3464,2673}'::integer[])) Planning Time: 0.170 ms Execution Time: 0.097 ms The results are also provided as SQL fiddles here: https://dbfiddle.uk/xo7fug1o https://www.db-fiddle.com/f/iUsfpdP2eD8YtdN2Em7Zyu/0
On Wed, 7 Jun 2023 at 04:44, PG Bug reporting form <noreply@postgresql.org> wrote: > In the example below, the query planner uses a sequential scan (query 1) > even though it could use an index scan (query 2). > > EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION > SELECT id, name FROM table2) AS q > WHERE id IN (SELECT id FROM table3); > EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION > SELECT id, name FROM table2) AS q > WHERE id IN (1538,8836,5486,3464,2673); It's not a bug that the planner does not consider evaluating the join before the UNION, it's just an optimisation opportunity we don't currently explore. If you want that, then write: EXPLAIN ANALYZE SELECT id, name FROM table1 WHERE id IN (SELECT id FROM table3) UNION SELECT id, name FROM table2 WHERE id IN (SELECT id FROM table3); David
> It's not a bug that the planner does not consider evaluating the join > before the UNION Yes, it's not a bug, but it's something which can be improved. If I simply change the original query from this: > SELECT ... WHERE id IN (SELECT ...); into this: > SELECT ... WHERE id = ANY(ARRAY(SELECT ...)); then Postgres uses an index scan, and the query is orders of magnitude faster. Note that the planner actually correctly computes the estimated costs for both variants, since I get: > cost=769.11..1227.17 when using IN > cost=86.45..86.65 when using ANY See https://dbfiddle.uk/iOkiiTJJ Also note that this issue doesn't only affect UNION queries. For example, the following query will also execute orders of magnitude faster if I simply replace IN with ANY: > SELECT * FROM t WHERE x = 'a' OR y IN (SELECT ...); Again, estimated costs say that using ANY should be faster: > cost=8.30..2443.31 when using IN > cost=56.45..350.69 when using ANY See https://dbfiddle.uk/b9piwQr4 Hence, why doesn't the planner simply test whether it's beneficial to replace IN with ANY? It seems that all which has to be done is to compare the query plans for both possible execution variants. I guess this should be rather simple to implement, isn't it? Thanks Mathias Am 06.06.23 um 23:32 schrieb David Rowley: > On Wed, 7 Jun 2023 at 04:44, PG Bug reporting form > <noreply@postgresql.org> wrote: >> In the example below, the query planner uses a sequential scan (query 1) >> even though it could use an index scan (query 2). >> >> EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION >> SELECT id, name FROM table2) AS q >> WHERE id IN (SELECT id FROM table3); > >> EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION >> SELECT id, name FROM table2) AS q >> WHERE id IN (1538,8836,5486,3464,2673); > > It's not a bug that the plannSer does not consider evaluating the join > before the UNION, it's just an optimisation opportunity we don't > currently explore. > > If you want that, then write: > > EXPLAIN ANALYZE SELECT id, name FROM table1 WHERE id IN (SELECT id > FROM table3) UNION SELECT id, name FROM table2 WHERE id IN (SELECT id > FROM table3); > > David
On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote: > Yes, it's not a bug, but it's something which can be improved. If I > simply change the original query from this: > > > SELECT ... WHERE id IN (SELECT ...); > > into this: > > > SELECT ... WHERE id = ANY(ARRAY(SELECT ...)); > > then Postgres uses an index scan, and the query is orders of magnitude > faster. Note that the planner actually correctly computes the estimated > costs for both variants, since I get: What's going on here is that there is code which will convert supported IN clauses into semi-joins. The first of your queries has this done, but the 2nd query does not. The 2nd query, since the semi-join conversion is not done, the qual later becomes eligible to be pushed down into the union subquery meaning the non-matching rows get filtered before the UNION is evaluated. We'll never attempt to push joins (semi or otherwise) down into subqueries, so this is not done for the first query. The main problem here is that in some cases the first of your queries will be faster and in other cases the 2nd will be faster. It'll depend on how many rows are in each table. So, really to find the fastest plan, the planner would have to consider both options. That would unfortunately mean that we'd have to perform the join search once without the semi-join pushed down and again with the semi-join pushed down. The join search is going to be the slowest part of planning when there are a few tables to join, so doing that twice could add quite a bit of overhead to the planner. You might also then consider how many times you'd need to perform the join search if there were, say, 5 IN clauses. To exhaustively find the best plan we'd need to try all possible combinations of converting each IN clause to a semi-join vs leaving the qual alone. If the main query already had, say 5 tables to join then that suddenly becomes a hugely costly query to plan. Given that, I'm not all that sure you're likely to see us making any improvements here. I suggest just rewriting the query in a way that it executes more quickly for you. David
Am 07.06.23 um 14:36 schrieb David Rowley: > On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote: >> If I simply change the original query from this: >> >>> SELECT ... WHERE id IN (SELECT ...); >> >> into this: >> >>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...)); >> >> then Postgres uses an index scan, and the query is orders of magnitude >> faster. > > What's going on here is that ... Thank you very much for the explanation, fully understood. > The main problem here is that in some cases the first of your queries > will be faster and in other cases the 2nd will be faster. It'll depend > on how many rows are in each table. So, really to find the fastest > plan, the planner would have to consider both options. Yes, considering both options is exactly what I'm suggesting. > The join search is going to be the slowest part of planning > when there are a few tables to join, so doing that twice could add > quite a bit of overhead to the planner. Yes, but I'd argue that we have to put that additional planning overhead into perspective. The planning time overhead will typically be in the order of milliseconds. Also, the additional planning time will not depend on the number of rows within the tables. In contrast to that, the execution time overhead can be in the order of minutes, since it will depend on the number of rows which are involved. After all, Postgres will unnecessarily execute a sequential scan, and the size of the scanned table might be gigabytes. > You might also then consider > how many times you'd need to perform the join search if there were, > say, 5 IN clauses. To exhaustively find the best plan we'd need to > try all possible combinations of converting each IN clause to a > semi-join vs leaving the qual alone. We don't necessarily have to use exhaustive search, a heuristic would already be a major improvement. For example, use the estimated number of rows returned by the subselect of each IN clause. If this number is below a certain threshold (which is yet to be defined), then rewrite the corresponding IN clause to ANY. As a final verification step, check the rewritten query against the original query, and only use the rewritten query if the estimated costs are lower. I think something like this should be possible to do with reasonable planner overhead. > I'm not all that sure you're likely to see us > making any improvements here. I'm very sorry to hear this. Please note that I stumbled upon this issue through a real-world use case. One of my queries was very slow. I looked at the query plan of the affected query, and I saw this: > Planning Time: 1.267 ms > Execution Time: 69566.632 ms > cost=3941783.88..3941783.92 Just by changing one IN clause to ANY, I now see this: > Planning Time: 1.103 ms > Execution Time: 0.232 ms > cost=1514.95..1515.62 The query time went from over a minute to about a millisecond! I couldn't believe it. Here, the overhead to find the better query plan would obviously have paid off hugely. > I suggest just rewriting the query in a > way that it executes more quickly for you. Yes, I'm happily doing that now. The problem is that most other Postgres users don't know that they might have to replace IN with ANY in order to massively speed up their queries. How should they know? This certainly comes unexpected, and this also isn't documented anywhere, as far as I can see. Thanks Mathias
I'm sorry, but I have to bring this up again. As it currently stands, the query planner isn't able to find an adequate query plan for simple real-life queries like the following one, where we simply want to find books by either author name or publisher name: > SELECT * FROM book WHERE > author_id IN (SELECT id FROM author WHERE name = 'some_name') OR > publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name'); Complete example here: https://dbfiddle.uk/q6_4NuDX The issue could be fixed quite easily by implementing a heuristic, the optimized query will execute a few THOUSAND times faster, most people have no clue that they could use ANY(ARRAY()) as a workaround, and still this optimization isn't something worth to be implemented? One of the more complex queries of our project now even executes 300,000 (that is: THREE HUNDRED THOUSAND) times faster with the optimization applied (execution time 72436.509 vs. 0.201 ms). See original query plan here: https://pastebin.com/raw/JsY1PzG3 See optimized query plan here: https://pastebin.com/raw/Xvq7zUY2 So, please consider implementing this optimization. It will be a HUGE performance improvement for a lot of queries used out there. Thank you. Mathias Am 07.06.23 um 18:53 schrieb Mathias Kunter: > Am 07.06.23 um 14:36 schrieb David Rowley: >> On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> >> wrote: >>> If I simply change the original query from this: >>> >>>> SELECT ... WHERE id IN (SELECT ...); >>> >>> into this: >>> >>>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...)); >>> >>> then Postgres uses an index scan, and the query is orders of magnitude >>> faster. >> >> What's going on here is that ... > > Thank you very much for the explanation, fully understood. > >> The main problem here is that in some cases the first of your queries >> will be faster and in other cases the 2nd will be faster. It'll depend >> on how many rows are in each table. So, really to find the fastest >> plan, the planner would have to consider both options. > > Yes, considering both options is exactly what I'm suggesting. > >> The join search is going to be the slowest part of planning >> when there are a few tables to join, so doing that twice could add >> quite a bit of overhead to the planner. > > Yes, but I'd argue that we have to put that additional planning overhead > into perspective. The planning time overhead will typically be in the > order of milliseconds. Also, the additional planning time will not > depend on the number of rows within the tables. > > In contrast to that, the execution time overhead can be in the order of > minutes, since it will depend on the number of rows which are involved. > After all, Postgres will unnecessarily execute a sequential scan, and > the size of the scanned table might be gigabytes. > >> You might also then consider >> how many times you'd need to perform the join search if there were, >> say, 5 IN clauses. To exhaustively find the best plan we'd need to >> try all possible combinations of converting each IN clause to a >> semi-join vs leaving the qual alone. > > We don't necessarily have to use exhaustive search, a heuristic would > already be a major improvement. > > For example, use the estimated number of rows returned by the subselect > of each IN clause. If this number is below a certain threshold (which is > yet to be defined), then rewrite the corresponding IN clause to ANY. As > a final verification step, check the rewritten query against the > original query, and only use the rewritten query if the estimated costs > are lower. > > I think something like this should be possible to do with reasonable > planner overhead. > >> I'm not all that sure you're likely to see us >> making any improvements here. > > I'm very sorry to hear this. Please note that I stumbled upon this issue > through a real-world use case. One of my queries was very slow. I looked > at the query plan of the affected query, and I saw this: > >> Planning Time: 1.267 ms >> Execution Time: 69566.632 ms >> cost=3941783.88..3941783.92 > > Just by changing one IN clause to ANY, I now see this: > >> Planning Time: 1.103 ms >> Execution Time: 0.232 ms cost=1514.95..1515.62 > > The query time went from over a minute to about a millisecond! I > couldn't believe it. Here, the overhead to find the better query plan > would obviously have paid off hugely. > >> I suggest just rewriting the query in a >> way that it executes more quickly for you. > > Yes, I'm happily doing that now. > > The problem is that most other Postgres users don't know that they might > have to replace IN with ANY in order to massively speed up their > queries. How should they know? This certainly comes unexpected, and this > also isn't documented anywhere, as far as I can see. > > Thanks > > Mathias
On Fri, 16 Jun 2023 at 03:53, Mathias Kunter <mathiaskunter@gmail.com> wrote: > > SELECT * FROM book WHERE > > author_id IN (SELECT id FROM author WHERE name = 'some_name') OR > > publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name'); > > Complete example here: https://dbfiddle.uk/q6_4NuDX > > The issue could be fixed quite easily by implementing a heuristic, the > optimized query will execute a few THOUSAND times faster, most people > have no clue that they could use ANY(ARRAY()) as a workaround, and still > this optimization isn't something worth to be implemented? There are some cases where converting the IN to a semi-join will significantly improve performance, and as you've shown, there are some cases where that optimisation causes performance regressions. We'd need a way to determine which is cheaper based on estimated costs and that might be, in my opinion, fairly tricky to do based on the order of how things currently are done in the query planner. In my previous emails about this, I was just trying to show that it's quite a tricky problem to fix properly. We're certainly open to patches which fix some of the problems you've mentioned. If you've looked into it and think it's quite an easy project, then we'll welcome contributions to improve this. Having the planner consider the costs of converting the IN to a semi-join and converting or not based on cost seems like a worthy goal. The pgsql-hackers as a good emailing list to bring up the topic and gather up some ideas on how you might go about fixing it. David
> We'd need a way to determine which is cheaper based on estimated costs Yes, that decision must of course be based on estimated costs, and also the cost estimation must be reasonably accurate. In all examples which I've seen so far, the cost estimations were adequate. > that might be, in my opinion, fairly tricky to do based on the order of > how things currently are done in the query planner. I don't know the details about the current planner implementation, but the planner obviously can already correctly estimate costs for both the IN and ANY(ARRAY()) variants. > I was just trying to show that it's quite a tricky problem to > fix properly. Yes, the tricky part obviously is to decide whether or not an IN clause should be converted to a semi-join, because that decision depends on the other IN clauses of the query as well. So, it's unfortunately not possible to decide independently for each IN clause. As you've already pointed out, doing an exhaustive search of all possible combinations is obviously problematic, since we'd have to check 2^N possible query plans (with N being the number of IN clauses of the query). So, this is certainly NOT the way to go, unless maybe N<4 or so. > Having the planner consider the costs of converting the IN to a semi- > join and converting or not based on cost seems like a worthy goal. It obviously is. As shown, the performance improvement can be gigantic for certain queries. > If you think it's quite an easy project, then we'll welcome > contributions to improve this. I'm not familiar with the current implementation, but my understanding is that this should be quite easy to implement for someone who is currently working on planner-related issues. After all, a reasonable heuristic solution can be as simple as this: 1) Create the query plan for the original query, using the current planner implementation. (That's the status quo.) 2) Create a modified version of the original input query where all of the IN clauses are replaced with ANY(ARRAY()). Create the query plan for that modified query version, using the current planner implementation. 3) Compare the two previously created query plans and execute the one with the lower estimated total cost. If you don't trust the planner's cost estimations for some reason, then only execute the modified version if its estimated cost is SIGNIFICANTLY lower. A simple solution like this would already be good enough to avoid many of the horrifying worst case scenarios, where a query takes minutes when it should actually only take milliseconds. It would of course be possible (and desirable) to come up with a more sophisticated heuristic than the one given above. For example, create a third version of the original query where only those IN clauses are replaced with ANY(ARRAY()) where the estimated number of rows returned by the subselect is below a certain threshold, and estimate costs for that modified query as well. > The pgsql-hackers as a good emailing list to bring up the topic Thank you for the hint. I'll continue the discussion over there. Mathias