Обсуждение: BUG #17964: Missed query planner optimization

Поиск
Список
Период
Сортировка

BUG #17964: Missed query planner optimization

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17964
Logged by:          Mathias Kunter
Email address:      mathiaskunter@gmail.com
PostgreSQL version: 14.8
Operating system:   x86_64-pc-linux-gnu
Description:

In the example below, the query planner uses a sequential scan (query 1)
even though it could use an index scan (query 2).

CREATE TABLE table1 (id SERIAL NOT NULL, name VARCHAR, CONSTRAINT
table1_pkey PRIMARY KEY (id));
CREATE TABLE table2 (id SERIAL NOT NULL, name VARCHAR, CONSTRAINT
table2_pkey PRIMARY KEY (id));
CREATE TABLE table3 (id INTEGER NOT NULL);

INSERT INTO table1 (id, name) SELECT generate_series(1, 10000),
md5(random()::text);
INSERT INTO table2 (id, name) SELECT generate_series(1, 10000),
md5(random()::text);
INSERT INTO table3 (id) VALUES (1538),(8836),(5486),(3464),(2673);

ANALYZE;

EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION
SELECT id, name FROM table2) AS q
WHERE id IN (SELECT id FROM table3);
---------------------------------------------------------------
Hash Semi Join  (cost=769.11..1227.17 rows=500 width=36) (actual
time=10.952..15.094 rows=10 loops=1)
  Hash Cond: (table1.id = table3.id)
  ->  HashAggregate  (cost=768.00..968.00 rows=20000 width=36) (actual
time=10.757..13.597 rows=20000 loops=1)
        Group Key: table1.id, table1.name
        Batches: 1  Memory Usage: 2577kB
        ->  Append  (cost=0.00..668.00 rows=20000 width=36) (actual
time=0.010..4.135 rows=20000 loops=1)
              ->  Seq Scan on table1  (cost=0.00..184.00 rows=10000
width=37) (actual time=0.009..1.288 rows=10000 loops=1)
              ->  Seq Scan on table2  (cost=0.00..184.00 rows=10000
width=37) (actual time=0.010..1.336 rows=10000 loops=1)
  ->  Hash  (cost=1.05..1.05 rows=5 width=4) (actual time=0.014..0.015
rows=5 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        ->  Seq Scan on table3  (cost=0.00..1.05 rows=5 width=4) (actual
time=0.009..0.010 rows=5 loops=1)
Planning Time: 0.444 ms
Execution Time: 15.257 ms

EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION
SELECT id, name FROM table2) AS q
WHERE id IN (1538,8836,5486,3464,2673);
---------------------------------------------------------------
HashAggregate  (cost=51.23..51.33 rows=10 width=36) (actual
time=0.065..0.067 rows=10 loops=1)
  Group Key: table1.id, table1.name
  Batches: 1  Memory Usage: 24kB
  ->  Append  (cost=0.29..51.18 rows=10 width=36) (actual time=0.025..0.057
rows=10 loops=1)
        ->  Index Scan using table1_pkey on table1  (cost=0.29..25.51 rows=5
width=37) (actual time=0.024..0.038 rows=5 loops=1)
              Index Cond: (id = ANY
('{1538,8836,5486,3464,2673}'::integer[]))
        ->  Index Scan using table2_pkey on table2  (cost=0.29..25.51 rows=5
width=37) (actual time=0.006..0.017 rows=5 loops=1)
              Index Cond: (id = ANY
('{1538,8836,5486,3464,2673}'::integer[]))
Planning Time: 0.170 ms
Execution Time: 0.097 ms


The results are also provided as SQL fiddles here:
https://dbfiddle.uk/xo7fug1o
https://www.db-fiddle.com/f/iUsfpdP2eD8YtdN2Em7Zyu/0


Re: BUG #17964: Missed query planner optimization

От
David Rowley
Дата:
On Wed, 7 Jun 2023 at 04:44, PG Bug reporting form
<noreply@postgresql.org> wrote:
> In the example below, the query planner uses a sequential scan (query 1)
> even though it could use an index scan (query 2).
>
> EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION
> SELECT id, name FROM table2) AS q
> WHERE id IN (SELECT id FROM table3);

> EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION
> SELECT id, name FROM table2) AS q
> WHERE id IN (1538,8836,5486,3464,2673);

It's not a bug that the planner does not consider evaluating the join
before the UNION, it's just an optimisation opportunity we don't
currently explore.

If you want that, then write:

EXPLAIN ANALYZE SELECT id, name FROM table1 WHERE id IN (SELECT id
FROM table3) UNION SELECT id, name FROM table2 WHERE id IN (SELECT id
FROM table3);

David



Re: BUG #17964: Missed query planner optimization

От
Mathias Kunter
Дата:
> It's not a bug that the planner does not consider evaluating the join
> before the UNION

Yes, it's not a bug, but it's something which can be improved. If I 
simply change the original query from this:

> SELECT ... WHERE id IN (SELECT ...);

into this:

> SELECT ... WHERE id = ANY(ARRAY(SELECT ...));

then Postgres uses an index scan, and the query is orders of magnitude 
faster. Note that the planner actually correctly computes the estimated 
costs for both variants, since I get:

> cost=769.11..1227.17 when using IN
> cost=86.45..86.65    when using ANY

See https://dbfiddle.uk/iOkiiTJJ

Also note that this issue doesn't only affect UNION queries. For 
example, the following query will also execute orders of magnitude 
faster if I simply replace IN with ANY:

> SELECT * FROM t WHERE x = 'a' OR y IN (SELECT ...);

Again, estimated costs say that using ANY should be faster:

> cost=8.30..2443.31 when using IN
> cost=56.45..350.69 when using ANY

See https://dbfiddle.uk/b9piwQr4

Hence, why doesn't the planner simply test whether it's beneficial to 
replace IN with ANY? It seems that all which has to be done is to 
compare the query plans for both possible execution variants. I guess 
this should be rather simple to implement, isn't it?

Thanks

Mathias


Am 06.06.23 um 23:32 schrieb David Rowley:
> On Wed, 7 Jun 2023 at 04:44, PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> In the example below, the query planner uses a sequential scan (query 1)
>> even though it could use an index scan (query 2).
>>
>> EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION
>> SELECT id, name FROM table2) AS q
>> WHERE id IN (SELECT id FROM table3);
> 
>> EXPLAIN ANALYZE SELECT id, name FROM (SELECT id, name FROM table1 UNION
>> SELECT id, name FROM table2) AS q
>> WHERE id IN (1538,8836,5486,3464,2673);
> 
> It's not a bug that the plannSer does not consider evaluating the join
> before the UNION, it's just an optimisation opportunity we don't
> currently explore.
> 
> If you want that, then write:
> 
> EXPLAIN ANALYZE SELECT id, name FROM table1 WHERE id IN (SELECT id
> FROM table3) UNION SELECT id, name FROM table2 WHERE id IN (SELECT id
> FROM table3);
> 
> David



Re: BUG #17964: Missed query planner optimization

От
David Rowley
Дата:
On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote:
> Yes, it's not a bug, but it's something which can be improved. If I
> simply change the original query from this:
>
> > SELECT ... WHERE id IN (SELECT ...);
>
> into this:
>
> > SELECT ... WHERE id = ANY(ARRAY(SELECT ...));
>
> then Postgres uses an index scan, and the query is orders of magnitude
> faster. Note that the planner actually correctly computes the estimated
> costs for both variants, since I get:

What's going on here is that there is code which will convert
supported IN clauses into semi-joins. The first of your queries has
this done, but the 2nd query does not.  The 2nd query, since the
semi-join conversion is not done, the qual later becomes eligible to
be pushed down into the union subquery meaning the non-matching rows
get filtered before the UNION is evaluated.  We'll never attempt to
push joins (semi or otherwise) down into subqueries, so this is not
done for the first query.

The main problem here is that in some cases the first of your queries
will be faster and in other cases the 2nd will be faster. It'll depend
on how many rows are in each table. So, really to find the fastest
plan, the planner would have to consider both options.  That would
unfortunately mean that we'd have to perform the join search once
without the semi-join pushed down and again with the semi-join pushed
down.  The join search is going to be the slowest part of planning
when there are a few tables to join, so doing that twice could add
quite a bit of overhead to the planner.   You might also then consider
how many times you'd need to perform the join search if there were,
say, 5 IN clauses.  To exhaustively find the best plan we'd need to
try all possible combinations of converting each IN clause to a
semi-join vs leaving the qual alone.  If the main query already had,
say 5 tables to join then that suddenly becomes a hugely costly query
to plan.  Given that, I'm not all that sure you're likely to see us
making any improvements here.  I suggest just rewriting the query in a
way that it executes more quickly for you.

David



Re: BUG #17964: Missed query planner optimization

От
Mathias Kunter
Дата:
Am 07.06.23 um 14:36 schrieb David Rowley:
> On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote:
>> If I simply change the original query from this:
>>
>>> SELECT ... WHERE id IN (SELECT ...);
>>
>> into this:
>>
>>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...));
>>
>> then Postgres uses an index scan, and the query is orders of magnitude
>> faster.
> 
> What's going on here is that ...

Thank you very much for the explanation, fully understood.

> The main problem here is that in some cases the first of your queries
> will be faster and in other cases the 2nd will be faster. It'll depend
> on how many rows are in each table. So, really to find the fastest
> plan, the planner would have to consider both options.

Yes, considering both options is exactly what I'm suggesting.

> The join search is going to be the slowest part of planning
> when there are a few tables to join, so doing that twice could add
> quite a bit of overhead to the planner.

Yes, but I'd argue that we have to put that additional planning overhead 
into perspective. The planning time overhead will typically be in the 
order of milliseconds. Also, the additional planning time will not 
depend on the number of rows within the tables.

In contrast to that, the execution time overhead can be in the order of 
minutes, since it will depend on the number of rows which are involved. 
After all, Postgres will unnecessarily execute a sequential scan, and 
the size of the scanned table might be gigabytes.

> You might also then consider
> how many times you'd need to perform the join search if there were,
> say, 5 IN clauses.  To exhaustively find the best plan we'd need to
> try all possible combinations of converting each IN clause to a
> semi-join vs leaving the qual alone.

We don't necessarily have to use exhaustive search, a heuristic would 
already be a major improvement.

For example, use the estimated number of rows returned by the subselect 
of each IN clause. If this number is below a certain threshold (which is 
yet to be defined), then rewrite the corresponding IN clause to ANY. As 
a final verification step, check the rewritten query against the 
original query, and only use the rewritten query if the estimated costs 
are lower.

I think something like this should be possible to do with reasonable 
planner overhead.

> I'm not all that sure you're likely to see us
> making any improvements here.

I'm very sorry to hear this. Please note that I stumbled upon this issue 
through a real-world use case. One of my queries was very slow. I looked 
at the query plan of the affected query, and I saw this:

> Planning Time: 1.267 ms
> Execution Time: 69566.632 ms
> cost=3941783.88..3941783.92

Just by changing one IN clause to ANY, I now see this:

> Planning Time: 1.103 ms
> Execution Time: 0.232 ms 
> cost=1514.95..1515.62

The query time went from over a minute to about a millisecond! I 
couldn't believe it. Here, the overhead to find the better query plan 
would obviously have paid off hugely.

> I suggest just rewriting the query in a
> way that it executes more quickly for you.

Yes, I'm happily doing that now.

The problem is that most other Postgres users don't know that they might 
have to replace IN with ANY in order to massively speed up their 
queries. How should they know? This certainly comes unexpected, and this 
also isn't documented anywhere, as far as I can see.

Thanks

Mathias



Re: BUG #17964: Missed query planner optimization

От
Mathias Kunter
Дата:
I'm sorry, but I have to bring this up again.

As it currently stands, the query planner isn't able to find an adequate 
query plan for simple real-life queries like the following one, where we 
simply want to find books by either author name or publisher name:

> SELECT * FROM book WHERE
> author_id IN (SELECT id FROM author WHERE name = 'some_name') OR
> publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name');

Complete example here: https://dbfiddle.uk/q6_4NuDX

The issue could be fixed quite easily by implementing a heuristic, the 
optimized query will execute a few THOUSAND times faster, most people 
have no clue that they could use ANY(ARRAY()) as a workaround, and still 
this optimization isn't something worth to be implemented?

One of the more complex queries of our project now even executes 300,000 
(that is: THREE HUNDRED THOUSAND) times faster with the optimization 
applied (execution time 72436.509 vs. 0.201 ms).

See original query plan here: https://pastebin.com/raw/JsY1PzG3
See optimized query plan here: https://pastebin.com/raw/Xvq7zUY2

So, please consider implementing this optimization. It will be a HUGE 
performance improvement for a lot of queries used out there. Thank you.

Mathias


Am 07.06.23 um 18:53 schrieb Mathias Kunter:
> Am 07.06.23 um 14:36 schrieb David Rowley:
>> On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> 
>> wrote:
>>> If I simply change the original query from this:
>>>
>>>> SELECT ... WHERE id IN (SELECT ...);
>>>
>>> into this:
>>>
>>>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...));
>>>
>>> then Postgres uses an index scan, and the query is orders of magnitude
>>> faster.
>>
>> What's going on here is that ...
> 
> Thank you very much for the explanation, fully understood.
> 
>> The main problem here is that in some cases the first of your queries
>> will be faster and in other cases the 2nd will be faster. It'll depend
>> on how many rows are in each table. So, really to find the fastest
>> plan, the planner would have to consider both options.
> 
> Yes, considering both options is exactly what I'm suggesting.
> 
>> The join search is going to be the slowest part of planning
>> when there are a few tables to join, so doing that twice could add
>> quite a bit of overhead to the planner.
> 
> Yes, but I'd argue that we have to put that additional planning overhead 
> into perspective. The planning time overhead will typically be in the 
> order of milliseconds. Also, the additional planning time will not 
> depend on the number of rows within the tables.
> 
> In contrast to that, the execution time overhead can be in the order of 
> minutes, since it will depend on the number of rows which are involved. 
> After all, Postgres will unnecessarily execute a sequential scan, and 
> the size of the scanned table might be gigabytes.
> 
>> You might also then consider
>> how many times you'd need to perform the join search if there were,
>> say, 5 IN clauses.  To exhaustively find the best plan we'd need to
>> try all possible combinations of converting each IN clause to a
>> semi-join vs leaving the qual alone.
> 
> We don't necessarily have to use exhaustive search, a heuristic would 
> already be a major improvement.
> 
> For example, use the estimated number of rows returned by the subselect 
> of each IN clause. If this number is below a certain threshold (which is 
> yet to be defined), then rewrite the corresponding IN clause to ANY. As 
> a final verification step, check the rewritten query against the 
> original query, and only use the rewritten query if the estimated costs 
> are lower.
> 
> I think something like this should be possible to do with reasonable 
> planner overhead.
> 
>> I'm not all that sure you're likely to see us
>> making any improvements here.
> 
> I'm very sorry to hear this. Please note that I stumbled upon this issue 
> through a real-world use case. One of my queries was very slow. I looked 
> at the query plan of the affected query, and I saw this:
> 
>> Planning Time: 1.267 ms
>> Execution Time: 69566.632 ms
>> cost=3941783.88..3941783.92
> 
> Just by changing one IN clause to ANY, I now see this:
> 
>> Planning Time: 1.103 ms
>> Execution Time: 0.232 ms cost=1514.95..1515.62
> 
> The query time went from over a minute to about a millisecond! I 
> couldn't believe it. Here, the overhead to find the better query plan 
> would obviously have paid off hugely.
> 
>> I suggest just rewriting the query in a
>> way that it executes more quickly for you.
> 
> Yes, I'm happily doing that now.
> 
> The problem is that most other Postgres users don't know that they might 
> have to replace IN with ANY in order to massively speed up their 
> queries. How should they know? This certainly comes unexpected, and this 
> also isn't documented anywhere, as far as I can see.
> 
> Thanks
> 
> Mathias



Re: BUG #17964: Missed query planner optimization

От
David Rowley
Дата:
On Fri, 16 Jun 2023 at 03:53, Mathias Kunter <mathiaskunter@gmail.com> wrote:
> > SELECT * FROM book WHERE
> > author_id IN (SELECT id FROM author WHERE name = 'some_name') OR
> > publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name');
>
> Complete example here: https://dbfiddle.uk/q6_4NuDX
>
> The issue could be fixed quite easily by implementing a heuristic, the
> optimized query will execute a few THOUSAND times faster, most people
> have no clue that they could use ANY(ARRAY()) as a workaround, and still
> this optimization isn't something worth to be implemented?

There are some cases where converting the IN to a semi-join will
significantly improve performance, and as you've shown, there are some
cases where that optimisation causes performance regressions.  We'd
need a way to determine which is cheaper based on estimated costs and
that might be, in my opinion, fairly tricky to do based on the order
of how things currently are done in the query planner.

In my previous emails about this, I was just trying to show that it's
quite a tricky problem to fix properly.  We're certainly open to
patches which fix some of the problems you've mentioned. If you've
looked into it and think it's quite an easy project, then we'll
welcome contributions to improve this. Having the planner consider the
costs of converting the IN to a semi-join and converting or not based
on cost seems like a worthy goal.  The pgsql-hackers as a good
emailing list to bring up the topic and gather up some ideas on how
you might go about fixing it.

David



Re: BUG #17964: Missed query planner optimization

От
Mathias Kunter
Дата:
> We'd need a way to determine which is cheaper based on estimated costs

Yes, that decision must of course be based on estimated costs, and also 
the cost estimation must be reasonably accurate. In all examples which 
I've seen so far, the cost estimations were adequate.

> that might be, in my opinion, fairly tricky to do based on the order of
> how things currently are done in the query planner.

I don't know the details about the current planner implementation, but 
the planner obviously can already correctly estimate costs for both the 
IN and ANY(ARRAY()) variants.

> I was just trying to show that it's quite a tricky problem to
> fix properly.

Yes, the tricky part obviously is to decide whether or not an IN clause 
should be converted to a semi-join, because that decision depends on the 
other IN clauses of the query as well. So, it's unfortunately not 
possible to decide independently for each IN clause.

As you've already pointed out, doing an exhaustive search of all 
possible combinations is obviously problematic, since we'd have to check 
2^N possible query plans (with N being the number of IN clauses of the 
query). So, this is certainly NOT the way to go, unless maybe N<4 or so.

> Having the planner consider the costs of converting the IN to a semi-
> join and converting or not based on cost seems like a worthy goal.

It obviously is. As shown, the performance improvement can be gigantic 
for certain queries.

> If you think it's quite an easy project, then we'll welcome
> contributions to improve this.

I'm not familiar with the current implementation, but my understanding 
is that this should be quite easy to implement for someone who is 
currently working on planner-related issues.

After all, a reasonable heuristic solution can be as simple as this:


1) Create the query plan for the original query, using the current 
planner implementation. (That's the status quo.)

2) Create a modified version of the original input query where all of 
the IN clauses are replaced with ANY(ARRAY()). Create the query plan for 
that modified query version, using the current planner implementation.

3) Compare the two previously created query plans and execute the one 
with the lower estimated total cost. If you don't trust the planner's 
cost estimations for some reason, then only execute the modified version 
if its estimated cost is SIGNIFICANTLY lower.


A simple solution like this would already be good enough to avoid many 
of the horrifying worst case scenarios, where a query takes minutes when 
it should actually only take milliseconds.

It would of course be possible (and desirable) to come up with a more 
sophisticated heuristic than the one given above. For example, create a 
third version of the original query where only those IN clauses are 
replaced with ANY(ARRAY()) where the estimated number of rows returned 
by the subselect is below a certain threshold, and estimate costs for 
that modified query as well.

> The pgsql-hackers as a good emailing list to bring up the topic

Thank you for the hint. I'll continue the discussion over there.

Mathias