Обсуждение: BUG #17922: ANY vs IN execution plan difference for a single explicit input value

Поиск
Список
Период
Сортировка

BUG #17922: ANY vs IN execution plan difference for a single explicit input value

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17922
Logged by:          Kaarel Moppel
Email address:      kaarel.moppel@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 22.04
Description:

I guess there can be differences how IN vs ANY input is handled for prepared
statements but for plain SQL input I would expect them to produce identical
plans. Seems to be not the case though for a below scenario which we
stumbled on in production where there's a runtime difference of around 5
orders of magnitude as a different leading index is chosen. A repeatable
test case:

CREATE UNLOGGED TABLE a (
    node_id int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
    project_id int8 NOT NULL
);

INSERT INTO a (project_id)
SELECT
    CASE WHEN random() < 0.99 THEN
        1
    ELSE
        2
    END
FROM
    generate_series(1, 8e6);

CREATE UNIQUE INDEX a_uq ON a (project_id, node_id);

VACUUM ANALYZE a;

CREATE UNLOGGED TABLE b (
    node_id int8 NOT NULL UNIQUE
);

INSERT INTO b (node_id)
SELECT
    node_id
FROM (
    SELECT
        node_id
    FROM
        a
    WHERE
        node_id > 7e6
    LIMIT 15000) x;

VACUUM ANALYZE b;

EXPLAIN
SELECT
    a.*
FROM
    a
    JOIN b USING (node_id)
WHERE
    a.project_id = ANY (ARRAY[1]::int8[])
ORDER BY
    a.project_id,
    a.node_id
LIMIT 10;

EXPLAIN
SELECT
    a.*
FROM
    a
    JOIN b USING (node_id)
WHERE
    a.project_id IN (1) -- Also OK with a plain "=1" witout IN
ORDER BY
    a.project_id,
    a.node_id
LIMIT 10;

                                        QUERY PLAN
              
───────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=0.72..1789.68 rows=10 width=16)
   ->  Nested Loop  (cost=0.72..2655175.45 rows=14842 width=16)
         ->  Index Only Scan using a_uq on a  (cost=0.43..260452.19
rows=7915872 width=16)
               Index Cond: (project_id = ANY ('{1}'::bigint[]))
         ->  Index Only Scan using b_node_id_key on b  (cost=0.29..0.30
rows=1 width=8)
               Index Cond: (node_id = a.node_id)
(6 rows)

                                          QUERY PLAN
                 
──────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=0.72..38.07 rows=10 width=16)
   ->  Nested Loop  (cost=0.72..55440.79 rows=14842 width=16)
         ->  Index Only Scan using b_node_id_key on b  (cost=0.29..397.29
rows=15000 width=8)
         ->  Index Only Scan using a_uq on a  (cost=0.43..3.67 rows=1
width=16)
               Index Cond: ((project_id = 1) AND (node_id = b.node_id))
(5 rows)


Re: BUG #17922: ANY vs IN execution plan difference for a single explicit input value

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> I guess there can be differences how IN vs ANY input is handled for prepared
> statements but for plain SQL input I would expect them to produce identical
> plans.

If you can find any such promise in the documentation, I will be glad
to remove it ;-)

> EXPLAIN
> SELECT
>     a.*
> FROM
>     a
>     JOIN b USING (node_id)
> WHERE
>     a.project_id = ANY (ARRAY[1]::int8[])
> ORDER BY
>     a.project_id,
>     a.node_id
> LIMIT 10;

> EXPLAIN
> SELECT
>     a.*
> FROM
>     a
>     JOIN b USING (node_id)
> WHERE
>     a.project_id IN (1) -- Also OK with a plain "=1" witout IN
> ORDER BY
>     a.project_id,
>     a.node_id
> LIMIT 10;

The thing about "project_id IN (1)", which is reduced to "project_id = 1"
by the parser, is that the planner can see that that renders the "ORDER
BY a.project_id" clause a no-op.  So it only has to order by node_id,
which is what makes this plan valid:

>  Limit  (cost=0.72..38.07 rows=10 width=16)
>    ->  Nested Loop  (cost=0.72..55440.79 rows=14842 width=16)
>          ->  Index Only Scan using b_node_id_key on b  (cost=0.29..397.29 rows=15000 width=8)
>          ->  Index Only Scan using a_uq on a  (cost=0.43..3.67 rows=1 width=16)
>                Index Cond: ((project_id = 1) AND (node_id = b.node_id))

If you don't recognize that then you're forced into sorting the indexscan
output, which is going to look pretty bad for a small-LIMIT situation.

We don't make any attempt to make a similar deduction from =ANY clauses,
mainly because it's usually not possible to be sure that the array has
only one member.  I'm not excited about the cost/benefit ratio of adding
code to check for that.

            regards, tom lane