Re: Properly pathify the union planner

Поиск
Список
Период
Сортировка
От David Rowley
Тема Re: Properly pathify the union planner
Дата
Msg-id CAApHDvqo1rV8O4pMU2-22iTASBXgnm4kbHF6A8_VMqiDR3hG8A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Properly pathify the union planner  (Richard Guo <guofenglinux@gmail.com>)
Ответы Re: Properly pathify the union planner  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers
On Mon, 11 Mar 2024 at 19:56, Richard Guo <guofenglinux@gmail.com> wrote:
> * There are cases where the setop_pathkeys of a subquery does not match
> the union_pathkeys generated in generate_union_paths() for sorting by
> the whole target list.  In such cases, even if we have explicitly sorted
> the paths of the subquery to meet the ordering required for its
> setop_pathkeys, we cannot find appropriate ordered paths for
> MergeAppend.  For instance,
>
> explain (costs off)
> (select a, b from t t1 where a = b) UNION (select a, b from t t2 where a = b);
>                         QUERY PLAN
> -----------------------------------------------------------
>  Unique
>    ->  Sort
>          Sort Key: t1.a, t1.b
>          ->  Append
>                ->  Index Only Scan using t_a_b_idx on t t1
>                      Filter: (a = b)
>                ->  Index Only Scan using t_a_b_idx on t t2
>                      Filter: (a = b)
> (8 rows)
>
> (Assume t_a_b_idx is a btree index on t(a, b))
>
> In this query, the setop_pathkeys of the subqueries includes only one
> PathKey because 'a' and 'b' are in the same EC inside the subqueries,
> while the union_pathkeys of the whole query includes two PathKeys, one
> for each target entry.  After we convert the setop_pathkeys to outer
> representation, we'd notice that it does not match union_pathkeys.
> Consequently, we are unable to recognize that the index scan paths are
> already appropriately sorted, leading us to miss the opportunity to
> utilize MergeAppend.
>
> Not sure if this case is common enough to be worth paying attention to.

I've spent almost all day looking into this, which is just enough work
to satisfy myself this *is* future work rather than v17 work.

The reason I feel this is future work rather than work for this patch
is that this is already a limitation of subqueries in general and it's
not unique to union child queries.

For example:

create table ab(a int, b int, primary key(a,b));
insert into ab select x,y from generate_series(1,100)x,generate_series(1,100)y;
vacuum analyze ab;
explain select * from (select a,b from ab where a = 1 order by a,b
limit 10) order by a,b;

The current plan for this is:

                   QUERY PLAN
-------------------------------------------------
 Sort
   Sort Key: ab.a, ab.b
   ->  Limit
         ->  Index Only Scan using ab_pkey on ab
               Index Cond: (a = 1)
(5 rows)

The additional sort isn't required but is added because the outer
query requires the pathkeys {a,b} and the inner query only has the
pathkey {b}. {a} is removed due to it being redundant because of the
const member.   The outer query does not know about the redundant
pathkeys so think the subquery is only sorted by {b} therefore adds
the sort on "a", "b".

The attached 0001 patch (renamed as .txt so it's ignored by the CFbot)
adjusts convert_subquery_pathkeys() to have it look a bit deeper and
try harder to match the path to the outer query's query_pathkeys.
After patching with that, the plan becomes:

                QUERY PLAN
-------------------------------------------
 Limit
   ->  Index Only Scan using ab_pkey on ab
         Index Cond: (a = 1)
(3 rows)

The patch is still incomplete as the matching is quite complex for the
case you mentioned with a=b.  It's a bit late here to start making
that work, but I think the key to make that work is to give
subquery_matches_pathkeys() an extra parameter or 2 for where to start
working on each list and recursively call itself where I've left the
TODO comment in the function and on the recursive call, try the next
query_pathkeys and the same sub_pathkey. If the recursive call
function returns false, continue on trying to match the normal way. If
it returns true, return true.

There'd be a bit more work elsewhere to do to make this work for the
general case.  For example:

explain (costs off) select * from (select a,b from ab where a = 1
offset 0) order by a,b;

still produces the following plan with the patched version:

                QUERY PLAN
-------------------------------------------
 Sort
   Sort Key: ab.a, ab.b
   ->  Index Only Scan using ab_pkey on ab
         Index Cond: (a = 1)
(4 rows)

the reason for this is that the subquery does not know the outer query
would like the index path to have the pathkeys  {a,b}.  I've not
looked at this issue in detail but I suspect we could do something
somewhere like set_subquery_pathlist() to check if the outer query's
query_pathkeys are all Vars from the subquery.  I think we'd need to
trawl through the eclasses to ensure that each query_pathkeys eclasses
contains a member mentioning only a Var from the subquery and if so,
get the subquery to set those pathkeys.

Anyway, this is all at least PG18 work so I'll raise a thread about it
around June.  The patch is included in case you want to mess around
with it. I'd be happy if you want to look into the subquery pathkey
issue portion of the work. I won't be giving this much more focus
until after the freeze.

> * In build_setop_child_paths() we also create properly sorted partial
> paths, which seems not necessary because we do not support parallel
> merge append, right?

Yeah. Thanks for noticing that. Removing all that saves quite a bit more code.

> * Another is minor and relates to cosmetic matters.  When we unique-ify
> the result of a UNION, we take the number of distinct groups as equal to
> the total input size.  For the Append path and Gather path, we use
> 'dNumGroups', which is 'rows' of the Append path.  For the MergeAppend
> we use 'rows' of the MergeAppend path.  I believe they are supposed to
> be the same, but I think it'd be better to keep them consistent: either
> use 'dNumGroups' for all the three kinds of paths, or use 'path->rows'
> for each path.

Yeah, that should use dNumGroups. Well spotted.

I've attached v3.

David

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: speed up a logical replica setup
Следующее
От: Andrei Lepikhov
Дата:
Сообщение: Re: a wrong index choose when statistics is out of date