Re: Removing unneeded self joins

Поиск

Список

Период

Сортировка

От	Andrei Lepikhov
Тема	Re: Removing unneeded self joins
Дата	19 октября 2023 г. 03:16:05
Msg-id	1bd80083-a13e-4e9f-a7ce-73c3faab521c@postgrespro.ru обсуждение исходный текст
Ответ на	Re: Removing unneeded self joins (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы	Re: Removing unneeded self joins
Список	pgsql-hackers

Дерево обсуждения

On 19/10/2023 01:50, Alexander Korotkov wrote:
> On Mon, Oct 16, 2023 at 11:28 AM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> On 12/10/2023 18:32, Alexander Korotkov wrote:
>>> On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov
>>> <a.lepikhov@postgrespro.ru> wrote:
>>>> On 4/10/2023 14:34, Alexander Korotkov wrote:
>>>>>    > Relid replacement machinery is the most contradictory code here. We used
>>>>>    > a utilitarian approach and implemented a simplistic variant.
>>>>>
>>>>>    > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
>>>>>    > > they are not necessary.  [1] points that infrastructure from [2] might
>>>>>    > > be useful.  The patchset from [2] seems committed mow.  However, I
>>>>>    > > can't see it is directly helpful in this matter.  Could we just skip
>>>>>    > > adding IS NOT NULL clause for the columns, that have
>>>>>    > > pg_attribute.attnotnull set?
>>>>>    > Thanks for the links, I will look into that case.
>>>> To be more precise, in the attachment, you can find a diff to the main
>>>> patch, which shows the volume of changes to achieve the desired behaviour.
>>>> Some explains in regression tests shifted. So, I've made additional tests:
>>>>
>>>> DROP TABLE test CASCADE;
>>>> CREATE TABLE test (a int, b int not null);
>>>> CREATE UNIQUE INDEX abc ON test(b);
>>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
>>>> WHERE t1.b=t2.b;
>>>> CREATE UNIQUE INDEX abc1 ON test(a,b);
>>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
>>>> WHERE t1.b=t2.b;
>>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
>>>> WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
>>>> DROP INDEX abc1;
>>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
>>>> WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);
>>>>
>>>> We have almost the results we wanted to have. But in the last explain
>>>> you can see that nothing happened with the OR clause. We should use the
>>>> expression mutator instead of walker to handle such clauses. But It
>>>> doesn't process the RestrictInfo node ... I'm inclined to put a solution
>>>> of this issue off for a while.
>>>
>>> OK.  I think it doesn't worth to eliminate IS NULL quals with this
>>> complexity (at least at this stage of work).
>>>
>>> I made improvements over the code.  Mostly new comments, grammar
>>> corrections of existing comments and small refactoring.
>>>
>>> Also, I found that the  suggestion from David Rowley [1] to qsort
>>> array of relations to faster find duplicates is still unaddressed.
>>> I've implemented it.  That helps to evade quadratic complexity with
>>> large number of relations.
>>>
>>> Also I've incorporated improvements from Alena Rybakina except one for
>>> skipping SJ removal when no SJ quals is found.  It's not yet clear for
>>> me if this check fix some cases. But at least optimization got skipped
>>> in some useful cases (as you can see in regression tests).
>>
>> I would like to propose one more minor improvement (see in attachment).
>> The idea here is that after removing a self-join and changing clauses we
>> should re-probe the set of relids with the same Oid, because we can find
>> more removable self-joins (see the demo test in join.sql).
> 
> 
> Thank you, I've integrated this into the patch.  BTW, the patch
> introduces two new GUC variables: enable_self_join_removal,
> self_join_search_limit.  enable_self_join_removal variable turns
> on/off optimization at all.  self_join_search_limit variable limits
> its usage by the number of joins.  AFICS, self_join_search_limit is
> intended to protect us from quadratic complexity self-join removal
> has.  I tried to reproduce the extreme case.
> 
> SELECT count(*) FROM pgbench_accounts a0, pgbench_accounts a1, ...,
> pgbench_accounts a100 WHERE a0.aid = 1 AND a1.aid = a0.aid + 1 AND ...
> AND a100.aid = a99.aid + 1;
> 
> This query took 3778.432 ms with self-join removal disabled, and
> 3756.009 ms with self-join removal enabled.  So, no measurable
> overhead.  Similar to the higher number of joins.  Can you imagine
> some extreme case when self-join removal could introduce significant
> overhead in comparison with other optimizer parts?  If not, should we
> remove self_join_search_limit GUC?
Thanks,
It was Zhihong Yu who worried about that case [1]. And my purpose was to 
show a method to avoid such a problem if it would be needed.
I guess the main idea here is that we have a lot of self-joins, but only 
few of them (or no one) can be removed.
I can't imagine a practical situation when we can be stuck in the 
problems here. So, I vote to remove this GUC.


[1] 
https://www.postgresql.org/message-id/CALNJ-vTyL-LpvSOPZxpC63Et3LJLUAFZSfRqGEhT5Rj7_EEj7w%40mail.gmail.com

-- 
regards,
Andrey Lepikhov
Postgres Professional

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Removing unneeded self joins