Обсуждение: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

Поиск
Список
Период
Сортировка

BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18261
Logged by:          Zuming Jiang
Email address:      zuming.jiang@inf.ethz.ch
PostgreSQL version: 16.1
Operating system:   Ubuntu 20.04
Description:

My fuzzer finds a logic bug in Postgres, which makes Postgres return
inconsistent results.

--- Set up database ---
create table t0 (pkey int4, c0 numeric, primary key(c0));
create table t1 (c3 int4);
create table t5 (vkey int4, pkey int4);
create view t6 as
select
    1 as c_7
  from
    ((select
            0.0 as c_0,
            -4 as c_1
          from
            t0 as ref_0
          ) as subq_0
      right outer join ((select
              ref_3.pkey as c_6,
              ref_3.c0 as c_7
            from
              t0 as ref_3
            ) as subq_1
        right outer join t0 as ref_4
        on (subq_1.c_7 = ref_4.c0 ))
      on (subq_0.c_0 = subq_1.c_7 ))
  where 'xxx' ~~* (ltrim('xxx',
      (overlay('xxx', 'xxx', subq_0.c_1, subq_1.c_6))));
insert into t5 values (68, 78000);
---

The fuzzer generates Test case 1:

--- Test case 1 ---
select 1 as c_0
from
  ((select
              ref_2.vkey as c_3,
              ref_2.pkey as c_4,
              ref_2.pkey as c_5
            from
              t5 as ref_2
            ) as subq_0
        full outer join (select
              ref_4.c_7 as c_0
            from
              t6 as ref_4
            where null::bool) as subq_1
        on (subq_0.c_3 = subq_1.c_0))
where subq_0.c_4 < (case when ((exists (
        select
            ref_16.c3 as c_9
          from
            t1 as ref_16
          where null::bool))) then (subq_1.c_0 # null::int4) else
(subq_1.c_0 # null::int4) end);

Because the then branch and else branch of the CASE WHEN expression '(case
when ((exists (select ref_16.c3 as c_9 from t1 as ref_16 where null::bool)))
then (subq_1.c_0 # null::int4) else (subq_1.c_0 # null::int4) end)' are the
same (both are subq_1.c_0 # null::int4), I simplify this CASE WHEN
expression by replacing it with (subq_1.c_0 # null::int4), and get Test case
2:

--- Test case 2 ---
select 1 as c_0
from
  ((select
              ref_2.vkey as c_3,
              ref_2.pkey as c_4,
              ref_2.pkey as c_5
            from
              t5 as ref_2
            ) as subq_0
        full outer join (select
              ref_4.c_7 as c_0
            from
              t6 as ref_4
            where null::bool) as subq_1
        on (subq_0.c_3 = subq_1.c_0 ))
where subq_0.c_4 < (subq_1.c_0 # null::int4);

--- Expected behavior ---
Test case 1 and Test case 2 return the same results.

--- Actual behavior ---
Test case 1 returns 0 rows, while Test case 2 returns 1 row.

Output of Test case 1:
 c_0 
-----
(0 rows)

Output of Test case 2:
 c_0 
-----
   1
(1 row)

--- Postgres version ---
Github commit: 0eac3c798c2d223d6557a5440d7534317dbd4fa0
Version: PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc
(Ubuntu
9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

--- Platform information ---
Platform: Ubuntu 20.04
Kernel: Linux 5.4.0-147-generic


Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Richard Guo
Дата:

On Wed, Dec 27, 2023 at 7:36 PM PG Bug reporting form <noreply@postgresql.org> wrote:
My fuzzer finds a logic bug in Postgres, which makes Postgres return
inconsistent results.

Thank you for the report.  This is surely a wrong result issue.  Great
catch!

I've looked into it a bit.  The problem lies in how the SJE code handles
the transfer of qual clauses from the removed relation to the remaining
one.  The code replaces the Vars of the removed relation with the Vars
of the remaining relation for each clause.  It then reintegrates these
clauses into the appropriate restriction or join clause lists, while
attempting to avoid duplicates.  So far so good.

However, the code compares RestrictInfo->clause to determine if two
clauses are duplicates.  This is just flat wrong.  Two RestrictInfos
with the same clause can have different required_relids,
incompatible_relids, is_pushed_down and so on.

Here is a simple example to illustrate this issue.

create table t (a int primary key, b int);
insert into t select 1,1;

-- wrong plan. the full join should be dummy
explain (costs off)
select 1 from t full join
    (select * from t t1 join
        t t2 join t t3 on t2.a = t3.a
        on true
    where false) s on true
where false;
              QUERY PLAN
--------------------------------------
 Merge Full Join
   ->  Seq Scan on t
   ->  Materialize
         ->  Result
               One-Time Filter: false
(5 rows)

-- wrong result
select 1 from t full join
    (select * from t t1 join
        t t2 join t t3 on t2.a = t3.a
        on true
    where false) s on true
where false;
 ?column?
----------
        1
(1 row)

In this query there are two RestrictInfos whose clause are both
const-false: one is supposed to be evaluated at join t1/t3, and the
other is supposed to be evaluated above the full join.  But the SJE
code mistakenly thinks that they are duplicates, so the one above the
full join is just abandoned.

Do we really need to avoid duplicates here?  We do not do that before.
For instance,

explain (costs off)
select * from t where b > 1 and b > 1;
           QUERY PLAN
---------------------------------
 Seq Scan on t
   Filter: ((b > 1) AND (b > 1))
(2 rows)

So I will not be surprised if I see redundant 'b > 1' after the join
removal in the following plan.

explain (costs off)
select * from t t1 join t t2 on t1.a = t2.a where t1.b > 1 and t2.b > 1;
               QUERY PLAN
-----------------------------------------
 Seq Scan on t t2
   Filter: ((a IS NOT NULL) AND (b > 1))
(2 rows)

If we determine that avoiding duplicates is necessary,  I think at least
we should compare the entire RestrictInfos not just their clauses.  One
challenge with this approach is that the 'rinfo_serial' usually differs,
making direct comparison problematic.  I'm wondering if we can make
'rinfo_serial' equal_ignore.  Not too sure about that.

Attached is a patch to show my thoughts.

Thanks
Richard
Вложения

Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Tom Lane
Дата:
Richard Guo <guofenglinux@gmail.com> writes:
> I've looked into it a bit.  The problem lies in how the SJE code handles
> the transfer of qual clauses from the removed relation to the remaining
> one.

I am definitely starting to think that the SJE patch was not ready
for prime time.  We keep finding not only minor but major problems
in it --- I'd call this one a "major" one.  Is it time to revert
and rethink it from scratch?

> If we determine that avoiding duplicates is necessary,  I think at least
> we should compare the entire RestrictInfos not just their clauses.  One
> challenge with this approach is that the 'rinfo_serial' usually differs,
> making direct comparison problematic.  I'm wondering if we can make
> 'rinfo_serial' equal_ignore.  Not too sure about that.

I'd say that that will break the cases rinfo_serial was introduced for.
Now, I certainly don't love rinfo_serial and would be happier if we
could do without it, but getting rid of it is another research project.

            regards, tom lane



Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Richard Guo
Дата:

On Fri, Dec 29, 2023 at 12:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> I've looked into it a bit.  The problem lies in how the SJE code handles
> the transfer of qual clauses from the removed relation to the remaining
> one.

I am definitely starting to think that the SJE patch was not ready
for prime time.  We keep finding not only minor but major problems
in it --- I'd call this one a "major" one.  Is it time to revert
and rethink it from scratch?

I feel the same way.  It seems to me that the SJE code still needs some
refactoring to get to a committable state.

Thanks
Richard

Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Alexander Korotkov
Дата:
On Thu, Dec 28, 2023 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Richard Guo <guofenglinux@gmail.com> writes:
> > I've looked into it a bit.  The problem lies in how the SJE code handles
> > the transfer of qual clauses from the removed relation to the remaining
> > one.
>
> I am definitely starting to think that the SJE patch was not ready
> for prime time.  We keep finding not only minor but major problems
> in it --- I'd call this one a "major" one.  Is it time to revert
> and rethink it from scratch?

First, I'd like to admit that the SJE patch contained a set of
oversights, some of them were very unnecessary ones (like putting
non-node into RelOptInfo.unique_for_rels).

The distinction between minor and major issues is often a matter of
perspective.  In the case of the SJE patch, I believe the issues yet
spotted are localized and do not necessitate a complete redesign of
the feature. This localized nature of the problems suggests that
targeted fixes could be sufficient to address the current concerns.

In terms of effort, the bitmapset issue was the toughest for me yet.
But I would have to say that this issue persisted in the code before.
As a result of the committed fixes, now we have the instrumentation to
detect problems like this in the future.

The integration of the SJE feature with the optimizer's data structure
and algorithms indeed presents a high risk of conflicts with other
patches. The experience with nullable vars is a testament to this
challenge. It's a delicate balance to maintain, and I recognize the
difficulties in predicting and managing such conflicts.

I am not advocating for the SJE feature unconditionally. I am fully
prepared to reconsider its implementation if a fundamental redesign
becomes necessary. However, at this stage, I believe that retaining
and refining the feature is more beneficial than a complete rollback.
Despite its imperfections, the SJE patch holds significant potential.

> > If we determine that avoiding duplicates is necessary,  I think at least
> > we should compare the entire RestrictInfos not just their clauses.  One
> > challenge with this approach is that the 'rinfo_serial' usually differs,
> > making direct comparison problematic.  I'm wondering if we can make
> > 'rinfo_serial' equal_ignore.  Not too sure about that.
>
> I'd say that that will break the cases rinfo_serial was introduced for.
> Now, I certainly don't love rinfo_serial and would be happier if we
> could do without it, but getting rid of it is another research project.

It's a pity that no regression tests detect that.  The attached patch
implements the special comparison function, which ignores the
'rinfo_serial' field.  This avoids marking 'rinfo_serial' as
pg_node_attr(equal_ignore).

Links.
1. https://www.postgresql.org/message-id/CAMbWs4_wJthNtYBL%2BSsebpgF-5L2r5zFFk6xYbS0A78GKOTFHw%40mail.gmail.com

------
Regards,
Alexander Korotkov

Вложения

Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Alexander Korotkov
Дата:
On Tue, Jan 2, 2024 at 3:43 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Fri, Dec 29, 2023 at 12:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Richard Guo <guofenglinux@gmail.com> writes:
>> > I've looked into it a bit.  The problem lies in how the SJE code handles
>> > the transfer of qual clauses from the removed relation to the remaining
>> > one.
>>
>> I am definitely starting to think that the SJE patch was not ready
>> for prime time.  We keep finding not only minor but major problems
>> in it --- I'd call this one a "major" one.  Is it time to revert
>> and rethink it from scratch?
>
> I feel the same way.  It seems to me that the SJE code still needs some
> refactoring to get to a committable state.

Thank you for your feedback.  Do you have something particular to be
done, which requires reverting the feature from the source tree?

My concern is that we've already fixed quite a set of bugs after
committing to this feature.  If we push the feature back into the
mailing list thread for possibly a long time, then more conflicting
features could be committed (like nullable vars in the past).  Thus,
we may end up committing it again with a shape not better than it is
now.  On the other hand, if there is a plan to redesign the feature in
a way that greatly reduces the number of potential conflicts, that's
another story.

------
Regards,
Alexander Korotkov



Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Andrei Lepikhov
Дата:
On 2/1/2024 08:43, Richard Guo wrote:
> 
> On Fri, Dec 29, 2023 at 12:32 AM Tom Lane <tgl@sss.pgh.pa.us 
> <mailto:tgl@sss.pgh.pa.us>> wrote:
> 
>     Richard Guo <guofenglinux@gmail.com <mailto:guofenglinux@gmail.com>>
>     writes:
>      > I've looked into it a bit.  The problem lies in how the SJE code
>     handles
>      > the transfer of qual clauses from the removed relation to the
>     remaining
>      > one.
> 
>     I am definitely starting to think that the SJE patch was not ready
>     for prime time.  We keep finding not only minor but major problems
>     in it --- I'd call this one a "major" one.  Is it time to revert
>     and rethink it from scratch?
> 
> 
> I feel the same way.  It seems to me that the SJE code still needs some
> refactoring to get to a committable state.
Being an author of the feature, I couldn't be all objective, of course. 
However, let me write some words on this code.

About profitability. According to current database development trends, 
the feature looks helpful, where we see many auto-generated and 
syntactically redundant queries. So, IMO, some tools for simplifying the 
parse tree should be in the pocket of the DBMS. We do have numerous 
planner hooks that give way to well-known extensions like Citus or 
TimescaleDB, but tweaking the parse tree is gradually cheaper in many cases.
By reducing the number of RangeTblEntries, clauses, and equivalence 
classes, we can profit significantly from this feature, especially with 
partitioned tables.
Hence, this code is not only about self-join elimination. It also 
introduces machinery for redundant parse tree reduction not only for the 
case of dead tails, like reduce_outer_joins. It may be utilized by 
extensions in other ways, too.

About weak points. We tried different ways to implement that feature 
during the development, even in the optimization stage. The current 
place and logic are the most effective. The self-join detection stage 
looks stable and polished enough. Only one part of the code is debatable 
- replacing one baserel (removing), involved in many relations with 
other DB objects, with another one (keeping).

About development. Having it working in a PG fork and as a patch at the 
commitfest for a long time, we hadn't had so much food for thought as 
last three months when it was committed to the master. It raised new 
questions on weaknesses of setrefs, links from outer query blocks, 
integrity control of PlannerInfo, etc.
Bugs we have been finding look at most not about architectural issues 
but about leaks in planner understanding, which is primarily correct 
about me personally.
I think, having close scrutiny of the PlannerInfo fields and clause 
replacement strategy, we can achieve stable behaviour of this feature 
before the release and I already see progress in stability of the 
feature. As I see dynamics of discussions in the pgsql-bugs list, we 
already have a community team of people who detect SJE problems, develop 
the code and provide a review. Maybe it is also a positive outcome?

Having written all the words above, I think we must formulate objective 
reasons if we want to revert this feature. Another way it can live on 
the mailing list for another five years without any progress. I, 
personally, vote for leaving this feature in the master.

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Andrei Lepikhov
Дата:
On 2/1/2024 15:02, Alexander Korotkov wrote:
> On Thu, Dec 28, 2023 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd say that that will break the cases rinfo_serial was introduced for.
>> Now, I certainly don't love rinfo_serial and would be happier if we
>> could do without it, but getting rid of it is another research project.
> 
> It's a pity that no regression tests detect that.  The attached patch
> implements the special comparison function, which ignores the
> 'rinfo_serial' field.  This avoids marking 'rinfo_serial' as
> pg_node_attr(equal_ignore).
I've reviewed this patch, and it seems okay. It was a real blunder in 
our understanding of clauses; thanks to all for pointing this out and 
fixing it.
The regression test shows where two fully equal join clauses, applied on 
different join levels, differ by the only required_relids. It covers the 
problem and can be helpful by itself. IMO, one line in this test is 
redundant (see attachment).
The comment on the patch looks okay. But assuming required_relids and 
clause fields of two RestrinctInfos are equal, I can't imagine a 
situation to have different is_pushed_down and incompatible_relids. Is 
it really possible?

-- 
regards,
Andrei Lepikhov
Postgres Professional

Вложения

Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Alexander Korotkov
Дата:
On Fri, Jan 5, 2024 at 5:32 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 2/1/2024 15:02, Alexander Korotkov wrote:
> > On Thu, Dec 28, 2023 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I'd say that that will break the cases rinfo_serial was introduced for.
> >> Now, I certainly don't love rinfo_serial and would be happier if we
> >> could do without it, but getting rid of it is another research project.
> >
> > It's a pity that no regression tests detect that.  The attached patch
> > implements the special comparison function, which ignores the
> > 'rinfo_serial' field.  This avoids marking 'rinfo_serial' as
> > pg_node_attr(equal_ignore).
> I've reviewed this patch, and it seems okay. It was a real blunder in
> our understanding of clauses; thanks to all for pointing this out and
> fixing it.
> The regression test shows where two fully equal join clauses, applied on
> different join levels, differ by the only required_relids. It covers the
> problem and can be helpful by itself. IMO, one line in this test is
> redundant (see attachment).
> The comment on the patch looks okay. But assuming required_relids and
> clause fields of two RestrinctInfos are equal, I can't imagine a
> situation to have different is_pushed_down and incompatible_relids. Is
> it really possible?

Andrei, thank you for your review.  I've pushed it with your
suggestions and my further minor editings to comments and the commit
message.  Sorry, I forgot to mention you as the reviewer.

One thing that worries me is that the patch version by Richard Guo [1]
marking 'rinfo_serial' as pg_node_attr(equal_ignore) didn't cause our
test suite to fail.  Do you think you can write a test, which would
fail in that case?

Links
1. https://www.postgresql.org/message-id/CAMbWs48_EAPrc_C5qvK2WGvydHvht1JUnXwer9FVeU7t_zA%2BMQ%40mail.gmail.com

------
Regards,
Alexander Korotkov



Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Richard Guo
Дата:

On Fri, Jan 5, 2024 at 11:32 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
The regression test shows where two fully equal join clauses, applied on
different join levels, differ by the only required_relids. It covers the
problem and can be helpful by itself. IMO, one line in this test is
redundant (see attachment).

Hmm, I don't think the insert statement in the test case is redundant.
It's needed to verify that the query in the test case gives the correct
result.  Without the insert statement, the wrong plan would give the
same result as the correct plan, i.e., an empty set in this case.

IMO, if we make some code changes and add a test case for that, we need
to ensure the test can give a different (and correct of course) result
than what came before.

Thanks
Richard

Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Alexander Korotkov
Дата:
On Mon, Jan 8, 2024, 13:51 Richard Guo <guofenglinux@gmail.com> wrote:

On Fri, Jan 5, 2024 at 11:32 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
The regression test shows where two fully equal join clauses, applied on
different join levels, differ by the only required_relids. It covers the
problem and can be helpful by itself. IMO, one line in this test is
redundant (see attachment).

Hmm, I don't think the insert statement in the test case is redundant.
It's needed to verify that the query in the test case gives the correct
result.  Without the insert statement, the wrong plan would give the
same result as the correct plan, i.e., an empty set in this case.

IMO, if we make some code changes and add a test case for that, we need
to ensure the test can give a different (and correct of course) result
than what came before.

Thank you for pointing this.  Yes, veryfying query result not just plan is strengthening the test.  I'll recheck this and push the fix later today.

------
Regards,
Alexander Korotkov

Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Alexander Korotkov
Дата:
On Mon, Jan 8, 2024 at 2:49 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Jan 8, 2024, 13:51 Richard Guo <guofenglinux@gmail.com> wrote:
>> On Fri, Jan 5, 2024 at 11:32 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
>>>
>>> The regression test shows where two fully equal join clauses, applied on
>>> different join levels, differ by the only required_relids. It covers the
>>> problem and can be helpful by itself. IMO, one line in this test is
>>> redundant (see attachment).
>>
>>
>> Hmm, I don't think the insert statement in the test case is redundant.
>> It's needed to verify that the query in the test case gives the correct
>> result.  Without the insert statement, the wrong plan would give the
>> same result as the correct plan, i.e., an empty set in this case.
>>
>> IMO, if we make some code changes and add a test case for that, we need
>> to ensure the test can give a different (and correct of course) result
>> than what came before.
>
> Thank you for pointing this.  Yes, veryfying query result not just plan is strengthening the test.  I'll recheck this
andpush the fix later today. 

Pushed.  Thanks again, Richard.

------
Regards,
Alexander Korotkov



Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Alexander Korotkov
Дата:
On Mon, Jan 8, 2024 at 3:10 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Jan 8, 2024 at 2:49 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Mon, Jan 8, 2024, 13:51 Richard Guo <guofenglinux@gmail.com> wrote:
> >> On Fri, Jan 5, 2024 at 11:32 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
> >>>
> >>> The regression test shows where two fully equal join clauses, applied on
> >>> different join levels, differ by the only required_relids. It covers the
> >>> problem and can be helpful by itself. IMO, one line in this test is
> >>> redundant (see attachment).
> >>
> >>
> >> Hmm, I don't think the insert statement in the test case is redundant.
> >> It's needed to verify that the query in the test case gives the correct
> >> result.  Without the insert statement, the wrong plan would give the
> >> same result as the correct plan, i.e., an empty set in this case.
> >>
> >> IMO, if we make some code changes and add a test case for that, we need
> >> to ensure the test can give a different (and correct of course) result
> >> than what came before.
> >
> > Thank you for pointing this.  Yes, veryfying query result not just plan is strengthening the test.  I'll recheck
thisand push the fix later today. 
>
> Pushed.  Thanks again, Richard.

I also noticed my indent oversight.  Will fix that later today.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koel&dt=2024-01-08%2014%3A19%3A03

------
Regards,
Alexander Korotkov



Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries

От
Andrei Lepikhov
Дата:
On 8/1/2024 18:50, Richard Guo wrote:
> 
> On Fri, Jan 5, 2024 at 11:32 AM Andrei Lepikhov 
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
> 
>     The regression test shows where two fully equal join clauses,
>     applied on
>     different join levels, differ by the only required_relids. It covers
>     the
>     problem and can be helpful by itself. IMO, one line in this test is
>     redundant (see attachment).
> 
> 
> Hmm, I don't think the insert statement in the test case is redundant.
> It's needed to verify that the query in the test case gives the correct
> result.  Without the insert statement, the wrong plan would give the
> same result as the correct plan, i.e., an empty set in this case.
> 
> IMO, if we make some code changes and add a test case for that, we need
> to ensure the test can give a different (and correct of course) result
> than what came before.
Maybe. I personally support explain by an execution result if the code 
is related to executor and couldn't be detected by the form of query 
plan. But why not? anyway, Alexander already committed that.

-- 
regards,
Andrei Lepikhov
Postgres Professional