Обсуждение: Weirdly pesimistic estimates in optimizer

Поиск
Список
Период
Сортировка

Weirdly pesimistic estimates in optimizer

От
David Kubečka
Дата:
Hi all,

I have encountered a performance problem with relatively simple query, which I think is caused by the overly pesimistic estimates in optimizer. I have originally run into this issue on a table few GBs large, but it can be reproduced with much smaller table as follows:

-- Setup main fact table
create table facts (fk int, f numeric);
insert into facts select (random()*10000)::int, random()*10000 from generate_series(1,1000000) g(i);
create index on facts (fk);
vacuum analyze facts;

-- Pick 100 values of 'fk' which cover roughly 1/100 rows from the 'facts' table and select all corresponding values
create table random_fk_dupl as select fk from facts where fk in (select (random()*10000)::int from generate_series(1,100) g(i));
vacuum analyze random_fk_dupl;

-- Create a table with unique values from 'random_fk_dupl'
create table random_fk_uniq as select distinct fk from random_fk_dupl;
vacuum analyze random_fk_uniq;

(The motivation for doing this is that I have a persistent table/cache with aggregated values and I want to update this table whenever new data are loaded to 'facts' by processing only loaded data. This is very rough description but it isn't needed to go into more detail for the sake of the argument.)

Now the main query:

david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_dupl) group by 1;
                                                              QUERY PLAN                                                             
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=892.82..1017.34 rows=9962 width=15) (actual time=22.752..22.791 rows=100 loops=1)
   Group Key: facts.fk
   ->  Nested Loop  (cost=167.01..842.63 rows=10038 width=15) (actual time=2.167..16.739 rows=9807 loops=1)
         ->  HashAggregate  (cost=166.59..167.59 rows=100 width=4) (actual time=2.153..2.193 rows=100 loops=1)
               Group Key: random_fk_dupl.fk
               ->  Seq Scan on random_fk_dupl  (cost=0.00..142.07 rows=9807 width=4) (actual time=0.003..0.688 rows=9807 loops=1)
         ->  Index Scan using facts_fk_idx on facts  (cost=0.42..5.75 rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
               Index Cond: (fk = random_fk_dupl.fk)
 Planning time: 0.372 ms
 Execution time: 22.842 ms

This is the plan I would expect given the estimates (quite good except the last aggregation) for this query and it exactly corresponds to what's written in the README of optimizer:

{quote}
The naive way to join two relations using a clause like WHERE A.X = B.Y
is to generate a nestloop plan like this:

    NestLoop
        Filter: A.X = B.Y
        -> Seq Scan on A
        -> Seq Scan on B

We can make this better by using a merge or hash join, but it still
requires scanning all of both input relations.  If A is very small and B is
very large, but there is an index on B.Y, it can be enormously better to do
something like this:

    NestLoop
        -> Seq Scan on A
        -> Index Scan using B_Y_IDX on B
            Index Condition: B.Y = A.X

Here, we are expecting that for each row scanned from A, the nestloop
plan node will pass down the current value of A.X into the scan of B.
That allows the indexscan to treat A.X as a constant for any one
invocation, and thereby use it as an index key.  This is the only plan type
that can avoid fetching all of B, and for small numbers of rows coming from
A, that will dominate every other consideration.  (As A gets larger, this
gets less attractive, and eventually a merge or hash join will win instead.
So we have to cost out all the alternatives to decide what to do.)
{quote}

So far so good. Now let's use 'random_fk_uniq' instead of 'random_fk_dupl'. I would expect almost the same plan (with just cheaper inner HashAggregate), because the optimizer knows that thar there are (at most) 100 values in 'random_fk_uniq' so nested loop is clearly the best choice. Instead we get this:

david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_uniq) group by 1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=18198.11..18322.64 rows=9962 width=15) (actual time=163.738..163.773 rows=100 loops=1)
   Group Key: facts.fk
   ->  Hash Semi Join  (cost=3.25..18147.92 rows=10038 width=15) (actual time=0.298..160.271 rows=9807 loops=1)
         Hash Cond: (facts.fk = random_fk_uniq.fk)
         ->  Seq Scan on facts  (cost=0.00..15408.00 rows=1000000 width=15) (actual time=0.010..66.524 rows=1000000 loops=1)
         ->  Hash  (cost=2.00..2.00 rows=100 width=4) (actual time=0.079..0.079 rows=100 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 4kB
               ->  Seq Scan on random_fk_uniq  (cost=0.00..2.00 rows=100 width=4) (actual time=0.003..0.028 rows=100 loops=1)
 Planning time: 1.021 ms
 Execution time: 163.911 ms

Even with our small data this is significantly slower and the relative difference gets much bigger with larger data. Why optimizer chooses such an ineffective plan? Because it apparently thinks it's the cheapest one. So let's force it to use nested loop in order to see the cost for that algorithm.

david=# set enable_hashjoin to off;
SET
david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_uniq) group by 1;
                                                              QUERY PLAN                                                             
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=20063.48..20188.01 rows=9962 width=15) (actual time=25.502..25.545 rows=100 loops=1)
   Group Key: facts.fk
   ->  Nested Loop  (cost=5.16..20013.29 rows=10038 width=15) (actual time=0.159..19.748 rows=9807 loops=1)
         ->  HashAggregate  (cost=2.25..3.25 rows=100 width=4) (actual time=0.032..0.068 rows=100 loops=1)
               Group Key: random_fk_uniq.fk
               ->  Seq Scan on random_fk_uniq  (cost=0.00..2.00 rows=100 width=4) (actual time=0.004..0.011 rows=100 loops=1)
         ->  Bitmap Heap Scan on facts  (cost=5.16..199.11 rows=100 width=15) (actual time=0.042..0.168 rows=98 loops=100)
               Recheck Cond: (fk = random_fk_uniq.fk)
               Heap Blocks: exact=9745
               ->  Bitmap Index Scan on facts_fk_idx  (cost=0.00..5.13 rows=100 width=0) (actual time=0.024..0.024 rows=98 loops=100)
                     Index Cond: (fk = random_fk_uniq.fk)
 Planning time: 0.702 ms
 Execution time: 25.660 ms

Ok, so the overall cost of nested loop is 20013 whereas for hash join it's only 18147. Let's finally force exactly the same plan as for query using 'random_fk_dupl'.

david=# set enable_bitmapscan to off;
SET
david=# explain analyze select fk, sum(f) from facts where fk in (select fk from random_fk_uniq) group by 1;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=21578.55..21703.07 rows=9962 width=15) (actual time=17.846..17.893 rows=100 loops=1)
   Group Key: facts.fk
   ->  Nested Loop  (cost=0.42..21528.36 rows=10038 width=15) (actual time=0.014..13.069 rows=9807 loops=1)
         ->  HashAggregate  (cost=2.25..3.25 rows=100 width=4) (actual time=0.032..0.068 rows=100 loops=1)
               Group Key: random_fk_uniq.fk
               ->  Seq Scan on random_fk_uniq  (cost=0.00..2.00 rows=100 width=4) (actual time=0.004..0.011 rows=100 loops=1)
         ->  Index Scan using facts_fk_idx on facts  (cost=0.42..214.26 rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)
               Index Cond: (fk = random_fk_uniq.fk)
 Planning time: 0.257 ms
 Execution time: 17.943 ms

The question is why optimizer, or rather the cost estimator, produced so much different estimates upon very small change in input. Moreover it seems that the main culprit of bad estimates isn't actually directly related to outer table, but rather to inner table. Just compare estimates for the two index scans:

With 'random_fk_dupl':
         ->  Index Scan using facts_fk_idx on facts  (cost=0.42..5.75 rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
With 'random_fk_uniq':
         ->  Index Scan using facts_fk_idx on facts  (cost=0.42..214.26 rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)

I have read the optimizer README file and also looked briefly at the code, but this seems to be something not related to particular implementation of algorithm (e.g. nested loop). Perhaps it's the way how cost estimates are propagated down (or sideways? that would be weird...) the query tree. But I am really not sure, since this is my first time lookng at the optimizer code base. I should also add that I have reproduced this behaviour for all versions of Pg from 9.2 up to current devel.

Regards,

David

Re: Weirdly pesimistic estimates in optimizer

От
Tomas Vondra
Дата:
Hi David ;-)

On 2.3.2015 20:19, David Kubečka wrote:
>
> The question is why optimizer, or rather the cost estimator,
> produced so much different estimates upon very small change in input.
> Moreover it seems that the main culprit of bad estimates isn't
> actually directly related to outer table, but rather to inner table.
> Just compare estimates for the two index scans:
> 
> With 'random_fk_dupl':
>          ->  Index Scan using facts_fk_idx on facts  (cost=0.42..5.75
> rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
> With 'random_fk_uniq':
>          ->  Index Scan using facts_fk_idx on facts  (cost=0.42..214.26
> rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)
> 
> I have read the optimizer README file and also looked briefly at the
>  code, but this seems to be something not related to particular 
> implementation of algorithm (e.g. nested loop). Perhaps it's the way 
> how cost estimates are propagated down (or sideways? that would be 
> weird...) the query tree. But I am really not sure, since this is my 
> first time lookng at the optimizer code base. I should also add that 
> I have reproduced this behaviour for all versions of Pg from 9.2 up 
> to current devel.

Interesting. I've only spent a few minutes looking at this, but it
certainly is a bit strange that using smaller table with unique values
results in a slower plan than using a table with duplicities (but the
same number of unique values).

Another observation is that the planner does see other plans, because
tweaking the cost variables like this:
   SET random_page_cost = 2;

produces this plan (which on my machine runs just a tad slower than the
first query in your post):
                           QUERY PLAN
---------------------------------------------------------------------HashAggregate  (cost=10564.81..10688.47 rows=9893
width=15) Group Key: facts.fk  ->  Nested Loop  (cost=5.42..10515.34 rows=9893 width=15)      ->  HashAggregate
(cost=2.24..3.23rows=99 width=4)            Group Key: random_fk_uniq.fk            ->  Seq Scan on random_fk_uniq
(cost=0.00..1.99...)      ->  Bitmap Heap Scan on facts  (cost=3.18..105.18 rows=100 ...)          Recheck Cond: (fk =
random_fk_uniq.fk)         ->  Bitmap Index Scan on facts_fk_idx  (cost=0.00..3.15 ...)                    Index Cond:
(fk= random_fk_uniq.fk)
 

I can get similar results by setting cpu_operator_cost=0.005. And
further lowering to random_page_cost to 1.1 actually gets me this:
                           QUERY PLAN
---------------------------------------------------------------------HashAggregate  (cost=6185.41..6309.08 rows=9893
width=15) Group Key: facts.fk  ->  Nested Loop  (cost=2.66..6135.95 rows=9893 width=15)     ->  HashAggregate
(cost=2.24..3.23rows=99 width=4)           Group Key: random_fk_uniq.fk           ->  Seq Scan on random_fk_uniq
(cost=0.00..1.99...)     ->  Index Scan using facts_fk_idx on facts  (cost=0.42..60.95 ...)              Index Cond:
(fk= random_fk_uniq.fk)
 

which is exactly the first plan from your post.

I still don't understand why using the smaller table produces less
efficient plan, but the estimated costs are very clost (20013 vs. 18147
is very small difference in this context), and the estimator shoots for
minimal average error. So it may seem 'weirdly pessimistic' but it might
be equally optimistic for other queries.

Also, keep in mind that the cost model is just a simplification of
reality, so it can't be correct 100% of the time.

The fact that this small difference in costs results in significant
duration difference suggests that the default optimizer cost values do
not reflect your environment. For example, you assume that this is very
fast:
   NestLoop       -> Seq Scan on A       -> Index Scan using B_Y_IDX on B           Index Condition: B.Y = A.X

but index scans often produce a lot of random I/O, and that may be quite
expensive if it really hits the storage. And that's what the default
values (random_page_cost=4) is set for. But if you're on a system with
lots of RAM, or SSD, ... that simply is not the case, and may penalize
index scans (and prefer more sequential workloads).

The other thing you might do is creating index on (fk,f), which on the
new releases produces this:
                          QUERY PLAN
-------------------------------------------------------------------------HashAggregate  (cost=783.77..907.43 rows=9893
width=15) Group Key: facts.fk  ->  Nested Loop  (cost=2.66..734.30 rows=9893 width=15)      ->  HashAggregate
(cost=2.24..3.23rows=99 width=4)          Group Key: random_fk_uniq.fk          ->  Seq Scan on random_fk_uniq
(cost=0.00..1.99...)      ->  Index Only Scan using facts_2_idx on facts  (cost=...)              Index Cond: (fk =
random_fk_uniq.fk)

Which limits the amount of random I/O by only scanning the index, and is
even faster than the first query.

regard

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Weirdly pesimistic estimates in optimizer

От
Kevin Grittner
Дата:
David Kubečka <kubecka.dav@gmail.com> wrote:

> I have read the optimizer README file and also looked briefly at
> the code, but this seems to be something not related to
> particular implementation of algorithm (e.g. nested loop).
> Perhaps it's the way how cost estimates are propagated down

It could be as simple as not having tuned your cost factors to
accurately reflect the relative costs of different actions in your
environment.  If you are using the default configuration, you might
want to try a few of the adjustments that are most often needed (at
least in my experience):

cpu_tuple_cost = 0.03
random_page_cost = 2
effective_cache_size = <50% to 75% of machine RAM>
work_mem = <machine RAM * 0.25 / max_connections>

You can SET these on an individual connection, one at a time or in
combination, and EXPLAIN the query to see the effects on plan
choice.

Other advice, not all of which matches my personal experience, can
be found here:

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

The best thing to do is experiment with different values with your
own queries and workloads to see what most accurately models your
costs (and thus produces the fastest plans).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Weirdly pesimistic estimates in optimizer

От
Robert Haas
Дата:
On Mon, Mar 2, 2015 at 2:19 PM, David Kubečka <kubecka.dav@gmail.com> wrote:
> The question is why optimizer, or rather the cost estimator, produced so
> much different estimates upon very small change in input. Moreover it seems
> that the main culprit of bad estimates isn't actually directly related to
> outer table, but rather to inner table. Just compare estimates for the two
> index scans:
>
> With 'random_fk_dupl':
>          ->  Index Scan using facts_fk_idx on facts  (cost=0.42..5.75
> rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
> With 'random_fk_uniq':
>          ->  Index Scan using facts_fk_idx on facts  (cost=0.42..214.26
> rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)

Whoa.  That's pretty dramatic.  Am I correctly understanding that the
two tables contain *exactly* the same data?  Could the issue be that
the optimizer thinks that one of the tables is thought to have a
higher logical-to-physical correlation than the other (see
pg_stats.correlation)?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Weirdly pesimistic estimates in optimizer

От
David Kubečka
Дата:
Hi Tomas and others,

2015-03-02 21:29 GMT+01:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:
Hi David ;-)

On 2.3.2015 20:19, David Kubečka wrote:
>
> The question is why optimizer, or rather the cost estimator,
> produced so much different estimates upon very small change in input.
> Moreover it seems that the main culprit of bad estimates isn't
> actually directly related to outer table, but rather to inner table.
> Just compare estimates for the two index scans:
>
> With 'random_fk_dupl':
>          ->  Index Scan using facts_fk_idx on facts  (cost=0.42..5.75
> rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
> With 'random_fk_uniq':
>          ->  Index Scan using facts_fk_idx on facts  (cost=0.42..214.26
> rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)
>
> I have read the optimizer README file and also looked briefly at the
>  code, but this seems to be something not related to particular
> implementation of algorithm (e.g. nested loop). Perhaps it's the way
> how cost estimates are propagated down (or sideways? that would be
> weird...) the query tree. But I am really not sure, since this is my
> first time lookng at the optimizer code base. I should also add that
> I have reproduced this behaviour for all versions of Pg from 9.2 up
> to current devel.

Interesting. I've only spent a few minutes looking at this, but it
certainly is a bit strange that using smaller table with unique values
results in a slower plan than using a table with duplicities (but the
same number of unique values).

Yep. And I think that I have found the cause of the strangeness. I have traced the planner code and this is how cost_index from costsize.c is called when computing the cost of index scan on outer (facts) table, when the inner table is with duplicate entries (random_fk_dupl):

cost_index (path=0x27f8fb8, root=0x27670e8, loop_count=9920)

The important thing here is the value of loop_count which is the (perfect estimate of) number of rows of random_fk_dupl (I have recreated the table so it differs slightly from previously posted version (9893)). Now the predominant part of running cost is computed by this expression:

run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

Since csquared (indexCorrelation * indexCorrelation) is very small in this case, the biggest term here is max_IO_cost, which is computed as follows:

max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;

There is division by loop_count because of predicted effect of caching and it is exactly this division which makes the run_cost for single index lookup so low compared with the query version with random_fk_uniq.  So the main problem is that the planner calls cost_index with loop_count equal to number of rows of inner table, *but it ignores semi-join semantics*, i.e. it doesn't account for the number of *unique* rows in the inner table which will actually be the number of loops.

Since this is my first time looking into the optimizer (and in fact any postgres) code I am not yet able to locate the exact place where this should be repaired, but I hope that in few days I will have a patch :-)

Another observation is that the planner does see other plans, because
tweaking the cost variables like this:

    SET random_page_cost = 2;

produces this plan (which on my machine runs just a tad slower than the
first query in your post):

                            QUERY PLAN
---------------------------------------------------------------------
 HashAggregate  (cost=10564.81..10688.47 rows=9893 width=15)
   Group Key: facts.fk
   ->  Nested Loop  (cost=5.42..10515.34 rows=9893 width=15)
       ->  HashAggregate  (cost=2.24..3.23 rows=99 width=4)
             Group Key: random_fk_uniq.fk
             ->  Seq Scan on random_fk_uniq  (cost=0.00..1.99 ...)
       ->  Bitmap Heap Scan on facts  (cost=3.18..105.18 rows=100 ...)
           Recheck Cond: (fk = random_fk_uniq.fk)
           ->  Bitmap Index Scan on facts_fk_idx  (cost=0.00..3.15 ...)
                     Index Cond: (fk = random_fk_uniq.fk)

Yeah, this makes now perfect sense to me. From the run_cost and max_IO_cost formulas you can see that (in this case!) random_page_cost is almost exactly the linear factor here. So eventually the results are completely opposite than I wished at first, i.e. instead of tweaking postgres to produce lower costs for random_fk_uniq it will start producing higher costs for random_fk_dupl. But I now believe that this is correct.

Regards,

-David


I can get similar results by setting cpu_operator_cost=0.005. And
further lowering to random_page_cost to 1.1 actually gets me this:

                            QUERY PLAN
---------------------------------------------------------------------
 HashAggregate  (cost=6185.41..6309.08 rows=9893 width=15)
   Group Key: facts.fk
   ->  Nested Loop  (cost=2.66..6135.95 rows=9893 width=15)
      ->  HashAggregate  (cost=2.24..3.23 rows=99 width=4)
            Group Key: random_fk_uniq.fk
            ->  Seq Scan on random_fk_uniq  (cost=0.00..1.99 ...)
      ->  Index Scan using facts_fk_idx on facts  (cost=0.42..60.95 ...)
               Index Cond: (fk = random_fk_uniq.fk)

which is exactly the first plan from your post.

I still don't understand why using the smaller table produces less
efficient plan, but the estimated costs are very clost (20013 vs. 18147
is very small difference in this context), and the estimator shoots for
minimal average error. So it may seem 'weirdly pessimistic' but it might
be equally optimistic for other queries.

Also, keep in mind that the cost model is just a simplification of
reality, so it can't be correct 100% of the time.

The fact that this small difference in costs results in significant
duration difference suggests that the default optimizer cost values do
not reflect your environment. For example, you assume that this is very
fast:

    NestLoop
        -> Seq Scan on A
        -> Index Scan using B_Y_IDX on B
            Index Condition: B.Y = A.X

but index scans often produce a lot of random I/O, and that may be quite
expensive if it really hits the storage. And that's what the default
values (random_page_cost=4) is set for. But if you're on a system with
lots of RAM, or SSD, ... that simply is not the case, and may penalize
index scans (and prefer more sequential workloads).

The other thing you might do is creating index on (fk,f), which on the
new releases produces this:

                           QUERY PLAN
-------------------------------------------------------------------------
 HashAggregate  (cost=783.77..907.43 rows=9893 width=15)
   Group Key: facts.fk
   ->  Nested Loop  (cost=2.66..734.30 rows=9893 width=15)
       ->  HashAggregate  (cost=2.24..3.23 rows=99 width=4)
           Group Key: random_fk_uniq.fk
           ->  Seq Scan on random_fk_uniq  (cost=0.00..1.99 ...)
       ->  Index Only Scan using facts_2_idx on facts  (cost=...)
               Index Cond: (fk = random_fk_uniq.fk)

Which limits the amount of random I/O by only scanning the index, and is
even faster than the first query.

regard

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Weirdly pesimistic estimates in optimizer

От
Tom Lane
Дата:
David Kubečka <kubecka.dav@gmail.com> writes:
> There is division by loop_count because of predicted effect of caching and
> it is exactly this division which makes the run_cost for single index
> lookup so low compared with the query version with random_fk_uniq.  So the
> main problem is that the planner calls cost_index with loop_count equal to
> number of rows of inner table, *but it ignores semi-join semantics*, i.e.
> it doesn't account for the number of *unique* rows in the inner table which
> will actually be the number of loops.

Good point.

> Since this is my first time looking into the optimizer (and in fact any
> postgres) code I am not yet able to locate the exact place where this
> should be repaired, but I hope that in few days I will have a patch :-)

Hm, this might not be the best problem to tackle for your first Postgres
patch :-(.  The information about the estimated number of unique rows
isn't readily available at the point where we're creating parameterized
paths.  When we do compute such an estimate, it's done like this:
   pathnode->path.rows = estimate_num_groups(root, uniq_exprs, rel->rows);

where uniq_exprs is a list of right-hand-side expressions we've determined
belong to the semijoin, and rel->rows represents the raw size of the
semijoin's RHS relation (which might itself be a join).  Neither of those
things are available to create_index_path.

I chewed on this for awhile and decided that there'd be no real harm in
taking identification of the unique expressions out of 
create_unique_path() and doing it earlier, in initsplan.c; we'd need a
couple more fields in SpecialJoinInfo but that doesn't seem like a
problem.  However, rel->rows is a *big* problem; we simply have not made
any join size estimates yet, and can't, because these things are done
bottom up.

However ... estimate_num_groups's dependency on its rowcount input is not
large (it's basically using it as a clamp).  So conceivably we could have
get_loop_count just multiply together the sizes of the base relations
included in the semijoin's RHS to get a preliminary estimate of that
number.  This would be the right thing anyway for a single relation in the
RHS, which is the most common case.  It would usually be an overestimate
for join RHS, but we could hope that the output of estimate_num_groups
wouldn't be affected too badly.
        regards, tom lane



Re: Weirdly pesimistic estimates in optimizer

От
Tom Lane
Дата:
I wrote:
> I chewed on this for awhile and decided that there'd be no real harm in
> taking identification of the unique expressions out of
> create_unique_path() and doing it earlier, in initsplan.c; we'd need a
> couple more fields in SpecialJoinInfo but that doesn't seem like a
> problem.  However, rel->rows is a *big* problem; we simply have not made
> any join size estimates yet, and can't, because these things are done
> bottom up.

> However ... estimate_num_groups's dependency on its rowcount input is not
> large (it's basically using it as a clamp).  So conceivably we could have
> get_loop_count just multiply together the sizes of the base relations
> included in the semijoin's RHS to get a preliminary estimate of that
> number.  This would be the right thing anyway for a single relation in the
> RHS, which is the most common case.  It would usually be an overestimate
> for join RHS, but we could hope that the output of estimate_num_groups
> wouldn't be affected too badly.

Attached is a draft patch that does those two things.  I like the first
part (replacing SpecialJoinInfo's rather ad-hoc join_quals field with
something more explicitly attuned to semijoin uniqueness processing).
The second part is still pretty much of a kluge, but then get_loop_count
was a kluge already.  This arguably makes it better.

Now, on the test case you presented, this has the unfortunate effect that
it now reliably chooses the "wrong" plan for both cases :-(.  But I think
that's a reflection of poor cost parameters (ie, test case fits handily in
RAM but we've not set the cost parameters to reflect that).  We do get the
same rowcount and roughly-same cost estimates for both the random_fk_dupl
and random_fk_uniq queries, so from that standpoint it's doing the right
thing.  If I reduce random_page_cost to 2 or so, it makes the choices you
wanted.

            regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9fe8008..9f65d66 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySpecialJoinInfo(const SpecialJoinIn
*** 1942,1948 ****
      COPY_SCALAR_FIELD(jointype);
      COPY_SCALAR_FIELD(lhs_strict);
      COPY_SCALAR_FIELD(delay_upper_joins);
!     COPY_NODE_FIELD(join_quals);

      return newnode;
  }
--- 1942,1951 ----
      COPY_SCALAR_FIELD(jointype);
      COPY_SCALAR_FIELD(lhs_strict);
      COPY_SCALAR_FIELD(delay_upper_joins);
!     COPY_SCALAR_FIELD(semi_can_btree);
!     COPY_SCALAR_FIELD(semi_can_hash);
!     COPY_NODE_FIELD(semi_operators);
!     COPY_NODE_FIELD(semi_rhs_exprs);

      return newnode;
  }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index fe509b0..fd876fb 100644
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
*************** _equalSpecialJoinInfo(const SpecialJoinI
*** 798,804 ****
      COMPARE_SCALAR_FIELD(jointype);
      COMPARE_SCALAR_FIELD(lhs_strict);
      COMPARE_SCALAR_FIELD(delay_upper_joins);
!     COMPARE_NODE_FIELD(join_quals);

      return true;
  }
--- 798,807 ----
      COMPARE_SCALAR_FIELD(jointype);
      COMPARE_SCALAR_FIELD(lhs_strict);
      COMPARE_SCALAR_FIELD(delay_upper_joins);
!     COMPARE_SCALAR_FIELD(semi_can_btree);
!     COMPARE_SCALAR_FIELD(semi_can_hash);
!     COMPARE_NODE_FIELD(semi_operators);
!     COMPARE_NODE_FIELD(semi_rhs_exprs);

      return true;
  }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 775f482..75c57a2 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outSpecialJoinInfo(StringInfo str, cons
*** 1945,1951 ****
      WRITE_ENUM_FIELD(jointype, JoinType);
      WRITE_BOOL_FIELD(lhs_strict);
      WRITE_BOOL_FIELD(delay_upper_joins);
!     WRITE_NODE_FIELD(join_quals);
  }

  static void
--- 1945,1954 ----
      WRITE_ENUM_FIELD(jointype, JoinType);
      WRITE_BOOL_FIELD(lhs_strict);
      WRITE_BOOL_FIELD(delay_upper_joins);
!     WRITE_BOOL_FIELD(semi_can_btree);
!     WRITE_BOOL_FIELD(semi_can_hash);
!     WRITE_NODE_FIELD(semi_operators);
!     WRITE_NODE_FIELD(semi_rhs_exprs);
  }

  static void
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5a9daf0..1a0d358 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** compute_semi_anti_join_factors(PlannerIn
*** 3294,3300 ****
      /* we don't bother trying to make the remaining fields valid */
      norm_sjinfo.lhs_strict = false;
      norm_sjinfo.delay_upper_joins = false;
!     norm_sjinfo.join_quals = NIL;

      nselec = clauselist_selectivity(root,
                                      joinquals,
--- 3294,3303 ----
      /* we don't bother trying to make the remaining fields valid */
      norm_sjinfo.lhs_strict = false;
      norm_sjinfo.delay_upper_joins = false;
!     norm_sjinfo.semi_can_btree = false;
!     norm_sjinfo.semi_can_hash = false;
!     norm_sjinfo.semi_operators = NIL;
!     norm_sjinfo.semi_rhs_exprs = NIL;

      nselec = clauselist_selectivity(root,
                                      joinquals,
*************** approx_tuple_count(PlannerInfo *root, Jo
*** 3456,3462 ****
      /* we don't bother trying to make the remaining fields valid */
      sjinfo.lhs_strict = false;
      sjinfo.delay_upper_joins = false;
!     sjinfo.join_quals = NIL;

      /* Get the approximate selectivity */
      foreach(l, quals)
--- 3459,3468 ----
      /* we don't bother trying to make the remaining fields valid */
      sjinfo.lhs_strict = false;
      sjinfo.delay_upper_joins = false;
!     sjinfo.semi_can_btree = false;
!     sjinfo.semi_can_hash = false;
!     sjinfo.semi_operators = NIL;
!     sjinfo.semi_rhs_exprs = NIL;

      /* Get the approximate selectivity */
      foreach(l, quals)
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index b86a3cd..1b16012 100644
*** a/src/backend/optimizer/path/indxpath.c
--- b/src/backend/optimizer/path/indxpath.c
*************** static Relids get_bitmap_tree_required_o
*** 130,136 ****
  static void find_indexpath_quals(Path *bitmapqual, List **quals, List **preds);
  static int    find_list_position(Node *node, List **nodelist);
  static bool check_index_only(RelOptInfo *rel, IndexOptInfo *index);
! static double get_loop_count(PlannerInfo *root, Relids outer_relids);
  static void match_restriction_clauses_to_index(RelOptInfo *rel,
                                     IndexOptInfo *index,
                                     IndexClauseSet *clauseset);
--- 130,141 ----
  static void find_indexpath_quals(Path *bitmapqual, List **quals, List **preds);
  static int    find_list_position(Node *node, List **nodelist);
  static bool check_index_only(RelOptInfo *rel, IndexOptInfo *index);
! static double get_loop_count(PlannerInfo *root, Index cur_relid, Relids outer_relids);
! static double adjust_rowcount_for_semijoins(PlannerInfo *root,
!                               Index cur_relid,
!                               Index outer_relid,
!                               double rowcount);
! static double approximate_joinrel_size(PlannerInfo *root, Relids relids);
  static void match_restriction_clauses_to_index(RelOptInfo *rel,
                                     IndexOptInfo *index,
                                     IndexClauseSet *clauseset);
*************** create_index_paths(PlannerInfo *root, Re
*** 402,408 ****

              /* And push that path into the mix */
              required_outer = get_bitmap_tree_required_outer(bitmapqual);
!             loop_count = get_loop_count(root, required_outer);
              bpath = create_bitmap_heap_path(root, rel, bitmapqual,
                                              required_outer, loop_count);
              add_path(rel, (Path *) bpath);
--- 407,413 ----

              /* And push that path into the mix */
              required_outer = get_bitmap_tree_required_outer(bitmapqual);
!             loop_count = get_loop_count(root, rel->relid, required_outer);
              bpath = create_bitmap_heap_path(root, rel, bitmapqual,
                                              required_outer, loop_count);
              add_path(rel, (Path *) bpath);
*************** build_index_paths(PlannerInfo *root, Rel
*** 969,975 ****
          outer_relids = NULL;

      /* Compute loop_count for cost estimation purposes */
!     loop_count = get_loop_count(root, outer_relids);

      /*
       * 2. Compute pathkeys describing index's ordering, if any, then see how
--- 974,980 ----
          outer_relids = NULL;

      /* Compute loop_count for cost estimation purposes */
!     loop_count = get_loop_count(root, rel->relid, outer_relids);

      /*
       * 2. Compute pathkeys describing index's ordering, if any, then see how
*************** bitmap_scan_cost_est(PlannerInfo *root,
*** 1553,1559 ****
      cost_bitmap_heap_scan(&bpath.path, root, rel,
                            bpath.path.param_info,
                            ipath,
!                           get_loop_count(root, required_outer));

      return bpath.path.total_cost;
  }
--- 1558,1564 ----
      cost_bitmap_heap_scan(&bpath.path, root, rel,
                            bpath.path.param_info,
                            ipath,
!                           get_loop_count(root, rel->relid, required_outer));

      return bpath.path.total_cost;
  }
*************** bitmap_and_cost_est(PlannerInfo *root, R
*** 1594,1600 ****
      cost_bitmap_heap_scan(&bpath.path, root, rel,
                            bpath.path.param_info,
                            (Path *) &apath,
!                           get_loop_count(root, required_outer));

      return bpath.path.total_cost;
  }
--- 1599,1605 ----
      cost_bitmap_heap_scan(&bpath.path, root, rel,
                            bpath.path.param_info,
                            (Path *) &apath,
!                           get_loop_count(root, rel->relid, required_outer));

      return bpath.path.total_cost;
  }
*************** check_index_only(RelOptInfo *rel, IndexO
*** 1861,1892 ****
   * answer for single-other-relation cases, and it seems like a reasonable
   * zero-order approximation for multiway-join cases.
   *
   * Note: for this to work, allpaths.c must establish all baserel size
   * estimates before it begins to compute paths, or at least before it
   * calls create_index_paths().
   */
  static double
! get_loop_count(PlannerInfo *root, Relids outer_relids)
  {
      double        result = 1.0;

      /* For a non-parameterized path, just return 1.0 quickly */
      if (outer_relids != NULL)
      {
!         int            relid;

!         relid = -1;
!         while ((relid = bms_next_member(outer_relids, relid)) >= 0)
          {
              RelOptInfo *outer_rel;

              /* Paranoia: ignore bogus relid indexes */
!             if (relid >= root->simple_rel_array_size)
                  continue;
!             outer_rel = root->simple_rel_array[relid];
              if (outer_rel == NULL)
                  continue;
!             Assert(outer_rel->relid == relid);    /* sanity check on array */

              /* Other relation could be proven empty, if so ignore */
              if (IS_DUMMY_REL(outer_rel))
--- 1866,1904 ----
   * answer for single-other-relation cases, and it seems like a reasonable
   * zero-order approximation for multiway-join cases.
   *
+  * In addition, we check to see if the other side of each join clause is on
+  * the inside of some semijoin that the current relation is on the outside of.
+  * If so, the only way that a parameterized path could be used is if the
+  * semijoin RHS has been unique-ified, so we should use the number of unique
+  * RHS rows rather than using the relation's raw rowcount.
+  *
   * Note: for this to work, allpaths.c must establish all baserel size
   * estimates before it begins to compute paths, or at least before it
   * calls create_index_paths().
   */
  static double
! get_loop_count(PlannerInfo *root, Index cur_relid, Relids outer_relids)
  {
      double        result = 1.0;

      /* For a non-parameterized path, just return 1.0 quickly */
      if (outer_relids != NULL)
      {
!         int            outer_relid;

!         outer_relid = -1;
!         while ((outer_relid = bms_next_member(outer_relids, outer_relid)) >= 0)
          {
              RelOptInfo *outer_rel;
+             double        rowcount;

              /* Paranoia: ignore bogus relid indexes */
!             if (outer_relid >= root->simple_rel_array_size)
                  continue;
!             outer_rel = root->simple_rel_array[outer_relid];
              if (outer_rel == NULL)
                  continue;
!             Assert(outer_rel->relid == outer_relid);    /* sanity check on array */

              /* Other relation could be proven empty, if so ignore */
              if (IS_DUMMY_REL(outer_rel))
*************** get_loop_count(PlannerInfo *root, Relids
*** 1895,1908 ****
              /* Otherwise, rel's rows estimate should be valid by now */
              Assert(outer_rel->rows > 0);

              /* Remember smallest row count estimate among the outer rels */
!             if (result == 1.0 || result > outer_rel->rows)
!                 result = outer_rel->rows;
          }
      }
      return result;
  }


  /****************************************************************************
   *                ----  ROUTINES TO CHECK QUERY CLAUSES  ----
--- 1907,2006 ----
              /* Otherwise, rel's rows estimate should be valid by now */
              Assert(outer_rel->rows > 0);

+             /* Check to see if rel is on the inside of any semijoins */
+             rowcount = adjust_rowcount_for_semijoins(root,
+                                                      cur_relid,
+                                                      outer_relid,
+                                                      outer_rel->rows);
+
              /* Remember smallest row count estimate among the outer rels */
!             if (result == 1.0 || result > rowcount)
!                 result = rowcount;
          }
      }
      return result;
  }

+ /*
+  * Check to see if outer_relid is on the inside of any semijoin that cur_relid
+  * is on the outside of.  If so, replace rowcount with the estimated number of
+  * unique rows from the semijoin RHS.  The estimate is crude but it's the best
+  * we can do at this stage of the proceedings.
+  */
+ static double
+ adjust_rowcount_for_semijoins(PlannerInfo *root,
+                               Index cur_relid,
+                               Index outer_relid,
+                               double rowcount)
+ {
+     ListCell   *lc;
+
+     foreach(lc, root->join_info_list)
+     {
+         SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) lfirst(lc);
+
+         if (sjinfo->jointype == JOIN_SEMI &&
+             bms_is_member(cur_relid, sjinfo->syn_lefthand) &&
+             bms_is_member(outer_relid, sjinfo->syn_righthand))
+         {
+             /* Estimate number of unique-ified rows */
+             double        nraw;
+             double        nunique;
+
+             nraw = approximate_joinrel_size(root, sjinfo->syn_righthand);
+             nunique = estimate_num_groups(root,
+                                           sjinfo->semi_rhs_exprs,
+                                           nraw);
+             if (rowcount > nunique)
+                 rowcount = nunique;
+         }
+     }
+     return rowcount;
+ }
+
+ /*
+  * Make an approximate estimate of the size of a joinrel.
+  *
+  * We don't have enough info at this point to get a good estimate, so we
+  * just multiply the base relation sizes together.  Fortunately, this is
+  * the right answer anyway for the most common case with a single relation
+  * on the RHS of a semijoin.  Also, estimate_num_groups() has only a weak
+  * dependency on its input_rows argument (it basically uses it as a clamp).
+  * So we might be able to get a fairly decent estimate even with a severe
+  * overestimate of the RHS's raw size.
+  */
+ static double
+ approximate_joinrel_size(PlannerInfo *root, Relids relids)
+ {
+     double        rowcount = 1.0;
+     int            relid;
+
+     relid = -1;
+     while ((relid = bms_next_member(relids, relid)) >= 0)
+     {
+         RelOptInfo *rel;
+
+         /* Paranoia: ignore bogus relid indexes */
+         if (relid >= root->simple_rel_array_size)
+             continue;
+         rel = root->simple_rel_array[relid];
+         if (rel == NULL)
+             continue;
+         Assert(rel->relid == relid);    /* sanity check on array */
+
+         /* Relation could be proven empty, if so ignore */
+         if (IS_DUMMY_REL(rel))
+             continue;
+
+         /* Otherwise, rel's rows estimate should be valid by now */
+         Assert(rel->rows > 0);
+
+         /* Accumulate product */
+         rowcount *= rel->rows;
+     }
+     return rowcount;
+ }
+

  /****************************************************************************
   *                ----  ROUTINES TO CHECK QUERY CLAUSES  ----
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index e7e9a1a..fe9fd57 100644
*** a/src/backend/optimizer/path/joinrels.c
--- b/src/backend/optimizer/path/joinrels.c
*************** make_join_rel(PlannerInfo *root, RelOptI
*** 624,630 ****
          /* we don't bother trying to make the remaining fields valid */
          sjinfo->lhs_strict = false;
          sjinfo->delay_upper_joins = false;
!         sjinfo->join_quals = NIL;
      }

      /*
--- 624,633 ----
          /* we don't bother trying to make the remaining fields valid */
          sjinfo->lhs_strict = false;
          sjinfo->delay_upper_joins = false;
!         sjinfo->semi_can_btree = false;
!         sjinfo->semi_can_hash = false;
!         sjinfo->semi_operators = NIL;
!         sjinfo->semi_rhs_exprs = NIL;
      }

      /*
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 49d776d..a7655e4 100644
*** a/src/backend/optimizer/plan/initsplan.c
--- b/src/backend/optimizer/plan/initsplan.c
***************
*** 17,22 ****
--- 17,23 ----
  #include "catalog/pg_type.h"
  #include "nodes/nodeFuncs.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/joininfo.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
*************** static SpecialJoinInfo *make_outerjoinin
*** 55,60 ****
--- 56,62 ----
                     Relids left_rels, Relids right_rels,
                     Relids inner_join_rels,
                     JoinType jointype, List *clause);
+ static void compute_semijoin_info(SpecialJoinInfo *sjinfo, List *clause);
  static void distribute_qual_to_rels(PlannerInfo *root, Node *clause,
                          bool is_deduced,
                          bool below_outer_join,
*************** make_outerjoininfo(PlannerInfo *root,
*** 1085,1091 ****
      sjinfo->jointype = jointype;
      /* this always starts out false */
      sjinfo->delay_upper_joins = false;
!     sjinfo->join_quals = clause;

      /* If it's a full join, no need to be very smart */
      if (jointype == JOIN_FULL)
--- 1087,1094 ----
      sjinfo->jointype = jointype;
      /* this always starts out false */
      sjinfo->delay_upper_joins = false;
!
!     compute_semijoin_info(sjinfo, clause);

      /* If it's a full join, no need to be very smart */
      if (jointype == JOIN_FULL)
*************** make_outerjoininfo(PlannerInfo *root,
*** 1237,1242 ****
--- 1240,1421 ----
      return sjinfo;
  }

+ /*
+  * compute_semijoin_info
+  *      Fill semijoin-related fields of a new SpecialJoinInfo
+  *
+  * Note: this relies on only the jointype and syn_righthand fields of the
+  * SpecialJoinInfo; the rest may not be set yet.
+  */
+ static void
+ compute_semijoin_info(SpecialJoinInfo *sjinfo, List *clause)
+ {
+     List       *semi_operators;
+     List       *semi_rhs_exprs;
+     bool        all_btree;
+     bool        all_hash;
+     ListCell   *lc;
+
+     /* Initialize semijoin-related fields in case we can't unique-ify */
+     sjinfo->semi_can_btree = false;
+     sjinfo->semi_can_hash = false;
+     sjinfo->semi_operators = NIL;
+     sjinfo->semi_rhs_exprs = NIL;
+
+     /* Nothing more to do if it's not a semijoin */
+     if (sjinfo->jointype != JOIN_SEMI)
+         return;
+
+     /*
+      * Look to see whether the semijoin's join quals consist of AND'ed
+      * equality operators, with (only) RHS variables on only one side of each
+      * one.  If so, we can figure out how to enforce uniqueness for the RHS.
+      *
+      * Note that the input clause list is the list of quals that are
+      * *syntactically* associated with the semijoin, which in practice means
+      * the synthesized comparison list for an IN or the WHERE of an EXISTS.
+      * Particularly in the latter case, it might contain clauses that aren't
+      * *semantically* associated with the join, but refer to just one side or
+      * the other.  We can ignore such clauses here, as they will just drop
+      * down to be processed within one side or the other.  (It is okay to
+      * consider only the syntactically-associated clauses here because for a
+      * semijoin, no higher-level quals could refer to the RHS, and so there
+      * can be no other quals that are semantically associated with this join.
+      * We do things this way because it is useful to have the set of potential
+      * unique-ification expressions before we can extract the list of quals
+      * that are actually semantically associated with the particular join.)
+      *
+      * Note that the semi_operators list consists of the joinqual operators
+      * themselves (but commuted if needed to put the RHS value on the right).
+      * These could be cross-type operators, in which case the operator
+      * actually needed for uniqueness is a related single-type operator. We
+      * assume here that that operator will be available from the btree or hash
+      * opclass when the time comes ... if not, create_unique_plan() will fail.
+      */
+     semi_operators = NIL;
+     semi_rhs_exprs = NIL;
+     all_btree = true;
+     all_hash = enable_hashagg;    /* don't consider hash if not enabled */
+     foreach(lc, clause)
+     {
+         OpExpr       *op = (OpExpr *) lfirst(lc);
+         Oid            opno;
+         Node       *left_expr;
+         Node       *right_expr;
+         Relids        left_varnos;
+         Relids        right_varnos;
+         Relids        all_varnos;
+         Oid            opinputtype;
+
+         /* Is it a binary opclause? */
+         if (!IsA(op, OpExpr) ||
+             list_length(op->args) != 2)
+         {
+             /* No, but does it reference both sides? */
+             all_varnos = pull_varnos((Node *) op);
+             if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
+                 bms_is_subset(all_varnos, sjinfo->syn_righthand))
+             {
+                 /*
+                  * Clause refers to only one rel, so ignore it --- unless it
+                  * contains volatile functions, in which case we'd better
+                  * punt.
+                  */
+                 if (contain_volatile_functions((Node *) op))
+                     return;
+                 continue;
+             }
+             /* Non-operator clause referencing both sides, must punt */
+             return;
+         }
+
+         /* Extract data from binary opclause */
+         opno = op->opno;
+         left_expr = linitial(op->args);
+         right_expr = lsecond(op->args);
+         left_varnos = pull_varnos(left_expr);
+         right_varnos = pull_varnos(right_expr);
+         all_varnos = bms_union(left_varnos, right_varnos);
+         opinputtype = exprType(left_expr);
+
+         /* Does it reference both sides? */
+         if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
+             bms_is_subset(all_varnos, sjinfo->syn_righthand))
+         {
+             /*
+              * Clause refers to only one rel, so ignore it --- unless it
+              * contains volatile functions, in which case we'd better punt.
+              */
+             if (contain_volatile_functions((Node *) op))
+                 return;
+             continue;
+         }
+
+         /* check rel membership of arguments */
+         if (!bms_is_empty(right_varnos) &&
+             bms_is_subset(right_varnos, sjinfo->syn_righthand) &&
+             !bms_overlap(left_varnos, sjinfo->syn_righthand))
+         {
+             /* typical case, right_expr is RHS variable */
+         }
+         else if (!bms_is_empty(left_varnos) &&
+                  bms_is_subset(left_varnos, sjinfo->syn_righthand) &&
+                  !bms_overlap(right_varnos, sjinfo->syn_righthand))
+         {
+             /* flipped case, left_expr is RHS variable */
+             opno = get_commutator(opno);
+             if (!OidIsValid(opno))
+                 return;
+             right_expr = left_expr;
+         }
+         else
+         {
+             /* mixed membership of args, punt */
+             return;
+         }
+
+         /* all operators must be btree equality or hash equality */
+         if (all_btree)
+         {
+             /* oprcanmerge is considered a hint... */
+             if (!op_mergejoinable(opno, opinputtype) ||
+                 get_mergejoin_opfamilies(opno) == NIL)
+                 all_btree = false;
+         }
+         if (all_hash)
+         {
+             /* ... but oprcanhash had better be correct */
+             if (!op_hashjoinable(opno, opinputtype))
+                 all_hash = false;
+         }
+         if (!(all_btree || all_hash))
+             return;
+
+         /* so far so good, keep building lists */
+         semi_operators = lappend_oid(semi_operators, opno);
+         semi_rhs_exprs = lappend(semi_rhs_exprs, copyObject(right_expr));
+     }
+
+     /* Punt if we didn't find at least one column to unique-ify */
+     if (semi_rhs_exprs == NIL)
+         return;
+
+     /*
+      * The expressions we'd need to unique-ify mustn't be volatile.
+      */
+     if (contain_volatile_functions((Node *) semi_rhs_exprs))
+         return;
+
+     /*
+      * If we get here, we can unique-ify the semijoin's RHS using at least one
+      * of sorting and hashing.  Save the information about how to do that.
+      */
+     sjinfo->semi_can_btree = all_btree;
+     sjinfo->semi_can_hash = all_hash;
+     sjinfo->semi_operators = semi_operators;
+     sjinfo->semi_rhs_exprs = semi_rhs_exprs;
+ }
+

  /*****************************************************************************
   *
diff --git a/src/backend/optimizer/util/orclauses.c b/src/backend/optimizer/util/orclauses.c
index d1c4e99..f0acc14 100644
*** a/src/backend/optimizer/util/orclauses.c
--- b/src/backend/optimizer/util/orclauses.c
*************** consider_new_or_clause(PlannerInfo *root
*** 335,341 ****
          /* we don't bother trying to make the remaining fields valid */
          sjinfo.lhs_strict = false;
          sjinfo.delay_upper_joins = false;
!         sjinfo.join_quals = NIL;

          /* Compute inner-join size */
          orig_selec = clause_selectivity(root, (Node *) join_or_rinfo,
--- 335,344 ----
          /* we don't bother trying to make the remaining fields valid */
          sjinfo.lhs_strict = false;
          sjinfo.delay_upper_joins = false;
!         sjinfo.semi_can_btree = false;
!         sjinfo.semi_can_hash = false;
!         sjinfo.semi_operators = NIL;
!         sjinfo.semi_rhs_exprs = NIL;

          /* Compute inner-join size */
          orig_selec = clause_selectivity(root, (Node *) join_or_rinfo,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..faca30b 100644
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_unique_path(PlannerInfo *root, Re
*** 1088,1099 ****
      Path        sort_path;        /* dummy for result of cost_sort */
      Path        agg_path;        /* dummy for result of cost_agg */
      MemoryContext oldcontext;
-     List       *in_operators;
-     List       *uniq_exprs;
-     bool        all_btree;
-     bool        all_hash;
      int            numCols;
-     ListCell   *lc;

      /* Caller made a mistake if subpath isn't cheapest_total ... */
      Assert(subpath == rel->cheapest_total_path);
--- 1088,1094 ----
*************** create_unique_path(PlannerInfo *root, Re
*** 1106,1113 ****
      if (rel->cheapest_unique_path)
          return (UniquePath *) rel->cheapest_unique_path;

!     /* If we previously failed, return NULL quickly */
!     if (sjinfo->join_quals == NIL)
          return NULL;

      /*
--- 1101,1108 ----
      if (rel->cheapest_unique_path)
          return (UniquePath *) rel->cheapest_unique_path;

!     /* If it's not possible to unique-ify, return NULL */
!     if (!(sjinfo->semi_can_btree || sjinfo->semi_can_hash))
          return NULL;

      /*
*************** create_unique_path(PlannerInfo *root, Re
*** 1116,1265 ****
       */
      oldcontext = MemoryContextSwitchTo(root->planner_cxt);

-     /*----------
-      * Look to see whether the semijoin's join quals consist of AND'ed
-      * equality operators, with (only) RHS variables on only one side of
-      * each one.  If so, we can figure out how to enforce uniqueness for
-      * the RHS.
-      *
-      * Note that the input join_quals list is the list of quals that are
-      * *syntactically* associated with the semijoin, which in practice means
-      * the synthesized comparison list for an IN or the WHERE of an EXISTS.
-      * Particularly in the latter case, it might contain clauses that aren't
-      * *semantically* associated with the join, but refer to just one side or
-      * the other.  We can ignore such clauses here, as they will just drop
-      * down to be processed within one side or the other.  (It is okay to
-      * consider only the syntactically-associated clauses here because for a
-      * semijoin, no higher-level quals could refer to the RHS, and so there
-      * can be no other quals that are semantically associated with this join.
-      * We do things this way because it is useful to be able to run this test
-      * before we have extracted the list of quals that are actually
-      * semantically associated with the particular join.)
-      *
-      * Note that the in_operators list consists of the joinqual operators
-      * themselves (but commuted if needed to put the RHS value on the right).
-      * These could be cross-type operators, in which case the operator
-      * actually needed for uniqueness is a related single-type operator.
-      * We assume here that that operator will be available from the btree
-      * or hash opclass when the time comes ... if not, create_unique_plan()
-      * will fail.
-      *----------
-      */
-     in_operators = NIL;
-     uniq_exprs = NIL;
-     all_btree = true;
-     all_hash = enable_hashagg;    /* don't consider hash if not enabled */
-     foreach(lc, sjinfo->join_quals)
-     {
-         OpExpr       *op = (OpExpr *) lfirst(lc);
-         Oid            opno;
-         Node       *left_expr;
-         Node       *right_expr;
-         Relids        left_varnos;
-         Relids        right_varnos;
-         Relids        all_varnos;
-         Oid            opinputtype;
-
-         /* Is it a binary opclause? */
-         if (!IsA(op, OpExpr) ||
-             list_length(op->args) != 2)
-         {
-             /* No, but does it reference both sides? */
-             all_varnos = pull_varnos((Node *) op);
-             if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
-                 bms_is_subset(all_varnos, sjinfo->syn_righthand))
-             {
-                 /*
-                  * Clause refers to only one rel, so ignore it --- unless it
-                  * contains volatile functions, in which case we'd better
-                  * punt.
-                  */
-                 if (contain_volatile_functions((Node *) op))
-                     goto no_unique_path;
-                 continue;
-             }
-             /* Non-operator clause referencing both sides, must punt */
-             goto no_unique_path;
-         }
-
-         /* Extract data from binary opclause */
-         opno = op->opno;
-         left_expr = linitial(op->args);
-         right_expr = lsecond(op->args);
-         left_varnos = pull_varnos(left_expr);
-         right_varnos = pull_varnos(right_expr);
-         all_varnos = bms_union(left_varnos, right_varnos);
-         opinputtype = exprType(left_expr);
-
-         /* Does it reference both sides? */
-         if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
-             bms_is_subset(all_varnos, sjinfo->syn_righthand))
-         {
-             /*
-              * Clause refers to only one rel, so ignore it --- unless it
-              * contains volatile functions, in which case we'd better punt.
-              */
-             if (contain_volatile_functions((Node *) op))
-                 goto no_unique_path;
-             continue;
-         }
-
-         /* check rel membership of arguments */
-         if (!bms_is_empty(right_varnos) &&
-             bms_is_subset(right_varnos, sjinfo->syn_righthand) &&
-             !bms_overlap(left_varnos, sjinfo->syn_righthand))
-         {
-             /* typical case, right_expr is RHS variable */
-         }
-         else if (!bms_is_empty(left_varnos) &&
-                  bms_is_subset(left_varnos, sjinfo->syn_righthand) &&
-                  !bms_overlap(right_varnos, sjinfo->syn_righthand))
-         {
-             /* flipped case, left_expr is RHS variable */
-             opno = get_commutator(opno);
-             if (!OidIsValid(opno))
-                 goto no_unique_path;
-             right_expr = left_expr;
-         }
-         else
-             goto no_unique_path;
-
-         /* all operators must be btree equality or hash equality */
-         if (all_btree)
-         {
-             /* oprcanmerge is considered a hint... */
-             if (!op_mergejoinable(opno, opinputtype) ||
-                 get_mergejoin_opfamilies(opno) == NIL)
-                 all_btree = false;
-         }
-         if (all_hash)
-         {
-             /* ... but oprcanhash had better be correct */
-             if (!op_hashjoinable(opno, opinputtype))
-                 all_hash = false;
-         }
-         if (!(all_btree || all_hash))
-             goto no_unique_path;
-
-         /* so far so good, keep building lists */
-         in_operators = lappend_oid(in_operators, opno);
-         uniq_exprs = lappend(uniq_exprs, copyObject(right_expr));
-     }
-
-     /* Punt if we didn't find at least one column to unique-ify */
-     if (uniq_exprs == NIL)
-         goto no_unique_path;
-
-     /*
-      * The expressions we'd need to unique-ify mustn't be volatile.
-      */
-     if (contain_volatile_functions((Node *) uniq_exprs))
-         goto no_unique_path;
-
-     /*
-      * If we get here, we can unique-ify using at least one of sorting and
-      * hashing.  Start building the result Path object.
-      */
      pathnode = makeNode(UniquePath);

      pathnode->path.pathtype = T_Unique;
--- 1111,1116 ----
*************** create_unique_path(PlannerInfo *root, Re
*** 1273,1290 ****
      pathnode->path.pathkeys = NIL;

      pathnode->subpath = subpath;
!     pathnode->in_operators = in_operators;
!     pathnode->uniq_exprs = uniq_exprs;

      /*
       * If the input is a relation and it has a unique index that proves the
!      * uniq_exprs are unique, then we don't need to do anything.  Note that
!      * relation_has_unique_index_for automatically considers restriction
       * clauses for the rel, as well.
       */
!     if (rel->rtekind == RTE_RELATION && all_btree &&
          relation_has_unique_index_for(root, rel, NIL,
!                                       uniq_exprs, in_operators))
      {
          pathnode->umethod = UNIQUE_PATH_NOOP;
          pathnode->path.rows = rel->rows;
--- 1124,1142 ----
      pathnode->path.pathkeys = NIL;

      pathnode->subpath = subpath;
!     pathnode->in_operators = sjinfo->semi_operators;
!     pathnode->uniq_exprs = sjinfo->semi_rhs_exprs;

      /*
       * If the input is a relation and it has a unique index that proves the
!      * semi_rhs_exprs are unique, then we don't need to do anything.  Note
!      * that relation_has_unique_index_for automatically considers restriction
       * clauses for the rel, as well.
       */
!     if (rel->rtekind == RTE_RELATION && sjinfo->semi_can_btree &&
          relation_has_unique_index_for(root, rel, NIL,
!                                       sjinfo->semi_rhs_exprs,
!                                       sjinfo->semi_operators))
      {
          pathnode->umethod = UNIQUE_PATH_NOOP;
          pathnode->path.rows = rel->rows;
*************** create_unique_path(PlannerInfo *root, Re
*** 1304,1310 ****
       * don't need to do anything.  The test for uniqueness has to consider
       * exactly which columns we are extracting; for example "SELECT DISTINCT
       * x,y" doesn't guarantee that x alone is distinct. So we cannot check for
!      * this optimization unless uniq_exprs consists only of simple Vars
       * referencing subquery outputs.  (Possibly we could do something with
       * expressions in the subquery outputs, too, but for now keep it simple.)
       */
--- 1156,1162 ----
       * don't need to do anything.  The test for uniqueness has to consider
       * exactly which columns we are extracting; for example "SELECT DISTINCT
       * x,y" doesn't guarantee that x alone is distinct. So we cannot check for
!      * this optimization unless semi_rhs_exprs consists only of simple Vars
       * referencing subquery outputs.  (Possibly we could do something with
       * expressions in the subquery outputs, too, but for now keep it simple.)
       */
*************** create_unique_path(PlannerInfo *root, Re
*** 1316,1326 ****
          {
              List       *sub_tlist_colnos;

!             sub_tlist_colnos = translate_sub_tlist(uniq_exprs, rel->relid);

              if (sub_tlist_colnos &&
                  query_is_distinct_for(rte->subquery,
!                                       sub_tlist_colnos, in_operators))
              {
                  pathnode->umethod = UNIQUE_PATH_NOOP;
                  pathnode->path.rows = rel->rows;
--- 1168,1180 ----
          {
              List       *sub_tlist_colnos;

!             sub_tlist_colnos = translate_sub_tlist(sjinfo->semi_rhs_exprs,
!                                                    rel->relid);

              if (sub_tlist_colnos &&
                  query_is_distinct_for(rte->subquery,
!                                       sub_tlist_colnos,
!                                       sjinfo->semi_operators))
              {
                  pathnode->umethod = UNIQUE_PATH_NOOP;
                  pathnode->path.rows = rel->rows;
*************** create_unique_path(PlannerInfo *root, Re
*** 1338,1347 ****
      }

      /* Estimate number of output rows */
!     pathnode->path.rows = estimate_num_groups(root, uniq_exprs, rel->rows);
!     numCols = list_length(uniq_exprs);

!     if (all_btree)
      {
          /*
           * Estimate cost for sort+unique implementation
--- 1192,1203 ----
      }

      /* Estimate number of output rows */
!     pathnode->path.rows = estimate_num_groups(root,
!                                               sjinfo->semi_rhs_exprs,
!                                               rel->rows);
!     numCols = list_length(sjinfo->semi_rhs_exprs);

!     if (sjinfo->semi_can_btree)
      {
          /*
           * Estimate cost for sort+unique implementation
*************** create_unique_path(PlannerInfo *root, Re
*** 1363,1369 ****
          sort_path.total_cost += cpu_operator_cost * rel->rows * numCols;
      }

!     if (all_hash)
      {
          /*
           * Estimate the overhead per hashtable entry at 64 bytes (same as in
--- 1219,1225 ----
          sort_path.total_cost += cpu_operator_cost * rel->rows * numCols;
      }

!     if (sjinfo->semi_can_hash)
      {
          /*
           * Estimate the overhead per hashtable entry at 64 bytes (same as in
*************** create_unique_path(PlannerInfo *root, Re
*** 1372,1378 ****
          int            hashentrysize = rel->width + 64;

          if (hashentrysize * pathnode->path.rows > work_mem * 1024L)
!             all_hash = false;    /* don't try to hash */
          else
              cost_agg(&agg_path, root,
                       AGG_HASHED, NULL,
--- 1228,1240 ----
          int            hashentrysize = rel->width + 64;

          if (hashentrysize * pathnode->path.rows > work_mem * 1024L)
!         {
!             /*
!              * We should not try to hash.  Hack the SpecialJoinInfo to
!              * remember this, in case we come through here again.
!              */
!             sjinfo->semi_can_hash = false;
!         }
          else
              cost_agg(&agg_path, root,
                       AGG_HASHED, NULL,
*************** create_unique_path(PlannerInfo *root, Re
*** 1382,1400 ****
                       rel->rows);
      }

!     if (all_btree && all_hash)
      {
          if (agg_path.total_cost < sort_path.total_cost)
              pathnode->umethod = UNIQUE_PATH_HASH;
          else
              pathnode->umethod = UNIQUE_PATH_SORT;
      }
!     else if (all_btree)
          pathnode->umethod = UNIQUE_PATH_SORT;
!     else if (all_hash)
          pathnode->umethod = UNIQUE_PATH_HASH;
      else
!         goto no_unique_path;

      if (pathnode->umethod == UNIQUE_PATH_HASH)
      {
--- 1244,1266 ----
                       rel->rows);
      }

!     if (sjinfo->semi_can_btree && sjinfo->semi_can_hash)
      {
          if (agg_path.total_cost < sort_path.total_cost)
              pathnode->umethod = UNIQUE_PATH_HASH;
          else
              pathnode->umethod = UNIQUE_PATH_SORT;
      }
!     else if (sjinfo->semi_can_btree)
          pathnode->umethod = UNIQUE_PATH_SORT;
!     else if (sjinfo->semi_can_hash)
          pathnode->umethod = UNIQUE_PATH_HASH;
      else
!     {
!         /* we can get here only if we abandoned hashing above */
!         MemoryContextSwitchTo(oldcontext);
!         return NULL;
!     }

      if (pathnode->umethod == UNIQUE_PATH_HASH)
      {
*************** create_unique_path(PlannerInfo *root, Re
*** 1412,1426 ****
      MemoryContextSwitchTo(oldcontext);

      return pathnode;
-
- no_unique_path:            /* failure exit */
-
-     /* Mark the SpecialJoinInfo as not unique-able */
-     sjinfo->join_quals = NIL;
-
-     MemoryContextSwitchTo(oldcontext);
-
-     return NULL;
  }

  /*
--- 1278,1283 ----
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..e6dd4e8 100644
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct PlannerGlobal
*** 101,108 ****

      bool        transientPlan;    /* redo plan when TransactionXmin changes? */

!     bool        hasRowSecurity;    /* row security applied? */
!
  } PlannerGlobal;

  /* macro for fetching the Plan associated with a SubPlan node */
--- 101,107 ----

      bool        transientPlan;    /* redo plan when TransactionXmin changes? */

!     bool        hasRowSecurity; /* row security applied? */
  } PlannerGlobal;

  /* macro for fetching the Plan associated with a SubPlan node */
*************** typedef struct PlaceHolderVar
*** 1374,1384 ****
   * commute with this join, because that would leave noplace to check the
   * pushed-down clause.  (We don't track this for FULL JOINs, either.)
   *
!  * join_quals is an implicit-AND list of the quals syntactically associated
!  * with the join (they may or may not end up being applied at the join level).
!  * This is just a side list and does not drive actual application of quals.
!  * For JOIN_SEMI joins, this is cleared to NIL in create_unique_path() if
!  * the join is found not to be suitable for a uniqueify-the-RHS plan.
   *
   * jointype is never JOIN_RIGHT; a RIGHT JOIN is handled by switching
   * the inputs to make it a LEFT JOIN.  So the allowed values of jointype
--- 1373,1385 ----
   * commute with this join, because that would leave noplace to check the
   * pushed-down clause.  (We don't track this for FULL JOINs, either.)
   *
!  * For a semijoin, we also extract the join operators and their RHS arguments
!  * and set semi_operators, semi_rhs_exprs, semi_can_btree, and semi_can_hash.
!  * This is done in support of possibly unique-ifying the RHS, so we don't
!  * bother unless at least one of semi_can_btree and semi_can_hash can be set
!  * true.  (You might expect that this information would be computed during
!  * join planning; but it's helpful to have it available during planning of
!  * parameterized table scans, so we store it in the SpecialJoinInfo structs.)
   *
   * jointype is never JOIN_RIGHT; a RIGHT JOIN is handled by switching
   * the inputs to make it a LEFT JOIN.  So the allowed values of jointype
*************** typedef struct PlaceHolderVar
*** 1391,1397 ****
   * SpecialJoinInfos with jointype == JOIN_INNER for outer joins, since for
   * cost estimation purposes it is sometimes useful to know the join size under
   * plain innerjoin semantics.  Note that lhs_strict, delay_upper_joins, and
!  * join_quals are not set meaningfully within such structs.
   */

  typedef struct SpecialJoinInfo
--- 1392,1398 ----
   * SpecialJoinInfos with jointype == JOIN_INNER for outer joins, since for
   * cost estimation purposes it is sometimes useful to know the join size under
   * plain innerjoin semantics.  Note that lhs_strict, delay_upper_joins, and
!  * of course the semi_xxx fields are not set meaningfully within such structs.
   */

  typedef struct SpecialJoinInfo
*************** typedef struct SpecialJoinInfo
*** 1404,1410 ****
      JoinType    jointype;        /* always INNER, LEFT, FULL, SEMI, or ANTI */
      bool        lhs_strict;        /* joinclause is strict for some LHS rel */
      bool        delay_upper_joins;        /* can't commute with upper RHS */
!     List       *join_quals;        /* join quals, in implicit-AND list format */
  } SpecialJoinInfo;

  /*
--- 1405,1415 ----
      JoinType    jointype;        /* always INNER, LEFT, FULL, SEMI, or ANTI */
      bool        lhs_strict;        /* joinclause is strict for some LHS rel */
      bool        delay_upper_joins;        /* can't commute with upper RHS */
!     /* Remaining fields are set only for JOIN_SEMI jointype: */
!     bool        semi_can_btree; /* true if semi_operators are all btree */
!     bool        semi_can_hash;    /* true if semi_operators are all hash */
!     List       *semi_operators; /* OIDs of equality join operators */
!     List       *semi_rhs_exprs; /* righthand-side expressions of these ops */
  } SpecialJoinInfo;

  /*