Обсуждение: limit in subquery causes poor selectivity estimation

Поиск
Список
Период
Сортировка

limit in subquery causes poor selectivity estimation

От
Peter Eisentraut
Дата:
This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (   sha1 bytea PRIMARY KEY,   something text
);

CREATE TABLE test2 (   sha1 bytea PRIMARY KEY,   blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test1 VALUES
(decode('$sha1','hex'),'blah$i$i')"; done
 
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test2 VALUES
(decode('$sha1','hex'),'foo$i')"; done
 

(Doesn't really matter whether the key values are the same or
overlapping or not.  Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);                             QUERY PLAN
----------------------------------------------------------------------Hash Semi Join  (cost=30.52..61.27 rows=1000
width=27) Hash Cond: (test1.sha1 = test2.sha1)  ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)  ->  Hash
(cost=18.01..18.01rows=1001 width=21)        ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
 

That's OK.  Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table.  (Or
can it?)

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);                                   QUERY
PLAN
----------------------------------------------------------------------------------Hash Join  (cost=10.60..33.35
rows=500width=27)  Hash Cond: (test1.sha1 = test2.sha1)  ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
-> Hash  (cost=8.10..8.10 rows=200 width=32)        ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
-> Limit  (cost=0.00..3.60 rows=200 width=21)                    ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001
width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that?  Just a case of, we haven't thought about
covering this case yet, or are there larger problems?



Re: limit in subquery causes poor selectivity estimation

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
>    Hash Cond: (test1.sha1 = test2.sha1)
>    ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
>    ->  Hash  (cost=18.01..18.01 rows=1001 width=21)
>          ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

> That's OK.  Apparently it can tell that joining two tables on their
> primary keys cannot result in more rows than the smaller table.  (Or
> can it?)

More like it knows that a semijoin can't produce more rows than the
lefthand input has.  But I think it is actually applying stats for
both columns here.

> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);

> Here, however, it has apparently not passed this knowledge through the
> LIMIT.

The LIMIT prevents the subquery from being flattened entirely, ie we
don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
you'll note that it punts for Vars coming from unflattened subqueries.

> So what's up with that?  Just a case of, we haven't thought about
> covering this case yet, or are there larger problems?

The larger problem is that if a subquery didn't get flattened, it's
often because it's got LIMIT, or GROUP BY, or some similar clause that
makes it highly suspect whether the statistics available for the table
column are reasonable to use for the subquery outputs.  It wouldn't be
that hard to grab the stats for test2.sha1, but then how do you want
to adjust them to reflect the LIMIT?
        regards, tom lane


Re: limit in subquery causes poor selectivity estimation

От
Robert Haas
Дата:
On Sat, Aug 27, 2011 at 1:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
>>                               QUERY PLAN
>> ----------------------------------------------------------------------
>>  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
>>    Hash Cond: (test1.sha1 = test2.sha1)
>>    ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
>>    ->  Hash  (cost=18.01..18.01 rows=1001 width=21)
>>          ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
>
>> That's OK.  Apparently it can tell that joining two tables on their
>> primary keys cannot result in more rows than the smaller table.  (Or
>> can it?)
>
> More like it knows that a semijoin can't produce more rows than the
> lefthand input has.  But I think it is actually applying stats for
> both columns here.
>
>> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
>
>> Here, however, it has apparently not passed this knowledge through the
>> LIMIT.
>
> The LIMIT prevents the subquery from being flattened entirely, ie we
> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
> you'll note that it punts for Vars coming from unflattened subqueries.
>
>> So what's up with that?  Just a case of, we haven't thought about
>> covering this case yet, or are there larger problems?
>
> The larger problem is that if a subquery didn't get flattened, it's
> often because it's got LIMIT, or GROUP BY, or some similar clause that
> makes it highly suspect whether the statistics available for the table
> column are reasonable to use for the subquery outputs.  It wouldn't be
> that hard to grab the stats for test2.sha1, but then how do you want
> to adjust them to reflect the LIMIT?

Well, you can't.  I think the question is, in the absence of perfect
information, is it better to use the stats you have, or just punt and
assume you know nothing?  Like Peter, I've certainly seen cases where
pulling up the stats would be a huge win, but it's hard to say whether
there are other cases where it would be worse than what we do now,
because nobody spends any time staring at the queries where the
existing system works great.  My gut feeling is that pulling up the
stats unchanged is likely to be better than punting, but my gut
feeling may not be worth much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: limit in subquery causes poor selectivity estimation

От
Peter Eisentraut
Дата:
On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
> > EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2
> LIMIT 200);
> 
> > Here, however, it has apparently not passed this knowledge through
> the
> > LIMIT.
> 
> The LIMIT prevents the subquery from being flattened entirely, ie we
> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
> you'll note that it punts for Vars coming from unflattened subqueries.
> 
> > So what's up with that?  Just a case of, we haven't thought about
> > covering this case yet, or are there larger problems?
> 
> The larger problem is that if a subquery didn't get flattened, it's
> often because it's got LIMIT, or GROUP BY, or some similar clause that
> makes it highly suspect whether the statistics available for the table
> column are reasonable to use for the subquery outputs.  It wouldn't be
> that hard to grab the stats for test2.sha1, but then how do you want
> to adjust them to reflect the LIMIT?

It turns out that this is a regression introduced in 8.4.8; the same
topic is also being discussed in

http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

and

http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php

This is the (previously posted) plan with 8.4.8:
                                   QUERY PLAN                                    
----------------------------------------------------------------------------------Hash Join  (cost=10.60..34.35
rows=500width=31)  Hash Cond: (test1.sha1 = test2.sha1)  ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
-> Hash  (cost=8.10..8.10 rows=200 width=32)        ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
-> Limit  (cost=0.00..3.60 rows=200 width=21)                    ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001
width=21)

And this is the plan with 8.4.7:
                                   QUERY PLAN                                    
----------------------------------------------------------------------------------Hash Join  (cost=10.80..34.55
rows=200width=31)  Hash Cond: (test1.sha1 = test2.sha1)  ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
-> Hash  (cost=8.30..8.30 rows=200 width=32)        ->  HashAggregate  (cost=6.30..8.30 rows=200 width=32)
-> Limit  (cost=0.00..3.80 rows=200 width=21)                    ->  Seq Scan on test2  (cost=0.00..19.01 rows=1001
width=21)

I liked the old one better. ;-)




Re: limit in subquery causes poor selectivity estimation

От
Robert Haas
Дата:
On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> > EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2
>> LIMIT 200);
>>
>> > Here, however, it has apparently not passed this knowledge through
>> the
>> > LIMIT.
>>
>> The LIMIT prevents the subquery from being flattened entirely, ie we
>> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
>> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
>> you'll note that it punts for Vars coming from unflattened subqueries.
>>
>> > So what's up with that?  Just a case of, we haven't thought about
>> > covering this case yet, or are there larger problems?
>>
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs.  It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?
>
> It turns out that this is a regression introduced in 8.4.8; the same
> topic is also being discussed in
>
> http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php
>
> and
>
> http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php
>
> This is the (previously posted) plan with 8.4.8:
>
>                                    QUERY PLAN
> ----------------------------------------------------------------------------------
>  Hash Join  (cost=10.60..34.35 rows=500 width=31)
>   Hash Cond: (test1.sha1 = test2.sha1)
>   ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
>   ->  Hash  (cost=8.10..8.10 rows=200 width=32)
>         ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
>               ->  Limit  (cost=0.00..3.60 rows=200 width=21)
>                     ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
>
> And this is the plan with 8.4.7:
>
>                                    QUERY PLAN
> ----------------------------------------------------------------------------------
>  Hash Join  (cost=10.80..34.55 rows=200 width=31)
>   Hash Cond: (test1.sha1 = test2.sha1)
>   ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
>   ->  Hash  (cost=8.30..8.30 rows=200 width=32)
>         ->  HashAggregate  (cost=6.30..8.30 rows=200 width=32)
>               ->  Limit  (cost=0.00..3.80 rows=200 width=21)
>                     ->  Seq Scan on test2  (cost=0.00..19.01 rows=1001 width=21)
>
> I liked the old one better. ;-)

AFAICS, those plans are identical, except for a minor difference in
the cost of scanning test2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: limit in subquery causes poor selectivity estimation

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> I liked the old one better. ;-)

> AFAICS, those plans are identical, except for a minor difference in
> the cost of scanning test2.

The point is that the estimate of the result size is worse in 8.4.8.

I am not, however, convinced that 8.4.7 was actually smarter ... it
may have been getting the right answer for the wrong reason.
        regards, tom lane


Re: limit in subquery causes poor selectivity estimation

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs.  It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?

> It turns out that this is a regression introduced in 8.4.8;

Well, the fact that examine_variable punts on subqueries is certainly
not a "regression introduced in 8.4.8"; it's always been like that.

I think your observation that 8.4.8 is worse has to be blamed on
commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a
fallback rule of assuming 0.5 selectivity for a semijoin if we didn't
have non-default ndistinct estimates on both sides.  Before that, 8.4.x
would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1),
even when one or both ndistinct values were completely made-up.

I'm not sure what we could do instead.  Perhaps you could argue that
we should just revert that commit on the grounds that it's doing more
harm than good, but I don't really believe that --- I think reverting
would just move the pain points around.  It's pure luck that 8.4.8
is worse rather than better on the particular example you cite.

On a longer-term basis, I'm looking into what we could do with
extracting stats from subqueries, but that doesn't seem like material
for a backpatch.  I have a draft patch that I've been playing with
(attached).  The main thing I feel unsure about is whether it's
reasonable to extract stats in this way from a subquery that has GROUP
BY or DISTINCT.  ISTM it's probably okay to ignore joining, sorting, or
limiting in the subquery: those might affect the stats of the subquery
output, but this is no worse than using the unmodified column statistics
for any other join-level selectivity estimate (where we already ignore
the effects of scan-level restriction clauses that will filter the
column values).  But GROUP BY or DISTINCT would entirely invalidate the
column frequency statistics, which makes me think that ignoring the
pg_statistic entry might be the thing to do.  Comments?

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ef29374fcccae23cb663c04470f12c22321a0e2c..a703f4a910c0e5f521f09cf9564a05c73cf803b8 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 95,100 ****
--- 95,101 ----
  #include "access/sysattr.h"
  #include "catalog/index.h"
  #include "catalog/pg_collation.h"
+ #include "catalog/pg_inherits_fn.h"
  #include "catalog/pg_opfamily.h"
  #include "catalog/pg_statistic.h"
  #include "catalog/pg_type.h"
*************** static double convert_one_bytea_to_scala
*** 168,173 ****
--- 169,176 ----
                              int rangelo, int rangehi);
  static char *convert_string_datum(Datum value, Oid typid);
  static double convert_timevalue_to_scalar(Datum value, Oid typid);
+ static void examine_variable_recurse(Query *subquery, AttrNumber resno,
+                          VariableStatData *vardata);
  static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
                     Oid sortop, Datum *min, Datum *max);
  static bool get_actual_variable_range(PlannerInfo *root,
*************** examine_variable(PlannerInfo *root, Node
*** 4176,4196 ****
          }
          else if (rte->rtekind == RTE_RELATION)
          {
              vardata->statsTuple = SearchSysCache3(STATRELATTINH,
                                                  ObjectIdGetDatum(rte->relid),
                                                  Int16GetDatum(var->varattno),
                                                    BoolGetDatum(rte->inh));
              vardata->freefunc = ReleaseSysCache;
          }
          else
          {
              /*
!              * XXX This means the Var comes from a JOIN or sub-SELECT. Later
!              * add code to dig down into the join etc and see if we can trace
!              * the variable to something with stats.  (But beware of
!              * sub-SELECTs with DISTINCT/GROUP BY/etc.    Perhaps there are no
!              * cases where this would really be useful, because we'd have
!              * flattened the subselect if it is??)
               */
          }

--- 4179,4205 ----
          }
          else if (rte->rtekind == RTE_RELATION)
          {
+             /* plain table, so look up the column in pg_statistic */
              vardata->statsTuple = SearchSysCache3(STATRELATTINH,
                                                  ObjectIdGetDatum(rte->relid),
                                                  Int16GetDatum(var->varattno),
                                                    BoolGetDatum(rte->inh));
              vardata->freefunc = ReleaseSysCache;
          }
+         else if (rte->rtekind == RTE_SUBQUERY)
+         {
+             /* recurse into subquery */
+             examine_variable_recurse(rte->subquery, var->varattno,
+                                      vardata);
+         }
          else
          {
              /*
!              * Otherwise, the Var comes from a FUNCTION, VALUES, or CTE RTE.
!              * (We won't see RTE_JOIN here because join alias Vars have
!              * already been flattened.)  There's not much we can do with
!              * function outputs, but maybe someday try to be smarter about
!              * VALUES and/or CTEs.
               */
          }

*************** examine_variable(PlannerInfo *root, Node
*** 4335,4340 ****
--- 4344,4435 ----
  }

  /*
+  * examine_variable_recurse
+  *        Handle a Var referring to a subquery result for examine_variable
+  *
+  * If the Var ultimately refers to a column of a table, we can fill in the
+  * statsTuple; none of the other fields of vardata need to be changed.
+  * (In particular, we don't try to set isunique, since even if the underlying
+  * column is unique, the subquery may have joined to other tables in a way
+  * that creates duplicates.)
+  *
+  * XXX For the moment, we ignore the possibility that the subquery may change
+  * the column's statistics so much that we'd be better off not consulting
+  * pg_statistic at all.  Possibly we should give up if the subquery uses
+  * GROUP BY or DISTINCT, for instance.
+  *
+  * The subquery we are looking at has not been through planner preprocessing,
+  * so some things have to be done differently here than in examine_variable.
+  * (Perhaps we should rearrange things so that the sub-root data structure
+  * is kept around after we plan a subquery?)
+  */
+ static void
+ examine_variable_recurse(Query *subquery, AttrNumber resno,
+                          VariableStatData *vardata)
+ {
+     TargetEntry *ste;
+     Var           *var;
+     RangeTblEntry *rte;
+
+     /* Get the subquery output expression referenced by the upper Var */
+     ste = get_tle_by_resno(subquery->targetList, resno);
+     if (ste == NULL || ste->resjunk)
+         elog(ERROR, "subquery does not have attribute %d", resno);
+     var = (Var *) ste->expr;
+
+ restart:
+     /* Can only handle a simple Var of subquery's query level */
+     if (var && IsA(var, Var) &&
+         var->varlevelsup == 0)
+     {
+         /* OK, fetch the RTE referenced by the Var */
+         rte = rt_fetch(var->varno, subquery->rtable);
+
+         /* XXX Can't call get_relation_stats_hook for lack of subroot */
+
+         if (rte->rtekind == RTE_RELATION)
+         {
+             /*
+              * We can't just rely on rte->inh here, because in parser output
+              * that only signifies the absence of ONLY.  We have to check
+              * relhassubclass as well.  For speed, we don't actually examine
+              * pg_inherits, so we might get fooled by a table that once had
+              * children but no longer does.  But that's okay, because ANALYZE
+              * will still produce inheritance-tree statistics in such cases,
+              * so we will still find an applicable pg_statistic entry.
+              */
+             bool    inh;
+
+             inh = rte->inh && has_subclass(rte->relid);
+             vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+                                                   ObjectIdGetDatum(rte->relid),
+                                                   Int16GetDatum(var->varattno),
+                                                   BoolGetDatum(inh));
+             vardata->freefunc = ReleaseSysCache;
+         }
+         else if (rte->rtekind == RTE_SUBQUERY)
+         {
+             /* recurse some more ... */
+             examine_variable_recurse(rte->subquery, var->varattno,
+                                      vardata);
+         }
+         else if (rte->rtekind == RTE_JOIN)
+         {
+             /*
+              * Since join alias variables haven't been flattened in the
+              * subquery, we have to be prepared to dereference them.
+              */
+             if (var->varattno == InvalidAttrNumber)
+                 return;            /* punt on whole-row reference */
+             Assert(var->varattno > 0);
+             var = (Var *) list_nth(rte->joinaliasvars, var->varattno - 1);
+             goto restart;
+         }
+         /* As in examine_variable, do nothing for other RTE kinds */
+     }
+ }
+
+ /*
   * get_variable_numdistinct
   *      Estimate the number of distinct values of a variable.
   *

Re: limit in subquery causes poor selectivity estimation

От
Tom Lane
Дата:
I wrote:
> On a longer-term basis, I'm looking into what we could do with
> extracting stats from subqueries, but that doesn't seem like material
> for a backpatch.  I have a draft patch that I've been playing with
> (attached).

I've committed a heavily rewritten version of that patch.  Git HEAD
seems to do reasonably well on the test case you gave at the start of
this thread.  I'm not sure yet how well the logic will hold up in
real-world queries as opposed to simplified test cases, but give it
a try.
        regards, tom lane


Re: limit in subquery causes poor selectivity estimation

От
Robert Haas
Дата:
On Fri, Sep 2, 2011 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> column values).  But GROUP BY or DISTINCT would entirely invalidate the
> column frequency statistics, which makes me think that ignoring the
> pg_statistic entry might be the thing to do.  Comments?

There's a possible problem there in that you may have trouble getting
a good join selectivity estimate in cases like:

SELECT ... FROM foo LEFT JOIN (SELECT x, SUM(1) FROM bar GROUP BY 1)
ON foo.x = bar.x

My guess is that in practice, the number of rows in foo that find a
join partner here is going to be much higher than what a stats-less
join selectivity estimation is likely to come up with.  You typically
don't write a query like this in the first place if you don't expect
to find matches, although I'm sure it's been done.  In some cases you
might even have a foreign key relationship to work with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company