Обсуждение: Suppressing unused subquery output columns

Поиск
Список
Период
Сортировка

Suppressing unused subquery output columns

От
Tom Lane
Дата:
A question in pgsql-general made me reflect about how the planner isn't
smart about unreferenced output columns of subqueries that it's not able
to flatten into the parent query.  Here's an example:

regression=# create table t1 (f1 int);
CREATE TABLE
regression=# create table t2 (f2 int primary key, f3 int);
CREATE TABLE
regression=# explain select f1 from (select * from t1 left join t2 on f1=f2) ss;
                      QUERY PLAN
------------------------------------------------------
 Seq Scan on t1  (cost=0.00..34.00 rows=2400 width=4)
 Planning time: 0.691 ms
(2 rows)

So far so good; we were able to apply join removal to t2, because the
query doesn't require outputting f2 or f3.  But if the subquery can't be
flattened, it stops working:

regression=# explain select f1 from (select * from t1 left join t2 on f1=f2 limit 1) ss;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Subquery Scan on ss  (cost=0.15..0.38 rows=1 width=4)
   ->  Limit  (cost=0.15..0.37 rows=1 width=12)
         ->  Nested Loop Left Join  (cost=0.15..516.00 rows=2400 width=12)
               ->  Seq Scan on t1  (cost=0.00..34.00 rows=2400 width=4)
               ->  Index Scan using t2_pkey on t2  (cost=0.15..0.19 rows=1 width=8)
                     Index Cond: (t1.f1 = f2)
(6 rows)

This is because while planning the separate subquery, the planner sees
"select *" and doesn't realize that f2 and f3 aren't really needed.

The attached draft patch fixes this by deleting unused output expressions
from unflattened subqueries, so that we get:

regression=# explain select f1 from (select * from t1 left join t2 on f1=f2 limit 1) ss;
                            QUERY PLAN
------------------------------------------------------------------
 Subquery Scan on ss  (cost=0.00..0.02 rows=1 width=4)
   ->  Limit  (cost=0.00..0.01 rows=1 width=4)
         ->  Seq Scan on t1  (cost=0.00..34.00 rows=2400 width=4)
 Planning time: 0.193 ms
(4 rows)

I'm not entirely convinced that it's worth the extra planning cycles,
though.  Given the small number of complaints to date, it might not
be worth doing this.  Thoughts?

            regards, tom lane

PS: to be clear, I'm not thinking of applying this till 9.5 opens,
in any case.

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 41eaa26..ec0bf3e 100644
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
***************
*** 17,25 ****
--- 17,28 ----

  #include <math.h>

+ #include "access/sysattr.h"
  #include "catalog/pg_class.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/pg_type.h"
  #include "foreign/fdwapi.h"
+ #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
  #ifdef OPTIMIZER_DEBUG
  #include "nodes/print.h"
*************** static void subquery_push_qual(Query *su
*** 98,103 ****
--- 101,107 ----
                     RangeTblEntry *rte, Index rti, Node *qual);
  static void recurse_push_qual(Node *setOp, Query *topquery,
                    RangeTblEntry *rte, Index rti, Node *qual);
+ static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);


  /*
*************** set_subquery_pathlist(PlannerInfo *root,
*** 1124,1130 ****
      /*
       * Must copy the Query so that planning doesn't mess up the RTE contents
       * (really really need to fix the planner to not scribble on its input,
!      * someday).
       */
      subquery = copyObject(subquery);

--- 1128,1134 ----
      /*
       * Must copy the Query so that planning doesn't mess up the RTE contents
       * (really really need to fix the planner to not scribble on its input,
!      * someday ... but see remove_unused_subquery_outputs to start with).
       */
      subquery = copyObject(subquery);

*************** set_subquery_pathlist(PlannerInfo *root,
*** 1199,1204 ****
--- 1203,1214 ----
      pfree(unsafeColumns);

      /*
+      * The upper query might not use all the subquery's output columns; if
+      * not, we can simplify.
+      */
+     remove_unused_subquery_outputs(subquery, rel);
+
+     /*
       * We can safely pass the outer tuple_fraction down to the subquery if the
       * outer level has no joining, aggregation, or sorting to do. Otherwise
       * we'd better tell the subquery to plan for full retrieval. (XXX This
*************** recurse_push_qual(Node *setOp, Query *to
*** 2033,2038 ****
--- 2043,2171 ----
  }

  /*****************************************************************************
+  *            SIMPLIFYING SUBQUERY TARGETLISTS
+  *****************************************************************************/
+
+ /*
+  * remove_unused_subquery_outputs
+  *        Remove subquery targetlist items we don't need
+  *
+  * It's possible, even likely, that the upper query does not read all the
+  * output columns of the subquery.  We can remove any such outputs that are
+  * not needed by the subquery itself (e.g., as sort/group columns) and do not
+  * affect semantics otherwise (e.g., volatile functions can't be removed).
+  * This is useful not only because we might be able to remove expensive-to-
+  * compute expressions, but because deletion of output columns might allow
+  * optimizations such as join removal to occur within the subquery.
+  *
+  * To avoid affecting column numbering in the targetlist, we don't physically
+  * remove unused tlist entries, but rather replace their expressions with NULL
+  * constants.  This is implemented by modifying subquery->targetList.
+  */
+ static void
+ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
+ {
+     Bitmapset  *attrs_used = NULL;
+     ListCell   *lc;
+
+     /*
+      * Do nothing if subquery has UNION/INTERSECT/EXCEPT: in principle we
+      * could update all the child SELECTs' tlists, but it seems not worth the
+      * trouble presently.
+      */
+     if (subquery->setOperations)
+         return;
+
+     /*
+      * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
+      * time: all its output columns must be used in the distinctClause.
+      */
+     if (subquery->distinctClause && !subquery->hasDistinctOn)
+         return;
+
+     /*
+      * Collect a bitmap of all the output column numbers used by the upper
+      * query.
+      *
+      * Add all the attributes needed for joins or final output.  Note: we must
+      * look at reltargetlist, not the attr_needed data, because attr_needed
+      * isn't computed for inheritance child rels, cf set_append_rel_size().
+      * (XXX might be worth changing that sometime.)
+      */
+     pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used);
+
+     /* Add all the attributes used by un-pushed-down restriction clauses. */
+     foreach(lc, rel->baserestrictinfo)
+     {
+         RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+         pull_varattnos((Node *) rinfo->clause, rel->relid, &attrs_used);
+     }
+
+     /*
+      * If there's a whole-row reference to the subquery, we can't remove
+      * anything.
+      */
+     if (bms_is_member(0 - FirstLowInvalidHeapAttributeNumber, attrs_used))
+         return;
+
+     /*
+      * Run through the tlist and zap entries we don't need.  It's okay to
+      * modify the tlist items in-place because set_subquery_pathlist made a
+      * copy of the subquery.
+      */
+     foreach(lc, subquery->targetList)
+     {
+         TargetEntry *tle = (TargetEntry *) lfirst(lc);
+
+         /*
+          * If it has a sortgroupref number, it's used in some sort/group
+          * clause so we'd better not remove it.  Also, don't remove any
+          * resjunk columns, since their reason for being has nothing to do
+          * with anybody reading the subquery's output.  (It's likely that
+          * resjunk columns in a sub-SELECT would always have ressortgroupref
+          * set, but even if they don't, it seems imprudent to remove them.)
+          */
+         if (tle->ressortgroupref || tle->resjunk)
+             continue;
+
+         /*
+          * If it's used by the upper query, we can't remove it.
+          */
+         if (bms_is_member(tle->resno - FirstLowInvalidHeapAttributeNumber,
+                           attrs_used))
+             continue;
+
+         /*
+          * If it contains a set-returning function, we can't remove it since
+          * that could change the number of rows returned by the subquery.
+          */
+         if (expression_returns_set((Node *) tle->expr))
+             continue;
+
+         /*
+          * If it contains volatile functions, we daren't remove it for fear
+          * that the user is expecting their side-effects to happen.
+          */
+         if (contain_volatile_functions((Node *) tle->expr))
+             continue;
+
+         /*
+          * OK, we don't need it.  Replace the expression with a NULL constant.
+          * We can just make the constant be of INT4 type, since nothing's
+          * going to look at it anyway.
+          */
+         tle->expr = (Expr *) makeConst(INT4OID,
+                                        -1,
+                                        InvalidOid,
+                                        sizeof(int32),
+                                        (Datum) 0,
+                                        true,    /* isnull */
+                                        true /* byval */ );
+     }
+ }
+
+ /*****************************************************************************
   *            DEBUG SUPPORT
   *****************************************************************************/


Re: Suppressing unused subquery output columns

От
Rod Taylor
Дата:
On Thu, Jun 5, 2014 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not entirely convinced that it's worth the extra planning cycles,
though.  Given the small number of complaints to date, it might not
be worth doing this.  Thoughts?


Would this avoid execution of expensive functions in views when their output is discarded?

-- On 9.3
CREATE TABLE data (col1 serial primary key);
INSERT INTO data DEFAULT VALUES;
INSERT INTO data DEFAULT VALUES;

CREATE OR REPLACE VIEW v AS select *, (pg_sleep(1))::text FROM data;

t=# explain analyze select col1 from v;
                                                  QUERY PLAN                                                 
--------------------------------------------------------------------------------------------------------------
 Subquery Scan on v  (cost=0.00..76.00 rows=2400 width=4) (actual time=1001.086..2002.217 rows=2 loops=1)
   ->  Seq Scan on data  (cost=0.00..52.00 rows=2400 width=4) (actual time=1001.083..2002.210 rows=2 loops=1)
 Total runtime: 2002.268 ms
(3 rows)


regards,

Rod

Re: Suppressing unused subquery output columns

От
Tom Lane
Дата:
Rod Taylor <rod.taylor@gmail.com> writes:
> On Thu, Jun 5, 2014 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not entirely convinced that it's worth the extra planning cycles,
>> though.  Given the small number of complaints to date, it might not
>> be worth doing this.  Thoughts?

> Would this avoid execution of expensive functions in views when their
> output is discarded?

Yes, as long as they're not marked volatile and don't return sets.
        regards, tom lane



Re: Suppressing unused subquery output columns

От
Jim Nasby
Дата:
On 6/5/14, 9:54 PM, Tom Lane wrote:
> Rod Taylor <rod.taylor@gmail.com> writes:
>> On Thu, Jun 5, 2014 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm not entirely convinced that it's worth the extra planning cycles,
>>> though.  Given the small number of complaints to date, it might not
>>> be worth doing this.  Thoughts?
>
>> Would this avoid execution of expensive functions in views when their
>> output is discarded?
>
> Yes, as long as they're not marked volatile and don't return sets.

That would certainly make it useful for us.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Suppressing unused subquery output columns

От
David Rowley
Дата:
On Fri, Jun 6, 2014 at 2:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not entirely convinced that it's worth the extra planning cycles,
though.  Given the small number of complaints to date, it might not
be worth doing this.  Thoughts?


That's a difficult question for sure. Obviously it's going to depend on the query that we're planning. If it's a very simple query and we don't reduce the target list of the subquery any, then we'll get a small performance impact... But if we do manage to prune down the targetlist and later allow a left join on a 1 billion row table to be removed, then the cost of the extra few cycles performed by this patch would be totally unmeasurable and completely insignificant.

At work we use <insert name of popular commercial relational database product here>, and on a daily basis I come across poorly written queries by other developers.

The things I can think of off hand are:

1. All primary key fields of a table are in a DISTINCT clause
2. Someone had written a query like: SELECT some,1 as type FROM table UNION SELECT thing,2 as type from othertable

The database in question manage to remove the DISTINCT on 1 because it's just not needed as each group could only ever have 1 row. On 2 it managed to see that because column 2 of the UNION query was a constant and it was a different constant on each side of the UNION, then the query would not produce duplicates and it changed this to UNION ALL.

At home I checked how PostgreSQL handled both of these cases, and saw that it failed to see through the non-sense in the poorly written queries on both accounts. This is fine, as anyone who ever posted on the performance list saying, "why is this slow?" We'd tell them to write their query another way. But what about all the companies who consider porting their application over to PostgreSQL and get to the stage of importing all the data over onto a test server and trying a few of their (poorly written) queries. And they see a 10 times slowdown and immediately think PostgreSQL is just not for them. It's a shame that they'd turn us down so early, but if we went ahead and added code to detect both of these situations then we'd end up increasing planning time for everyone else who writes their queries properly.

I don't really have an answer for this, but I do think it needs more discussion. The only things I've thought of so far as a planner_strength GNU that can try to optimise these things when it gets to a certain level, but that just introduces surprise factor and a whole bunch of threads on performance saying... Why is this query on pg10.4 slower than pg10.2? and our reply goes, is planner_strength set to the same on both? It does not sound pretty, or another option to replan a query when the cost is over a certain threshold with the strength level turned to maximum, but that seems like it could go along the same lines as far as surprise factor is concerned.

It's pretty hard to get the best of both worlds here. I know my post does not have many answers, but I posted it anyway just in case someone else comes up with a good idea that perhaps we could all work towards.

Regards

David Rowley


Re: Suppressing unused subquery output columns

От
Robert Haas
Дата:
On Thu, Jun 5, 2014 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The attached draft patch fixes this by deleting unused output expressions
> from unflattened subqueries, so that we get:
>
> regression=# explain select f1 from (select * from t1 left join t2 on f1=f2 limit 1) ss;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  Subquery Scan on ss  (cost=0.00..0.02 rows=1 width=4)
>    ->  Limit  (cost=0.00..0.01 rows=1 width=4)
>          ->  Seq Scan on t1  (cost=0.00..34.00 rows=2400 width=4)
>  Planning time: 0.193 ms
> (4 rows)
>
> I'm not entirely convinced that it's worth the extra planning cycles,
> though.  Given the small number of complaints to date, it might not
> be worth doing this.  Thoughts?

I've encountered this issue before and think it would be great to fix
it.  I'm not sure precisely how many cycles it's worth paying, but
definitely more than zero.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Suppressing unused subquery output columns

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 5, 2014 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The attached draft patch fixes this by deleting unused output expressions
>> from unflattened subqueries, so that we get:
>> ...
>> I'm not entirely convinced that it's worth the extra planning cycles,
>> though.  Given the small number of complaints to date, it might not
>> be worth doing this.  Thoughts?

> I've encountered this issue before and think it would be great to fix
> it.  I'm not sure precisely how many cycles it's worth paying, but
> definitely more than zero.

We have a couple votes for this patch and no one has spoken against it,
so I'll go ahead and push it into HEAD.
        regards, tom lane



refactoring allpaths.c (was Re: Suppressing unused subquery output columns)

От
Tom Lane
Дата:
I wrote:
> We have a couple votes for this patch and no one has spoken against it,
> so I'll go ahead and push it into HEAD.

BTW, I forgot to mention that while working on this patch I was thinking
it's past time to separate out the subquery support in allpaths.c into
its own file.  With this patch, allpaths.c is 2363 lines, about 690 of
which are set_subquery_pathlist and subsidiary functions.  While that
might not be quite tail-wagging-dog level, I think it's enough to justify
splitting that code out into a new file, say optimizer/path/subquerypath.c.

There are also about 630 lines devoted to appendrel path generation,
which might be thought enough to deserve a separate file of its own.
I'm a bit less excited about that though, mainly because the appendrel
logic has some not-quite-arms-length interactions with set_rel_size();
but there's certainly some case to be made for doing it.

Thoughts?
        regards, tom lane



Re: refactoring allpaths.c (was Re: Suppressing unused subquery output columns)

От
Etsuro Fujita
Дата:
(2014/06/13 1:37), Tom Lane wrote:
> I wrote:
>> We have a couple votes for this patch and no one has spoken against it,
>> so I'll go ahead and push it into HEAD.
> 
> BTW, I forgot to mention that while working on this patch I was thinking
> it's past time to separate out the subquery support in allpaths.c into
> its own file.  With this patch, allpaths.c is 2363 lines, about 690 of
> which are set_subquery_pathlist and subsidiary functions.  While that
> might not be quite tail-wagging-dog level, I think it's enough to justify
> splitting that code out into a new file, say optimizer/path/subquerypath.c.

+1

Sorry for the late reply.

Best regards,
Etsuro Fujita