Обсуждение: [PATCH] Equivalence Class Filters

Поиск
Список
Период
Сортировка

[PATCH] Equivalence Class Filters

От
David Rowley
Дата:
Hi,

I'd like to gather up what the community interest might be for this patch. Let me explain:

As of today Equivalence Classes are used in the query planner to allow the planner to have more flexibility into the join order by collecting information to describe which expressions are equal to each other. These Equivalence classes, when they contain a constant value also allow predicate push down. For example:

# explain select * from t1 inner join t2 on t1.id = t2.id where t1.id=1;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Nested Loop  (cost=0.56..12.61 rows=1 width=12)
   ->  Index Scan using t1_pkey on t1  (cost=0.29..8.30 rows=1 width=8)
         Index Cond: (id = 1)
   ->  Index Only Scan using t2_pkey on t2  (cost=0.28..4.29 rows=1 width=4)
         Index Cond: (id = 1)
(5 rows) 

We can see that a qual was added to filter t2.id=1.

As of today these Equivalence Classes only incorporate expressions which use the equality operator, but what if the above query had been written as:

select * from t1 inner join t2 on t1.id = t2.id where t1.id <= 10;

Should we not be able to assume that t2.id is also <= 10? Currently we don't, but the attached patch expands to add what I've called Equivalence Filters to Equivalence Classes.

This allows the above query to produce a plan like:

                                  QUERY PLAN
------------------------------------------------------------------------------
 Merge Join  (cost=0.56..5.68 rows=1 width=12)
   Merge Cond: (t1.id = t2.id)
   ->  Index Scan using t1_pkey on t1  (cost=0.29..8.44 rows=9 width=8)
         Index Cond: (id <= 10)
   ->  Index Only Scan using t2_pkey on t2  (cost=0.28..4.45 rows=10 width=4)
         Index Cond: (id <= 10)

Now, it may not be that common to perform range filters on an id column, but it can be fairly common for date values to be join conditions and have date range filters too. For example in [1] Alexandre claimed a 1100 to 2200 performance improvement after manually pushing the date filter into the subquery. This patch allows this to become automatic.

This all works by simply just collecting OpExprs during building the EquivalanceClasses which have previously been rejected for the eqclass because they don't use an equality operator. OpExprs in the form of "Expr op Const" and "Const op Expr" are collected, and then once the EquivalanceClasses have been build the "Expr" is searched for in the built classes to see if we can find that Expr as a member of a class, if so this then becomes an EquivalenceFilter and gets tagged onto to the EquivalenceClass.

Patch Status - I'm a bit uncertain as to how far we can go with this and if we deem this as a good idea, then we'd need to be careful not to cause any performance regression. For example if the filter was "somestaticfunc(col) < 1234", then we might not want to push that down as somestaticfunc() might be so costly, that it might be better to perform the join with all the rows instead.  For this reason I've limited the current patch to only using Operators which are members of a btree op class. Perhaps we could do more, but likely there's less of a win to be had due to less chance of that qual being useful for an index scan.

Writing this patch has been on my "interesting" list for a while now. I got some time last weekend to finally write it, but due to that happening at the weekend rather than in work time it's a low priority for me to. However there's no sense in it gathering dust, so I'm posting it today.

Is this something that we might want?



--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: [PATCH] Equivalence Class Filters

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> As of today these Equivalence Classes only incorporate expressions which
> use the equality operator, but what if the above query had been written as:

> select * from t1 inner join t2 on t1.id = t2.id where t1.id <= 10;

> Should we not be able to assume that t2.id is also <= 10?

This sort of thing has been suggested multiple times before, and I've
rejected it every time on the grounds that it would likely add far more
planner cycles than it'd be worth, eg, time would be spent on searches for
matching subexpressions whether or not anything was learned (and often
nothing would be learned).  While I've only read your description of the
patch not the patch itself, the search methodology you propose seems
pretty brute-force and unlikely to solve that issue.  It's particularly
important to avoid O(N^2) behaviors when there are N expressions ...

Another issue that would need consideration is how to avoid skewing
planner selectivity estimates with derived clauses that are fully
redundant with other clauses.  The existing EC machinery is mostly
able to dodge that problem by generating just a minimal set of equality
clauses from an EC, but I don't see how we'd make that work here.

I'm also wondering why you want to limit it to comparisons to constants;
that seems rather arbitrary.

Lastly, in most cases knowing that t2.id <= 10 is just not worth all
that much; it's certainly far less useful than an equality condition.
It's not difficult to imagine that this would often be a net drag on
runtime performance (never mind planner performance) by doing nothing
except creating additional filter conditions the executor has to check.
Certainly it would be valuable to know this if it let us exclude some
partition of a table, but that's only going to happen in a small minority
of queries.

I'm not necessarily opposed to doing anything in this area, but so far
I've not seen how to do it in a way that is attractive when planner
complexity, performance, and end results are all considered.
        regards, tom lane



Re: [PATCH] Equivalence Class Filters

От
David Rowley
Дата:
On 6 December 2015 at 06:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <david.rowley@2ndquadrant.com> writes:
> As of today these Equivalence Classes only incorporate expressions which
> use the equality operator, but what if the above query had been written as:

> select * from t1 inner join t2 on t1.id = t2.id where t1.id <= 10;

> Should we not be able to assume that t2.id is also <= 10?
 
This sort of thing has been suggested multiple times before, and I've
rejected it every time on the grounds that it would likely add far more
planner cycles than it'd be worth, eg, time would be spent on searches for
matching subexpressions whether or not anything was learned (and often
nothing would be learned).

I guess if it keeps coming up then people must be hitting this limitation. Perhaps continually rejecting it based on your original opinion is not the best way to move forward with improving the query planner.
 
  While I've only read your description of the
patch not the patch itself, the search methodology you propose seems
pretty brute-force and unlikely to solve that issue.  It's particularly
important to avoid O(N^2) behaviors when there are N expressions ...



Likely it would be possible to do something to improve on that.
Perhaps if the list of filter clauses grows beyond a certain number then a hash table can be built on the ec_members list with the em_expr as the key and the EquivalenceClass as the data. Although likely we don't currently have anything available for generating hash values for Expr nodes. On checking it seems that 32 is the estimated tipping point for hashing in find_join_rel(), perhaps something similar would suit this.
 
Another issue that would need consideration is how to avoid skewing
planner selectivity estimates with derived clauses that are fully
redundant with other clauses.  The existing EC machinery is mostly
able to dodge that problem by generating just a minimal set of equality
clauses from an EC, but I don't see how we'd make that work here.


Could you give me an example of where this is a problem? I've tried fooling the planner into giving me bad estimates, but I've not managed to.

# explain analyze select * from t1 where t1.id < 10 and t1.id < 100 and t1.id < 1000;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Index Scan using t1_pkey on t1  (cost=0.29..8.49 rows=9 width=8) (actual time=0.155..0.160 rows=9 loops=1)
   Index Cond: ((id < 10) AND (id < 100) AND (id < 1000))

I'd say the t1.id < 100 AND t1.id < 1000 are fully redundant here. The estimate given by the planner is bang-on.

Perhaps you're talking about another column making the pushed down qual redundant?  if so, is the current eclass code not prone to the exact same problem?

I'm also wondering why you want to limit it to comparisons to constants;
that seems rather arbitrary.


Do you have other useful cases in mind? 
The other one I considered was "expr op expr" clauses, where both those exprs were in eclasses. For example:

select * from t1 inner join t2 on t1.col1=t2.col1 and t1.col2=t2.col2 where t1.col1 < t1.col2;

I'd imagine that would have a narrower use case. I simply stopped with "Expr op Const" because I though that would be more common and less planner overhead to detect. Giving that below you also raise concerns that "expr op const" is likely not that useful in enough cases, then I'm not holding up too much hope of adding more complexity to the detection mechanism for less common wins.
 
Lastly, in most cases knowing that t2.id <= 10 is just not worth all
that much; it's certainly far less useful than an equality condition.
It's not difficult to imagine that this would often be a net drag on
runtime performance (never mind planner performance) by doing nothing
except creating additional filter conditions the executor has to check.
Certainly it would be valuable to know this if it let us exclude some
partition of a table, but that's only going to happen in a small minority
of queries.


I'd have thought that my link to a thread of a reported 1100 to 2200 times performance improvement might have made you think twice about claiming the uselessness of this idea. 

I'm also quite surprised at you mentioning partitions here. Personally I'd be thinking of indexes long before thinking of partitions. I don't think I need to remind you that btree indexes handle >= and <= just fine. It's not all that hard to imagine that if they're using that column for a join condition and are serious about performance that they just might also have an index defined on that column too. I'd imagine the only case we might have to worry about is when the selectivity of the pushed qual is getting close to 1. Perhaps the RestrictInfos could be marked as "optional" somehow, and we could simply remove them when their selectivity estimates are too high.

I'm not necessarily opposed to doing anything in this area, but so far
I've not seen how to do it in a way that is attractive when planner
complexity, performance, and end results are all considered.

Glad to hear it! Otherwise it would be a real shame to miss out on such great wins during execution time.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Equivalence Class Filters

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 6 December 2015 at 06:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Another issue that would need consideration is how to avoid skewing
>> planner selectivity estimates with derived clauses that are fully
>> redundant with other clauses.

> Could you give me an example of where this is a problem?

Using the regression database, try

explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.thousand;

explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.thousand and a.thousand < 100;

explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.thousand and a.thousand < 100 and b.thousand <
100;

The first two give pretty accurate estimates for the join size, the third
not so much, because it thinks b.thousand < 100 is an independent
constraint.

> I've tried fooling
> the planner into giving me bad estimates, but I've not managed to.
> # explain analyze select * from t1 where t1.id < 10 and t1.id < 100 and
> t1.id < 1000;

That particular case works because clausesel.c's AddRangeClause figures
out that the similar inequalities are redundant and drops all but one,
on its way to (not) identifying a range constraint for t1.id.  There's
nothing comparable for constraints on different variables, though,
especially not constraints on Vars of different relations, which will
never even appear in the same restriction list.

> if so, is the current eclass code not prone to the exact same problem?

No, and I already explained why not.

>> Lastly, in most cases knowing that t2.id <= 10 is just not worth all
>> that much; it's certainly far less useful than an equality condition.

> I'd have thought that my link to a thread of a reported 1100 to 2200 times
> performance improvement might have made you think twice about claiming the
> uselessness of this idea.

I said "in most cases".  You can find example cases to support almost any
weird planner optimization no matter how expensive and single-purpose;
but that is the wrong way to think about it.  What you have to think about
is average cases, and in particular, not putting a drag on planning time
in cases where no benefit ensues.  We're not committing any patches that
give one uncommon case an 1100X speedup by penalizing every other query 10%,
or even 1%; especially not when there may be other ways to fix it.

The EC machinery wasn't designed in a vacuum.  It is interested in
deriving equality conditions because those are useful for merge and hash
joins.  It's also tightly tied in with the planner's understanding of sort
ordering and distinct-ness.  None of those considerations provide any
reason to build a similar thing for inequalities.

It may be that computing derived inequalities is something that's worth
adding, but we're going to have to take a very hard look at the costs and
benefits; it's not a slam dunk that such a patch would get committed.
        regards, tom lane



Re: [PATCH] Equivalence Class Filters

От
Jim Nasby
Дата:
On 12/6/15 10:38 AM, Tom Lane wrote:
> I said "in most cases".  You can find example cases to support almost any
> weird planner optimization no matter how expensive and single-purpose;
> but that is the wrong way to think about it.  What you have to think about
> is average cases, and in particular, not putting a drag on planning time
> in cases where no benefit ensues.  We're not committing any patches that
> give one uncommon case an 1100X speedup by penalizing every other query 10%,
> or even 1%; especially not when there may be other ways to fix it.

This is a problem that seriously hurts Postgres in data warehousing 
applications. We can't keep ignoring optimizations that provide even as 
little as 10% execution improvements for 10x worse planner performance, 
because in a warehouse it's next to impossible for planning time to matter.

Obviously it'd be great if there was a fast, easy way to figure out 
whether a query would be expensive enough to go the whole 9 yards on 
planning it but at this point I suspect a simple GUC would be a big 
improvement.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [PATCH] Equivalence Class Filters

От
Tom Lane
Дата:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 12/6/15 10:38 AM, Tom Lane wrote:
>> I said "in most cases".  You can find example cases to support almost any
>> weird planner optimization no matter how expensive and single-purpose;
>> but that is the wrong way to think about it.  What you have to think about
>> is average cases, and in particular, not putting a drag on planning time
>> in cases where no benefit ensues.  We're not committing any patches that
>> give one uncommon case an 1100X speedup by penalizing every other query 10%,
>> or even 1%; especially not when there may be other ways to fix it.

> This is a problem that seriously hurts Postgres in data warehousing 
> applications.

Please provide some specific examples.  I remain skeptical that this
would make a useful difference all that often in the real world ...
and handwaving like that does nothing to change my opinion.  What do
the queries look like, and why would deducing an extra inequality
condition help them?
        regards, tom lane



Re: [PATCH] Equivalence Class Filters

От
"David G. Johnston"
Дата:
On Mon, Dec 7, 2015 at 8:35 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 12/6/15 10:38 AM, Tom Lane wrote:
I said "in most cases".  You can find example cases to support almost any
weird planner optimization no matter how expensive and single-purpose;
but that is the wrong way to think about it.  What you have to think about
is average cases, and in particular, not putting a drag on planning time
in cases where no benefit ensues.  We're not committing any patches that
give one uncommon case an 1100X speedup by penalizing every other query 10%,
or even 1%; especially not when there may be other ways to fix it.

This is a problem that seriously hurts Postgres in data warehousing applications. We can't keep ignoring optimizations that provide even as little as 10% execution improvements for 10x worse planner performance, because in a warehouse it's next to impossible for planning time to matter.

Obviously it'd be great if there was a fast, easy way to figure out whether a query would be expensive enough to go the whole 9 yards on planning it but at this point I suspect a simple GUC would be a big improvement.

Something like "enable_equivalencefilters" but that defaults to false unlike every one existing "enable_*" GUC?

​It would be a lot more user-friendly to have something along the lines of "planner_mode (text)" with labels like "star, transactional, bulk_load, etc..." because I suspect there are other things we'd want to add if we start identifying queries by their type/usage and optimize accordingly.  Having the knobs available is necessary but putting on a façade would make the user more straight-forward for the common cases.

David J.

Re: [PATCH] Equivalence Class Filters

От
Simon Riggs
Дата:
On 6 December 2015 at 16:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> Lastly, in most cases knowing that t2.id <= 10 is just not worth all
>> that much; it's certainly far less useful than an equality condition.

> I'd have thought that my link to a thread of a reported 1100 to 2200 times
> performance improvement might have made you think twice about claiming the
> uselessness of this idea.

Personally, I think this optimization is a long way down the list of important items because I don't see it as a typical use case. There are already patches submitted for more important items, so this isn't the right place to be arguing such things. Not every beneficial optimization has a wide use case. 

Since the discussion has become more general, I would add a few points.

I said "in most cases".  You can find example cases to support almost any
weird planner optimization no matter how expensive and single-purpose;
but that is the wrong way to think about it.  What you have to think about
is average cases, and in particular, not putting a drag on planning time
in cases where no benefit ensues.  We're not committing any patches that
give one uncommon case an 1100X speedup by penalizing every other query 10%,
or even 1%; especially not when there may be other ways to fix it.

I agree with this point also, I just don't see it as a blocker for expensive general case optimizations.

There are many optimizations we might adopt, yet planning time is a factor. It seems simple enough to ignore more complex optimizations if we have already achieved a threshold cost (say 10). Such a test would add nearly zero time for the common case. We can apply the optimizations in some kind of ordering depending upon the cost, so we are careful to balance the cost/benefit of trying certain optimizations.

Optimizer directives might be useful also, but we can do just as well with a heuristic.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Equivalence Class Filters

От
Jim Nasby
Дата:
On 12/7/15 9:54 AM, Tom Lane wrote:
> Jim Nasby<Jim.Nasby@BlueTreble.com>  writes:
>> >On 12/6/15 10:38 AM, Tom Lane wrote:
>>> >>I said "in most cases".  You can find example cases to support almost any
>>> >>weird planner optimization no matter how expensive and single-purpose;
>>> >>but that is the wrong way to think about it.  What you have to think about
>>> >>is average cases, and in particular, not putting a drag on planning time
>>> >>in cases where no benefit ensues.  We're not committing any patches that
>>> >>give one uncommon case an 1100X speedup by penalizing every other query 10%,
>>> >>or even 1%; especially not when there may be other ways to fix it.
>> >This is a problem that seriously hurts Postgres in data warehousing
>> >applications.
> Please provide some specific examples.  I remain skeptical that this
> would make a useful difference all that often in the real world ...
> and handwaving like that does nothing to change my opinion.  What do
> the queries look like, and why would deducing an extra inequality
> condition help them?

I was speaking more broadly than this particular case. There's a lot of 
planner improvements that get shot down because of the planning overhead 
they would add. That's great for cases when milliseconds count, but 
spending an extra 60 seconds (a planning eternity) to shave an hour off 
a warehouse/reporting query.

There needs to be some way to give the planner an idea of how much 
effort it should expend. GEQO and *_collapse_limit addresses this in the 
opposite direction (putting a cap on planner effort), but I think we 
need something that does the opposite "I know this query will take a 
long time, so expend extra effort on planning it."
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [PATCH] Equivalence Class Filters

От
Jim Nasby
Дата:
On 12/7/15 10:44 AM, Simon Riggs wrote:
> There are many optimizations we might adopt, yet planning time is a
> factor. It seems simple enough to ignore more complex optimizations if
> we have already achieved a threshold cost (say 10). Such a test would
> add nearly zero time for the common case. We can apply the optimizations
> in some kind of ordering depending upon the cost, so we are careful to
> balance the cost/benefit of trying certain optimizations.

Unfortunately I've seen a lot of millisecond queries that have 6 figure 
estimates, due to data being in cache. So I'm not sure how practical 
that would be.

Maybe a better starting point would be a planner timeout.

I definitely agree we need some method to limit planning time when 
necessary (ie: OLTP). Without that we'll never be able to start testing 
more complex optimizations.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [PATCH] Equivalence Class Filters

От
Simon Riggs
Дата:
On 7 December 2015 at 16:55, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
 
Unfortunately I've seen a lot of millisecond queries that have 6 figure estimates, due to data being in cache. So I'm not sure how practical that would be.

We seem to be agreed that longer running queries may benefit from more thorough optimization.

I like the idea of specifying a goal for the optimization, so the user can explicitly ask to spend more time. But I would also rather that I didn't have to do that at all, by using a heuristic that allows us to act sensibly even when not explicitly instructed by the user. It Just Works.

We can decide what the heuristic limit is based upon the cost of the technique. I don't think it matters how quick your CPU is, it matters how long planning takes as a % of the whole expected execution time.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Equivalence Class Filters

От
Gavin Flower
Дата:
On 08/12/15 05:27, David G. Johnston wrote:
> On Mon, Dec 7, 2015 at 8:35 AM, Jim Nasby <Jim.Nasby@bluetreble.com 
> <mailto:Jim.Nasby@bluetreble.com>>wrote:
>
>     On 12/6/15 10:38 AM, Tom Lane wrote:
>
>         I said "in most cases".  You can find example cases to support
>         almost any
>         weird planner optimization no matter how expensive and
>         single-purpose;
>         but that is the wrong way to think about it.  What you have to
>         think about
>         is average cases, and in particular, not putting a drag on
>         planning time
>         in cases where no benefit ensues.  We're not committing any
>         patches that
>         give one uncommon case an 1100X speedup by penalizing every
>         other query 10%,
>         or even 1%; especially not when there may be other ways to fix it.
>
>
>     This is a problem that seriously hurts Postgres in data
>     warehousing applications. We can't keep ignoring optimizations
>     that provide even as little as 10% execution improvements for 10x
>     worse planner performance, because in a warehouse it's next to
>     impossible for planning time to matter.
>
>     Obviously it'd be great if there was a fast, easy way to figure
>     out whether a query would be expensive enough to go the whole 9
>     yards on planning it but at this point I suspect a simple GUC
>     would be a big improvement.
>
>
> Something like "enable_equivalencefilters" but that defaults to false 
> unlike every one existing "enable_*" GUC?
>
> ​It would be a lot more user-friendly to have something along the 
> lines of "planner_mode (text)" with labels like "star, transactional, 
> bulk_load, etc..." because I suspect there are other things we'd want 
> to add if we start identifying queries by their type/usage and 
> optimize accordingly.  Having the knobs available is necessary but 
> putting on a façade would make the user more straight-forward for the 
> common cases.
>
> David J.
>
How about:

planning_time_base 10  # Default effort, may be increased or decreased 
as required - must be at least 1
planning_time_XXXX  0  # By default, planner makes no (or minimal) 
effort to optimise for feature XXXX

So for some people, adjusting planning_time_base may be sufficient - but 
for more specialised cases, people can tell the planner to consider 
expending more effort.


Cheers,
Gavin




Re: [PATCH] Equivalence Class Filters

От
Evgeniy Shishkin
Дата:
> On 07 Dec 2015, at 22:27, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
>
> On 08/12/15 05:27, David G. Johnston wrote:
>> On Mon, Dec 7, 2015 at 8:35 AM, Jim Nasby <Jim.Nasby@bluetreble.com <mailto:Jim.Nasby@bluetreble.com>>wrote:
>>
>>    On 12/6/15 10:38 AM, Tom Lane wrote:
>>
>>        I said "in most cases".  You can find example cases to support
>>        almost any
>>        weird planner optimization no matter how expensive and
>>        single-purpose;
>>        but that is the wrong way to think about it.  What you have to
>>        think about
>>        is average cases, and in particular, not putting a drag on
>>        planning time
>>        in cases where no benefit ensues.  We're not committing any
>>        patches that
>>        give one uncommon case an 1100X speedup by penalizing every
>>        other query 10%,
>>        or even 1%; especially not when there may be other ways to fix it.
>>
>>
>>    This is a problem that seriously hurts Postgres in data
>>    warehousing applications. We can't keep ignoring optimizations
>>    that provide even as little as 10% execution improvements for 10x
>>    worse planner performance, because in a warehouse it's next to
>>    impossible for planning time to matter.
>>
>>    Obviously it'd be great if there was a fast, easy way to figure
>>    out whether a query would be expensive enough to go the whole 9
>>    yards on planning it but at this point I suspect a simple GUC
>>    would be a big improvement.
>>
>>
>> Something like "enable_equivalencefilters" but that defaults to false unlike every one existing "enable_*" GUC?
>>
>> ​It would be a lot more user-friendly to have something along the lines of "planner_mode (text)" with labels like
"star,transactional, bulk_load, etc..." because I suspect there are other things we'd want to add if we start
identifyingqueries by their type/usage and optimize accordingly. Having the knobs available is necessary but putting on
afaçade would make the user more straight-forward for the common cases. 
>>
>> David J.
>>
> How about:
>
> planning_time_base 10  # Default effort, may be increased or decreased as required - must be at least 1
> planning_time_XXXX  0  # By default, planner makes no (or minimal) effort to optimise for feature XXXX
>
> So for some people, adjusting planning_time_base may be sufficient - but for more specialised cases, people can tell
theplanner to consider expending more effort. 
>

Mysql have now 19 optimizer_switch parameters
https://dev.mysql.com/doc/refman/5.7/en/switchable-optimizations.html

Please don't do that.


I'd rather like some sort of pg_stat_statements, which would track execution and planning time.
On new query, we can lookup if query can benefit from more planning time.
But i don't know how costly this can be.

>
> Cheers,
> Gavin
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers




Re: [PATCH] Equivalence Class Filters

От
Gavin Flower
Дата:
On 08/12/15 08:34, Evgeniy Shishkin wrote:
>> On 07 Dec 2015, at 22:27, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
>>
>> On 08/12/15 05:27, David G. Johnston wrote:
>>> On Mon, Dec 7, 2015 at 8:35 AM, Jim Nasby <Jim.Nasby@bluetreble.com <mailto:Jim.Nasby@bluetreble.com>>wrote:
>>>
>>>     On 12/6/15 10:38 AM, Tom Lane wrote:
>>>
>>>         I said "in most cases".  You can find example cases to support
>>>         almost any
>>>         weird planner optimization no matter how expensive and
>>>         single-purpose;
>>>         but that is the wrong way to think about it.  What you have to
>>>         think about
>>>         is average cases, and in particular, not putting a drag on
>>>         planning time
>>>         in cases where no benefit ensues.  We're not committing any
>>>         patches that
>>>         give one uncommon case an 1100X speedup by penalizing every
>>>         other query 10%,
>>>         or even 1%; especially not when there may be other ways to fix it.
>>>
>>>
>>>     This is a problem that seriously hurts Postgres in data
>>>     warehousing applications. We can't keep ignoring optimizations
>>>     that provide even as little as 10% execution improvements for 10x
>>>     worse planner performance, because in a warehouse it's next to
>>>     impossible for planning time to matter.
>>>
>>>     Obviously it'd be great if there was a fast, easy way to figure
>>>     out whether a query would be expensive enough to go the whole 9
>>>     yards on planning it but at this point I suspect a simple GUC
>>>     would be a big improvement.
>>>
>>>
>>> Something like "enable_equivalencefilters" but that defaults to false unlike every one existing "enable_*" GUC?
>>>
>>> ​It would be a lot more user-friendly to have something along the lines of "planner_mode (text)" with labels like
"star,transactional, bulk_load, etc..." because I suspect there are other things we'd want to add if we start
identifyingqueries by their type/usage and optimize accordingly. Having the knobs available is necessary but putting on
afaçade would make the user more straight-forward for the common cases.
 
>>>
>>> David J.
>>>
>> How about:
>>
>> planning_time_base 10  # Default effort, may be increased or decreased as required - must be at least 1
>> planning_time_XXXX  0  # By default, planner makes no (or minimal) effort to optimise for feature XXXX
>>
>> So for some people, adjusting planning_time_base may be sufficient - but for more specialised cases, people can tell
theplanner to consider expending more effort.
 
>>
> Mysql have now 19 optimizer_switch parameters
> https://dev.mysql.com/doc/refman/5.7/en/switchable-optimizations.html
I notice that they are either on or off, I suspect that it is better to 
have some sort of measure of how much extra effort the planner should make.

>
> Please don't do that.
I think having some might be useful - though in most situations, having 
a general indicator to the planner might be sufficient.
From reading the thread, I have the impression that for some extreme 
workloads, them some extra twiddling would be useful even though for 
most people it simply be an unnecessary complication.

In over twenty years I've never needed such knobs, but I might get a 
project next year where they might be useful.  So I agree that for most 
situations, such extra stuff is not needed - but I'd like additional 
options available if I ever needed them.

>
> I'd rather like some sort of pg_stat_statements, which would track execution and planning time.
> On new query, we can lookup if query can benefit from more planning time.
> But i don't know how costly this can be.
>
>> Cheers,
>> Gavin
>>
>>
>>
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers




Re: [PATCH] Equivalence Class Filters

От
David Rowley
Дата:
On 8 December 2015 at 04:35, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 12/6/15 10:38 AM, Tom Lane wrote:
I said "in most cases".  You can find example cases to support almost any
weird planner optimization no matter how expensive and single-purpose;
but that is the wrong way to think about it.  What you have to think about
is average cases, and in particular, not putting a drag on planning time
in cases where no benefit ensues.  We're not committing any patches that
give one uncommon case an 1100X speedup by penalizing every other query 10%,
or even 1%; especially not when there may be other ways to fix it.

This is a problem that seriously hurts Postgres in data warehousing applications. We can't keep ignoring optimizations that provide even as little as 10% execution improvements for 10x worse planner performance, because in a warehouse it's next to impossible for planning time to matter.

Obviously it'd be great if there was a fast, easy way to figure out whether a query would be expensive enough to go the whole 9 yards on planning it but at this point I suspect a simple GUC would be a big improvement.

I've certainly been here before [1], but the idea fell of deaf ears.

The biggest frustration for me is that the way Tom always seems to argue his point it's as if planning time is roughly the same or more expensive than execution time, and likely that's true in many cases, but I would imagine in more normal cases that execution time is longer, although I've never had the guts to stand up and argued this as I don't have any stats to back me up.

I was talking to Thomas Munro yesterday about this, and he mentioned that it would be quite nice to have some stats on how much time is spent in the planner, vs executor. He came up with the idea of adding a column to pg_stat_statements to record the planning time.

If we could get real statistics on real world cases and we discovered that, for example on average that the total CPU time of planning was only 1% of execution time, then it would certainly make adding 2% overhead onto the planner a bit less of a worry, as this would just be %2 of 1% (0.02%). Such information, if fed back into the community might be able to drive us in the right direction when it comes to deciding what needs to be done to resolve this constant issue with accepting planner improvement patches.

I believe that with parallel query on the horizon for 9.6 that we're now aiming to support bigger OLAP type database than ever before. So if we ignore patches like this one then it appears that we have some conflicting goals in the community as it seems that we're willing to add the brawn, but we're not willing to add the brain. If this is the case then it's a shame, as I think we can have both. So I very much agree on the fact that we must find a way to maintain support and high performance of small OLTP databases too. 

[1] http://www.postgresql.org/message-id/CAApHDvrJrz-0xinyiqTiWs0mFX17GWD2Y8VZ+i92nuZsha8ocw@mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Equivalence Class Filters

От
Jeremy Harris
Дата:
On 07/12/15 16:44, Simon Riggs wrote:
> There are many optimizations we might adopt, yet planning time is a factor.
> It seems simple enough to ignore more complex optimizations if we have
> already achieved a threshold cost (say 10). Such a test would add nearly
> zero time for the common case. We can apply the optimizations in some kind
> of ordering depending upon the cost, so we are careful to balance the
> cost/benefit of trying certain optimizations.

Given parallelism, why not continue planning after initiating a
a cancellable execution, giving a better plan to be used if the
excecution runs for long enough?
-- 
Cheers, Jeremy





Re: [PATCH] Equivalence Class Filters

От
Jim Nasby
Дата:
On 12/7/15 7:26 PM, David Rowley wrote:
> I was talking to Thomas Munro yesterday about this, and he mentioned
> that it would be quite nice to have some stats on how much time is spent
> in the planner, vs executor. He came up with the idea of adding a column
> to pg_stat_statements to record the planning time.

I think that would be useful. Maybe something in pg_stat_database too.

> If we could get real statistics on real world cases and we discovered
> that, for example on average that the total CPU time of planning was
> only 1% of execution time, then it would certainly make adding 2%
> overhead onto the planner a bit less of a worry, as this would just be
> %2 of 1% (0.02%). Such information, if fed back into the community might
> be able to drive us in the right direction when it comes to deciding
> what needs to be done to resolve this constant issue with accepting
> planner improvement patches.

Might be nice, but I think it's also pretty unnecessary.

I've dealt with dozens of queries that took minutes to hours to run. Yet 
I can't recall ever having an EXPLAIN on one of these take more than a 
few seconds. I tend to do more OLTP stuff so maybe others have 
experienced long-running EXPLAIN, in which case it'd be great to know that.

Actually, I have seen long EXPLAIN times, but only as part of trying to 
aggressively increase *_collapse_limit. IIRC I was able to increase one 
of those to 14 and one to 18 before plan time became unpredictably bad 
(it wasn't always bad though, just sometimes).

> I believe that with parallel query on the horizon for 9.6 that we're now
> aiming to support bigger OLAP type database than ever before. So if we
> ignore patches like this one then it appears that we have some
> conflicting goals in the community as it seems that we're willing to add
> the brawn, but we're not willing to add the brain. If this is the case
> then it's a shame, as I think we can have both. So I very much agree on
> the fact that we must find a way to maintain support and high
> performance of small OLTP databases too.

+1
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [PATCH] Equivalence Class Filters

От
Jim Nasby
Дата:
On 12/8/15 3:52 AM, Jeremy Harris wrote:
> On 07/12/15 16:44, Simon Riggs wrote:
>> There are many optimizations we might adopt, yet planning time is a factor.
>> It seems simple enough to ignore more complex optimizations if we have
>> already achieved a threshold cost (say 10). Such a test would add nearly
>> zero time for the common case. We can apply the optimizations in some kind
>> of ordering depending upon the cost, so we are careful to balance the
>> cost/benefit of trying certain optimizations.
>
> Given parallelism, why not continue planning after initiating a
> a cancellable execution, giving a better plan to be used if the
> excecution runs for long enough?

Because that would take significantly more work than what Simon is 
proposing.

That said, I think the ability to restart with a different plan is 
something we might need, for cases when we discover the plan estimates 
were way off. If that ever gets built it might be useful for what you 
propose as well.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [PATCH] Equivalence Class Filters

От
Robert Haas
Дата:
On Mon, Dec 7, 2015 at 8:26 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> The biggest frustration for me is that the way Tom always seems to argue his
> point it's as if planning time is roughly the same or more expensive than
> execution time, and likely that's true in many cases, but I would imagine in
> more normal cases that execution time is longer, although I've never had the
> guts to stand up and argued this as I don't have any stats to back me up.

I think this is missing the point.  It is possible to have a query
planner optimization that is so expensive that it loses even when it
improves the plan, but I don't think this optimization falls into that
category.  This particular change would not have been requested as
many times as it has been if people didn't keep rediscovering that, on
a certain class of queries, it works *really* well.  The problem that
I understand Tom to be concerned about is the queries where the
optimization consumes additional planning time without delivering a
better plan - and those where, without careful thought, it might even
deliver a worse plan.

Now, I'm not sure Tom is right about that, but he might be.  The class
of queries we're talking about here are the ones where somebody
constrains a column that is part of an equivalence class with multiple
members.  Perhaps the best example is a large join, let's say 10
tables, where the join column is the same for all tables; thus, we
have an equivalence class with 10 members.  Suppose further we have an
inequality condition applied to one member of that equivalence class.

Currently, we'll apply that inequality to the table that the user
actually mentioned and ignore all the others; in theory, it could be
right to enforce that inequality condition on any nonempty subset of
the 10 tables we've got.  It might be right to pick exactly one of
those tables and enforce the inequality there, or it might be right to
enforce it on some of them, or it might be right to enforce it on all
of them.  The reason why the answer isn't automatically "all of them"
is because, first of all, it's possible that enforcing the condition
at a particular table costs more to filter out the rows that we save
in execution time at higher levels of the plan tree.  For example,
consider A JOIN B ON A.X = B.X WHERE A.X > 1000000.  It might be that
the range of A.X is [0,1000001] but the range of B.X is
[1000000,2000000]; so enforcing the inequality against A is very
selective but enforcing it against B filters out basically nothing.
Furthermore, there are some cases involving parameterized paths where
enforcing the inequality multiple times is definitely bad: for
example, if we've got a nested loop where the outer side is a seq scan
that enforces the condition and the inner side is an index probe, it
is just a waste to retest it on the inner side.  We already know that
the outer row passes the inequality, so the inner row will necessarily
pass also.  This doesn't apply to merge or hash joins, and it also
doesn't apply to all nested loops: scans that aren't paramaterized by
the equivalence-class column can still benefit from separate
enforcement of the inequality.

Considering the factors mentioned in the previous paragraph, it seems
likely to me that a naive patch that propagates the inequalities to
every relation where they could hypothetically be enforced will be a
significant loss in some cases even just in execution cost, ignoring
completely the effects on planning time.  And, on the flip side,
figuring out what non-empty subset of relations forms the optimal set
upon which to enforce the inequality seems like a complicated problem. A first cut might be to enforce the inequality
againstthe relation
 
where it's believed to be most selective, and to also enforce it in
paths for other relations that aren't parameterized by the
equivalence-class column mentioned in the inequality provided that the
selectivity is thought to be above some threshold ... but I'm not sure
this is very principled, and it's certainly not obvious what arbitrary
threshold would be appropriate.  The threshold where multiple
enforcement of the qual starts to pay off also depends on the cost of
the qual: an expensive qual has to filter out more rows than a cheap
qual to be worthwhile.  I'm not sure how to estimate all this, but I
don't have trouble believing that if not done carefully it could
either (a) cost a lot of planning time or (b) generate lousy plans.

Now, all that having been said, I think this is a pretty important
optimization.  Lots of people have asked for it, and I think it would
be worth expending some brainpower to try to figure out a way to be
smarter than we are now, which is, in a nutshell, as dumb as possible.
You're completely right that, on an OLAP query that's going to run for
a minute, a few extra milliseconds of planning time could be totally
OK even if they don't yield any benefit on most queries. Maybe we can
get the cost of this feature down to the point where we can turn it on
for everybody and have that be OK.  But if not, having something we
can turn on for queries that are going to run for a long time, or that
we estimate are going to run for a long time, would, I think, be
useful.  As things stand, planning time can consume a very significant
percentage of total runtime on OLTP queries, and if you are processing
300k queries/second and 10% of that is planning time, and somebody
boosts planning time by 1%, that's an 0.1% hit overall and you just
lost several hundred queries per second.  That's not nothing,
especially if we do 3 or 4 such "optimizations" per release cycle.
But if the optimization can be made nearly free in cases where it
doesn't apply, that's a different story, and of course OLAP is a whole
different ball game.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Equivalence Class Filters

От
David Rowley
Дата:
On 9 December 2015 at 15:14, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 7, 2015 at 8:26 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> The biggest frustration for me is that the way Tom always seems to argue his
> point it's as if planning time is roughly the same or more expensive than
> execution time, and likely that's true in many cases, but I would imagine in
> more normal cases that execution time is longer, although I've never had the
> guts to stand up and argued this as I don't have any stats to back me up.

I think this is missing the point.  It is possible to have a query
planner optimization that is so expensive that it loses even when it
improves the plan, but I don't think this optimization falls into that
category.  This particular change would not have been requested as
many times as it has been if people didn't keep rediscovering that, on
a certain class of queries, it works *really* well.  The problem that
I understand Tom to be concerned about is the queries where the
optimization consumes additional planning time without delivering a
better plan - and those where, without careful thought, it might even
deliver a worse plan.


My point was more of a response to the general condition that we cannot add any planner overhead unless the extra CPU cycles spent are almost certainly going improve the plan.
 
Now, I'm not sure Tom is right about that, but he might be.  The class
of queries we're talking about here are the ones where somebody
constrains a column that is part of an equivalence class with multiple
members.  Perhaps the best example is a large join, let's say 10
tables, where the join column is the same for all tables; thus, we
have an equivalence class with 10 members.  Suppose further we have an
inequality condition applied to one member of that equivalence class.

Currently, we'll apply that inequality to the table that the user
actually mentioned and ignore all the others; in theory, it could be
right to enforce that inequality condition on any nonempty subset of
the 10 tables we've got.  It might be right to pick exactly one of
those tables and enforce the inequality there, or it might be right to
enforce it on some of them, or it might be right to enforce it on all
of them.  The reason why the answer isn't automatically "all of them"
is because, first of all, it's possible that enforcing the condition
at a particular table costs more to filter out the rows that we save
in execution time at higher levels of the plan tree.  For example,
consider A JOIN B ON A.X = B.X WHERE A.X > 1000000.  It might be that
the range of A.X is [0,1000001] but the range of B.X is
[1000000,2000000]; so enforcing the inequality against A is very
selective but enforcing it against B filters out basically nothing.

hmm, well I think it's more than likely that my view on this has been skewed by the fact that we push equality quals down with the eclass contains a Const without any regard that it could generate a worse plan or add needless CPU overhead for checking the quals. If you rewrite your above paragraph with equality instead of inequality then it still applies. A JOIN B ON A.X = B.X WHERE A.X = 1000000, will clause B.X = 1000000 to be pushed into B, but what if B.X values are all set to 1000000? I've not seen anyone complain about us doing that.  This is the reason I limited the patch to only deal with members of the BTREE op-class, I assumed that if there's some BTREE index helping with the equality qual then that same index may well just help with a qual using the <, <=, > or >= operator, and also I thought that in most cases all these btree inequality operations will have about the same CPU cost as equality.

Would you be able to explain the difference between the two? Or is it just a question of the likelihood?

 
Furthermore, there are some cases involving parameterized paths where
enforcing the inequality multiple times is definitely bad: for
example, if we've got a nested loop where the outer side is a seq scan
that enforces the condition and the inner side is an index probe, it
is just a waste to retest it on the inner side.  We already know that
the outer row passes the inequality, so the inner row will necessarily
pass also.  This doesn't apply to merge or hash joins, and it also
doesn't apply to all nested loops: scans that aren't paramaterized by
the equivalence-class column can still benefit from separate
enforcement of the inequality.


I guess that could be fixed by somehow marking these pushed quals as optional and having parameterised scans ignore optional quals.
I have to admit that I didn't give it much thought as all of the cases that I tested turned what used to be nested loop with a parameterised inner index scan into a merge join, which was faster in my test case. Of course, that does not mean that I claim that this merge join will be faster in all cases or that the plan will switch to using a merge join in all cases.

 
Considering the factors mentioned in the previous paragraph, it seems
likely to me that a naive patch that propagates the inequalities to
every relation where they could hypothetically be enforced will be a
significant loss in some cases even just in execution cost, ignoring
completely the effects on planning time.  And, on the flip side,
figuring out what non-empty subset of relations forms the optimal set
upon which to enforce the inequality seems like a complicated problem.
  A first cut might be to enforce the inequality against the relation
where it's believed to be most selective, and to also enforce it in
paths for other relations that aren't parameterized by the
equivalence-class column mentioned in the inequality provided that the
selectivity is thought to be above some threshold ... but I'm not sure
this is very principled, and it's certainly not obvious what arbitrary
threshold would be appropriate.  The threshold where multiple
enforcement of the qual starts to pay off also depends on the cost of
the qual: an expensive qual has to filter out more rows than a cheap
qual to be worthwhile.  I'm not sure how to estimate all this, but I
don't have trouble believing that if not done carefully it could
either (a) cost a lot of planning time or (b) generate lousy plans.

Again this is one of the reason that I limited it to btree operators only.
I don't quite know how to estimate this either as the extra rows being filtered may mean that a sort no longer spills to disk, or a hash join no longer multi-batches. I'd imagine filtering would be extra worthwhile in such cases, so I doubt setting the threshold to some constant value is the correct way, and most likely considering the path with and without the qual(s) would be far too costly.

 

Now, all that having been said, I think this is a pretty important
optimization.  Lots of people have asked for it, and I think it would
be worth expending some brainpower to try to figure out a way to be
smarter than we are now, which is, in a nutshell, as dumb as possible.
You're completely right that, on an OLAP query that's going to run for
a minute, a few extra milliseconds of planning time could be totally
OK even if they don't yield any benefit on most queries. Maybe we can
get the cost of this feature down to the point where we can turn it on
for everybody and have that be OK.  But if not, having something we
can turn on for queries that are going to run for a long time, or that
we estimate are going to run for a long time, would, I think, be
useful.  As things stand, planning time can consume a very significant
percentage of total runtime on OLTP queries, and if you are processing
300k queries/second and 10% of that is planning time, and somebody
boosts planning time by 1%, that's an 0.1% hit overall and you just
lost several hundred queries per second.  That's not nothing,
especially if we do 3 or 4 such "optimizations" per release cycle.
But if the optimization can be made nearly free in cases where it
doesn't apply, that's a different story, and of course OLAP is a whole
different ball game.


Well I agree with that. I've got no interest in slowing anything down. I've been busy working with big databases for quite a while now, so perhaps my point of view is coming from that direction still.

So anyway, it's quite clear that we don't want the patch in its current form, and I'm not volunteering any more time at this stage to improve it, so for this reason I won't add it the January commit fest.

Thanks for the feedback.

David

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Equivalence Class Filters

От
Robert Haas
Дата:
On Tue, Dec 8, 2015 at 11:12 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> My point was more of a response to the general condition that we cannot add
> any planner overhead unless the extra CPU cycles spent are almost certainly
> going improve the plan.

I hope that's an overstatement of the requirement, because otherwise
it boils down to "don't improve the planner".  Most things we do to
try to improve plans are going to generate paths that will only
sometimes be better than the paths we're already generating.  "almost
certainly" is a high bar to clear.

> hmm, well I think it's more than likely that my view on this has been skewed
> by the fact that we push equality quals down with the eclass contains a
> Const without any regard that it could generate a worse plan or add needless
> CPU overhead for checking the quals. If you rewrite your above paragraph
> with equality instead of inequality then it still applies. A JOIN B ON A.X =
> B.X WHERE A.X = 1000000, will clause B.X = 1000000 to be pushed into B, but
> what if B.X values are all set to 1000000? I've not seen anyone complain
> about us doing that.  This is the reason I limited the patch to only deal
> with members of the BTREE op-class, I assumed that if there's some BTREE
> index helping with the equality qual then that same index may well just help
> with a qual using the <, <=, > or >= operator, and also I thought that in
> most cases all these btree inequality operations will have about the same
> CPU cost as equality.
>
> Would you be able to explain the difference between the two? Or is it just a
> question of the likelihood?

Likelihood, I guess.  I mean, typically if you are doing an equijoin
on two or more relations, the join column is unique, so there's only
going to be one row in each of A and B that is equal to a given
constant.  If A.X and/or B.X contain duplicate values, then the output
is going to have a number of rows equal to the product of the number
of duplicates in each, sort of like a Cartesian join.  That scenario
could happen, but it's a pretty strange kind of query which I would be
disinclined to spend a lot of time optimizing.  OTOH, something like
SELECT * FROM orders o, lines l WHERE l.order_id = o.order_id AND
o.order_id > 1000000 is a pretty normal query, at least IMHO.
Worrying about how that's going to perform with various data
distributions seems like a pretty reasonable thing to care about.

>> Furthermore, there are some cases involving parameterized paths where
>> enforcing the inequality multiple times is definitely bad: for
>> example, if we've got a nested loop where the outer side is a seq scan
>> that enforces the condition and the inner side is an index probe, it
>> is just a waste to retest it on the inner side.  We already know that
>> the outer row passes the inequality, so the inner row will necessarily
>> pass also.  This doesn't apply to merge or hash joins, and it also
>> doesn't apply to all nested loops: scans that aren't paramaterized by
>> the equivalence-class column can still benefit from separate
>> enforcement of the inequality.
>>
> I guess that could be fixed by somehow marking these pushed quals as
> optional and having parameterised scans ignore optional quals.

Yeah.

> I have to admit that I didn't give it much thought as all of the cases that
> I tested turned what used to be nested loop with a parameterised inner index
> scan into a merge join, which was faster in my test case. Of course, that
> does not mean that I claim that this merge join will be faster in all cases
> or that the plan will switch to using a merge join in all cases.

Interesting that it turned into a merge join and that that was faster.

> Again this is one of the reason that I limited it to btree operators only.
> I don't quite know how to estimate this either as the extra rows being
> filtered may mean that a sort no longer spills to disk, or a hash join no
> longer multi-batches. I'd imagine filtering would be extra worthwhile in
> such cases, so I doubt setting the threshold to some constant value is the
> correct way, and most likely considering the path with and without the
> qual(s) would be far too costly.

It might be OK, particularly for OLTP queries, but it's certainly not
going to be the cheapest thing anybody ever did.

> Well I agree with that. I've got no interest in slowing anything down. I've
> been busy working with big databases for quite a while now, so perhaps my
> point of view is coming from that direction still.

Yeah.

> So anyway, it's quite clear that we don't want the patch in its current
> form, and I'm not volunteering any more time at this stage to improve it, so
> for this reason I won't add it the January commit fest.

Too bad, but I understand.  I hope you come back around to working on
this further at some point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Equivalence Class Filters

От
Simon Riggs
Дата:
On 7 December 2015 at 16:44, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 December 2015 at 16:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> Lastly, in most cases knowing that t2.id <= 10 is just not worth all
>> that much; it's certainly far less useful than an equality condition.

> I'd have thought that my link to a thread of a reported 1100 to 2200 times
> performance improvement might have made you think twice about claiming the
> uselessness of this idea.

Personally, I think this optimization is a long way down the list of important items because I don't see it as a typical use case. There are already patches submitted for more important items, so this isn't the right place to be arguing such things. Not every beneficial optimization has a wide use case. 

There is an interesting real world case where we might get some use of these thoughts.

If we have Orders and OrderItems (FK->Orders)
and we also know (and can Assert) Order.order_date <= OrderItems.ship_date
then a restriction on Orders.order_date > X => OrderItem.ship_date > X when the two tables are joined on OrderId
and also a restriction on OrderItems.ship_date >= X => Orders.order_date < X when the two tables are joined on OrderId

Such an assertion could be checked during the FK check, so would not be expensive to maintain.

One for the future, at least, since we don't have any way of expressing or enforcing that just yet.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Equivalence Class Filters

От
Tomas Vondra
Дата:
On 12/16/2015 01:26 AM, Simon Riggs wrote:
>
> There is an interesting real world case where we might get some use of
> these thoughts.
>
> If we have Orders and OrderItems (FK->Orders)
> and we also know (and can Assert) Order.order_date <= OrderItems.ship_date
> then a restriction on Orders.order_date > X => OrderItem.ship_date > X
> when the two tables are joined on OrderId
> and also a restriction on OrderItems.ship_date >= X => Orders.order_date
> < X when the two tables are joined on OrderId
>
> Such an assertion could be checked during the FK check, so would not be
> expensive to maintain.
>
> One for the future, at least, since we don't have any way of expressing
> or enforcing that just yet.

There's a concept of "correlation maps", described in a paper [1] 
presented on VLDB 2009. It essentially talks about deriving conditions 
between attributes of the same table, but I guess it might be modified 
to track the correlations through foreign keys.

Interestingly enough, the data for the paper paper were collected on 
PostgreSQL, but the correlation maps were implemented in an application 
layer on top of the database.

[1] http://www.vldb.org/pvldb/2/vldb09-199.pdf

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Equivalence Class Filters

От
David Rowley
Дата:
On 16 December 2015 at 13:26, Simon Riggs <simon@2ndquadrant.com> wrote:
There is an interesting real world case where we might get some use of these thoughts.

If we have Orders and OrderItems (FK->Orders)
and we also know (and can Assert) Order.order_date <= OrderItems.ship_date
then a restriction on Orders.order_date > X => OrderItem.ship_date > X when the two tables are joined on OrderId
and also a restriction on OrderItems.ship_date >= X => Orders.order_date < X when the two tables are joined on OrderId

Such an assertion could be checked during the FK check, so would not be expensive to maintain.

One for the future, at least, since we don't have any way of expressing or enforcing that just yet.


That does sound interesting, but it's important to remember that referenced tables are not updated in real time in that same way that indexes are. This was the reason the INNER JOIN removals had problems, we simply can't determine at planner time that the trigger queue for the foreign key will be empty during execution, so can't be certain that the foreign key will be "true".

I'm just mentioning this as I wouldn't want someone to run off thinking this was a fantastic idea without being aware of the above, and waste time making the same mistakes as I did last year.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Equivalence Class Filters

От
David Fetter
Дата:
On Sun, Dec 20, 2015 at 10:27:35PM +1300, David Rowley wrote:
> On 16 December 2015 at 13:26, Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> > There is an interesting real world case where we might get some
> > use of these thoughts.
> >
> > If we have Orders and OrderItems (FK->Orders) and we also know
> > (and can Assert) Order.order_date <= OrderItems.ship_date then a
> > restriction on Orders.order_date > X => OrderItem.ship_date > X
> > when the two tables are joined on OrderId and also a restriction
> > on OrderItems.ship_date >= X => Orders.order_date < X when the two
> > tables are joined on OrderId
> >
> > Such an assertion could be checked during the FK check, so would
> > not be expensive to maintain.
> >
> > One for the future, at least, since we don't have any way of
> > expressing or enforcing that just yet.
> >
> That does sound interesting, but it's important to remember that
> referenced tables are not updated in real time in that same way that
> indexes are.

Is getting them so even remotely possible, given the system we have
now?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate