Re: improving GROUP BY estimation

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: improving GROUP BY estimation
Дата
Msg-id 1457097036.24980.28.camel@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: improving GROUP BY estimation  (Mark Dilger <hornschnorter@gmail.com>)
Ответы Re: improving GROUP BY estimation  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Список pgsql-hackers
On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote:
> > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
> > 
> > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the
newformula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to
constructexamples of much higher differences).
 
> > 
> > The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much
closerto the new value, simply because
 
> > 
> >    (small number) + (huge number)
> >    ------------------------------
> >                   2
> > 
> > is always much closer to the huge number. We're usually quite happy
> > when the estimates are within the same order of magnitude, so whether
> > it's K or K/2 makes pretty much no difference.
> > 
> > I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Ah, OK. Haven't realized you meant geometric mean. With that it looks
like this:

1) independent
                          10      50     100     500    1000    5000
------------------------------------------------------------------   actual               919    3829    6244    9944
10001  10001    current               10      50     102     516    1018    4996    new                  973    4001
6382   9897    9951    9951    average              491    2025    3205    5206    5484    7473    geom. mean
99     447     807    2260    3183    7051
 

2) dependent
                          10      50     100     500    1000    5000
------------------------------------------------------------------   actual                10      50     100     500
1000    5000    current               10      53     105     508    1016    5014    new                  880    4105
6472   9955   10018   10018    average              445    2079    3288    5231    5517    7516    geom. mean
94     466     824    2249    3190    7087
 

To better illustrate the differences, we may use the "ratio error"
defined as
   err(a,b) = MAX(a/b, b/a)

where 'a' is the actual value and 'b' is the estimate. That gives us:

1) independent
                       10       50      100      500    1000    5000
------------------------------------------------------------------     current       91.90    76.58    61.22    19.27
9.82    2.00      new            1.06     1.04     1.02     1.00    1.01    1.01      average        1.87     1.89
1.95    1.91    1.82    1.34      geom. mean     9.32     8.56     7.74     4.40    3.14    1.42
 

2) dependent
                       10       50      100      500    1000    5000
------------------------------------------------------------------     current        1.00     1.06     1.05     1.02
1.02    1.00      new           88.00    82.10    64.72    19.91   10.02    2.00      average       44.50    41.58
32.88   10.46    5.52    1.50      geom. mean     9.38     9.33     8.24     4.50    3.19    1.42
 

So the geometric mean seems to be working better than plain average. But
I don't think we can simply conclude we should use the geometric mean of
the estimates, as it assumes the risks of over- and under-estimating the
cardinality are the same. Because they aren't.

> 
> Yes, that is what I proposed upthread.  I'm not wedded to that, though.
> In general, I am with Tomas on this one, believing that his estimate
> will be much better than the current estimate.  But I believe the *best*
> estimate will be somewhere between his and the current one, and I'm
> fishing for any decent way of coming up with a weighted average that
> is closer to his than to the current, but not simply equal to his proposal.
> 
> The reason I want the formula to be closer to Tomas's than to the
> current is that I think that on average, across all tables, across all
> databases, in practice it will be closer to the right estimate than the
> current formula.  That's just my intuition, and so I can't defend it.
> But if my intuition is right, the best formula we can adopt would be one
> that is moderated from his by a little bit, and in the direction of the
> estimate that the current code generates.

I think looking merely at what fraction of data sets (or queries) uses
dependent GROUP BY and WHERE clauses is not sufficient.

The general behavior of the two GROUP BY estimators is that the current
one tends to under-estimate, while the new one tends to over-estimate.
(It's not difficult to construct counter-examples by using more complex
dependencies between the columns etc. but let's ignore that.)

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may end up
using HashAggregate only to get killed by OOM. When we're lucky not to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected. Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

> 
> I could easily lose this debate merely for lack of a principled basis
> for saying how far toward the current estimate the new estimate should
> be adjusted.  The geometric average is one suggestion, but I don't have
> a principled argument for it.
> 
> Like I said above, I'm fishing for a decent formula here.

Thanks for the valuable feedback!

> 
> Mark Dilger

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: Equivalent of --enable-tap-tests in MSVC scripts
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: RFC: replace pg_stat_activity.waiting with something more descriptive