Re: multivariate statistics v10

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: multivariate statistics v10
Дата	2 марта 2016 г. 14:56:43
Msg-id	56D6FF1C.5060303@2ndquadrant.com обсуждение исходный текст
Ответ на	Re: multivariate statistics v9 (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы	Re: multivariate statistics v10 Re: multivariate statistics v11
Список	pgsql-hackers

Дерево обсуждения

Hi,

Attached is v10 of the patch series. There are 9 parts at the moment:

   0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
   0002-shared-infrastructure-and-functional-dependencies.patch
   0003-clause-reduction-using-functional-dependencies.patch
   0004-multivariate-MCV-lists.patch
   0005-multivariate-histograms.patch
   0006-multi-statistics-estimation.patch
   0007-multivariate-ndistinct-coefficients.patch
   0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
   0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch

However, the first one is still just a temporary workaround that I plan
to address next, and the last 3 are all dealing with the ndistinct
coefficients (and shall be squashed into a single chunk).


README docs
-----------

Aside from fixing a few bugs, there are several major improvements, the
main one being that I've moved most of the comments explaining how it
all works into a set of regular README files, located in
src/backend/utils/mvstats:

1) README.stats - Overview of available types of statistics, what
    clauses can be estimated, how multiple statistics are combined etc.
    This is probably the right place to start.

2) docs for each type of statistics currently available

    README.dependencies - soft functional dependencies
    README.mcv          - MCV lists
    README.histogram    - histograms
    README.ndistinct    - ndistinct coefficients

The READMEs are added and modified through the patch series, so the best
thing to do is apply all the patches and start reading.

I have not improved the user-oriented SGML documentation in this patch,
that's one of the tasks I'd lie to work on next. But the READMEs should
give you a good idea how it's supposed to work, and there are some
examples of use in the regression tests.


Significantly simplified places
-------------------------------

The patch version also significantly simplifies several places that were
needlessly complex in the previous ones - firstly the function
evaluating clauses on multivariate histograms was rather needlessly
bloated, so I've simplified it a lot. Similarly for the code in
clauselist_select() that combines multiple statistics to estimate a list
of clauses - that's much simpler now too. And various other pieces.

That being said, I still think the code in clausesel.c can be
simplified. I feel there's a lot of cruft, mostly due to unknowingly
implementing something that could be solved by an existing function.

A prime example of that is inspecting the expression tree to check if we
know how to estimate the clauses using the multivariate statistics. That
sounds like a nice match for expression walker, but currently is done by
custom code. I plan to look at that next.

Also, I'm not quite sure I understand what the varRelid parameter of
clauselist_selectivity is for, so the code may be handling that wrong
(seems to be working though).


ndistinct coefficients
----------------------

The one new piece in this patch is the GROUP BY estimation, based on the
ndistinct coefficients. So for example you can do this:

     CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
                         FROM generate_series(1,1000000) s(i);
     ANALYZE t;
     EXPLAIN SELECT * FROM t GROUP BY a, b;

which currently does this:

                               QUERY PLAN
-----------------------------------------------------------------------
  Group  (cost=127757.34..135257.34 rows=99996 width=8)
    Group Key: a, b
    ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8)
          Sort Key: a, b
          ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
(5 rows)

but we know that there are only 1000 groups because the columns are
correlated. So let's create ndistinct statistics on the two columns:

     CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
     ANALYZE t;

which results in estimates like this:

                            QUERY PLAN
-----------------------------------------------------------------
  HashAggregate  (cost=19425.00..19435.00 rows=1000 width=8)
    Group Key: a, b
    ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
(3 rows)

I'm not quite sure how to combine this type of statistics with MCV lists
and histograms, so for now it's used only for GROUP BY.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: multivariate statistics v10

Вложения