Re: multivariate statistics v10
От | Tomas Vondra |
---|---|
Тема | Re: multivariate statistics v10 |
Дата | |
Msg-id | 56D6FF1C.5060303@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: multivariate statistics v9 (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: multivariate statistics v10
Re: multivariate statistics v11 |
Список | pgsql-hackers |
Hi, Attached is v10 of the patch series. There are 9 parts at the moment: 0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch 0002-shared-infrastructure-and-functional-dependencies.patch 0003-clause-reduction-using-functional-dependencies.patch 0004-multivariate-MCV-lists.patch 0005-multivariate-histograms.patch 0006-multi-statistics-estimation.patch 0007-multivariate-ndistinct-coefficients.patch 0008-change-how-we-apply-selectivity-to-number-of-groups-.patch 0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch However, the first one is still just a temporary workaround that I plan to address next, and the last 3 are all dealing with the ndistinct coefficients (and shall be squashed into a single chunk). README docs ----------- Aside from fixing a few bugs, there are several major improvements, the main one being that I've moved most of the comments explaining how it all works into a set of regular README files, located in src/backend/utils/mvstats: 1) README.stats - Overview of available types of statistics, what clauses can be estimated, how multiple statistics are combined etc. This is probably the right place to start. 2) docs for each type of statistics currently available README.dependencies - soft functional dependencies README.mcv - MCV lists README.histogram - histograms README.ndistinct - ndistinct coefficients The READMEs are added and modified through the patch series, so the best thing to do is apply all the patches and start reading. I have not improved the user-oriented SGML documentation in this patch, that's one of the tasks I'd lie to work on next. But the READMEs should give you a good idea how it's supposed to work, and there are some examples of use in the regression tests. Significantly simplified places ------------------------------- The patch version also significantly simplifies several places that were needlessly complex in the previous ones - firstly the function evaluating clauses on multivariate histograms was rather needlessly bloated, so I've simplified it a lot. Similarly for the code in clauselist_select() that combines multiple statistics to estimate a list of clauses - that's much simpler now too. And various other pieces. That being said, I still think the code in clausesel.c can be simplified. I feel there's a lot of cruft, mostly due to unknowingly implementing something that could be solved by an existing function. A prime example of that is inspecting the expression tree to check if we know how to estimate the clauses using the multivariate statistics. That sounds like a nice match for expression walker, but currently is done by custom code. I plan to look at that next. Also, I'm not quite sure I understand what the varRelid parameter of clauselist_selectivity is for, so the code may be handling that wrong (seems to be working though). ndistinct coefficients ---------------------- The one new piece in this patch is the GROUP BY estimation, based on the ndistinct coefficients. So for example you can do this: CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b FROM generate_series(1,1000000) s(i); ANALYZE t; EXPLAIN SELECT * FROM t GROUP BY a, b; which currently does this: QUERY PLAN ----------------------------------------------------------------------- Group (cost=127757.34..135257.34 rows=99996 width=8) Group Key: a, b -> Sort (cost=127757.34..130257.34 rows=1000000 width=8) Sort Key: a, b -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8) (5 rows) but we know that there are only 1000 groups because the columns are correlated. So let's create ndistinct statistics on the two columns: CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct); ANALYZE t; which results in estimates like this: QUERY PLAN ----------------------------------------------------------------- HashAggregate (cost=19425.00..19435.00 rows=1000 width=8) Group Key: a, b -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8) (3 rows) I'm not quite sure how to combine this type of statistics with MCV lists and histograms, so for now it's used only for GROUP BY. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
- 0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
- 0002-shared-infrastructure-and-functional-dependencies.patch
- 0003-clause-reduction-using-functional-dependencies.patch
- 0004-multivariate-MCV-lists.patch
- 0005-multivariate-histograms.patch
- 0006-multi-statistics-estimation.patch
- 0007-multivariate-ndistinct-coefficients.patch
- 0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
- 0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch
В списке pgsql-hackers по дате отправления: