Re: WIP: multivariate statistics / proof of concept
От | Tomas Vondra |
---|---|
Тема | Re: WIP: multivariate statistics / proof of concept |
Дата | |
Msg-id | 5519E9B2.5080105@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: WIP: multivariate statistics / proof of concept (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Ответы |
Re: WIP: multivariate statistics / proof of concept
|
Список | pgsql-hackers |
Hello, attached is a new version of the patch series. Aside from fixing various issues (crashes, memory leaks). The patches are rebased to current master, and I also attach a few SQL scripts I used for testing (nothing fancy, just stress-testing all the parts the patch touches). The main changes in the patches (requiring plenty of changes in the other parts) are about these: (1) combining multiple statistics on a table -------------------------------------------- In the previous version of the patch, it was only possible to use a single statistics on a table - when there was a statistics "covering" all the conditions it worked fine, but that's not always the case. The new patch is able to combine multiple statistics by decomposing the probability (=selectivity) into conditional probabilities. Imagine estimating selectivity of clauses WHERE (a=1) AND (b=1) AND (c=1) AND (d=1) with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for example like this: P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1) where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the second may be simplified like this: P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1) using the assumption "no multivariate stats => independent". Both these probabilities match the existing statistics. The idea is described a bit more in the part #5 of the patch. (2) choosing the best combination of statistics ----------------------------------------------- There may be more statistics on a table, and multiple possible ways to use them to estimate the clauses (different ordering, overlapping statistics, etc.). The patch formulates this as an optimization task with two goals. (a) cover as many clauses as possible (b) reuse as many conditions (i.e. dependencies) as possible and implements two algorithms to solve this: (a) exhaustive, walking through all possible states (using dynamic programming), and (b) greedy, choosing the best local solution in each step. The time requirements for the exhaustive solution grows pretty quickly with the number of clauses and statistics on a table (~ O(N!)). The greedy is much faster, as it's ~O(N) and in fact much more time is spent in actually processing the selected statistics (walking through the histograms etc.). I assume the exhaustive search may find a better solution in some cases (that the greedy algorithm misses), but so far I've been unable to come up with such example. To make this easier to test, I've added GUC to switch between these algorithms easily (set to 'greedy' by default) mvstat_search = {'greedy', 'exhaustive'} I assume this GUC will be removed eventually, after we figure out which algorithm is the right one. (3) estimation of more complex conditions (AND/OR clauses) ---------------------------------------------------------- I've added ability to estimate more complex clauses - combinations of AND/OR clauses and such. It's somewhat incomplete at the moment, but hopefully the ideas will be clear from the TODOs/FIXMEs along the way. Let me know if you have any questions about this version of the patch, or about the ideas it implements in general. I also welcome real-world examples of poorly estimated queries, so that I can test if these patches improve that particular case situation. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
В списке pgsql-hackers по дате отправления: