Re: multivariate statistics (v19)
От | Petr Jelinek |
---|---|
Тема | Re: multivariate statistics (v19) |
Дата | |
Msg-id | 3eb2a4b6-911b-29d9-7280-bbf4837acb05@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: multivariate statistics (v19) (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: multivariate statistics (v19)
|
Список | pgsql-hackers |
On 10/08/16 13:33, Tomas Vondra wrote: > On 08/10/2016 06:41 AM, Michael Paquier wrote: >> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra >>> 2) combining multiple statistics >>> >>> I think the ability to combine multivariate statistics (covering >>> different >>> subsets of conditions) is important and useful, but I'm starting to >>> think >>> that the current implementation may not be the correct one (which is >>> why I >>> haven't written the SGML docs about this part of the patch series yet). >>> >>> Assume there's a table "t" with 3 columns (a, b, c), and that we're >>> estimating query: >>> >>> SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 >>> >>> but that we only have two statistics (a,b) and (b,c). The current >>> patch does >>> about this: >>> >>> P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) >>> >>> i.e. it estimates the first two conditions using (a,b), and then >>> estimates >>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very >>> efficient, >>> but it only works as long as the query contains conditions >>> "connecting" the >>> two statistics. So if we remove the "b=2" condition from the query, this >>> stops working. >> >> This is trying to make the algorithm smarter than the user, which is >> something I'd think we could live without. In this case statistics on >> (a,c) or (a,b,c) are missing. And what if the user does not want to >> make use of stats for (a,c) because he only defined (a,b) and (b,c)? >> > > I don't think so. Obviously, if you have statistics covering all the > conditions - great, we can't really do better than that. > > But there's a crucial relation between the number of dimensions of the > statistics and accuracy of the statistics. Let's say you have statistics > on 8 columns, and you split each dimension twice to build a histogram - > that's 256 buckets right there, and we only get ~50% selectivity in each > dimension (the actual histogram building algorithm is more complex, but > you get the idea). > I think it makes sense to pursue this, but I also think we can easily live with not having it in the first version that gets committed and doing it as follow-up patch. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
В списке pgsql-hackers по дате отправления: