Re: proposal : cross-column stats
От | tv@fuzzy.cz |
---|---|
Тема | Re: proposal : cross-column stats |
Дата | |
Msg-id | 0f997798b71d041e31262a49e361db88.squirrel@sq.gransy.com обсуждение исходный текст |
Ответ на | Re: proposal : cross-column stats (Florian Pflug <fgp@phlo.org>) |
Ответы |
Re: proposal : cross-column stats
|
Список | pgsql-hackers |
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >> Well, not really - I haven't done any experiments with it. For two >> columns selectivity equation is >> >> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >> >> where A and B are columns, dist(X) is number of distinct values in >> column X and sel(X) is selectivity of column X. > > Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, > if A and B are independent, the formula must reduce to sel(A) * sel(B), > and I cannot see how that'd work with the formula above. Yes, it's a selectivity estimate for P(A=a and B=b). It's based on conditional probability, as P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) and "uniform correlation" assumption so that it's possible to replace the conditional probabilities with constants. And those constants are then estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B). So it does not reduce to sel(A)*sel(B) exactly, as the dist(A)/dist(A,B) is just an estimate of P(B|A). The paper states that this works best for highly correlated data, while for low correlated data it (at least) matches the usual estimates. I don't say it's perfect, but it seems to produce reasonable estimates. Tomas
В списке pgsql-hackers по дате отправления: