Re: estimating # of distinct values
От | Tomas Vondra |
---|---|
Тема | Re: estimating # of distinct values |
Дата | |
Msg-id | 4D38AE45.5070101@fuzzy.cz обсуждение исходный текст |
Ответ на | Re: estimating # of distinct values (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
Список | pgsql-hackers |
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a): > It seems that the suggested multi-column selectivity estimator would be > more sensitive to ndistinct of the individual columns. Is that correct? > How is it biased? If we routinely under-estimate ndistinct of individual > columns, for example, does the bias accumulate or cancel itself in the > multi-column estimate? > > I'd like to see some testing of the suggested selectivity estimator with > the ndistinct estimates we have. Who knows, maybe it works fine in > practice. The estimator for two columns and query 'A=a AND B=b' is about 0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b)) so it's quite simple. It's not that sensitive to errors or ndistinct estimates for individual columns, but the problem is in the multi-column ndistinct estimates. It's very likely that with dependent colunms (e.g. with the ZIP codes / cities) the distribution is so pathological that the sampling-based estimate will be very off. I guess this was a way too short analysis, but if you can provide more details of the expected tests etc. I'll be happy to provide that. regards Tomas
В списке pgsql-hackers по дате отправления: