Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
От | Josh Berkus |
---|---|
Тема | Re: [PERFORM] Bad n_distinct estimation; hacks suggested? |
Дата | |
Msg-id | 200504241130.50218.josh@agliodbs.com обсуждение исходный текст |
Ответ на | Re: [PERFORM] Bad n_distinct estimation; hacks suggested? (Andrew Dunstan <andrew@dunslane.net>) |
Список | pgsql-hackers |
Andrew, > The math in the paper does not seem to look at very low levels of q (= > sample to pop ratio). Yes, I think that's the failing. Mind you, I did more testing and found out that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy (which I would consider acceptable) with a sample size of 25% or more (which is infeasable in any large table). The formula does work for populations where D/N is much lower, say 0.01. So overall it seems to only work for 1/4 of cases; those where n/N is large and D/N is low. And, annoyingly, that's probably the population where accurate estimation is least crucial, as it consists mostly of small tables. I've just developed (not original, probably, but original to *me*) a formula that works on populations where n/N is very small and D/N is moderate (i.e. 0.1 to 0.4): N * (d/n)^(sqrt(N/n)) However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4) populations, and only 3 of them to boot. I'd appreciate other people trying it on their own data populations, particularly very different ones, like D/N > 0.7 or D/N < 0.01. Further, as Andrew points out we presumably do page sampling rather than purely random sampling so I should probably read the paper he referenced. Working on it now .... -- Josh Berkus Aglio Database Solutions San Francisco
В списке pgsql-hackers по дате отправления: