Re: ANALYZE sampling is too good
От | Sergey E. Koposov |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | alpine.LRH.2.00.1312110506150.19468@lnfm1.sai.msu.ru обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Peter Geoghegan <pg@heroku.com>) |
Ответы |
Re: ANALYZE sampling is too good
|
Список | pgsql-hackers |
For what it's worth. I'll quote Chaudhuri et al. first line from the abstract about the block sampling. "Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics." And after briefly glancing through the paper, my opinion is why it works is because after making one version of statistics they cross-validate, see how well it goes and then collect more if the cross-validation error is large (for example because the data is clustered). Without this bit, as far as I can a simply block based sampler will be bound to make catastrophic mistakes depending on the distribution Also, just another point about targets (e.g X%) for estimating stuff from the samples (as it was discussed in the thread). Basically, the is a point talking about a sampling a fixed target (5%) of the data ONLY if you fix the actual distribution of your data in the table, and decide what statistic you are trying to find, e.g. average, std. dev. a 90% percentile, ndistinct or a histogram and so forth. There won't be a general answer as the percentages will be distribution dependend and statistic dependent. Cheers, Sergey PS I'm not a statistician, but I use statistics a lot ******************************************************************* Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
В списке pgsql-hackers по дате отправления: