Re: ANALYZE sampling is too good
От | Greg Stark |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | CAM-w4HMFAfAAUp5YDEFkRTNt7UbrxT-9ecAJUNgwv2DyC=weGg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Josh Berkus <josh@agliodbs.com>) |
Ответы |
Re: ANALYZE sampling is too good
Re: ANALYZE sampling is too good |
Список | pgsql-hackers |
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus <josh@agliodbs.com> wrote: > > It's also applicable for the other stats; histogram buckets constructed > from a 5% sample are more likely to be accurate than those constructed > from a 0.1% sample. Same with nullfrac. The degree of improved > accuracy, would, of course, require some math to determine. This "some math" is straightforward basic statistics. The 95th percentile confidence interval for a sample consisting of 300 samples from a population of a 1 million would be 5.66%. A sample consisting of 1000 samples would have a 95th percentile confidence interval of +/- 3.1%. The histogram and nullfact answers the same kind of question as a political poll, "what fraction of the population falls within this subset". This is why pollsters don't need to sample 15 million Americans to have a decent poll result. That's just not how the math works for these kinds of questions. n_distinct is an entirely different kettle of fish. It's a different kind of problem and the error rate there *is* going to be dependent on the percentage of the total population that you sampled. Moreover from the papers I read I'm convinced any sample less than 50-80% is nearly useless so I'm convinced you can't get good results without reading the whole table. -- greg
В списке pgsql-hackers по дате отправления: