Re: benchmarking the query planner
От | Greg Stark |
---|---|
Тема | Re: benchmarking the query planner |
Дата | |
Msg-id | 4136ffa0812111823u645b6ec9wdca60b3da4b00499@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>) |
Ответы |
Re: benchmarking the query planner
|
Список | pgsql-hackers |
On Thu, Dec 11, 2008 at 11:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote: > >> > And I would like it even more if the sample size increased according >> to table size, since that makes ndistinct values fairly random for >> large >> > tables. >> >> Unfortunately _any_ ndistinct estimate based on a sample of the table >> is going to be pretty random. > > We know that constructed data distributions can destroy the > effectiveness of the ndistinct estimate and make sample size irrelevant. > But typical real world data distributions do improve their estimations > with increased sample size and so it is worthwhile. Well that just means "more is always better" which puts us back in the same boat of always needing more. The existing sampling mechanism is tied to solid statistics. It provides the correct sample size to get a consistent confidence range for range queries. This is the same mathematics which governs election polling and other surveys. The sample size you need to get +/- 5% 19 times out of 20 increases as the population increases, but not by very much. However ndistinct is a different kind of question. Instead of needing the relatively small and slowly growing sample size you need a percentage of the whole table. It would mean letting this one measure control the whole sample size decision. And the end result would be pretty unimpressive. The papers I read showed pretty poor results for sample sizes as large as 50% of the table. Simply raising the statistics target for larger tables would not be very helpful. All it would do is raise the table size at which you would find the sample size too small. You do have to change to using a percentage of the whole table -- which would make ANALYZE a *lot* slower for larger tables. Really the only way we'll get good ndistinct calculations is if we decide to scan the whole table. If we have to do that for some other reason then there are algorithms for gathering enough information to handle ndistinct properly. -- greg
В списке pgsql-hackers по дате отправления: