Re: pg_statistics and sample size WAS: Overhauling GUCS
| От | Josh Berkus |
|---|---|
| Тема | Re: pg_statistics and sample size WAS: Overhauling GUCS |
| Дата | |
| Msg-id | 200806100834.51471.josh@agliodbs.com обсуждение исходный текст |
| Ответ на | Re: Overhauling GUCS (Gregory Stark <stark@enterprisedb.com>) |
| Список | pgsql-hackers |
Greg, > The analogous case in our situation is not having 300 million distinct > values, since we're not gathering info on specific values, only the > buckets. We need, for example, 600 samples *for each bucket*. Each bucket > is chosen to have the same number of samples in it. So that means that we > always need the same number of samples for a given number of buckets. I think that's plausible. The issue is that in advance of the sampling we don't know how many buckets there *are*. So we first need a proportional sample to determine the number of buckets, then we need to retain a histogram sample proportional to the number of buckets. I'd like to see someone with a PhD in this weighing in, though. > Really? Could you send references? The paper I read surveyed previous work > and found that you needed to scan up to 50% of the table to get good > results. 50-250% is considerably looser than what I recall it considering > "good" results so these aren't entirely inconsistent but I thought previous > results were much worse than that. Actually, based on my several years selling performance tuning, I generally found that as long as estimates were correct within a factor of 3 (33% to 300%) the correct plan was generally chosen. There are papers on block-based sampling which were already cited on -hackers; I'll hunt through the archives later. -- Josh Berkus PostgreSQL @ Sun San Francisco
В списке pgsql-hackers по дате отправления: