Re: [HACKERS] Re: Top N queries and disbursion
От | Bruce Momjian |
---|---|
Тема | Re: [HACKERS] Re: Top N queries and disbursion |
Дата | |
Msg-id | 199910072353.TAA09358@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: Top N queries and disbursion (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
> No, it's certainly not the right thing. To my understanding, disbursion > is a measure of the frequency of the most common value of an attribute; > but that tells you very little about how many other values there are. > 1/disbursion is a lower bound on the number of values, but it wouldn't > be a good estimate unless you had reason to think that the values were > pretty evenly distributed. There could be a *lot* of very-infrequent > values. > > > with 100 distinct values of an attribute uniformly distribuited in a > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving > > us 440 distinct values. > > This is an illustration of the fact that Postgres' disbursion-estimator > is pretty bad :-(. It usually underestimates the frequency of the most > common value, unless the most common value is really frequent > (probability > 0.2 or so). I've been trying to think of a more accurate > way of figuring the statistic that wouldn't be unreasonably slow. > Or, perhaps, we should forget all about disbursion and adopt some other > statistic(s). Yes, you have the crux of the issue. I wrote it because it was the best thing I could think of, but it is non-optimimal. Because all the optimal solutions seemed too slow to me, I couldn't think of a better one. Here is my narrative on it from vacuum.c: --------------------------------------------------------------------------- * We compute the column min, max, null and non-null counts.* Plus we attempt to find the count of the value that occursmost* frequently in each column* These figures are used to compute the selectivity of the column** We use a three-buckedcache to get the most frequent item* The 'guess' buckets count hits. A cache miss causes guess1* to get themost hit 'guess' item in the most recent cycle, and* the new item goes into guess2. Whenever the total count of hits* of a 'guess' entry is larger than 'best', 'guess' becomes 'best'.** This method works perfectly for columns with uniquevalues, and columns* with only two unique values, plus nulls.** It becomes less perfect as the number of unique valuesincreases and* their distribution in the table becomes more random. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
В списке pgsql-hackers по дате отправления: