Re: [HACKERS] Index Puzzle for you
От | Tom Lane |
---|---|
Тема | Re: [HACKERS] Index Puzzle for you |
Дата | |
Msg-id | 3841.946480843@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: [HACKERS] Index Puzzle for you (Kristofer Munn <kmunn@munn.com>) |
Список | pgsql-hackers |
Kristofer Munn <kmunn@munn.com> writes: > Good catch! There were 296 possible issues the table. One had 86,544 > articles associated with it. The next highest was 5,949. Then the > numbers drop to 630, 506, 412, 184 and then the rest are all under 62. > Out of curiosity, how does vacuum decide on the large estimate? The estimate is made using a "disbursion" statistic calculated by VACUUM ANALYZE. I don't know the whole statistical theory here, but if you think of disbursion as the fraction of values in the column that are equal to the most common value, you won't be too far off. I gather from your numbers that your table has about 1 million rows, so the disbursion of ixissue would be somewhere around 86544/1000000. The planner uses the disbursion in a way that amounts to assuming that any "WHERE column = constant" search is in fact searching for the most common value, so we get an estimate of returned rows that is in the vicinity of the number of rows with the most common value. (It's not exact, first because VACUUM can't estimate that number perfectly accurately, and second because the disbursion actually has some second- order terms in it too.) When the most common value is much more common than anything else, this essentially means that queries are always optimized for retrieving the most common value, even when they're retrieving some other value. In your particular case, the optimizer is estimating that the runtime of an index scan that needs to retrieve almost 10% of the rows in the table will be worse than the runtime of a plain sequential scan. I'm not sure if that's right or not (the cost models could use more work), but the first-order mistake is that the estimate of retrieved rows is way off --- unless you are actually retrieving that one hugely popular issue. In current sources (7.0-to-be), VACUUM records the most common value along with the disbursion, and the planner checks to see if the "constant" in the WHERE clause is that value or not. If not, it doesn't use the disbursion straight-up, but a smaller estimate. This helps a good deal on drastically skewed column distributions such as you are describing. It's still easily fooled :-(, but it's hard to see how to do much better without expending a lot more space to store a lot more statistics. regards, tom lane
В списке pgsql-hackers по дате отправления: