Re: ANALYZE sampling is too good
От | Robert Haas |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | CA+TgmoZdyPmwHKgsW3PuZKg87ZqiuF2MChwZYXtV7qMh8CrMMg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Jeff Janes <jeff.janes@gmail.com>) |
Ответы |
Re: ANALYZE sampling is too good
|
Список | pgsql-hackers |
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > My reading of the code is that if it is not in the MCV, then it is assumed > to have the average selectivity (about 1/n_distinct, but deflating top and > bottom for the MCV list). There is also a check that it is less than the > least common of the MCV, but I don't know why that situation would ever > prevail--that should always be higher or equal to the average selectivity. I've never seen an n_distinct value of more than 5 digits, regardless of reality. Typically I've seen 20-50k, even if the real number is much higher. But the n_distinct value is only for non-MCVs, so if we estimate the selectivity of column = 'rarevalue' to be (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces the estimate, and making the MCV list longer naturally makes mcvfrac bigger. I'm not sure how important the less-frequent-than-the-least-common-MCV part is, but I'm very sure that raising the statistics target helps to solve the problem of overestimating the prevalence of uncommon values in a very big table. > I think that parts of the planner are N^2 in the size of histogram (or was > that the size of the MCV list?). So we would probably need a way to use a > larger sample size to get more accurate n_distinct and MCV frequencies, but > not save the entire histogram that goes with that sample size. I think the saving the histogram part is important. As you say, the MCVs are important for a variety of planning purposes, such as hash joins. More than that, in my experience, people with large tables are typically very willing to spend more planning time to get a better plan, because mistakes are expensive and the queries are likely to run for a while anyway. People with small tables care about planning time, because it makes no sense to spend an extra 1ms planning a query unless you improve the plan by enough to save at least 1ms when executing it, and when the tables are small and access is expected to be fast anyway that's often not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: