Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
От | Mark Kirkwood |
---|---|
Тема | Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? |
Дата | |
Msg-id | f6ced95d-54d0-9b01-1588-f768a2b2a097@catalyst.net.nz обсуждение исходный текст |
Ответ на | [HACKERS] Make ANALYZE more selective about what is a "most common value"? (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
|
Список | pgsql-hackers |
On 05/06/17 09:30, Tom Lane wrote: > I've been thinking about the behavior discussed in > https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org > and it seems to me that there are a couple of things we ought to do about > it. > > First, I think we need a larger hard floor on the number of occurrences > of a value that're required to make ANALYZE decide it is a "most common > value". The existing coding is willing to believe that anything that > appears at least twice in the sample is a potential MCV, but that design > originated when we were envisioning stats samples of just a few thousand > rows --- specifically, default_statistics_target was originally just 10, > leading to a 3000-row sample size. So accepting two-appearance values as > MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it > could be a tenth or a hundredth of that. > > As a round number, I'm thinking that a good floor would be a frequency > estimate of 1/1000. With today's typical sample size of 30000 rows, > a value would have to appear at least 30 times in the sample to be > believed to be an MCV. That seems like it gives us a reasonable margin > of error against the kind of sampling noise seen in the above-cited > thread. > > Second, the code also has a rule that potential MCVs need to have an > estimated frequency at least 25% larger than what it thinks the "average" > value's frequency is. A rule of that general form seems like a good idea, > but I now think the 25% threshold is far too small to do anything useful. > In particular, in any case like this where there are more distinct values > than there are sample rows, the "average frequency" estimate will > correspond to less than one occurrence in the sample, so that this rule is > totally useless to filter anything that we would otherwise consider as an > MCV. I wonder if we shouldn't make it be "at least double the estimated > average frequency". > Or possibly calculate the sample standard deviation and make use of that to help decide on a more flexible cutoff than twice the avg frequency? Are there any research papers that might help us here (I'm drowning in a sea of barely relevant search results for most phrases I've tried so far)? I recall there were some that Tom referenced when this stuff was originally written. On the other hand I do have access to some mathematicians specializing in statistics - so can get their thoughts on this issue if you feel it would be worthwhile. Cheers Mark
В списке pgsql-hackers по дате отправления: