stricter MCV tests for uniform distributions (was Re: MCV lists forhighly skewed distributions)
От | John Naylor |
---|---|
Тема | stricter MCV tests for uniform distributions (was Re: MCV lists forhighly skewed distributions) |
Дата | |
Msg-id | CAJVSVGVzZgYV3dHOnnRgLdW9B0mocyTgwWvDfMK7rO8NynwO4g@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: stricter MCV tests for uniform distributions (was Re: MCV listsfor highly skewed distributions)
|
Список | pgsql-hackers |
(Starting a new thread so as not to distract review) On 1/21/18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On 21 January 2018 at 07:26, John Naylor <jcnaylor@gmail.com> wrote: >> I spent a few hours hacking on this, and it turns out calculating the >> right number of MCVs taking into account both uniform and highly >> non-uniform distributions is too delicate a problem for me to solve >> right now. The logic suggested by Dean Rasheed in [1] always produces >> no MCVs for a perfectly uniform distribution (which is good), but very >> often also for other distributions, which is not good. My efforts to >> tweak that didn't work, so I didn't get as far as adapting it for the >> problem Jeff is trying to solve. > > Hmm, Tom suggested that the test based on the average frequency over > all values might be too strict because the estimated number of > distinct values is often too low, so that might explain what you're > seeing. In my test tables, I've noticed that our Ndistinct estimator is most inaccurate for geometric distributions, so that's certainly possible, but confusingly, it occasionally gave an empty MCV list along with a histogram with a boundary duplicated 5 times, which I thought I was guarding against. I'm thinking my implementation of your logic is flawed somehow. In case you're curious I've attached my rough (complier warnings and all) test patch. > It occurs to me that maybe a better test to exclude a value from the > MCV list would be to demand that its relative standard error not be > too high. Such a test, in addition to the existing tests, might be > sufficient to solve the opposite problem of too many values in the MCV > list, because the real problem there is including a value after having > seen relatively few occurrences of it in the sample, and thus having a > wildly inaccurate estimate for it. Setting a bound on the relative > standard error would mean that we could have a reasonable degree of > confidence in estimates produced from the sample. If you don't mind, what would the math look like for that? -John Naylor
Вложения
В списке pgsql-hackers по дате отправления: