Re: ANALYZE sampling is too good
От | Greg Stark |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | CAM-w4HPpVDC-EDMui60a4_i4WMhqM=ZTTdshZJyse0O-e5Karw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Greg Stark <stark@mit.edu>) |
Ответы |
Re: ANALYZE sampling is too good
Re: ANALYZE sampling is too good Re: ANALYZE sampling is too good Re: ANALYZE sampling is too good |
Список | pgsql-hackers |
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote: > I'm not actually sure there is any systemic bias here. The larger > number of rows per block generate less precise results but from my > thought experiments they seem to still be accurate? So I've done some empirical tests for a table generated by: create table sizeskew as (select i,j,repeat('i',i) from generate_series(1,1000) as i, generate_series(1,1000) as j); I find that using the whole block doesn't cause any problem with the avg_width field for the "repeat" column.That does reinforce my belief that we might not need any particularly black magic here. It does however cause a systemic error in the histogram bounds. It seems the median is systematically overestimated by more and more the larger the number of rows per block are used: 1: 524 4: 549 8: 571 12: 596 16: 602 20: 618 (total sample slightly smaller than normal) 30: 703 (substantially smaller sample) So there is something clearly wonky in the histogram stats that's affected by the distribution of the sample. The only thing I can think of is maybe the most common elements are being selected preferentially from the early part of the sample which is removing a substantial part of the lower end of the range. But even removing 100 from the beginning shouldn't be enough to push the median above 550. -- greg
В списке pgsql-hackers по дате отправления: