Re: ANALYZE sampling is too good
От | Heikki Linnakangas |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | 52A86266.9020700@vmware.com обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Greg Stark <stark@mit.edu>) |
Список | pgsql-hackers |
On 12/11/2013 02:08 PM, Greg Stark wrote: > On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote: >> I'm not actually sure there is any systemic bias here. The larger >> number of rows per block generate less precise results but from my >> thought experiments they seem to still be accurate? > > So I've done some empirical tests for a table generated by: > create table sizeskew as (select i,j,repeat('i',i) from > generate_series(1,1000) as i, generate_series(1,1000) as j); > > I find that using the whole block doesn't cause any problem with the > avg_width field for the "repeat" column.That does reinforce my belief > that we might not need any particularly black magic here. How large a sample did you use? Remember that the point of doing block-level sampling instead of the current approach would be to allow using a significantly smaller sample (in # of blocks), and still achieve the same sampling error. If the sample is "large enough", it will mask any systemic bias caused by block-sampling, but the point is to reduce the number of sampled blocks. The practical question here is this: What happens to the quality of the statistics if you only read 1/2 the number of blocks than you normally would, but included all the rows in the blocks we read in the sample? How about 1/10 ? Or to put it another way: could we achieve more accurate statistics by including all rows from the sampled rows, while reading the same number of blocks? In particular, I wonder if it would help with estimating ndistinct. It generally helps to have a larger sample for ndistinct estimation, so it might be beneficial. - Heikki
В списке pgsql-hackers по дате отправления: