Re: [GENERAL] Large DB
От | Tom Lane |
---|---|
Тема | Re: [GENERAL] Large DB |
Дата | |
Msg-id | 9088.1080947172@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: [GENERAL] Large DB (Manfred Koizar <mkoi-pg@aon.at>) |
Ответы |
Re: [GENERAL] Large DB
|
Список | pgsql-hackers |
Manfred Koizar <mkoi-pg@aon.at> writes: > On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A bigger problem is that this makes the sampling quite nonuniform, >> because rows that are on relatively low-density pages would be more >> likely to become part of the final sample than rows that are on pages >> with lots of tuples. > This sounds like you are assuming that I want to take exactly one tuple > out of each block of the block sample. This is not the case. No, I understood that you wanted to resample, but [ ... thinks for awhile ... ] hmm, now I can't construct a failure case either. I must have done the math wrong before. There's still a risk of not being able to collect N rows out of N blocks, if you are unfortunate enough to select a lot of wholly-empty pages. But that seems like a low-probability scenario; besides such a table would be so desperately in need of VACUUM FULL that the possible low quality of the stats hardly matters. You should not need to use the Vitter algorithm for the block-level selection, since you can know the number of blocks in the table in advance. You can just use the traditional method of choosing each block or not with probability (k/K), where k = number of sample blocks still needed, K = number of blocks from here to the end. You'd run the Vitter algorithm separately to decide whether to keep or discard each live row you find in the blocks you read. I do like this, since it eliminates the current method's bias towards estimating the number of live rows from the density found near the start of the table only. At the end you'd know the total number of live rows on all the pages you read, and it's reasonable to extrapolate that total to the full table size. Question: if the table size is less than N blocks, are you going to read every block or try to reduce the number of blocks sampled? If you don't adjust the sample size then I think this would perform worse for intermediate-size tables than the current method does ... perhaps not so much at sample size = 3000, but at larger sizes it would hurt. A lot of people are setting the stats target to 100 which means a sample size of 30000 --- how do the page-access counts look in that case? regards, tom lane
В списке pgsql-hackers по дате отправления: