Re: query slows down with more accurate stats
От | Tom Lane |
---|---|
Тема | Re: query slows down with more accurate stats |
Дата | |
Msg-id | 25762.1082390410@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: query slows down with more accurate stats (Manfred Koizar <mkoi-pg@aon.at>) |
Ответы |
Number of pages in a random sample (was: query slows down with more accurate stats)
|
Список | pgsql-performance |
Manfred Koizar <mkoi-pg@aon.at> writes: > Random sampling is more like "every possible sample is equally likely to > be collected", and two-stage sampling doesn't satisfy this condition. Okay, I finally see the point here: in the limit as the number of pages B goes to infinity, you'd expect the probability that each tuple in your sample came from a different page to go to 1. But this doesn't happen in the two-stage sampling method: the probability doesn't increase beyond the value it would have for B=n. On the average each sample page would supply one tuple, but the odds that this holds *exactly* would be pretty low. However the existing sampling method has glaring flaws of its own, in particular having to do with the fact that a tuple whose slot is preceded by N empty slots is N times more likely to be picked than one that has no empty-slot predecessors. The fact that the two-stage method artificially constrains the sample to come from only n pages seems like a minor problem by comparison; I'd happily accept it to get rid of the empty-slot bias. A possible compromise is to limit the number of pages sampled to something a bit larger than n, perhaps 2n or 3n. I don't have a feeling for the shape of the different-pages probability function; would this make a significant difference, or would it just waste cycles? regards, tom lane
В списке pgsql-performance по дате отправления: