Re: ANALYZE sampling is too good
От | Mark Kirkwood |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | 52A4F084.3070200@catalyst.net.nz обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Josh Berkus <josh@agliodbs.com>) |
Список | pgsql-hackers |
On 09/12/13 08:03, Josh Berkus wrote: > > So there's a set of math designed to calculate for the skew introduced > by reading *all* of the rows in each block. That's what I meant by > "block-based sampling"; you read, say, 400 pages, you compile statistics > on *all* of the rows on those pages, you apply some algorithms to adjust > for groupings of rows based on how grouped they are. And you have a > pretty good estimate of how grouped they are, because you just looked a > complete sets of rows on a bunch of nonadjacent pages. > > Obviously, you need to look at more rows than you would with a > pure-random sample. Like I said, the 80%+ accurate point in the papers > seemed to be at a 5% sample. However, since those rows come from the > same pages, the cost of looking at more rows is quite small, compared to > the cost of looking at 64 times as many disk pages. > > My ACM subscription has lapsed, though; someone with a current ACM > subscription could search for this; there are several published papers, > with math and pseudocode. > This makes more sense! Unfortunately you had confused everyone (well me at least) by decreeing that we needed block based sampling - when we started using that in 2004 - there is more than one way to sample blocks it seems :-) What you are actually suggesting is a way to do block based sampling that requires reading fewer blocks, and that is a great idea! Of course as Greg has just suggested - the details are gonna be the clincher. Can you supply authors or titles of papers? Also with reference to Greenplum - their use of postgres is fairly special case, and an approach that works well for them is not always suitable for more general purpose use. Regards Mark
В списке pgsql-hackers по дате отправления: