> > > However ... where this thread started was not about trying to reduce > the remaining statistical imperfections in our existing sampling method. > It was about whether we could reduce the number of pages read for an > acceptable cost in increased statistical imperfection.
I think it is pretty clear that n_distinct at least, and probably MCV, would be a catastrophe under some common data distribution patterns if we sample all rows in each block without changing our current computation method. If we come up with a computation that works for that type of sampling, it would probably be an improvement under our current sampling as well. If we find such a thing, I wouldn't want it to get rejected just because the larger block-sampling method change did not make it in.
Well, why not take a supersample containing all visible tuples from N selected blocks, and do bootstrapping over it, with subsamples of M independent rows each?
Bootstrapping methods generally do not work well when ties are significant events, i.e. when two values being identical is meaningfully different from them being very close but not identical.