Re: Multi-Dimensional Histograms
От | Nathan Boley |
---|---|
Тема | Re: Multi-Dimensional Histograms |
Дата | |
Msg-id | 6fa3b6e20906301239w1bf3b1cer88545747d041c3c@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Multi-Dimensional Histograms (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
> I'm finding myself unable to follow all the terminology on this thead. > What's dimension reduction? What's PCA? ( Sorry for the jargon - thanks Josh ) > It feels like what you might need is statistics for colB (MCVs and/or a > histogram) for certain particular values of colA. Certainly - this is exactly what a multi-dimensional histogram would store. > Unfortunately, in > the general case the set of values of colA for which you need these > statistics might be inconveniently large. > Which is why, in the general case, we need some sort of dimension reduction. If we could say that, for instance, the distribution of colB is roughly the same for all values of colA less than 100, and it is roughly the same for all values of colA >= 100, then we would have effectively reduced the dimension from ndistinct(colB)*ndistinct(colA) to 2*ndistinct(colB). The multidimensional histogram in the above paper is somewhat akin to what you suggested and I just described - it attempts to select contiguous regions that are "the same". PCA is a much different approach, but it is still, broadly, a dimension reduction technique. > At the end of the day, we cant do any dimension reduction unless the > ordering encodes some sort of useful information, and the data type > being in R^n is certainly no guarantee. Consider, for instance, the > cross correlation of zip-codes and area codes - you would really want > to order those by some geographic relation. I think that is why > cross-column stats is so hard in the general case. To clarify my response to Tom, directly above, if as in Robert's example, all of the values in colA < 100 were different than all of the values > 100 with respect to colB, then it is easy to represent the structure in something manageable. Ie, we store two histograms for colB: 1 for colA > 100, one for colA < 100. However, if there are the same two classes in colB ( lets say C1 and C2 ) but rather than C1 being associated with values in colA < 100, it is associated with 100 completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ... ) there is not a whole lot we can do besides storing a list of all of those values. We could maybe use the ratio of classes to improve the average plan choice, but you would still get a lot of bad plans. Could anyone provide a concrete example ( or better yet a data set ) where this occurs? -Nathan
В списке pgsql-hackers по дате отправления: