Re: Cross-column statistics revisited
От | Ron Mayer |
---|---|
Тема | Re: Cross-column statistics revisited |
Дата | |
Msg-id | 48F7CE57.10004@cheapcomplexdevices.com обсуждение исходный текст |
Ответ на | Re: Cross-column statistics revisited (Josh Berkus <josh@agliodbs.com>) |
Ответы |
Re: Cross-column statistics revisited
|
Список | pgsql-hackers |
Josh Berkus wrote: >> Yes, or to phrase that another way: What kinds of queries are being >> poorly optimized now and why? > > Well, we have two different correlation problems. One is the problem of > dependant correlation, such as the 1.0 correlation of ZIP and CITY fields > as a common problem. This could in fact be fixed, I believe, via a linear > math calculation based on the sampled level of correlation, assuming we > have enough samples. And it's really only an issue if the correlation is >> 0.5. I'd note that this can be an issue even without 2 columns involved. I've seen a number of tables where the data is loaded in batches so similar-values from a batch tend to be packed into relatively few pages. Thinks a database for a retailer that nightly aggregates data from each of many stores. Each incoming batch inserts the store's data into tightly packed disk pages where most all rows on the page are for that store. But those pages are interspersed with pages from other stores. I think I like the ideas Greg Stark had a couple years ago: http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php "...sort the sampled values by value and count up theaverage number of distinct blocks per value.... Or perhaps we need a second histogram where the quantities are of distinct pages rather than total records.... We might also need a separate "average number of n-block spans per value" since those seem to me to lead more directly to values like "blocks that need to be read".
В списке pgsql-hackers по дате отправления: