Re: Cross-field statistics
От | Decibel! |
---|---|
Тема | Re: Cross-field statistics |
Дата | |
Msg-id | F03A6C3F-A325-4BB7-AEA0-784DFE2F5F54@decibel.org обсуждение исходный текст |
Ответ на | Re: Cross-field statistics (Gregory Stark <stark@enterprisedb.com>) |
Список | pgsql-hackers |
On Apr 17, 2008, at 12:22 PM, Gregory Stark wrote: > "Decibel!" <decibel@decibel.org> writes: > >> For each field that isn't already in a set of field groupings >> * Sort sample rows on that field >> * Calculate correlation for all other fields >> * If there are other fields that have a correlation to this sort >> order over >> some threshold, save them along with the field we originally >> sorted on as a >> new 'field grouping' >> * Else, there are no other fields that group with this field; >> it's a "loner" > > I think this is going somewhere. But "correlation" isn't quite > right. It has > the same problem our use of correlation for clusteredness has. > Consider the > case of Zip code and City. They're nearly very non-independent > variables but > there's basically no correlation. If we have a limited number of values in one of the fields, it would be possible to build a histogram of values for other fields based on the values in the first field. But I can't see how we could possibly represent something like city, zip in a compact form. You would have to keep a range of zips that cover a city. Hmm... but we only care about cities that have a substantial number of zip codes. This is what the equivalent of the most-common-values list would be for cross-platform stats: for the most_common_vals in column a, you store a range or histogram of the corresponding values in b, assuming that there is a good correspondence. >> For each field grouping, at a minimum we'd need to store a >> histogram for that >> grouping. > > This is a problem. What does a histogram on a grouping mean? It's > not clear > how to come up with a histogram which can help answer questions like > A between ? and ? and B between ? and ? > > You can do a histogram on <a,b> or <b,a> but neither are going to be > especially useful. Heikki and I came up with a weird hybrid thing > which might > be useful for avoiding overestimating selectivity like > WHERE city='BOS' AND areacode = '617' > > But it didn't help at all with the converse, ie: > WHERE city='BOS' AND areacode = '212' > > It's hard to see how we could possibly catch cases like that though. If the two fields share the same correlation, then the histogram is just what we use right now. We could actually do this today, but only for fields with a high physical correlation. What I was describing allowed extending this to fields that have a high correlation to each other, even if they didn't have a high physical correlation. I know that this doesn't help us for things like city/area code or city/zip, but other than my idea above I'm rather at a loss on how to represent that in a compact fashion. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
В списке pgsql-hackers по дате отправления: