Re: two dimensional statistics in Postgres
От | Gavin Flower |
---|---|
Тема | Re: two dimensional statistics in Postgres |
Дата | |
Msg-id | 545B5291.6080807@archidevsys.co.nz обсуждение исходный текст |
Ответ на | two dimensional statistics in Postgres (Katharina Büchse<katharina.buechse@uni-jena.de>) |
Ответы |
Re: two dimensional statistics in Postgres
|
Список | pgsql-hackers |
On 06/11/14 23:15, Katharina Büchse wrote: > Hi, > > I'm a phd-student at the university of Jena, Thüringen, Germany, in > the field of data bases, more accurate query optimization. > I want to implement a system in PostgreSQL that detects column > correlations and creates statistical data about correlated columns for > the optimizer. Therefore I need to store two dimensional statistics > (especially two dimensional histograms) in PostgreSQL. > I had a look at the description of "WIP: multivariate statistics / > proof of concept", which looks really promising, I guess these > statistics are based on scans of the data? For my system I need both > -- statistical data based on table scans (actually, samples are > enough) and those based on query feedback. Query feedback (tuple > counts and, speaking a little inaccurately, the where-part of the > query itself) needs to be extracted and there needs to be a decision > for the optimizer, when to take multivariate statistics and when to > use the one dimensional ones. Oracle in this case just disables one > dimensional histograms if there is already a multidimensional > histogram, but this is not always useful, especially in the case of a > feedback based histogram (which might not cover the whole data space). > I want to use both kinds of histograms because correlations might > occur only in parts of the data. In this case a histogram based on a > sample of the whole table might not get the point and wouldn't help > for the part of the data the user seems to be interested in. > There are special data structures for storing multidimensional > histograms based on feedback and I already tried to implement one of > these in C. In the case of two dimensions they are of course not "for > free" (one dimensional would be much cheaper), but based on the > principle of maximum entropy they deliver really good results. I > decided for only two dimensions because in this case we have the best > proportion of cost and benefit when searching for correlation (here > I'm relying on tests that were made in DB2 within a project called > CORDS which detects correlations even between different tables). > > I'd be grateful for any advices and discussions. > Regards, > > Katharina > > Could you store a 2 dimensional histogram in a one dimensional array: A[z] = value, where z = col * rowSize + row (zero starting index)? Cheers, Gavin
В списке pgsql-hackers по дате отправления: