Обсуждение: Implementing STDDEV and VARIANCE
I'd like to implement stddev and variance aggregates in Postgres. This is a long standing TODO item. There already has been some discussion on implementing this, see http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html There are two definitions for standard deviation and variance: _ population variance = sigma^2 = SUM(X - X)^2 / N population stddev = sqrt(population variance) _ sample variance = s^2 = SUM(X - X)^2 / (N-1) sample stddev = sqrt(sample variance) These statistics can be calculated in one pass when three variables are kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only two variables are kept. E.g. avg() is calculated as follows: sx = 0 n = 0 for every row {sx = sx + value in row // transition function 1n = n+1 // transition function 2 } avg = sum(x) / n stddev / variance might be calculated as follows: sx = 0 n = 0 sx2 = 0 for every row {sx = sx + value in row // transition function 1n = n+1 // transition function 2sx2 = sx2 + value in row^2 // transition function 3 } var = (1/n) * (sx2 - (1/n) * sx^2) // Population or var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample and stddev = sqrt(var) I've looked through the code and the following things need to be implemented: 1. Add three columns to pg_aggregate for the additional third transition function. Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:>All that you need to implement this is room to keep two running>sums insteadof one. I haven't looked at pgsql's aggregate functions,>but I'd hope that the working state can be a struct notjust a>single number. I saw no other way than adding another transition function and logic, as this might break user-defined aggregates (are there any around?). 2. Add logic to nodeAgg.c to execute the third transition function and finalize function with three rather than two parameters 3. Add functions f(a,b) that returns a + b^2 for selected types 4. Add four finalize functions to calculate the variance / stddev 5. Update the code for create aggregate, to include the definition of the third transition function 6. Update the documentation My questions are: 1. Is this the correct way to continue? What am I missing? Any errors in my reasoning? 2. I am proposing the names stddev(x) and variance(x) for population and samplestddev(x) and samplevariance(x) for sample statistics. Any comments? 3. I'm planning to implement this for types float4, float8 and numeric. Any other types also? int[2,4,8] don't seem logical, as these would introduce serious rounding errors. Let me know what you think, Jeroen
Jeroen van Vianen <jeroen@design.nl> writes: > I'd like to implement stddev and variance aggregates in Postgres. This is a > long standing TODO item. > 1. Add three columns to pg_aggregate for the additional third transition > function. > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: >> All that you need to implement this is room to keep two running >> sums instead of one. I haven't looked at pgsql's aggregate functions, >> but I'd hope that the working state can be a struct not just a >> single number. > I saw no other way than adding another transition function and logic, as > this might break user-defined aggregates (are there any around?). Yes, there are some, and no you do not need a third transition function. What you do need is a datatype holding two values that you can use as the transition datatype, plus appropriate functions for the transition functions. The reason there are two transition functions at all is that it allows some of the standard aggregate functions to be built using arithmetic functions that exist anyway --- for example, float8 AVG is built from float8 addition, float8 increment, and float8 divide, with only float8 increment being a function you wouldn't have anyway. However, the whole thing is really a kluge; nodeAvg.c has all sorts of weird little hacks that are necessary to make AVG have the right behavior in boundary conditions such as no-tuples-in. (A blind application of float8 divide would lead to a divide-by-zero exception in that case.) These hacks limit the ability of user-defined aggregates to control their behavior at the boundary conditions. Nor can an aggregate control its response to NULL data values; that's hardwired into nodeAvg.c as well. A cleaner solution would have just one transition function and one transition data value, plus an optional finalization function that takes only the one data value. For AVG the transition data type would be a two-field struct and the transition function would update both fields. This would halve the function-call overhead per tuple. We'd have to provide specialized transition and finalization functions for AVG and probably a couple of the other standard aggregates, but that would allow us to rely on those functions to do the right things at the boundary conditions; nodeAvg.c could stop foreclosing the choices. I have been thinking about proposing such a change along with the function manager rewrite that is now planned for 7.1. That would be a good time because user-defined aggregates would need to be revisited anyway. Also, if the transition functions are to determine the behavior for no-tuples and for NULL data values, they had better be able to pass and return NULLs cleanly; which depends on the function manager rewrite. In short, I'd suggest thinking about implementing STDDEV with a single transition function and transition data value. You'll need specialized functions for it anyway, so I don't see that you're saving any work by proposing a third transition function. What you will need instead is a pg_type entry for the transition data type, but since that data type needn't have any operators, there's not much work needed. > 3. I'm planning to implement this for types float4, float8 and numeric. Any > other types also? int[2,4,8] don't seem logical, as these would introduce > serious rounding errors. I'd suggest just two basic implementations, with float8 and numeric internal calculations respectively. Data of other numeric types can be type-coerced to one of these (that might even happen automatically), but the output would always be either float8 or numeric. (I don't think float4 has enough precision to generate reliable stddev numbers except in very narrow conditions...) regards, tom lane
I wrote: > whole thing is really a kluge; nodeAvg.c has all sorts of weird little For "nodeAvg.c" read "nodeAgg.c"; sorry for the momentary brain fade... regards, tom lane
Tom Lane wrote: > Jeroen van Vianen <jeroen@design.nl> writes: > > I'd like to implement stddev and variance aggregates in Postgres. This is a > > long standing TODO item. > > > 1. Add three columns to pg_aggregate for the additional third transition > > function. > > > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: > >> All that you need to implement this is room to keep two running > >> sums instead of one. I haven't looked at pgsql's aggregate functions, > >> but I'd hope that the working state can be a struct not just a > >> single number. > > > I saw no other way than adding another transition function and logic, as > > this might break user-defined aggregates (are there any around?). > > Yes, there are some, and no you do not need a third transition > function. What you do need is a datatype holding two values that > you can use as the transition datatype, plus appropriate functions > for the transition functions. So it might be better to have this type hold three values (n, sum(x) and sum(x^2)) and only use one transition function to update all three values at once and have the finalization function do the necessary calculations. > The reason there are two transition functions at all is that it allows > some of the standard aggregate functions to be built using arithmetic > functions that exist anyway --- for example, float8 AVG is built from > float8 addition, float8 increment, and float8 divide, with only float8 > increment being a function you wouldn't have anyway. However, the > whole thing is really a kluge; nodeAvg.c has all sorts of weird little > hacks that are necessary to make AVG have the right behavior in boundary > conditions such as no-tuples-in. (A blind application of float8 divide > would lead to a divide-by-zero exception in that case.) These hacks > limit the ability of user-defined aggregates to control their behavior > at the boundary conditions. Nor can an aggregate control its response > to NULL data values; that's hardwired into nodeAvg.c as well. Yes, I saw these little hacks. And there are boundary conditions with stddev and variance with no rows and one row (for sample stddev and sample variance). > A cleaner solution would have just one transition function and one > transition data value, plus an optional finalization function that takes > only the one data value. For AVG the transition data type would be a > two-field struct and the transition function would update both fields. > This would halve the function-call overhead per tuple. We'd have to > provide specialized transition and finalization functions for AVG and > probably a couple of the other standard aggregates, but that would allow > us to rely on those functions to do the right things at the boundary > conditions; nodeAvg.c could stop foreclosing the choices. So you suggest changing all transition functions for 7.1 to keep all the state they need? > I have been thinking about proposing such a change along with the > function manager rewrite that is now planned for 7.1. That would be > a good time because user-defined aggregates would need to be revisited > anyway. Also, if the transition functions are to determine the behavior > for no-tuples and for NULL data values, they had better be able to pass > and return NULLs cleanly; which depends on the function manager rewrite. Are you suggesting also to change the lay-out of pg_attribute in 7.1 to something like this and do updates for all built-in types and aggregates? aggname aggowner aggtype aggtranstype [ n, sx, sx2 ] agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 ) aggtransfunction function that does ( n = n + 1, sx = sx + x, sx2 = sx2 + x * x ) aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) / n Might it be better for me to wait for 7.1 before implementing stddev and variance? > In short, I'd suggest thinking about implementing STDDEV with a > single transition function and transition data value. You'll need > specialized functions for it anyway, so I don't see that you're saving > any work by proposing a third transition function. What you will need > instead is a pg_type entry for the transition data type, but since that > data type needn't have any operators, there's not much work needed. OK, clear. > > 3. I'm planning to implement this for types float4, float8 and numeric. Any > > other types also? int[2,4,8] don't seem logical, as these would introduce > > serious rounding errors. > > I'd suggest just two basic implementations, with float8 and numeric > internal calculations respectively. Data of other numeric types can > be type-coerced to one of these (that might even happen automatically), > but the output would always be either float8 or numeric. (I don't think > float4 has enough precision to generate reliable stddev numbers except > in very narrow conditions...) OK, only float8 and numeric. Jeroen
Jeroen van Vianen <jeroen@design.nl> writes: > aggname > aggowner > aggtype > aggtranstype [ n, sx, sx2 ] > agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 ) > aggtransfunction function that does ( n = n + 1, sx = sx + x, > sx2 = sx2 + x * x ) > aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) / > n Right, that's pretty much what I'm visualizing. One minor detail: there is not an "agginitfunction", there is an "agginitvalue". So your special datatype to hold n/sx/sx2 must have at least a typinput function that can convert the text string held in pg_aggregate into the desired internal form of the initial state. (At least, that's how it's done now. Do you want to argue to change it? As long as we're opening up the AGG design for reconsideration, we could revisit that choice too.) > Might it be better for me to wait for 7.1 before implementing stddev and > variance? Well, you will need to be pretty fast on the draw if you want to get it into 7.0 release, since we will be freezing features for beta in a week. But I see no reason that you couldn't implement STDDEV within the existing framework; just ignore transfn2 and do it as above. You might have some problems with getting the desired response for zero or one tuples, but there isn't any way to fix that within the current framework :-(. We have to do the function manager rewrite before you can have control over when to return a NULL. As long as you are willing to live with that, you can have useful functionality now. regards, tom lane
Hannu Krosing wrote: > I'm sure many people would appreciate it even without "preserving oid-s" > as OID's are declared deprecated for (AFAIK) >2 years, OIDs should not be deprecated (never heard that one before). They will become more important if and when postgres moves more toward being an odbms. For reasons see any book on object theory and identity.
> Hannu Krosing wrote: > > > I'm sure many people would appreciate it even without "preserving oid-s" > > as OID's are declared deprecated for (AFAIK) >2 years, > > OIDs should not be deprecated (never heard that one before). They will > become more important if and when postgres moves more toward being an > odbms. For reasons see any book on object theory and identity. Agreed. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Mon, 24 Jan 2000, Chris Bitmead wrote: > Hannu Krosing wrote: > > > I'm sure many people would appreciate it even without "preserving oid-s" > > as OID's are declared deprecated for (AFAIK) >2 years, > > OIDs should not be deprecated (never heard that one before). They will > become more important if and when postgres moves more toward being an > odbms. For reasons see any book on object theory and identity. You are the second person that mentioned the ODBMS<->OID tie in ... when you say ODBMS, is that the same as ORDBMS, or we talking a different issue? Want to elaborate? Like, since we already have the 'core', I believe, what does it take to continue that trend? Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
At 11:27 PM 1/23/00 -0500, Tom Lane wrote: >Right, that's pretty much what I'm visualizing. One minor detail: there >is not an "agginitfunction", there is an "agginitvalue". So your >special datatype to hold n/sx/sx2 must have at least a typinput function >that can convert the text string held in pg_aggregate into the desired >internal form of the initial state. (At least, that's how it's done >now. Do you want to argue to change it? As long as we're opening up >the AGG design for reconsideration, we could revisit that choice too.) At the moment I have a hard time visualizing an aggregate function where a constant initializer wouldn't serve, but ... what would be the cost of the generalization? It would only be called once per query or subquery containing the aggregate, right? If generalizing it can be done over a latte or mocha, perhaps it's worth doing. If it takes as long as drinking a pint of Guiness, perhaps not :) (I mean, let's get into meaningful metrics here!) - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> You are the second person that mentioned the ODBMS<->OID tie in ... when > you say ODBMS, is that the same as ORDBMS, or we talking a different > issue? > > Want to elaborate? Like, since we already have the 'core', I believe, > what does it take to continue that trend? Hi, yes I am very interested in this issue, and ORDBMS and ODBMS are rather different. A long time ago, I wrote up a web page about this issue here... http://www.tech.com.au/postgres/ which should tell you all my thoughts on this issue. Basicly I can contribute a LOT of work to the postgres project, but as far as hacking the back end to achieve it - the learning curve is steep compared to my available time. But hacking a client side stuff, I could handle. Enjoy, Chris Bitmead.
Can I make a suggestion for toast? I don't know how much this is thought about yet, but it seems like it would be a good idea if large object support be re-implemented in terms of TOAST (when it is available). Because current large objects suck so much. Can I suggest that the TOAST system should be implemented to allow for random access? So for example, to retrieve a whole data member, of course it looks normal... SELECT xxx from yyy; But to retrieve a particular chunk.... SELECT PORTION(xxx, 65535, 131071) from yyy; to retrieve the 2nd 64k chunk. This could be useful for many sorts of apps.
At 11:27 PM 1/23/00 -0500, Tom Lane wrote: >Jeroen van Vianen <jeroen@design.nl> writes: > > aggname > > aggowner > > aggtype > > aggtranstype [ n, sx, sx2 ] > > agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 ) > > aggtransfunction function that does ( n = n + 1, sx = sx + x, > > sx2 = sx2 + x * x ) > > aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) / > > n > >Right, that's pretty much what I'm visualizing. One minor detail: there >is not an "agginitfunction", there is an "agginitvalue". So your >special datatype to hold n/sx/sx2 must have at least a typinput function >that can convert the text string held in pg_aggregate into the desired >internal form of the initial state. (At least, that's how it's done >now. Do you want to argue to change it? As long as we're opening up >the AGG design for reconsideration, we could revisit that choice too.) I would suggest supplying an initfunction that initializes the datatype that holds n/sx/sx2 so you're able to set individual members to NULL if so desired. I also won't need to implement typinput for all required aggregate types, one small headache less ;-) count --> int4 min/max --> basetype sum --> basetype avg --> basetype, n stddev, variance --> n, basetype, basetype > > Might it be better for me to wait for 7.1 before implementing stddev and > > variance? > >Well, you will need to be pretty fast on the draw if you want to get it >into 7.0 release, since we will be freezing features for beta in a week. True. >But I see no reason that you couldn't implement STDDEV within the >existing framework; just ignore transfn2 and do it as above. You might >have some problems with getting the desired response for zero or one >tuples, but there isn't any way to fix that within the current >framework :-(. We have to do the function manager rewrite before you >can have control over when to return a NULL. As long as you are willing >to live with that, you can have useful functionality now. The problem with zero or one rows is pretty important IMO if you want to implement stddev and variance for both population and sample. You won't be able to explain the difference in outcome if you don't do it right. Let me wait for you to overhaul the fmgr code and do all the aggregate stuff right in one sweep. Only thing is: how do we deal with current user-defined aggregates? At 09:02 PM 1/23/00 -0800, Don Baccus wrote: >At 11:27 PM 1/23/00 -0500, Tom Lane wrote: > >Right, that's pretty much what I'm visualizing. One minor detail: there > >is not an "agginitfunction", there is an "agginitvalue". So your > >special datatype to hold n/sx/sx2 must have at least a typinput function > >that can convert the text string held in pg_aggregate into the desired > >internal form of the initial state. (At least, that's how it's done > >now. Do you want to argue to change it? As long as we're opening up > >the AGG design for reconsideration, we could revisit that choice too.) > >At the moment I have a hard time visualizing an aggregate function where >a constant initializer wouldn't serve, but ... what would be the cost of >the generalization? It would only be called once per query or subquery >containing the aggregate, right? Initializer functions for count need to return 0, for min, max, avg, sum, stddev and variance they need to set individual members to NULL (at least that's how I see it now). A function returning this (with the new fmgr code) would be very easy to implement (I hope ;-) ). I'll hold my breath until the dust settles and we're starting 7.1. Jeroen
Jeroen van Vianen <jeroen@design.nl> writes: >> Right, that's pretty much what I'm visualizing. One minor detail: there >> is not an "agginitfunction", there is an "agginitvalue". So your >> special datatype to hold n/sx/sx2 must have at least a typinput function >> that can convert the text string held in pg_aggregate into the desired >> internal form of the initial state. (At least, that's how it's done >> now. Do you want to argue to change it? As long as we're opening up >> the AGG design for reconsideration, we could revisit that choice too.) > I would suggest supplying an initfunction that initializes the datatype > that holds n/sx/sx2 so you're able to set individual members to NULL if so > desired. I also won't need to implement typinput for all required aggregate > types, one small headache less ;-) I've been thinking about this and have come to the conclusion that using an initial value string and a typinput function is still the right design, compared to using a parameterless initializer function. See, that way you have a shot at reusing the same transition type (and typinput code) for several different aggregates with slightly different initial condition requirements, whereas with the initializer-function approach, you have no choice but to write a separate initializer function for each aggregate initial condition you need. I have been thinking that we could save some effort (at a trivial cost in memory) by defining a datatype or two that is specifically intended to be an aggregate transition state datatype. For example, a struct containing three or four float8 fields and as many bool fields would serve nicely for AVG, STDDEV, and probably some other aggregates; some of them wouldn't use all the fields, but so what? The only support code this datatype would need would be a typinput function to convert a string from pg_aggregate into initial struct contents. By doing it that way, the same datatype can support several aggregates with different initial condition requirements, without having to write a separate initializer function for each one. I think the point about initializing struct contents to NULL is a red herring. For basic C types like float8, there isn't such a thing as NULL really; you have to have a separate flag field or count field to tell you that you've seen no input yet. That can be initialized equally well by a typinput function or by an initializer function. But the typinput function gets to have a parameter taken from pg_aggregate; an initializer function would not. > Let me wait for you to overhaul the fmgr code and do all the aggregate > stuff right in one sweep. Only thing is: how do we deal with current > user-defined aggregates? Well, a user aggregate that only used transfn1, or only transfn2, would convert directly. This scheme would break user aggregates that used both, which is why I'm running it up the flagpole early --- to see if anyone complains. regards, tom lane
unsubscribe
Has this gone anywhere? > I'd like to implement stddev and variance aggregates in Postgres. This is a > long standing TODO item. > > There already has been some discussion on implementing this, see > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html > > There are two definitions for standard deviation and variance: > _ > population variance = sigma^2 = SUM(X - X)^2 / N > > population stddev = sqrt(population variance) > _ > sample variance = s^2 = SUM(X - X)^2 / (N-1) > > sample stddev = sqrt(sample variance) > > These statistics can be calculated in one pass when three variables are > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only > two variables are kept. > > E.g. avg() is calculated as follows: > > sx = 0 > n = 0 > for every row { > sx = sx + value in row // transition function 1 > n = n+1 // transition function 2 > } > avg = sum(x) / n > > stddev / variance might be calculated as follows: > > sx = 0 > n = 0 > sx2 = 0 > for every row { > sx = sx + value in row // transition function 1 > n = n+1 // transition function 2 > sx2 = sx2 + value in row^2 // transition function 3 > } > var = (1/n) * (sx2 - (1/n) * sx^2) // Population > > or > > var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample > > and > > stddev = sqrt(var) > > I've looked through the code and the following things need to be implemented: > > 1. Add three columns to pg_aggregate for the additional third transition > function. > > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: > >All that you need to implement this is room to keep two running > >sums instead of one. I haven't looked at pgsql's aggregate functions, > >but I'd hope that the working state can be a struct not just a > >single number. > > I saw no other way than adding another transition function and logic, as > this might break user-defined aggregates (are there any around?). > > 2. Add logic to nodeAgg.c to execute the third transition function and > finalize function with three rather than two parameters > 3. Add functions f(a,b) that returns a + b^2 for selected types > 4. Add four finalize functions to calculate the variance / stddev > 5. Update the code for create aggregate, to include the definition of the > third transition function > 6. Update the documentation > > My questions are: > 1. Is this the correct way to continue? What am I missing? Any errors in my > reasoning? > 2. I am proposing the names stddev(x) and variance(x) for population and > samplestddev(x) and > samplevariance(x) for sample statistics. Any comments? > 3. I'm planning to implement this for types float4, float8 and numeric. Any > other types also? int[2,4,8] don't seem logical, as these would introduce > serious rounding errors. > > Let me know what you think, > > > Jeroen > > ************ > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: I created all that sometimes back. Dunno why never added it to contrib. Will post it another day. Jan > Has this gone anywhere? > > > I'd like to implement stddev and variance aggregates in Postgres. This is a > > long standing TODO item. > > > > There already has been some discussion on implementing this, see > > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html > > > > There are two definitions for standard deviation and variance: > > _ > > population variance = sigma^2 = SUM(X - X)^2 / N > > > > population stddev = sqrt(population variance) > > _ > > sample variance = s^2 = SUM(X - X)^2 / (N-1) > > > > sample stddev = sqrt(sample variance) > > > > These statistics can be calculated in one pass when three variables are > > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only > > two variables are kept. > > > > E.g. avg() is calculated as follows: > > > > sx = 0 > > n = 0 > > for every row { > > sx = sx + value in row // transition function 1 > > n = n+1 // transition function 2 > > } > > avg = sum(x) / n > > > > stddev / variance might be calculated as follows: > > > > sx = 0 > > n = 0 > > sx2 = 0 > > for every row { > > sx = sx + value in row // transition function 1 > > n = n+1 // transition function 2 > > sx2 = sx2 + value in row^2 // transition function 3 > > } > > var = (1/n) * (sx2 - (1/n) * sx^2) // Population > > > > or > > > > var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample > > > > and > > > > stddev = sqrt(var) > > > > I've looked through the code and the following things need to be implemented: > > > > 1. Add three columns to pg_aggregate for the additional third transition > > function. > > > > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: > > >All that you need to implement this is room to keep two running > > >sums instead of one. I haven't looked at pgsql's aggregate functions, > > >but I'd hope that the working state can be a struct not just a > > >single number. > > > > I saw no other way than adding another transition function and logic, as > > this might break user-defined aggregates (are there any around?). > > > > 2. Add logic to nodeAgg.c to execute the third transition function and > > finalize function with three rather than two parameters > > 3. Add functions f(a,b) that returns a + b^2 for selected types > > 4. Add four finalize functions to calculate the variance / stddev > > 5. Update the code for create aggregate, to include the definition of the > > third transition function > > 6. Update the documentation > > > > My questions are: > > 1. Is this the correct way to continue? What am I missing? Any errors in my > > reasoning? > > 2. I am proposing the names stddev(x) and variance(x) for population and > > samplestddev(x) and > > samplevariance(x) for sample statistics. Any comments? > > 3. I'm planning to implement this for types float4, float8 and numeric. Any > > other types also? int[2,4,8] don't seem logical, as these would introduce > > serious rounding errors. > > > > Let me know what you think, > > > > > > Jeroen > > > > ************ > > > > > -- > Bruce Momjian | http://www.op.net/~candle > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jeroen, not sure if you were involved in this, but standard deviation is in the current development tree, and will be released in 7.1 in a few months. > I'd like to implement stddev and variance aggregates in Postgres. This is a > long standing TODO item. > > There already has been some discussion on implementing this, see > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html > > There are two definitions for standard deviation and variance: > _ > population variance = sigma^2 = SUM(X - X)^2 / N > > population stddev = sqrt(population variance) > _ > sample variance = s^2 = SUM(X - X)^2 / (N-1) > > sample stddev = sqrt(sample variance) > > These statistics can be calculated in one pass when three variables are > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only > two variables are kept. > > E.g. avg() is calculated as follows: > > sx = 0 > n = 0 > for every row { > sx = sx + value in row // transition function 1 > n = n+1 // transition function 2 > } > avg = sum(x) / n > > stddev / variance might be calculated as follows: > > sx = 0 > n = 0 > sx2 = 0 > for every row { > sx = sx + value in row // transition function 1 > n = n+1 // transition function 2 > sx2 = sx2 + value in row^2 // transition function 3 > } > var = (1/n) * (sx2 - (1/n) * sx^2) // Population > > or > > var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample > > and > > stddev = sqrt(var) > > I've looked through the code and the following things need to be implemented: > > 1. Add three columns to pg_aggregate for the additional third transition > function. > > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: > >All that you need to implement this is room to keep two running > >sums instead of one. I haven't looked at pgsql's aggregate functions, > >but I'd hope that the working state can be a struct not just a > >single number. > > I saw no other way than adding another transition function and logic, as > this might break user-defined aggregates (are there any around?). > > 2. Add logic to nodeAgg.c to execute the third transition function and > finalize function with three rather than two parameters > 3. Add functions f(a,b) that returns a + b^2 for selected types > 4. Add four finalize functions to calculate the variance / stddev > 5. Update the code for create aggregate, to include the definition of the > third transition function > 6. Update the documentation > > My questions are: > 1. Is this the correct way to continue? What am I missing? Any errors in my > reasoning? > 2. I am proposing the names stddev(x) and variance(x) for population and > samplestddev(x) and > samplevariance(x) for sample statistics. Any comments? > 3. I'm planning to implement this for types float4, float8 and numeric. Any > other types also? int[2,4,8] don't seem logical, as these would introduce > serious rounding errors. > > Let me know what you think, > > > Jeroen > > ************ > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 22:27 9/29/00 -0400, you wrote: >Jeroen, not sure if you were involved in this, but standard deviation is >in the current development tree, and will be released in 7.1 in a few >months. Yes, I noticed. And I'm happy to see it has been implemented without my help, as I'm desperately running out of time to do any coding on Postgres. Thanks, Jeroen