Обсуждение: Implementing STDDEV and VARIANCE

Поиск

Список

Период

Сортировка

Implementing STDDEV and VARIANCE

От

Jeroen van Vianen

Дата:

23 января 2000 г., 11:56:18

I'd like to implement stddev and variance aggregates in Postgres. This is a 
long standing TODO item.

There already has been some discussion on implementing this, see 
http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html

There are two definitions for standard deviation and variance:                                        _
population variance = sigma^2 = SUM(X - X)^2 / N

population stddev = sqrt(population variance)                                _
sample variance = s^2 = SUM(X - X)^2 / (N-1)

sample stddev = sqrt(sample variance)

These statistics can be calculated in one pass when three variables are 
kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only 
two variables are kept.

E.g. avg() is calculated as follows:

sx = 0
n = 0
for every row {sx = sx + value in row                   // transition function 1n = n+1
// transition function 2
 
}
avg = sum(x) / n

stddev / variance might be calculated as follows:

sx = 0
n = 0
sx2 = 0
for every row {sx = sx + value in row                   // transition function 1n = n+1
// transition function 2sx2 = sx2 + value in row^2               // transition function 3
 
}
var = (1/n) * (sx2 - (1/n) * sx^2)              // Population

or

var = (1/(n-1)) * (sx2 - (1/n) * sx^2)          // Sample

and

stddev = sqrt(var)

I've looked through the code and the following things need to be implemented:

1. Add three columns to pg_aggregate for the additional third transition 
function.

Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:>All that you need to implement this is room to keep two running>sums
insteadof one.  I haven't looked at pgsql's aggregate functions,>but I'd hope that the working state can be a struct
notjust a>single number.
 

I saw no other way than adding another transition function and logic, as 
this might break user-defined aggregates (are there any around?).

2. Add logic to nodeAgg.c to execute the third transition function and 
finalize function with three rather than two parameters
3. Add functions f(a,b) that returns a + b^2 for selected types
4. Add four finalize functions to calculate the variance / stddev
5. Update the code for create aggregate, to include the definition of the 
third transition function
6. Update the documentation

My questions are:
1. Is this the correct way to continue? What am I missing? Any errors in my 
reasoning?
2. I am proposing the names stddev(x) and variance(x) for population and 
samplestddev(x) and
samplevariance(x) for sample statistics. Any comments?
3. I'm planning to implement this for types float4, float8 and numeric. Any 
other types also? int[2,4,8] don't seem logical, as these would introduce 
serious rounding errors.

Let me know what you think,


Jeroen

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Tom Lane

Дата:

23 января 2000 г., 13:28:19

Jeroen van Vianen <jeroen@design.nl> writes:
> I'd like to implement stddev and variance aggregates in Postgres. This is a 
> long standing TODO item.

> 1. Add three columns to pg_aggregate for the additional third transition 
> function.

> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>> All that you need to implement this is room to keep two running
>> sums instead of one.  I haven't looked at pgsql's aggregate functions,
>> but I'd hope that the working state can be a struct not just a
>> single number.

> I saw no other way than adding another transition function and logic, as 
> this might break user-defined aggregates (are there any around?).

Yes, there are some, and no you do not need a third transition
function.  What you do need is a datatype holding two values that
you can use as the transition datatype, plus appropriate functions
for the transition functions.

The reason there are two transition functions at all is that it allows
some of the standard aggregate functions to be built using arithmetic
functions that exist anyway --- for example, float8 AVG is built from
float8 addition, float8 increment, and float8 divide, with only float8
increment being a function you wouldn't have anyway.  However, the
whole thing is really a kluge; nodeAvg.c has all sorts of weird little
hacks that are necessary to make AVG have the right behavior in boundary
conditions such as no-tuples-in.  (A blind application of float8 divide
would lead to a divide-by-zero exception in that case.)  These hacks
limit the ability of user-defined aggregates to control their behavior
at the boundary conditions.  Nor can an aggregate control its response
to NULL data values; that's hardwired into nodeAvg.c as well.

A cleaner solution would have just one transition function and one
transition data value, plus an optional finalization function that takes
only the one data value.  For AVG the transition data type would be a
two-field struct and the transition function would update both fields.
This would halve the function-call overhead per tuple.  We'd have to
provide specialized transition and finalization functions for AVG and
probably a couple of the other standard aggregates, but that would allow
us to rely on those functions to do the right things at the boundary
conditions; nodeAvg.c could stop foreclosing the choices.

I have been thinking about proposing such a change along with the
function manager rewrite that is now planned for 7.1.  That would be
a good time because user-defined aggregates would need to be revisited
anyway.  Also, if the transition functions are to determine the behavior
for no-tuples and for NULL data values, they had better be able to pass
and return NULLs cleanly; which depends on the function manager rewrite.

In short, I'd suggest thinking about implementing STDDEV with a
single transition function and transition data value.  You'll need
specialized functions for it anyway, so I don't see that you're saving
any work by proposing a third transition function.  What you will need
instead is a pg_type entry for the transition data type, but since that
data type needn't have any operators, there's not much work needed.

> 3. I'm planning to implement this for types float4, float8 and numeric. Any 
> other types also? int[2,4,8] don't seem logical, as these would introduce 
> serious rounding errors.

I'd suggest just two basic implementations, with float8 and numeric
internal calculations respectively.  Data of other numeric types can
be type-coerced to one of these (that might even happen automatically),
but the output would always be either float8 or numeric.  (I don't think
float4 has enough precision to generate reliable stddev numbers except
in very narrow conditions...)
        regards, tom lane

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Tom Lane

Дата:

23 января 2000 г., 13:57:19

I wrote:
> whole thing is really a kluge; nodeAvg.c has all sorts of weird little

For "nodeAvg.c" read "nodeAgg.c"; sorry for the momentary brain fade...
        regards, tom lane

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Jeroen van Vianen

Дата:

23 января 2000 г., 14:42:21

Tom Lane wrote:
> Jeroen van Vianen <jeroen@design.nl> writes:
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
> 
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
> 
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >> All that you need to implement this is room to keep two running
> >> sums instead of one.  I haven't looked at pgsql's aggregate functions,
> >> but I'd hope that the working state can be a struct not just a
> >> single number.
> 
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
> 
> Yes, there are some, and no you do not need a third transition
> function.  What you do need is a datatype holding two values that
> you can use as the transition datatype, plus appropriate functions
> for the transition functions.

So it might be better to have this type hold three values (n, sum(x) and
sum(x^2)) and only use one transition function to update all three
values at once and have the finalization function do the necessary
calculations.

> The reason there are two transition functions at all is that it allows
> some of the standard aggregate functions to be built using arithmetic
> functions that exist anyway --- for example, float8 AVG is built from
> float8 addition, float8 increment, and float8 divide, with only float8
> increment being a function you wouldn't have anyway.  However, the
> whole thing is really a kluge; nodeAvg.c has all sorts of weird little
> hacks that are necessary to make AVG have the right behavior in boundary
> conditions such as no-tuples-in.  (A blind application of float8 divide
> would lead to a divide-by-zero exception in that case.)  These hacks
> limit the ability of user-defined aggregates to control their behavior
> at the boundary conditions.  Nor can an aggregate control its response
> to NULL data values; that's hardwired into nodeAvg.c as well.

Yes, I saw these little hacks. And there are boundary conditions with
stddev and variance with no rows and one row (for sample stddev and
sample variance).

> A cleaner solution would have just one transition function and one
> transition data value, plus an optional finalization function that takes
> only the one data value.  For AVG the transition data type would be a
> two-field struct and the transition function would update both fields.
> This would halve the function-call overhead per tuple.  We'd have to
> provide specialized transition and finalization functions for AVG and
> probably a couple of the other standard aggregates, but that would allow
> us to rely on those functions to do the right things at the boundary
> conditions; nodeAvg.c could stop foreclosing the choices.

So you suggest changing all transition functions for 7.1 to keep all the
state they need?

> I have been thinking about proposing such a change along with the
> function manager rewrite that is now planned for 7.1.  That would be
> a good time because user-defined aggregates would need to be revisited
> anyway.  Also, if the transition functions are to determine the behavior
> for no-tuples and for NULL data values, they had better be able to pass
> and return NULLs cleanly; which depends on the function manager rewrite.

Are you suggesting also to change the lay-out of pg_attribute in 7.1 to
something like this and do updates for all built-in types and
aggregates?

aggname                
aggowner
aggtype
aggtranstype           [ n, sx, sx2 ]
agginitfunction        function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
aggtransfunction       function that does ( n = n + 1, sx = sx + x,                        sx2 = sx2 + x * x )
aggfinalizefunction    function that returns (sx2 - (1/n) * sx * sx ) /
n

Might it be better for me to wait for 7.1 before implementing stddev and
variance?

> In short, I'd suggest thinking about implementing STDDEV with a
> single transition function and transition data value.  You'll need
> specialized functions for it anyway, so I don't see that you're saving
> any work by proposing a third transition function.  What you will need
> instead is a pg_type entry for the transition data type, but since that
> data type needn't have any operators, there's not much work needed.

OK, clear.

> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
> 
> I'd suggest just two basic implementations, with float8 and numeric
> internal calculations respectively.  Data of other numeric types can
> be type-coerced to one of these (that might even happen automatically),
> but the output would always be either float8 or numeric.  (I don't think
> float4 has enough precision to generate reliable stddev numbers except
> in very narrow conditions...)

OK, only float8 and numeric.


Jeroen

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Tom Lane

Дата:

23 января 2000 г., 23:28:27

Jeroen van Vianen <jeroen@design.nl> writes:
> aggname                
> aggowner
> aggtype
> aggtranstype           [ n, sx, sx2 ]
> agginitfunction        function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
> aggtransfunction       function that does ( n = n + 1, sx = sx + x,
>                          sx2 = sx2 + x * x )
> aggfinalizefunction    function that returns (sx2 - (1/n) * sx * sx ) /
> n

Right, that's pretty much what I'm visualizing.  One minor detail: there
is not an "agginitfunction", there is an "agginitvalue".  So your
special datatype to hold n/sx/sx2 must have at least a typinput function
that can convert the text string held in pg_aggregate into the desired
internal form of the initial state.  (At least, that's how it's done
now.  Do you want to argue to change it?  As long as we're opening up
the AGG design for reconsideration, we could revisit that choice too.)

> Might it be better for me to wait for 7.1 before implementing stddev and
> variance?

Well, you will need to be pretty fast on the draw if you want to get it
into 7.0 release, since we will be freezing features for beta in a week.

But I see no reason that you couldn't implement STDDEV within the
existing framework; just ignore transfn2 and do it as above.  You might
have some problems with getting the desired response for zero or one
tuples, but there isn't any way to fix that within the current
framework :-(.  We have to do the function manager rewrite before you
can have control over when to return a NULL.  As long as you are willing
to live with that, you can have useful functionality now.
        regards, tom lane

Re: [HACKERS] Happy column dropping

От

Chris Bitmead

Дата:

23 января 2000 г., 23:42:26

Hannu Krosing wrote:

> I'm sure many people would appreciate it even without "preserving oid-s"
> as OID's are declared deprecated for (AFAIK) >2 years, 

OIDs should not be deprecated (never heard that one before). They will
become more important if and when postgres moves more toward being an
odbms. For reasons see any book on object theory and identity.

Re: [HACKERS] Happy column dropping

От

Bruce Momjian

Дата:

23 января 2000 г., 23:56:26

> Hannu Krosing wrote:
> 
> > I'm sure many people would appreciate it even without "preserving oid-s"
> > as OID's are declared deprecated for (AFAIK) >2 years, 
> 
> OIDs should not be deprecated (never heard that one before). They will
> become more important if and when postgres moves more toward being an
> odbms. For reasons see any book on object theory and identity.

Agreed.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

ORDBMS<->OID (Re: [HACKERS] Happy column dropping)

От

The Hermit Hacker

Дата:

23 января 2000 г., 23:56:27

On Mon, 24 Jan 2000, Chris Bitmead wrote:

> Hannu Krosing wrote:
> 
> > I'm sure many people would appreciate it even without "preserving oid-s"
> > as OID's are declared deprecated for (AFAIK) >2 years, 
> 
> OIDs should not be deprecated (never heard that one before). They will
> become more important if and when postgres moves more toward being an
> odbms. For reasons see any book on object theory and identity.

You are the second person that mentioned the ODBMS<->OID tie in ... when
you say ODBMS, is that the same as ORDBMS, or we talking a different
issue?

Want to elaborate?  Like, since we already have the 'core', I believe,
what does it take to continue that trend?

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Don Baccus

Дата:

24 января 2000 г., 00:06:27

At 11:27 PM 1/23/00 -0500, Tom Lane wrote:

>Right, that's pretty much what I'm visualizing.  One minor detail: there
>is not an "agginitfunction", there is an "agginitvalue".  So your
>special datatype to hold n/sx/sx2 must have at least a typinput function
>that can convert the text string held in pg_aggregate into the desired
>internal form of the initial state.  (At least, that's how it's done
>now.  Do you want to argue to change it?  As long as we're opening up
>the AGG design for reconsideration, we could revisit that choice too.)

At the moment I have a hard time visualizing an aggregate function where
a constant initializer wouldn't serve, but ... what would be the cost of
the generalization?  It would only be called once per query or subquery
containing the aggregate, right?  

If generalizing it can be done over a latte or mocha, perhaps it's worth
doing.  If it takes as long as drinking a pint of Guiness, perhaps not :)

(I mean, let's get into meaningful metrics here!)

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: ORDBMS<->OID (Re: [HACKERS] Happy column dropping)

От

Chris Bitmead

Дата:

24 января 2000 г., 01:00:27

> You are the second person that mentioned the ODBMS<->OID tie in ... when
> you say ODBMS, is that the same as ORDBMS, or we talking a different
> issue?
> 
> Want to elaborate?  Like, since we already have the 'core', I believe,
> what does it take to continue that trend?

Hi, yes I am very interested in this issue, and ORDBMS and ODBMS are
rather different. 

A long time ago, I wrote up a web page about this issue here...
http://www.tech.com.au/postgres/

which should tell you all my thoughts on this issue. Basicly I can
contribute a LOT of work to the postgres project, but as far as hacking
the back end to achieve it - the learning curve is steep compared to my
available time. But hacking a client side stuff, I could handle.

Enjoy,
Chris Bitmead.

TOAST suggestion...

От

Chris Bitmead

Дата:

24 января 2000 г., 02:29:28

Can I make a suggestion for toast? I don't know how much this is thought
about yet, but it seems like it would be a good idea if large object
support be re-implemented in terms of TOAST (when it is available).
Because current
large objects suck so much.

Can I suggest that the TOAST system should be implemented to allow for
random access?
So for example, to retrieve a whole data member, of course it looks
normal...

SELECT xxx from yyy;

But to retrieve a particular chunk....

SELECT PORTION(xxx, 65535, 131071) from yyy; 

to retrieve the 2nd 64k chunk.

This could be useful for many sorts of apps.

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Jeroen van Vianen

Дата:

24 января 2000 г., 05:28:30

At 11:27 PM 1/23/00 -0500, Tom Lane wrote:
>Jeroen van Vianen <jeroen@design.nl> writes:
> > aggname
> > aggowner
> > aggtype
> > aggtranstype           [ n, sx, sx2 ]
> > agginitfunction        function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
> > aggtransfunction       function that does ( n = n + 1, sx = sx + x,
> >                          sx2 = sx2 + x * x )
> > aggfinalizefunction    function that returns (sx2 - (1/n) * sx * sx ) /
> > n
>
>Right, that's pretty much what I'm visualizing.  One minor detail: there
>is not an "agginitfunction", there is an "agginitvalue".  So your
>special datatype to hold n/sx/sx2 must have at least a typinput function
>that can convert the text string held in pg_aggregate into the desired
>internal form of the initial state.  (At least, that's how it's done
>now.  Do you want to argue to change it?  As long as we're opening up
>the AGG design for reconsideration, we could revisit that choice too.)

I would suggest supplying an initfunction that initializes the datatype 
that holds n/sx/sx2 so you're able to set individual members to NULL if so 
desired. I also won't need to implement typinput for all required aggregate 
types, one small headache less ;-)

count --> int4
min/max --> basetype
sum --> basetype
avg --> basetype, n
stddev, variance --> n, basetype, basetype

> > Might it be better for me to wait for 7.1 before implementing stddev and
> > variance?
>
>Well, you will need to be pretty fast on the draw if you want to get it
>into 7.0 release, since we will be freezing features for beta in a week.

True.

>But I see no reason that you couldn't implement STDDEV within the
>existing framework; just ignore transfn2 and do it as above.  You might
>have some problems with getting the desired response for zero or one
>tuples, but there isn't any way to fix that within the current
>framework :-(.  We have to do the function manager rewrite before you
>can have control over when to return a NULL.  As long as you are willing
>to live with that, you can have useful functionality now.

The problem with zero or one rows is pretty important IMO if you want to 
implement stddev and variance for both population and sample. You won't be 
able to explain the difference in outcome if you don't do it right.

Let me wait for you to overhaul the fmgr code and do all the aggregate 
stuff right in one sweep. Only thing is: how do we deal with current 
user-defined aggregates?

At 09:02 PM 1/23/00 -0800, Don Baccus wrote:
>At 11:27 PM 1/23/00 -0500, Tom Lane wrote:
> >Right, that's pretty much what I'm visualizing.  One minor detail: there
> >is not an "agginitfunction", there is an "agginitvalue".  So your
> >special datatype to hold n/sx/sx2 must have at least a typinput function
> >that can convert the text string held in pg_aggregate into the desired
> >internal form of the initial state.  (At least, that's how it's done
> >now.  Do you want to argue to change it?  As long as we're opening up
> >the AGG design for reconsideration, we could revisit that choice too.)
>
>At the moment I have a hard time visualizing an aggregate function where
>a constant initializer wouldn't serve, but ... what would be the cost of
>the generalization?  It would only be called once per query or subquery
>containing the aggregate, right?

Initializer functions for count need to return 0, for min, max, avg, sum, 
stddev and variance they need to set individual members to NULL (at least 
that's how I see it now). A function returning this (with the new fmgr 
code) would be very easy to implement (I hope ;-) ).

I'll hold my breath until the dust settles and we're starting 7.1.


Jeroen

Re: [HACKERS] Implementing STDDEV and VARIANCE

От

Tom Lane

Дата:

24 января 2000 г., 10:31:37

Jeroen van Vianen <jeroen@design.nl> writes:
>> Right, that's pretty much what I'm visualizing.  One minor detail: there
>> is not an "agginitfunction", there is an "agginitvalue".  So your
>> special datatype to hold n/sx/sx2 must have at least a typinput function
>> that can convert the text string held in pg_aggregate into the desired
>> internal form of the initial state.  (At least, that's how it's done
>> now.  Do you want to argue to change it?  As long as we're opening up
>> the AGG design for reconsideration, we could revisit that choice too.)

> I would suggest supplying an initfunction that initializes the datatype 
> that holds n/sx/sx2 so you're able to set individual members to NULL if so 
> desired. I also won't need to implement typinput for all required aggregate 
> types, one small headache less ;-)

I've been thinking about this and have come to the conclusion that using
an initial value string and a typinput function is still the right design,
compared to using a parameterless initializer function.  See, that way
you have a shot at reusing the same transition type (and typinput code)
for several different aggregates with slightly different initial
condition requirements, whereas with the initializer-function approach,
you have no choice but to write a separate initializer function for each
aggregate initial condition you need.

I have been thinking that we could save some effort (at a trivial cost
in memory) by defining a datatype or two that is specifically intended
to be an aggregate transition state datatype.  For example, a struct
containing three or four float8 fields and as many bool fields would
serve nicely for AVG, STDDEV, and probably some other aggregates; some
of them wouldn't use all the fields, but so what?  The only support code
this datatype would need would be a typinput function to convert a
string from pg_aggregate into initial struct contents.  By doing it that
way, the same datatype can support several aggregates with different
initial condition requirements, without having to write a separate
initializer function for each one.

I think the point about initializing struct contents to NULL is a red
herring.  For basic C types like float8, there isn't such a thing as
NULL really; you have to have a separate flag field or count field to
tell you that you've seen no input yet.  That can be initialized equally
well by a typinput function or by an initializer function.  But the
typinput function gets to have a parameter taken from pg_aggregate;
an initializer function would not.

> Let me wait for you to overhaul the fmgr code and do all the aggregate 
> stuff right in one sweep. Only thing is: how do we deal with current 
> user-defined aggregates?

Well, a user aggregate that only used transfn1, or only transfn2, would
convert directly.  This scheme would break user aggregates that used
both, which is why I'm running it up the flagpole early --- to see if
anyone complains.
        regards, tom lane

unsubscribe

От

"Nguyen, Thuan X"

Дата:

24 января 2000 г., 10:45:34

unsubscribe

Re: Implementing STDDEV and VARIANCE

От

Bruce Momjian

Дата:

07 июня 2000 г., 22:57:27

Has this gone anywhere?

> I'd like to implement stddev and variance aggregates in Postgres. This is a 
> long standing TODO item.
> 
> There already has been some discussion on implementing this, see 
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
> 
> There are two definitions for standard deviation and variance:
>                                          _
> population variance = sigma^2 = SUM(X - X)^2 / N
> 
> population stddev = sqrt(population variance)
>                                  _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
> 
> sample stddev = sqrt(sample variance)
> 
> These statistics can be calculated in one pass when three variables are 
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only 
> two variables are kept.
> 
> E.g. avg() is calculated as follows:
> 
> sx = 0
> n = 0
> for every row {
>     sx = sx + value in row                   // transition function 1
>     n = n+1                                  // transition function 2
> }
> avg = sum(x) / n
> 
> stddev / variance might be calculated as follows:
> 
> sx = 0
> n = 0
> sx2 = 0
> for every row {
>     sx = sx + value in row                   // transition function 1
>     n = n+1                                  // transition function 2
>     sx2 = sx2 + value in row^2               // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2)              // Population
> 
> or
> 
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2)          // Sample
> 
> and
> 
> stddev = sqrt(var)
> 
> I've looked through the code and the following things need to be implemented:
> 
> 1. Add three columns to pg_aggregate for the additional third transition 
> function.
> 
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>  >All that you need to implement this is room to keep two running
>  >sums instead of one.  I haven't looked at pgsql's aggregate functions,
>  >but I'd hope that the working state can be a struct not just a
>  >single number.
> 
> I saw no other way than adding another transition function and logic, as 
> this might break user-defined aggregates (are there any around?).
> 
> 2. Add logic to nodeAgg.c to execute the third transition function and 
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the 
> third transition function
> 6. Update the documentation
> 
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my 
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and 
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any 
> other types also? int[2,4,8] don't seem logical, as these would introduce 
> serious rounding errors.
> 
> Let me know what you think,
> 
> 
> Jeroen
> 
> ************
> 


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Implementing STDDEV and VARIANCE

От

JanWieck@t-online.de (Jan Wieck)

Дата:

09 июня 2000 г., 06:46:48

Bruce Momjian wrote:
   I  created  all that sometimes back. Dunno why never added it   to contrib.  Will post it another day.

Jan

> Has this gone anywhere?
>
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
> >
> > There already has been some discussion on implementing this, see
> > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
> >
> > There are two definitions for standard deviation and variance:
> >                                          _
> > population variance = sigma^2 = SUM(X - X)^2 / N
> >
> > population stddev = sqrt(population variance)
> >                                  _
> > sample variance = s^2 = SUM(X - X)^2 / (N-1)
> >
> > sample stddev = sqrt(sample variance)
> >
> > These statistics can be calculated in one pass when three variables are
> > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> > two variables are kept.
> >
> > E.g. avg() is calculated as follows:
> >
> > sx = 0
> > n = 0
> > for every row {
> >  sx = sx + value in row                   // transition function 1
> >  n = n+1                                  // transition function 2
> > }
> > avg = sum(x) / n
> >
> > stddev / variance might be calculated as follows:
> >
> > sx = 0
> > n = 0
> > sx2 = 0
> > for every row {
> >  sx = sx + value in row                   // transition function 1
> >  n = n+1                                  // transition function 2
> >  sx2 = sx2 + value in row^2               // transition function 3
> > }
> > var = (1/n) * (sx2 - (1/n) * sx^2)              // Population
> >
> > or
> >
> > var = (1/(n-1)) * (sx2 - (1/n) * sx^2)          // Sample
> >
> > and
> >
> > stddev = sqrt(var)
> >
> > I've looked through the code and the following things need to be implemented:
> >
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
> >
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >  >All that you need to implement this is room to keep two running
> >  >sums instead of one.  I haven't looked at pgsql's aggregate functions,
> >  >but I'd hope that the working state can be a struct not just a
> >  >single number.
> >
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
> >
> > 2. Add logic to nodeAgg.c to execute the third transition function and
> > finalize function with three rather than two parameters
> > 3. Add functions f(a,b) that returns a + b^2 for selected types
> > 4. Add four finalize functions to calculate the variance / stddev
> > 5. Update the code for create aggregate, to include the definition of the
> > third transition function
> > 6. Update the documentation
> >
> > My questions are:
> > 1. Is this the correct way to continue? What am I missing? Any errors in my
> > reasoning?
> > 2. I am proposing the names stddev(x) and variance(x) for population and
> > samplestddev(x) and
> > samplevariance(x) for sample statistics. Any comments?
> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
> >
> > Let me know what you think,
> >
> >
> > Jeroen
> >
> > ************
> >
>
>
> --
>   Bruce Momjian                        |  http://www.op.net/~candle
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>


--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Implementing STDDEV and VARIANCE

От

Bruce Momjian

Дата:

29 сентября 2000 г., 22:27:22

Jeroen, not sure if you were involved in this, but standard deviation is
in the current development tree, and will be released in 7.1 in a few
months.

> I'd like to implement stddev and variance aggregates in Postgres. This is a 
> long standing TODO item.
> 
> There already has been some discussion on implementing this, see 
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
> 
> There are two definitions for standard deviation and variance:
>                                          _
> population variance = sigma^2 = SUM(X - X)^2 / N
> 
> population stddev = sqrt(population variance)
>                                  _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
> 
> sample stddev = sqrt(sample variance)
> 
> These statistics can be calculated in one pass when three variables are 
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only 
> two variables are kept.
> 
> E.g. avg() is calculated as follows:
> 
> sx = 0
> n = 0
> for every row {
>     sx = sx + value in row                   // transition function 1
>     n = n+1                                  // transition function 2
> }
> avg = sum(x) / n
> 
> stddev / variance might be calculated as follows:
> 
> sx = 0
> n = 0
> sx2 = 0
> for every row {
>     sx = sx + value in row                   // transition function 1
>     n = n+1                                  // transition function 2
>     sx2 = sx2 + value in row^2               // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2)              // Population
> 
> or
> 
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2)          // Sample
> 
> and
> 
> stddev = sqrt(var)
> 
> I've looked through the code and the following things need to be implemented:
> 
> 1. Add three columns to pg_aggregate for the additional third transition 
> function.
> 
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>  >All that you need to implement this is room to keep two running
>  >sums instead of one.  I haven't looked at pgsql's aggregate functions,
>  >but I'd hope that the working state can be a struct not just a
>  >single number.
> 
> I saw no other way than adding another transition function and logic, as 
> this might break user-defined aggregates (are there any around?).
> 
> 2. Add logic to nodeAgg.c to execute the third transition function and 
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the 
> third transition function
> 6. Update the documentation
> 
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my 
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and 
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any 
> other types also? int[2,4,8] don't seem logical, as these would introduce 
> serious rounding errors.
> 
> Let me know what you think,
> 
> 
> Jeroen
> 
> ************
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Implementing STDDEV and VARIANCE

От

Jeroen van Vianen

Дата:

02 октября 2000 г., 04:18:06

At 22:27 9/29/00 -0400, you wrote:
>Jeroen, not sure if you were involved in this, but standard deviation is
>in the current development tree, and will be released in 7.1 in a few
>months.

Yes, I noticed. And I'm happy to see it has been implemented without my 
help, as I'm desperately running out of time to do any coding on Postgres.

Thanks,

Jeroen

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Implementing STDDEV and VARIANCE