Re: [HACKERS] Standard Deviation function.
От | Tom Lane |
---|---|
Тема | Re: [HACKERS] Standard Deviation function. |
Дата | |
Msg-id | 28715.897060244@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: [HACKERS] Standard Deviation function. (Andreas Zeugswetter <andreas.zeugswetter@telecom.at>) |
Список | pgsql-hackers |
Andreas Zeugswetter <andreas.zeugswetter@telecom.at> writes: > Wow, this is it. But as I said, the above line is wrong (By the way: > this is a very common mistake). > It should read: > $self->{standard_deviation} = sqrt( $self->{pseudo_variance} / $self->{count} ) > Note: The - 1 is missing The formula with N-1 in the divisor is correct for the "sample standard deviation". That is what you use when your N data points represent a sample from a larger population, and you want to estimate the standard deviation of the whole population. If your N data points in fact are the *whole* population of interest, then you calculate the "population standard deviation" which has just N in the divisor. So both versions of the formula are correct depending on the situation, and you really ought to provide both. (To justify the difference intuitively: if you have exactly one data point, and it is the *whole* population, then the mean equals the data value and the standard deviation is zero. That is what you get with N in the divisor. But if your one data point is a sample from a larger population, you cannot estimate the population's standard deviation; you need more data. The N-1 equation gives 0/0 in this case, correctly signifying that the value is indeterminate.) I think the Perl code given earlier in the thread pretty much sucks from a numerical accuracy point of view. The running mean calculation suffers from accumulation of errors, and that propagates into the pseudo-variance in a big way. It's particularly bad if the data is tightly clustered about the mean; the code ends up doing lots of subtractions of nearly equal values. The accepted way to do sample standard deviation in one pass is this: STDDEV = SQRT( (N*SIGMA(Xi^2) - SIGMA(Xi)^2) / (N*(N-1)) ) where N is the number of data points and SIGMA(Xi) means the sum of the data values Xi. You keep running sums of Xi and Xi^2 as you pass over the data, then you apply the above equation once at the end. (For population standard deviation, you use N^2 as the denominator. For variance, you just leave off the SQRT().) All that you need to implement this is room to keep two running sums instead of one. I haven't looked at pgsql's aggregate functions, but I'd hope that the working state can be a struct not just a single number. regards, tom lane
В списке pgsql-hackers по дате отправления: