Обсуждение: Perl Standard Deviation function is wrong !
Hi, First of all I would like to thank you for your work on the Statistics Module. Unfortunately a lot of books differ in their formula for variance and stdev. In Europe the below corrected definition where stdev is not simply the sqrt of variance seems to be more popular. For large populations (>400) the calculation will be almost the same, but for small populations (like 5) the below calculation will be different. [Hackers] please forget my last mail to this subject. It was wrong. Tanx Andreas Zeugswetter David Gould wrote: >The Perl Module "Statistics/Descriptive" has on the fly variance calculation. > >sub add_data { > my $self = shift; ##Myself > my $oldmean; > my ($min,$mindex,$max,$maxdex); > > ##Take care of appending to an existing data set > $min = (defined ($self->{min}) ? $self->{min} : $_[0]); > $max = (defined ($self->{max}) ? $self->{max} : $_[0]); > $maxdex = $self->{maxdex} || 0; > $mindex = $self->{mindex} || 0; > > ##Calculate new mean, pseudo-variance, min and max; > foreach (@_) { > $oldmean = $self->{mean}; > $self->{sum} += $_; > $self->{count}++; > if ($_ >= $max) { > $max = $_; > $maxdex = $self->{count}-1; > } > if ($_ <= $min) { > $min = $_; > $mindex = $self->{count}-1; > } > $self->{mean} += ($_ - $oldmean) / $self->{count}; > $self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean}); > } > > $self->{min} = $min; > $self->{mindex} = $mindex; > $self->{max} = $max; > $self->{maxdex} = $maxdex; > $self->{sample_range} = $self->{max} - $self->{min}; > if ($self->{count} > 1) { > $self->{variance} = $self->{pseudo_variance} / ($self->{count} -1); > $self->{standard_deviation} = sqrt( $self->{variance}); Most books state: $self->{variance} = $self->{pseudo_variance} / $self->{count}; $self->{standard_deviation} = sqrt( $self->{pseudo_variance} / ( $self->{count} - 1 )) > } > return 1; >}
>Variance is just square of std. dev, no? No ! Stdev is divided by count, Variance by (count - 1) I think the difference really has to do with what you are calculating. If you want the std. dev./var. of the data THEMSELVES, divide by the count. If you want an estimate about the properties of the POPULATION from which the data were sampled, divide by count-1. People have needs for both in different circumstances. Perhaps there needs to be two versions, or a function argument, to distinguish the two uses, both of which are legitimate. Cheers, Brook
Brook Milligan wrote: > > >Variance is just square of std. dev, no? > > No ! Stdev is divided by count, Variance by (count - 1) > > I think the difference really has to do with what you are calculating. > If you want the std. dev./var. of the data THEMSELVES, divide by the > count. If you want an estimate about the properties of the POPULATION > from which the data were sampled, divide by count-1. People have > needs for both in different circumstances. > > Perhaps there needs to be two versions, or a function argument, to > distinguish the two uses, both of which are legitimate. Gentlemen, First let me apologize if this conversation has been taking place in the Perl newsgroups. You've caught me at a time when I'm sans news reader. (I could use Netscape, but .... <shudder> and I'd be ignored by most of the guru's in the group). Back to the topic at hand. The module states its references for the statistical formulae as well as its methods of calculation so you should always know what you're getting. I haven't done intensive statistics for a long time. I inherited the module from Jason Kastner to add more methods to it and to see if I could make some changes to the interface. Since then, I've released several bug fixes caused by those changes. If the public demands more statistics, then I'll make it so. I'm a little leary of making changes without having some hard references. If any of you would like to send me some (I'll be tracking them down, too!) I'd appreciate it. Once I have that warm fuzzy that I'm not just inventing mathematics, then I'll change the methods for standard variation and variance to accept a single argument that causes them to give the DATA statistics instead of the population statistics. I can't see overhauling the default behavior and forcing people to rewrite scripts already in place. It made them angry enough when I changed the OO interface... I look forward to hearing from you, or having results to share with you, soon! Colin Kuskie p.s. I recently changed jobs. My new email address is: ckuskie@cadence.com A new release will give me the excuse to change the modules documentation to reflect that.