Re: Weighted Stats

Поиск
Список
Период
Сортировка
От David Fetter
Тема Re: Weighted Stats
Дата
Msg-id 20160319063437.GD1950@fetter.org
обсуждение исходный текст
Ответ на Re: Weighted Stats  (Jeff Janes <jeff.janes@gmail.com>)
Ответы Re: Weighted Stats  (Jeff Janes <jeff.janes@gmail.com>)
Re: Weighted Stats  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Fri, Mar 18, 2016 at 06:12:12PM -0700, Jeff Janes wrote:
> On Tue, Mar 15, 2016 at 8:36 AM, David Fetter <david@fetter.org> wrote:
> >
> > Please find attached a patch that uses the float8 version to cover the
> > numeric types.
> 
> Is there a well-defined meaning for having a negative weight?  If no,
> should it be disallowed?

Opinions on this appear to vary.  A Wikipedia article defines weights
as non-negative, while a manual to which it refers only uses non-zero.

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition
https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html

I'm not sure which if either would be authoritative, but I could
certainly make up variants for each assumption.

The assumption they have in common about weights is that a zero weight
is not part of the calculation, which assumption is implemented in the
previously submitted code.

> I don't know what I was expecting,  but not this:
> 
> select weighted_avg(x,10000000-2*x) from generate_series(1,10000000) f(x);
>    weighted_avg
> ------------------
>  16666671666717.1

I'm guessing that negative weights can cause bizarre outcomes,
assuming it turns out we should allow them.

> Also, I think it might not give the correct answer even without
> negative weights:
> 
> create table foo as select floor(random()*10000)::int val from
> generate_series(1,10000000);
> 
> create table foo2 as select val, count(*) from foo group by val;
> 
> Shouldn't these then give the same result:
> 
> select stddev_samp(val) from foo;
>     stddev_samp
> -------------------
>  2887.054977297105
> 
> select weighted_stddev_samp(val,count) from foo2;
>  weighted_stddev_samp
> ----------------------
>      2887.19919651336
> 
> The 5th digit seems too early to be seeing round-off error.

Please pardon me if I've misunderstood, but you appear to be assuming
that
   SELECT val, count(*) FROM foo GROUP BY val

will produce precisely identical count(*)s at each row, which it
overwhelmingly likely won't, producing the difference you see above.

What have I misunderstood?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Performance degradation in commit ac1d794
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Performance degradation in commit ac1d794