Re: BUG #15307: Low numerical precision of (Co-) Variance
От | Dean Rasheed |
---|---|
Тема | Re: BUG #15307: Low numerical precision of (Co-) Variance |
Дата | |
Msg-id | CAEZATCXhKrKbrGX8iPPzXZwPq8xEqkJgv9_Ga8ynJo3A=E19DQ@mail.gmail.com обсуждение исходный текст |
Ответ на | BUG #15307: Low numerical precision of (Co-) Variance (PG Bug reporting form <noreply@postgresql.org>) |
Ответы |
Re: BUG #15307: Low numerical precision of (Co-) Variance
|
Список | pgsql-bugs |
On 1 August 2018 at 14:35, PG Bug reporting form <noreply@postgresql.org> wrote: > Numerical precision of variance computations in PostgreSQL is too low. > > SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x) > FROM (SELECT 1000000.01 as x UNION SELECT 999999.99 as x) AS x > > The first two give the low-precision answer 0.000244140625 instead of > 0.0002. Interestingly enough, VAR_SAMP(x) is okay ... For a number of those statistical aggregates, PostgreSQL provides 2 implementations -- one implemented using double precision floating point arithmetic, which is much faster, but necessarily less accurate and possibly platform-dependent; and one implemented using the arbitrary precision numeric datatype, which will return much more accurate results. For any input datatypes other than floating point, you will automatically get the latter, which is what you're seeing with var_samp(x), when you're not explicitly casting the input to float8. However, all-the 2-argument aggregates such as corr() and covar_pop/samp() currently only have floating point implementations, and suffer from the problem you report, which I agree, is not great. If we can easily improve the accuracy of those aggregates, then I think it is worthwhile. Using a two pass approach isn't really an option, given the way that aggregates work in PostgreSQL, however, implementing Welford's algorithm looks to be quite straightforward. I had a quick play and I found that it fixed the accuracy problem with no noticeable performance penalty -- there are a few extra cycles in the accumulator functions, but compared to the overall per-tuple overhead, that appears to be negligible. I'll post something shortly. Regards, Dean
В списке pgsql-bugs по дате отправления: