Обсуждение: Aggregate Function corr does not always return the correct value
Hi!
correlation coefficient with the CORR function yielded such weird
results. After a little analysis, it was discovered that they were
calculating the correlation coefficient for two sets, one of which is
more or less random and the other of which is simply a set of constant
values (0.09 if that matters). As a result, they were attaining
unexpected results. However, as far as I am aware, they should have
received NULL because it is impossible to calculate the standard
deviation for such a set.
It turns out that for some values, the function does not return NULL.
Here is a reproducing of the problem:
==============================
postgres=# WITH dataset AS (SELECT x, 0.125 AS y FROM generate_series(0, 5) AS x) SELECT corr(x, y) FROM dataset;
corr
------
(1 row)
postgres=# WITH dataset AS (SELECT x, 0.1 AS y FROM generate_series(0, 5) AS x) SELECT corr(x, y) FROM dataset;
corr
--------------------
0.6546536707079771
(1 row)
corr
------
(1 row)
postgres=# WITH dataset AS (SELECT x, 0.1 AS y FROM generate_series(0, 5) AS x) SELECT corr(x, y) FROM dataset;
corr
--------------------
0.6546536707079771
(1 row)
==============================
I'm not sure how to fix that yet, but the issue is that the Sxx and Syy
variables in the float8_corr function are very close to 0, but not zero,
so we can't return NULL.
Best regards,
Maxim Orlov.
Maxim Orlov <orlovmg@gmail.com> writes: > One of the clients complained as to why the query for calculating the > correlation coefficient with the CORR function yielded such weird > results. After a little analysis, it was discovered that they were > calculating the correlation coefficient for two sets, one of which is > more or less random and the other of which is simply a set of constant > values (0.09 if that matters). As a result, they were attaining > unexpected results. However, as far as I am aware, they should have > received NULL because it is impossible to calculate the standard > deviation for such a set. [ shrug... ] Calculations with float8 are inherently inexact, so it's unsurprising that we sometimes fail to detect that the input is exactly a horizontal or vertical line. I don't think there is anything to be done here that wouldn't end in making things worse. regards, tom lane
Hi,
Try casting it to
numeric
or use an expression that avoids floating-point rounding off : WITH dataset AS (
SELECT x, CAST(0.125 AS numeric) AS y
FROM generate_series(0, 5) AS x
)
SELECT corr(x, y) FROM dataset;
Thanks & Regards
Dinesh Nair
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Tuesday, August 26, 2025 11:04 PM
To: Maxim Orlov <orlovmg@gmail.com>
Cc: Postgres hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Aggregate Function corr does not always return the correct value
Sent: Tuesday, August 26, 2025 11:04 PM
To: Maxim Orlov <orlovmg@gmail.com>
Cc: Postgres hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Aggregate Function corr does not always return the correct value
Caution: This email was sent from an external source. Please verify the sender’s identity before clicking links or opening attachments.
Maxim Orlov <orlovmg@gmail.com> writes:
> One of the clients complained as to why the query for calculating the
> correlation coefficient with the CORR function yielded such weird
> results. After a little analysis, it was discovered that they were
> calculating the correlation coefficient for two sets, one of which is
> more or less random and the other of which is simply a set of constant
> values (0.09 if that matters). As a result, they were attaining
> unexpected results. However, as far as I am aware, they should have
> received NULL because it is impossible to calculate the standard
> deviation for such a set.
[ shrug... ] Calculations with float8 are inherently inexact, so
it's unsurprising that we sometimes fail to detect that the input
is exactly a horizontal or vertical line. I don't think there is
anything to be done here that wouldn't end in making things worse.
regards, tom lane
Maxim Orlov <orlovmg@gmail.com> writes:
> One of the clients complained as to why the query for calculating the
> correlation coefficient with the CORR function yielded such weird
> results. After a little analysis, it was discovered that they were
> calculating the correlation coefficient for two sets, one of which is
> more or less random and the other of which is simply a set of constant
> values (0.09 if that matters). As a result, they were attaining
> unexpected results. However, as far as I am aware, they should have
> received NULL because it is impossible to calculate the standard
> deviation for such a set.
[ shrug... ] Calculations with float8 are inherently inexact, so
it's unsurprising that we sometimes fail to detect that the input
is exactly a horizontal or vertical line. I don't think there is
anything to be done here that wouldn't end in making things worse.
regards, tom lane
Em ter., 26 de ago. de 2025 às 14:34, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Maxim Orlov <orlovmg@gmail.com> writes:
> One of the clients complained as to why the query for calculating the
> correlation coefficient with the CORR function yielded such weird
> results. After a little analysis, it was discovered that they were
> calculating the correlation coefficient for two sets, one of which is
> more or less random and the other of which is simply a set of constant
> values (0.09 if that matters). As a result, they were attaining
> unexpected results. However, as far as I am aware, they should have
> received NULL because it is impossible to calculate the standard
> deviation for such a set.
[ shrug... ] Calculations with float8 are inherently inexact, so
it's unsurprising that we sometimes fail to detect that the input
is exactly a horizontal or vertical line. I don't think there is
anything to be done here that wouldn't end in making things worse.
With the below checking
if (Sxx == 0.0 && Syy == 0.0)
PG_RETURN_NULL();
PG_RETURN_NULL();
This test returns NaN
WITH dataset AS (SELECT x, 0.125 AS y FROM generate_series(0, 5) AS x) SELECT corr(x, y) FROM dataset;
But I can't say if this answer (NaN) makes things worse.
best regards,
Ranier Vilela