Обсуждение: BUG #19340: Wrong result from CORR() function
The following bug has been logged on the website:
Bug reference: 19340
Logged by: Oleg Ivanov
Email address: o15611@gmail.com
PostgreSQL version: 18.1
Operating system: all
Description:
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
corr
------
(1 row)
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,25) ;
corr
---------------------
-0.1877294297321991
(1 row)
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,31) ;
corr
----------------------
-0.03366532289960463
(1 row)
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,32) ;
corr
----------------------
0.037939234274452456
(1 row)
Another example:
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,36) ;
corr
------
(1 row)
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,37) ;
corr
---------------------
0.37167954745944803
(1 row)
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,113) ;
corr
----------------------
0.022884787550167176
(1 row)
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,114) ;
corr
-----------------------
-0.004981175111303341
(1 row)
In the Oracle Database:
SQL> select corr( 0.09 , 0.09000001 ) FROM (select rownum as id from dual
connect by level <=330);
CORR(0.09,0.09000001)
---------------------
If argument is the constant, function CORR() must give a 0 or NaN.
Consequences of this bug: statistic functions are used to make business
descision. Wrong and completely different results can lead to make mistakes.
On Tue, 2025-12-02 at 07:57 +0000, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 19340 > Logged by: Oleg Ivanov > Email address: o15611@gmail.com > PostgreSQL version: 18.1 > Operating system: all > Description: > > postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ; > corr > ------ > > (1 row) > > If argument is the constant, function CORR() must give a 0 or NaN. > Consequences of this bug: statistic functions are used to make business > descision. Wrong and completely different results can lead to make mistakes. The documentation is pretty clear about that: In all cases, null is returned if the computation is meaningless, for example when N is zero. Yours, Laurenz Albe
Yes, must be NULL in all the queries I have provided!
But PostgreSQL curr() returns numbers, wich is incorrect.
On Tue, Dec 2, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Tue, 2025-12-02 at 07:57 +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference: 19340
> Logged by: Oleg Ivanov
> Email address: o15611@gmail.com
> PostgreSQL version: 18.1
> Operating system: all
> Description:
>
> postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
> corr
> ------
>
> (1 row)
>
> If argument is the constant, function CORR() must give a 0 or NaN.
> Consequences of this bug: statistic functions are used to make business
> descision. Wrong and completely different results can lead to make mistakes.
The documentation is pretty clear about that:
In all cases, null is returned if the computation is meaningless, for example when N is zero.
Yours,
Laurenz Albe
Oleg Ivanov <o15611@gmail.com> writes: > Yes, must be NULL in all the queries I have provided! > But PostgreSQL curr() returns numbers, wich is incorrect. Yeah, looks like roundoff error to me. In your example SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,25) ; at the end of float8_corr we have 3754 PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy)); (gdb) i locals transarray = <optimized out> transvalues = 0x1b96da8 N = 25 Sxx = 3.2869204384208827e-34 Syy = 9.3266240309214617e-33 Sxy = -3.2869204384208827e-34 where ideally those three values would be zero (and we would have fallen out with a NULL result at the preceding line). It's fundamentally impossible to guarantee exact results with floating-point arithmetic, so if you are expecting that you need to readjust your expectations. But having said that, it does seem a bit sad that we can't detect constant-input cases exactly. I wonder whether it'd be worth carrying additional state to check that explicitly (instead of assuming that "if (Sxx == 0 || Syy == 0)" will catch it). You might find the previous discussion interesting: https://www.postgresql.org/message-id/flat/153313051300.1397.9594490737341194671%40wrigleys.postgresql.org regards, tom lane
On Tue, 2025-12-02 at 17:05 +0300, Oleg Ivanov wrote: > Yes, must be NULL in all the queries I have provided! > But PostgreSQL curr() returns numbers, wich is incorrect. I see. I guess these are rounding errors, which are to be expected with floating point numbers. But perhaps we could do better. Yours, Laurenz Albe
On Tue, 2 Dec 2025 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> It's fundamentally impossible to guarantee exact results with
> floating-point arithmetic, so if you are expecting that you need
> to readjust your expectations. But having said that, it does
> seem a bit sad that we can't detect constant-input cases exactly.
Yes, indeed. I tested the following query:
SELECT n,
(SELECT variance(1.3::float8) FROM generate_series(1, n)),
(SELECT corr(1.3, 1.3) FROM generate_series(1, n))
FROM generate_series(1, 10) g(n);
In v11 (with the old algorithm) this produces
n | variance | corr
----+----------------------+------
1 | |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 3.5527136788005e-16 | 1
6 | 2.368475785867e-16 | 1
7 | 3.38353683695286e-16 | 1
8 | 0 |
9 | 0 |
10 | 0 |
(10 rows)
whereas in HEAD (with the Youngs-Cramer algorithm) it produces
n | variance | corr
----+------------------------+------
1 | |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
6 | 5.259072701473412e-33 | 1
7 | 4.382560584561177e-33 | 1
8 | 3.756480501052437e-33 | 1
9 | 3.2869204384208825e-33 | 1
10 | 6.817316464872942e-33 | 1
(10 rows)
so the errors in the variance are smaller, but any non-zero error
makes the correlation completely wrong.
> I wonder whether it'd be worth carrying additional state to
> check that explicitly (instead of assuming that "if (Sxx == 0 ||
> Syy == 0)" will catch it).
I wondered the same thing. It's not nice to have to do that, but
clearly the existing test for constant inputs is no good. The question
is, do we really want to spend extra cycles on every query just to
catch this odd corner case?
Regards,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Tue, 2 Dec 2025 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether it'd be worth carrying additional state to
>> check that explicitly (instead of assuming that "if (Sxx == 0 ||
>> Syy == 0)" will catch it).
> I wondered the same thing. It's not nice to have to do that, but
> clearly the existing test for constant inputs is no good. The question
> is, do we really want to spend extra cycles on every query just to
> catch this odd corner case?
I experimented with the attached patch, which is very incomplete;
I just carried it far enough to be able to run performance checks on
the modified code, and so all the binary statistics aggregates except
corr() are broken. I observe about 2% slowdown on this test case:
SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,100000000);
I think that any real-world usage is going to expend more effort
obtaining the input data than this test does, so 2% should be a
conservative upper bound on the cost. Seems to me that getting
NULL-or-not right is probably worth a percent or so.
If anyone feels differently, another idea could be to use a
separate state transition function for corr() that skips the
accumulation steps that corr() doesn't use. But I agree with
the pre-existing decision to use just one transition function
for all the binary aggregates.
If this seems like a reasonable approach, I'll see about finishing
out the patch.
regards, tom lane
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..5d8954a34d0 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,15 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is a 9-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), firstX, firstY, DIFF,
+ * in that order.
+ *
+ * DIFF, like N, is treated as an integer. Bit 0 is set if we saw distinct
+ * X inputs, and bit 1 is set if we saw distinct Y inputs. This allows us
+ * to detect constant inputs exactly, which is important for deciding whether
+ * some outputs should be NULL.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3351,23 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ firstX,
+ firstY,
tmpX,
tmpY,
scale;
+ int diff;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 9);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ firstX = transvalues[6];
+ firstY = transvalues[7];
+ diff = transvalues[8];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3373,6 +3385,19 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Syy += tmpY * tmpY * scale;
Sxy += tmpX * tmpY * scale;
+ /*
+ * Check to see if we have seen distinct inputs. In normal use, diff
+ * will reach 3 very soon and then we can stop checking.
+ */
+ if (diff != 3)
+ {
+ /* Need SQL-style comparison of NaNs here */
+ if (float8_ne(newvalX, firstX))
+ diff |= 1;
+ if (float8_ne(newvalY, firstY))
+ diff |= 2;
+ }
+
/*
* Overflow check. We only report an overflow error when finite
* inputs lead to infinite results. Note also that Sxx, Syy and Sxy
@@ -3410,6 +3435,8 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+ firstX = newvalX;
+ firstY = newvalY;
}
/*
@@ -3425,12 +3452,15 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = firstX;
+ transvalues[7] = firstY;
+ transvalues[8] = diff;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[9];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3469,11 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(firstX);
+ transdatums[7] = Float8GetDatumFast(firstY);
+ transdatums[8] = Float8GetDatum(diff);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 9, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3730,27 +3763,25 @@ float8_corr(PG_FUNCTION_ARGS)
{
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
- float8 N,
- Sxx,
+ float8 Sxx,
Syy,
Sxy;
+ int diff;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
- N = transvalues[0];
+ transvalues = check_float8_array(transarray, "float8_corr", 9);
Sxx = transvalues[2];
Syy = transvalues[4];
Sxy = transvalues[5];
+ diff = transvalues[8];
- /* if N is 0 we should return NULL */
- if (N < 1.0)
+ /*
+ * Per spec, we must return NULL if N is zero, all X inputs are equal, or
+ * all Y inputs are equal. Checking the diff mask covers all three cases.
+ */
+ if (diff != 3)
PG_RETURN_NULL();
/* Note that Sxx and Syy are guaranteed to be non-negative */
-
- /* per spec, return NULL for horizontal and vertical lines */
- if (Sxx == 0 || Syy == 0)
- PG_RETURN_NULL();
-
PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
}
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..68dc1329ea0 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -505,7 +505,7 @@
aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
On Tue, 2 Dec 2025 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I experimented with the attached patch, which is very incomplete; > I just carried it far enough to be able to run performance checks on > the modified code, and so all the binary statistics aggregates except > corr() are broken. I observe about 2% slowdown on this test case: I played around with having just 2 extra array elements, constX and constY equal to the common value if all the values are the same, and NaN otherwise. For me, that was slightly faster, which I put down to floating point comparison being faster than converting back and forth between floating point and integer. Either way, it seems like a difference that no one is likely to notice. Doing it that way does lead to one difference though: all-NaN inputs leads to a NaN result, whereas your patch produces NULL for that case. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I played around with having just 2 extra array elements, constX and
> constY equal to the common value if all the values are the same, and
> NaN otherwise.
Hmm.
> Doing it that way does lead to one difference though: all-NaN inputs
> leads to a NaN result, whereas your patch produces NULL for that case.
Yeah, I did it as I did precisely because I wanted all-NaN-input to be
seen as a constant. But you could make an argument that NaN is not
really a fixed value but has more kinship to the "we don't know what
the value is" interpretation of SQL NULL. In that case your proposal
is semantically reasonable on the grounds that maybe the NaNs aren't
really all equal, and I agree it ought to be a little faster than
mine.
regards, tom lane
On Tue, 2 Dec 2025 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > I played around with having just 2 extra array elements, constX and > > constY equal to the common value if all the values are the same, and > > NaN otherwise. > > Hmm. > > > Doing it that way does lead to one difference though: all-NaN inputs > > leads to a NaN result, whereas your patch produces NULL for that case. > > Yeah, I did it as I did precisely because I wanted all-NaN-input to be > seen as a constant. But you could make an argument that NaN is not > really a fixed value but has more kinship to the "we don't know what > the value is" interpretation of SQL NULL. I think that's more consistent with the general policy that most (all?) math functions have, where if any input is NaN, the result is NaN. Regards, Dean
I tried the attached realization of your idea, and it does seem
very marginally faster than what I did; but it's like a 1.8%
slowdown instead of 2%. Might be different on a different
machine of course. But I think we should choose on the basis
of which semantics we like better, rather than such a tiny
performance difference.
I'm coming around to the conclusion that your way is better,
though. It seems good that "any NaN in the input results in
NaN output", which your way does and mine doesn't.
regards, tom lane
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..c2173f378bf 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,16 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is an 8-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), commonX, and commonY
+ * in that order.
+ *
+ * commonX is defined as the common X value if all the X values were the
+ * same, else NaN; likewise for commonY. This is useful for deciding
+ * whether corr() should return NULL. Note that this representation cannot
+ * distinguish all-the-values-were-NaN from the-values-weren't-all-the-same,
+ * but that's okay because we want to return NaN for all-NaN input.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3352,21 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ commonX,
+ commonY,
tmpX,
tmpY,
scale;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3373,6 +3384,16 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Syy += tmpY * tmpY * scale;
Sxy += tmpX * tmpY * scale;
+ /*
+ * Check to see if we have seen distinct inputs. We can use a test
+ * that's a bit cheaper than float8_ne() because if commonX is already
+ * NaN, it does not matter whether the != test returns true or not.
+ */
+ if (newvalX != commonX || isnan(newvalX))
+ commonX = get_float8_nan();
+ if (newvalY != commonY || isnan(newvalY))
+ commonY = get_float8_nan();
+
/*
* Overflow check. We only report an overflow error when finite
* inputs lead to infinite results. Note also that Sxx, Syy and Sxy
@@ -3410,6 +3431,8 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+ commonX = newvalX;
+ commonY = newvalY;
}
/*
@@ -3425,12 +3448,14 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = commonX;
+ transvalues[7] = commonY;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3464,10 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(commonX);
+ transdatums[7] = Float8GetDatumFast(commonY);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3733,24 +3760,27 @@ float8_corr(PG_FUNCTION_ARGS)
float8 N,
Sxx,
Syy,
- Sxy;
+ Sxy,
+ commonX,
+ commonY;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
+ transvalues = check_float8_array(transarray, "float8_corr", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
- /* Note that Sxx and Syy are guaranteed to be non-negative */
-
/* per spec, return NULL for horizontal and vertical lines */
- if (Sxx == 0 || Syy == 0)
+ if (!isnan(commonX) || !isnan(commonY))
PG_RETURN_NULL();
+ /* at this point, Sxx and Syy cannot be zero or negative */
PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
}
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..d1342f7bb94 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -505,7 +505,7 @@
aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
I wrote:
> I'm coming around to the conclusion that your way is better,
> though. It seems good that "any NaN in the input results in
> NaN output", which your way does and mine doesn't.
Poking further at this, I found that my v2 patch fails that principle
in one case:
regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
corr
------
(1 row)
We see that Y is constant and therefore return NULL, despite the
other NaN input.
I think we can fix that along these lines:
@@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
if (N < 1.0)
PG_RETURN_NULL();
- /* per spec, return NULL for horizontal and vertical lines */
- if (!isnan(commonX) || !isnan(commonY))
+ /*
+ * per spec, return NULL for horizontal and vertical lines; but not if the
+ * result would otherwise be NaN
+ */
+ if ((!isnan(commonX) || !isnan(commonY)) &&
+ (!isnan(Sxx) && !isnan(Syy)))
PG_RETURN_NULL();
/* at this point, Sxx and Syy cannot be zero or negative */
(don't think it should be necessary to also check Sxy)
BTW, HEAD is inconsistent: it will return NaN for this example, but
only because it's confused by roundoff error into thinking that Y
isn't constant. With few enough inputs, it produces NULL too:
regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,3) g;
corr
------
(1 row)
regards, tom lane
On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Poking further at this, I found that my v2 patch fails that principle
> in one case:
>
> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
> corr
> ------
>
> (1 row)
>
> We see that Y is constant and therefore return NULL, despite the
> other NaN input.
>
> I think we can fix that along these lines:
>
> @@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
> if (N < 1.0)
> PG_RETURN_NULL();
>
> - /* per spec, return NULL for horizontal and vertical lines */
> - if (!isnan(commonX) || !isnan(commonY))
> + /*
> + * per spec, return NULL for horizontal and vertical lines; but not if the
> + * result would otherwise be NaN
> + */
> + if ((!isnan(commonX) || !isnan(commonY)) &&
> + (!isnan(Sxx) && !isnan(Syy)))
> PG_RETURN_NULL();
>
> /* at this point, Sxx and Syy cannot be zero or negative */
I think that would be more readable as 2 separate "if" statements,
something like:
/* if any input is NaN, return NaN */
if (isnan(Sxx) || isnan(Syy))
PG_RETURN_FLOAT8(get_float8_nan());
/* per spec, return NULL for horizontal and vertical lines */
if (!isnan(commonX) || !isnan(commonY))
PG_RETURN_NULL();
so that we're more explicit about the NaN-handling semantics.
Also, given that any NaN input results in NaN output, regardless of
whether or not the inputs are all the same, the values for commonX and
commonY don't matter if any input is NaN. I think that allows the
accumulator function's tests to be simplified (no need to test if new
values are NaN).
Another case to consider is this:
SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;
Here Sxx and Syy end up being zero, so the current code returns NULL,
but with the patch it returns NaN (because Sxy is also zero). It's not
quite obvious what to do in this case (the correct answer is 1, but
there' no way of reliably computing that with double precision
arithmetic). My first thought is that we should probably try to limit
NaN results to NaN inputs, and so it would be better to continue to
return NULL for cases like this where either Sxx or Syy are zero, even
though the inputs weren't quite constant.
Then there's this:
SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
product Sxx * Syy underflows to zero. So in both HEAD and with the
patch, this returns Infinity. That's not good, given that the
correlation coefficient is supposed to lie in the range [-1,1].
The correct answer in this case is also 1, which could be achieved by
taking the square roots of Sxx and Syy separately, before multiplying,
but it might also be sensible to clamp the result to the range [-1,1].
Regard,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Poking further at this, I found that my v2 patch fails that principle
>> in one case:
>>
>> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
>> corr
>> ------
>>
>> (1 row)
>>
>> We see that Y is constant and therefore return NULL, despite the
>> other NaN input.
> I think that would be more readable as 2 separate "if" statements,
Actually, in the light of morning I think this behavior is probably
fine. This example treads on two different edge-cases, one where
we're supposed to return NULL and one where we're supposed to return
NaN, and it's not that open-and-shut which rule should win. A
counterexample here is float8_covar_samp: that returns NULL if there
are less than two input values, and will still do so if there is one
input that is NaN. Should we change that? I don't think so.
The fact that HEAD returns NULL for this corr() case as long as
there is not enough input to create a roundoff problem also suggests
to me that that's the behavior to stick with.
So I'm now thinking that the NULL-output rule should win in cases
where they are both applicable. However, I don't know if we want
to take that so far as to say that constant-NaN input should
produce a NULL; the argument that NaN isn't really a constant
still has some force in my mind. What do you think?
> Another case to consider is this:
> SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;
> Here Sxx and Syy end up being zero, so the current code returns NULL,
> but with the patch it returns NaN (because Sxy is also zero). It's not
> quite obvious what to do in this case (the correct answer is 1, but
> there' no way of reliably computing that with double precision
> arithmetic). My first thought is that we should probably try to limit
> NaN results to NaN inputs, and so it would be better to continue to
> return NULL for cases like this where either Sxx or Syy are zero, even
> though the inputs weren't quite constant.
Good example. I had wondered whether to retain the final test for zero
Sxx/Syy, and this shows that we do need that (along with a comment
explaining that roundoff error could produce zeroes even though we
know the inputs weren't constant).
> Then there's this:
> SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
> Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
> product Sxx * Syy underflows to zero. So in both HEAD and with the
> patch, this returns Infinity. That's not good, given that the
> correlation coefficient is supposed to lie in the range [-1,1].
> The correct answer in this case is also 1, which could be achieved by
> taking the square roots of Sxx and Syy separately, before multiplying,
> but it might also be sensible to clamp the result to the range [-1,1].
I like the idea of taking the square roots separately. I believe
sqrt() is a hardware operation on pretty much any machine people still
care about, and we're doing this part only once per aggregation.
So the cost seems fairly minimal.
regards, tom lane
Attached is a fleshed-out patch proposal that fixes the related
aggregates and adds test cases.
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Also, given that any NaN input results in NaN output, regardless of
> whether or not the inputs are all the same, the values for commonX and
> commonY don't matter if any input is NaN. I think that allows the
> accumulator function's tests to be simplified (no need to test if new
> values are NaN).
I'm not convinced about that. I have it as
if (newvalX != commonX || isnan(newvalX))
commonX = get_float8_nan();
and I think you're saying we could just write
if (newvalX != commonX)
commonX = get_float8_nan();
My concern is that if newvalX is NaN but commonX isn't, then
I believe that the non-NaN-aware "!=" test is supposed to return
false, which'd cause us to not update commonX to NaN as required.
Maybe we could make it work by writing
if (!(newvalX == commonX))
commonX = get_float8_nan();
but I don't have a lot of faith in C compilers getting that right.
> Then there's this:
> SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
> The correct answer in this case is also 1, which could be achieved by
> taking the square roots of Sxx and Syy separately, before multiplying,
> but it might also be sensible to clamp the result to the range [-1,1].
Poking at this, I soon found a test case where even with the separate
sqrt() calls we'd produce a result slightly outside [-1, 1] (running
this test over more values of x is sufficient). So now I think we
should do both the separate sqrt and the clamp.
regards, tom lane
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..fddff3f878a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,16 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is an 8-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), commonX, and commonY
+ * in that order.
+ *
+ * commonX is defined as the common X value if all the X values were the same,
+ * else NaN; likewise for commonY. This is useful for deciding whether corr()
+ * and related functions should return NULL. This representation cannot
+ * distinguish all-the-values-were-NaN from the-values-weren't-all-the-same,
+ * but that's okay because we want to return NaN for all-NaN input.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3352,21 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ commonX,
+ commonY,
tmpX,
tmpY,
scale;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3397,6 +3408,16 @@ float8_regr_accum(PG_FUNCTION_ARGS)
if (isinf(Sxy))
Sxy = get_float8_nan();
}
+
+ /*
+ * Check to see if we have seen distinct inputs. We can use a test
+ * that's a bit cheaper than float8_ne() because if commonX is already
+ * NaN, it does not matter whether the != test returns true or not.
+ */
+ if (newvalX != commonX || isnan(newvalX))
+ commonX = get_float8_nan();
+ if (newvalY != commonY || isnan(newvalY))
+ commonY = get_float8_nan();
}
else
{
@@ -3410,6 +3431,9 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+
+ commonX = newvalX;
+ commonY = newvalY;
}
/*
@@ -3425,12 +3449,14 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = commonX;
+ transvalues[7] = commonY;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3465,10 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(commonX);
+ transdatums[7] = Float8GetDatumFast(commonY);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3449,7 +3477,7 @@ float8_regr_accum(PG_FUNCTION_ARGS)
/*
* float8_regr_combine
*
- * An aggregate combine function used to combine two 6 fields
+ * An aggregate combine function used to combine two 8-fields
* aggregate transition data into a single transition data.
* This function is used only in two stage aggregation and
* shouldn't be called outside aggregate context.
@@ -3467,12 +3495,16 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy1,
Syy1,
Sxy1,
+ Cx1,
+ Cy1,
N2,
Sx2,
Sxx2,
Sy2,
Syy2,
Sxy2,
+ Cx2,
+ Cy2,
tmp1,
tmp2,
N,
@@ -3480,10 +3512,12 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sxx,
Sy,
Syy,
- Sxy;
+ Sxy,
+ Cx,
+ Cy;
- transvalues1 = check_float8_array(transarray1, "float8_regr_combine", 6);
- transvalues2 = check_float8_array(transarray2, "float8_regr_combine", 6);
+ transvalues1 = check_float8_array(transarray1, "float8_regr_combine", 8);
+ transvalues2 = check_float8_array(transarray2, "float8_regr_combine", 8);
N1 = transvalues1[0];
Sx1 = transvalues1[1];
@@ -3491,6 +3525,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy1 = transvalues1[3];
Syy1 = transvalues1[4];
Sxy1 = transvalues1[5];
+ Cx1 = transvalues1[6];
+ Cy1 = transvalues1[7];
N2 = transvalues2[0];
Sx2 = transvalues2[1];
@@ -3498,6 +3534,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy2 = transvalues2[3];
Syy2 = transvalues2[4];
Sxy2 = transvalues2[5];
+ Cx2 = transvalues2[6];
+ Cy2 = transvalues2[7];
/*--------------------
* The transition values combine using a generalization of the
@@ -3523,6 +3561,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy = Sy2;
Syy = Syy2;
Sxy = Sxy2;
+ Cx = Cx2;
+ Cy = Cy2;
}
else if (N2 == 0.0)
{
@@ -3532,6 +3572,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy = Sy1;
Syy = Syy1;
Sxy = Sxy1;
+ Cx = Cx1;
+ Cy = Cy1;
}
else
{
@@ -3549,6 +3591,14 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sxy = Sxy1 + Sxy2 + N1 * N2 * tmp1 * tmp2 / N;
if (unlikely(isinf(Sxy)) && !isinf(Sxy1) && !isinf(Sxy2))
float_overflow_error();
+ if (float8_eq(Cx1, Cx2))
+ Cx = Cx1;
+ else
+ Cx = get_float8_nan();
+ if (float8_eq(Cy1, Cy2))
+ Cy = Cy1;
+ else
+ Cy = get_float8_nan();
}
/*
@@ -3564,12 +3614,14 @@ float8_regr_combine(PG_FUNCTION_ARGS)
transvalues1[3] = Sy;
transvalues1[4] = Syy;
transvalues1[5] = Sxy;
+ transvalues1[6] = Cx;
+ transvalues1[7] = Cy;
PG_RETURN_ARRAYTYPE_P(transarray1);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3578,8 +3630,10 @@ float8_regr_combine(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(Cx);
+ transdatums[7] = Float8GetDatumFast(Cy);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3594,7 +3648,7 @@ float8_regr_sxx(PG_FUNCTION_ARGS)
float8 N,
Sxx;
- transvalues = check_float8_array(transarray, "float8_regr_sxx", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_sxx", 8);
N = transvalues[0];
Sxx = transvalues[2];
@@ -3615,7 +3669,7 @@ float8_regr_syy(PG_FUNCTION_ARGS)
float8 N,
Syy;
- transvalues = check_float8_array(transarray, "float8_regr_syy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_syy", 8);
N = transvalues[0];
Syy = transvalues[4];
@@ -3636,7 +3690,7 @@ float8_regr_sxy(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_sxy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_sxy", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3657,7 +3711,7 @@ float8_regr_avgx(PG_FUNCTION_ARGS)
float8 N,
Sx;
- transvalues = check_float8_array(transarray, "float8_regr_avgx", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_avgx", 8);
N = transvalues[0];
Sx = transvalues[1];
@@ -3676,7 +3730,7 @@ float8_regr_avgy(PG_FUNCTION_ARGS)
float8 N,
Sy;
- transvalues = check_float8_array(transarray, "float8_regr_avgy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_avgy", 8);
N = transvalues[0];
Sy = transvalues[3];
@@ -3695,7 +3749,7 @@ float8_covar_pop(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_covar_pop", 6);
+ transvalues = check_float8_array(transarray, "float8_covar_pop", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3714,7 +3768,7 @@ float8_covar_samp(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_covar_samp", 6);
+ transvalues = check_float8_array(transarray, "float8_covar_samp", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3733,13 +3787,18 @@ float8_corr(PG_FUNCTION_ARGS)
float8 N,
Sxx,
Syy,
- Sxy;
+ Sxy,
+ commonX,
+ commonY,
+ result;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
+ transvalues = check_float8_array(transarray, "float8_corr", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/* if N is 0 we should return NULL */
if (N < 1.0)
@@ -3747,11 +3806,37 @@ float8_corr(PG_FUNCTION_ARGS)
/* Note that Sxx and Syy are guaranteed to be non-negative */
- /* per spec, return NULL for horizontal and vertical lines */
+ /*
+ * Per spec, return NULL for horizontal and vertical lines. We can detect
+ * constant inputs exactly by checking commonX and commonY, even though
+ * Sxx and/or Syy might be nonzero due to roundoff error.
+ */
+ if (!isnan(commonX) || !isnan(commonY))
+ PG_RETURN_NULL();
+
+ /*
+ * Although we now know that the inputs weren't all equal, Sxx and/or Syy
+ * could be zero due to roundoff error, if the inputs were close together.
+ * It seems best to return NULL in that case; blindly applying the result
+ * formula would yield NaN, which seems wrong if the inputs did not
+ * include any NaN.
+ */
if (Sxx == 0 || Syy == 0)
PG_RETURN_NULL();
- PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
+ /*
+ * We compute sqrt(Sxx) * sqrt(Syy) not sqrt(Sxx * Syy) because the raw
+ * product could underflow or overflow. Despite all these precautions,
+ * this formula can yield results outside [-1, 1] due to roundoff error.
+ * Clamp it to the expected range.
+ */
+ result = Sxy / (sqrt(Sxx) * sqrt(Syy));
+ if (result < -1)
+ result = -1;
+ else if (result > 1)
+ result = 1;
+
+ PG_RETURN_FLOAT8(result);
}
Datum
@@ -3762,13 +3847,17 @@ float8_regr_r2(PG_FUNCTION_ARGS)
float8 N,
Sxx,
Syy,
- Sxy;
+ Sxy,
+ commonX,
+ commonY;
- transvalues = check_float8_array(transarray, "float8_regr_r2", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_r2", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/* if N is 0 we should return NULL */
if (N < 1.0)
@@ -3776,12 +3865,12 @@ float8_regr_r2(PG_FUNCTION_ARGS)
/* Note that Sxx and Syy are guaranteed to be non-negative */
- /* per spec, return NULL for a vertical line */
- if (Sxx == 0)
+ /* per spec, return NULL for a vertical line (see float8_corr comments) */
+ if (!isnan(commonX) || Sxx == 0)
PG_RETURN_NULL();
- /* per spec, return 1.0 for a horizontal line */
- if (Syy == 0)
+ /* per spec, return 1.0 for a horizontal line (see float8_corr comments) */
+ if (!isnan(commonY) || Syy == 0)
PG_RETURN_FLOAT8(1.0);
PG_RETURN_FLOAT8((Sxy * Sxy) / (Sxx * Syy));
@@ -3794,12 +3883,14 @@ float8_regr_slope(PG_FUNCTION_ARGS)
float8 *transvalues;
float8 N,
Sxx,
- Sxy;
+ Sxy,
+ commonX;
- transvalues = check_float8_array(transarray, "float8_regr_slope", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_slope", 8);
N = transvalues[0];
Sxx = transvalues[2];
Sxy = transvalues[5];
+ commonX = transvalues[6];
/* if N is 0 we should return NULL */
if (N < 1.0)
@@ -3807,8 +3898,8 @@ float8_regr_slope(PG_FUNCTION_ARGS)
/* Note that Sxx is guaranteed to be non-negative */
- /* per spec, return NULL for a vertical line */
- if (Sxx == 0)
+ /* per spec, return NULL for a vertical line (see float8_corr comments) */
+ if (!isnan(commonX) || Sxx == 0)
PG_RETURN_NULL();
PG_RETURN_FLOAT8(Sxy / Sxx);
@@ -3823,14 +3914,16 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
Sx,
Sxx,
Sy,
- Sxy;
+ Sxy,
+ commonX;
- transvalues = check_float8_array(transarray, "float8_regr_intercept", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_intercept", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Sxy = transvalues[5];
+ commonX = transvalues[6];
/* if N is 0 we should return NULL */
if (N < 1.0)
@@ -3838,8 +3931,8 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
/* Note that Sxx is guaranteed to be non-negative */
- /* per spec, return NULL for a vertical line */
- if (Sxx == 0)
+ /* per spec, return NULL for a vertical line (see float8_corr comments) */
+ if (!isnan(commonX) || Sxx == 0)
PG_RETURN_NULL();
PG_RETURN_FLOAT8((Sy - Sx * Sxy / Sxx) / N);
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..f22ccfbf49f 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -475,37 +475,37 @@
aggcombinefn => 'int8pl', aggtranstype => 'int8', agginitval => '0' },
{ aggfnoid => 'regr_sxx', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_sxx', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_syy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_syy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_sxy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_sxy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_avgx', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_avgx', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_avgy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_avgy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_r2', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_r2', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_slope', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_slope', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_intercept', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_intercept', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'covar_pop', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_covar_pop', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'covar_samp', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_covar_samp', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index be0e1573183..7ba2e613d1e 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -515,6 +515,57 @@ SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
NaN |
(1 row)
+-- check some cases that formerly had poor roundoff-error behavior
+SET extra_float_digits = 1;
+SELECT corr(0.09, g), regr_r2(0.09, g)
+ FROM generate_series(1, 30) g;
+ corr | regr_r2
+------+---------
+ | 1
+(1 row)
+
+SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
+ FROM generate_series(1, 30) g;
+ corr | regr_r2 | regr_slope | regr_intercept
+------+---------+------------+----------------
+ | | |
+(1 row)
+
+SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
+ FROM generate_series(1, 3) g;
+ corr
+------
+
+(1 row)
+
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 3) g;
+ corr
+------
+ 1
+(1 row)
+
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 30) g;
+ corr
+------
+ 1
+(1 row)
+
+SET extra_float_digits = 0;
+-- these examples pose definitional questions for NaN inputs
+SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+ NaN
+(1 row)
+
+SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+
+(1 row)
+
-- test accum and combine functions directly
CREATE TABLE regr_test (x float8, y float8);
INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
@@ -538,10 +589,10 @@ SELECT float8_accum('{4,140,2900}'::float8[], 100);
{5,240,6280}
(1 row)
-SELECT float8_regr_accum('{4,140,2900,1290,83075,15050}'::float8[], 200, 100);
- float8_regr_accum
-------------------------------
- {5,240,6280,1490,95080,8680}
+SELECT float8_regr_accum('{4,140,2900,1290,83075,15050,100,0}'::float8[], 200, 100);
+ float8_regr_accum
+--------------------------------------
+ {5,240,6280,1490,95080,8680,100,NaN}
(1 row)
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
@@ -576,25 +627,25 @@ SELECT float8_combine('{3,60,200}'::float8[], '{2,180,200}'::float8[]);
{5,240,6280}
(1 row)
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{0,0,0,0,0,0}'::float8[]);
- float8_regr_combine
----------------------------
- {3,60,200,750,20000,2000}
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,1,NaN}'::float8[],
+ '{0,0,0,0,0,0,0,0}'::float8[]);
+ float8_regr_combine
+---------------------------------
+ {3,60,200,750,20000,2000,1,NaN}
(1 row)
-SELECT float8_regr_combine('{0,0,0,0,0,0}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
- float8_regr_combine
------------------------------
- {2,180,200,740,57800,-3400}
+SELECT float8_regr_combine('{0,0,0,0,0,0,0,0}'::float8[],
+ '{2,180,200,740,57800,-3400,NaN,1}'::float8[]);
+ float8_regr_combine
+-----------------------------------
+ {2,180,200,740,57800,-3400,NaN,1}
(1 row)
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
- float8_regr_combine
-------------------------------
- {5,240,6280,1490,95080,8680}
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,7,8}'::float8[],
+ '{2,180,200,740,57800,-3400,7,9}'::float8[]);
+ float8_regr_combine
+------------------------------------
+ {5,240,6280,1490,95080,8680,7,NaN}
(1 row)
DROP TABLE regr_test;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 77ca6ffa3a9..5eb412bd43b 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -140,6 +140,24 @@ SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
+-- check some cases that formerly had poor roundoff-error behavior
+SET extra_float_digits = 1;
+SELECT corr(0.09, g), regr_r2(0.09, g)
+ FROM generate_series(1, 30) g;
+SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
+ FROM generate_series(1, 30) g;
+SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
+ FROM generate_series(1, 3) g;
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 3) g;
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 30) g;
+SET extra_float_digits = 0;
+
+-- these examples pose definitional questions for NaN inputs
+SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
+
-- test accum and combine functions directly
CREATE TABLE regr_test (x float8, y float8);
INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
@@ -148,7 +166,7 @@ FROM regr_test WHERE x IN (10,20,30,80);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
FROM regr_test;
SELECT float8_accum('{4,140,2900}'::float8[], 100);
-SELECT float8_regr_accum('{4,140,2900,1290,83075,15050}'::float8[], 200, 100);
+SELECT float8_regr_accum('{4,140,2900,1290,83075,15050,100,0}'::float8[], 200, 100);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
FROM regr_test WHERE x IN (10,20,30);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
@@ -156,12 +174,12 @@ FROM regr_test WHERE x IN (80,100);
SELECT float8_combine('{3,60,200}'::float8[], '{0,0,0}'::float8[]);
SELECT float8_combine('{0,0,0}'::float8[], '{2,180,200}'::float8[]);
SELECT float8_combine('{3,60,200}'::float8[], '{2,180,200}'::float8[]);
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{0,0,0,0,0,0}'::float8[]);
-SELECT float8_regr_combine('{0,0,0,0,0,0}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,1,NaN}'::float8[],
+ '{0,0,0,0,0,0,0,0}'::float8[]);
+SELECT float8_regr_combine('{0,0,0,0,0,0,0,0}'::float8[],
+ '{2,180,200,740,57800,-3400,NaN,1}'::float8[]);
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,7,8}'::float8[],
+ '{2,180,200,740,57800,-3400,7,9}'::float8[]);
DROP TABLE regr_test;
-- test count, distinct
I wrote:
> Poking at this, I soon found a test case where even with the separate
> sqrt() calls we'd produce a result slightly outside [-1, 1] (running
> this test over more values of x is sufficient). So now I think we
> should do both the separate sqrt and the clamp.
Per CI results, on some platforms the roundoff error is different from
what I observe, producing a value just less than 1 rather than just
more. That doesn't invalidate needing the clamp, but it does mean
that we can't use that test case just like that. I'm inclined to
remove the change of extra_float_digits, but keep the test case.
regards, tom lane
On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Attached is a fleshed-out patch proposal that fixes the related
> aggregates and adds test cases.
>
Looking at float8_regr_accum(), I think it would be preferable to
arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
values are seen. I.e., something like this:
if (newvalX != commonX || isnan(newvalX))
commonX = get_float8_nan();
if (newvalY != commonY || isnan(newvalY))
commonY = get_float8_nan();
if (isnan(commonX) || isnan(commonY))
{
tmpX = newvalX * N - Sx;
tmpY = newvalY * N - Sy;
scale = 1.0 / (N * transvalues[0]);
if (isnan(commonX))
Sxx += tmpX * tmpX * scale;
if (isnan(commonY))
Syy += tmpY * tmpY * scale;
if (isnan(commonX) && isnan(commonY))
Sxy += tmpX * tmpY * scale;
... Overflow check ...
}
This would mean that float8_corr(), float8_regr_r2(),
float8_regr_slope(), and float8_regr_intercept() would not need to
look at commonX or commonY, and could simply rely on Sxx == 0 or Syy
== 0 to detect horizontal and vertical lines.
Aside from making the code simpler, this would guarantee that the
aggregate functions regr_sxx() and regr_syy() would return exactly
zero for all-constant X and Y inputs respectively, and that
regr_sxy(), covar_pop(), and covar_samp() would return exactly zero if
either the X or the Y inputs were all constant.
Something else that occurred to me was that float8_regr_avgx() and
float8_regr_avgy() might as well make use of commonX and commonY,
since we're calculating them, so they would return exact averages if
all the X or Y values were the same, rather than results with possible
rounding errors.
I also wonder if it would be worth doing something similar for the
single-variable aggregates so that var_pop(), var_samp(),
stddev_pop(), and stddev_samp() would all return exactly zero, and
avg() would return the exact common value, if all the inputs were
constant.
Regards,
Dean
On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Poking at this, I soon found a test case where even with the separate > sqrt() calls we'd produce a result slightly outside [-1, 1] (running > this test over more values of x is sufficient). So now I think we > should do both the separate sqrt and the clamp. > I'm starting to have doubts about having 2 sqrt() calls. The problem is that it seems to produce a noticeable reduction in accuracy in quite a few cases. This is especially noticeable with fully-correlated data. For example: SELECT n, (SELECT corr(x, x) FROM generate_series(1, n) x) FROM generate_series(1, 10) g(n); n | corr ----+-------------------- 1 | 2 | 0.9999999999999998 3 | 0.9999999999999998 4 | 0.9999999999999998 5 | 0.9999999999999998 6 | 1 7 | 0.9999999999999999 8 | 1 9 | 0.9999999999999999 10 | 1 (10 rows) Now I'm not sure that the current code can be expected to get cases like this exactly right 100% of the time, but it's pretty close. For example, if I do this: WITH t1 AS ( SELECT n, random() * 1000 AS r FROM generate_series(1, 1000000) n ), t2 AS ( SELECT corr(r, r) FROM t1 GROUP BY n % 10000 ) SELECT count(*), count(*) FILTER (WHERE corr != 1) FROM t2; on HEAD it produced corr = 1 every time I ran it, whereas the patch gives rounding errors roughly 25% of the time, which seems likely to be noticed. Perhaps we should only use 2 sqrt()'s if the product Sxx * Syy overflows. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Looking at float8_regr_accum(), I think it would be preferable to
> arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
> values are seen. I.e., something like this:
That seems like a good idea. I was initially worried that the extra
isnan() checks would slow down aggregation noticeably in the normal
case where we soon discover that the inputs aren't all equal. They
don't seem to though. For me, the attached v2 clocks in at only 0.9%
slower than HEAD, IOW actually faster than v1; which I suspect just
proves that what we're trying to measure here is comparable to the
noise threshold.
I took your other suggestions too, except for:
> I also wonder if it would be worth doing something similar for the
> single-variable aggregates so that var_pop(), var_samp(),
> stddev_pop(), and stddev_samp() would all return exactly zero, and
> avg() would return the exact common value, if all the inputs were
> constant.
I'm less excited about this, because for all those aggregates,
you have the option of using the numeric variant if you're
dissatisified with the accuracy of the float8 variant. Also,
given that the per-input work is substantially less, the overhead
of tracking the common input value would probably be noticeably
greater. If somebody else wants to investigate that, I won't
stand in the way, but I don't want to do it.
regards, tom lane
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..05f920e3aec 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,16 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is an 8-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), commonX, and commonY
+ * in that order.
+ *
+ * commonX is defined as the common X value if all the X values were the same,
+ * else NaN; likewise for commonY. This is useful for deciding whether corr()
+ * and related functions should return NULL. This representation cannot
+ * distinguish all-the-values-were-NaN from the-values-weren't-all-the-same,
+ * but that's okay because we want to return NaN for all-NaN input.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3352,21 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ commonX,
+ commonY,
tmpX,
tmpY,
scale;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3366,36 +3377,58 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy += newvalY;
if (transvalues[0] > 0.0)
{
- tmpX = newvalX * N - Sx;
- tmpY = newvalY * N - Sy;
- scale = 1.0 / (N * transvalues[0]);
- Sxx += tmpX * tmpX * scale;
- Syy += tmpY * tmpY * scale;
- Sxy += tmpX * tmpY * scale;
+ /*
+ * Check to see if we have seen distinct inputs. We can use a test
+ * that's a bit cheaper than float8_ne() because if commonX is already
+ * NaN, it does not matter whether the != test returns true or not.
+ */
+ if (newvalX != commonX || isnan(newvalX))
+ commonX = get_float8_nan();
+ if (newvalY != commonY || isnan(newvalY))
+ commonY = get_float8_nan();
/*
- * Overflow check. We only report an overflow error when finite
- * inputs lead to infinite results. Note also that Sxx, Syy and Sxy
- * should be NaN if any of the relevant inputs are infinite, so we
- * intentionally prevent them from becoming infinite.
+ * If we have not seen distinct inputs, then Sxx, Syy, and/or Sxy
+ * should remain exactly zero (since Sx's exact value would be N *
+ * commonX, etc). Carrying out these calculations would just add the
+ * possibility of injecting roundoff error.
*/
- if (isinf(Sx) || isinf(Sxx) || isinf(Sy) || isinf(Syy) || isinf(Sxy))
+ if (isnan(commonX) || isnan(commonY))
{
- if (((isinf(Sx) || isinf(Sxx)) &&
- !isinf(transvalues[1]) && !isinf(newvalX)) ||
- ((isinf(Sy) || isinf(Syy)) &&
- !isinf(transvalues[3]) && !isinf(newvalY)) ||
- (isinf(Sxy) &&
- !isinf(transvalues[1]) && !isinf(newvalX) &&
- !isinf(transvalues[3]) && !isinf(newvalY)))
- float_overflow_error();
+ tmpX = newvalX * N - Sx;
+ tmpY = newvalY * N - Sy;
+ scale = 1.0 / (N * transvalues[0]);
+ if (isnan(commonX))
+ Sxx += tmpX * tmpX * scale;
+ if (isnan(commonY))
+ Syy += tmpY * tmpY * scale;
+ if (isnan(commonX) && isnan(commonY))
+ Sxy += tmpX * tmpY * scale;
+
+ /*
+ * Overflow check. We only report an overflow error when finite
+ * inputs lead to infinite results. Note also that Sxx, Syy and
+ * Sxy should be NaN if any of the relevant inputs are infinite,
+ * so we intentionally prevent them from becoming infinite.
+ */
+ if (isinf(Sx) || isinf(Sxx) || isinf(Sy) || isinf(Syy) || isinf(Sxy))
+ {
+ if (((isinf(Sx) || isinf(Sxx)) &&
+ !isinf(transvalues[1]) && !isinf(newvalX)) ||
+ ((isinf(Sy) || isinf(Syy)) &&
+ !isinf(transvalues[3]) && !isinf(newvalY)) ||
+ (isinf(Sxy) &&
+ !isinf(transvalues[1]) && !isinf(newvalX) &&
+ !isinf(transvalues[3]) && !isinf(newvalY)))
+ float_overflow_error();
- if (isinf(Sxx))
- Sxx = get_float8_nan();
- if (isinf(Syy))
- Syy = get_float8_nan();
- if (isinf(Sxy))
- Sxy = get_float8_nan();
+ if (isinf(Sxx))
+ Sxx = get_float8_nan();
+ if (isinf(Syy))
+ Syy = get_float8_nan();
+ if (isinf(Sxy))
+ Sxy = get_float8_nan();
+ }
}
}
else
@@ -3410,6 +3443,9 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+
+ commonX = newvalX;
+ commonY = newvalY;
}
/*
@@ -3425,12 +3461,14 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = commonX;
+ transvalues[7] = commonY;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3477,10 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(commonX);
+ transdatums[7] = Float8GetDatumFast(commonY);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3449,7 +3489,7 @@ float8_regr_accum(PG_FUNCTION_ARGS)
/*
* float8_regr_combine
*
- * An aggregate combine function used to combine two 6 fields
+ * An aggregate combine function used to combine two 8-fields
* aggregate transition data into a single transition data.
* This function is used only in two stage aggregation and
* shouldn't be called outside aggregate context.
@@ -3467,12 +3507,16 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy1,
Syy1,
Sxy1,
+ Cx1,
+ Cy1,
N2,
Sx2,
Sxx2,
Sy2,
Syy2,
Sxy2,
+ Cx2,
+ Cy2,
tmp1,
tmp2,
N,
@@ -3480,10 +3524,12 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sxx,
Sy,
Syy,
- Sxy;
+ Sxy,
+ Cx,
+ Cy;
- transvalues1 = check_float8_array(transarray1, "float8_regr_combine", 6);
- transvalues2 = check_float8_array(transarray2, "float8_regr_combine", 6);
+ transvalues1 = check_float8_array(transarray1, "float8_regr_combine", 8);
+ transvalues2 = check_float8_array(transarray2, "float8_regr_combine", 8);
N1 = transvalues1[0];
Sx1 = transvalues1[1];
@@ -3491,6 +3537,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy1 = transvalues1[3];
Syy1 = transvalues1[4];
Sxy1 = transvalues1[5];
+ Cx1 = transvalues1[6];
+ Cy1 = transvalues1[7];
N2 = transvalues2[0];
Sx2 = transvalues2[1];
@@ -3498,6 +3546,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy2 = transvalues2[3];
Syy2 = transvalues2[4];
Sxy2 = transvalues2[5];
+ Cx2 = transvalues2[6];
+ Cy2 = transvalues2[7];
/*--------------------
* The transition values combine using a generalization of the
@@ -3523,6 +3573,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy = Sy2;
Syy = Syy2;
Sxy = Sxy2;
+ Cx = Cx2;
+ Cy = Cy2;
}
else if (N2 == 0.0)
{
@@ -3532,6 +3584,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy = Sy1;
Syy = Syy1;
Sxy = Sxy1;
+ Cx = Cx1;
+ Cy = Cy1;
}
else
{
@@ -3549,6 +3603,14 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sxy = Sxy1 + Sxy2 + N1 * N2 * tmp1 * tmp2 / N;
if (unlikely(isinf(Sxy)) && !isinf(Sxy1) && !isinf(Sxy2))
float_overflow_error();
+ if (float8_eq(Cx1, Cx2))
+ Cx = Cx1;
+ else
+ Cx = get_float8_nan();
+ if (float8_eq(Cy1, Cy2))
+ Cy = Cy1;
+ else
+ Cy = get_float8_nan();
}
/*
@@ -3564,12 +3626,14 @@ float8_regr_combine(PG_FUNCTION_ARGS)
transvalues1[3] = Sy;
transvalues1[4] = Syy;
transvalues1[5] = Sxy;
+ transvalues1[6] = Cx;
+ transvalues1[7] = Cy;
PG_RETURN_ARRAYTYPE_P(transarray1);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3578,8 +3642,10 @@ float8_regr_combine(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(Cx);
+ transdatums[7] = Float8GetDatumFast(Cy);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3594,7 +3660,7 @@ float8_regr_sxx(PG_FUNCTION_ARGS)
float8 N,
Sxx;
- transvalues = check_float8_array(transarray, "float8_regr_sxx", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_sxx", 8);
N = transvalues[0];
Sxx = transvalues[2];
@@ -3615,7 +3681,7 @@ float8_regr_syy(PG_FUNCTION_ARGS)
float8 N,
Syy;
- transvalues = check_float8_array(transarray, "float8_regr_syy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_syy", 8);
N = transvalues[0];
Syy = transvalues[4];
@@ -3636,7 +3702,7 @@ float8_regr_sxy(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_sxy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_sxy", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3655,16 +3721,22 @@ float8_regr_avgx(PG_FUNCTION_ARGS)
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
float8 N,
- Sx;
+ Sx,
+ commonX;
- transvalues = check_float8_array(transarray, "float8_regr_avgx", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_avgx", 8);
N = transvalues[0];
Sx = transvalues[1];
+ commonX = transvalues[6];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
+ /* if all inputs were the same just return that, avoiding roundoff error */
+ if (!isnan(commonX))
+ PG_RETURN_FLOAT8(commonX);
+
PG_RETURN_FLOAT8(Sx / N);
}
@@ -3674,16 +3746,22 @@ float8_regr_avgy(PG_FUNCTION_ARGS)
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
float8 N,
- Sy;
+ Sy,
+ commonY;
- transvalues = check_float8_array(transarray, "float8_regr_avgy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_avgy", 8);
N = transvalues[0];
Sy = transvalues[3];
+ commonY = transvalues[7];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
+ /* if all inputs were the same just return that, avoiding roundoff error */
+ if (!isnan(commonY))
+ PG_RETURN_FLOAT8(commonY);
+
PG_RETURN_FLOAT8(Sy / N);
}
@@ -3695,7 +3773,7 @@ float8_covar_pop(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_covar_pop", 6);
+ transvalues = check_float8_array(transarray, "float8_covar_pop", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3714,7 +3792,7 @@ float8_covar_samp(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_covar_samp", 6);
+ transvalues = check_float8_array(transarray, "float8_covar_samp", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3733,9 +3811,12 @@ float8_corr(PG_FUNCTION_ARGS)
float8 N,
Sxx,
Syy,
- Sxy;
+ Sxy,
+ product,
+ sqrtproduct,
+ result;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
+ transvalues = check_float8_array(transarray, "float8_corr", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
@@ -3751,7 +3832,29 @@ float8_corr(PG_FUNCTION_ARGS)
if (Sxx == 0 || Syy == 0)
PG_RETURN_NULL();
- PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
+ /*
+ * The product Sxx * Syy might underflow or overflow. If so, we can
+ * recover by computing sqrt(Sxx) * sqrt(Syy) instead of sqrt(Sxx * Syy).
+ * However, the double sqrt() calculation is a bit slower and less
+ * accurate, so don't do it if we don't have to.
+ */
+ product = Sxx * Syy;
+ if (product == 0 || isinf(product))
+ sqrtproduct = sqrt(Sxx) * sqrt(Syy);
+ else
+ sqrtproduct = sqrt(product);
+ result = Sxy / sqrtproduct;
+
+ /*
+ * Despite all these precautions, this formula can yield results outside
+ * [-1, 1] due to roundoff error. Clamp it to the expected range.
+ */
+ if (result < -1)
+ result = -1;
+ else if (result > 1)
+ result = 1;
+
+ PG_RETURN_FLOAT8(result);
}
Datum
@@ -3764,7 +3867,7 @@ float8_regr_r2(PG_FUNCTION_ARGS)
Syy,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_r2", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_r2", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
@@ -3796,7 +3899,7 @@ float8_regr_slope(PG_FUNCTION_ARGS)
Sxx,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_slope", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_slope", 8);
N = transvalues[0];
Sxx = transvalues[2];
Sxy = transvalues[5];
@@ -3825,7 +3928,7 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
Sy,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_intercept", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_intercept", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..f22ccfbf49f 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -475,37 +475,37 @@
aggcombinefn => 'int8pl', aggtranstype => 'int8', agginitval => '0' },
{ aggfnoid => 'regr_sxx', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_sxx', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_syy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_syy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_sxy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_sxy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_avgx', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_avgx', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_avgy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_avgy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_r2', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_r2', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_slope', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_slope', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_intercept', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_intercept', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'covar_pop', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_covar_pop', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'covar_samp', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_covar_samp', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index be0e1573183..464a5ce39da 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -515,6 +515,55 @@ SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
NaN |
(1 row)
+-- check some cases that formerly had poor roundoff-error behavior
+SELECT corr(0.09, g), regr_r2(0.09, g)
+ FROM generate_series(1, 30) g;
+ corr | regr_r2
+------+---------
+ | 1
+(1 row)
+
+SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
+ FROM generate_series(1, 30) g;
+ corr | regr_r2 | regr_slope | regr_intercept
+------+---------+------------+----------------
+ | | |
+(1 row)
+
+SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
+ FROM generate_series(1, 3) g;
+ corr
+------
+
+(1 row)
+
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 3) g;
+ corr
+------
+ 1
+(1 row)
+
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 30) g;
+ corr
+------
+ 1
+(1 row)
+
+-- these examples pose definitional questions for NaN inputs
+SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+ NaN
+(1 row)
+
+SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+
+(1 row)
+
-- test accum and combine functions directly
CREATE TABLE regr_test (x float8, y float8);
INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
@@ -538,10 +587,10 @@ SELECT float8_accum('{4,140,2900}'::float8[], 100);
{5,240,6280}
(1 row)
-SELECT float8_regr_accum('{4,140,2900,1290,83075,15050}'::float8[], 200, 100);
- float8_regr_accum
-------------------------------
- {5,240,6280,1490,95080,8680}
+SELECT float8_regr_accum('{4,140,2900,1290,83075,15050,100,0}'::float8[], 200, 100);
+ float8_regr_accum
+---------------------------------------
+ {5,240,2900,1490,95080,15050,100,NaN}
(1 row)
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
@@ -576,25 +625,25 @@ SELECT float8_combine('{3,60,200}'::float8[], '{2,180,200}'::float8[]);
{5,240,6280}
(1 row)
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{0,0,0,0,0,0}'::float8[]);
- float8_regr_combine
----------------------------
- {3,60,200,750,20000,2000}
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,1,NaN}'::float8[],
+ '{0,0,0,0,0,0,0,0}'::float8[]);
+ float8_regr_combine
+---------------------------------
+ {3,60,200,750,20000,2000,1,NaN}
(1 row)
-SELECT float8_regr_combine('{0,0,0,0,0,0}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
- float8_regr_combine
------------------------------
- {2,180,200,740,57800,-3400}
+SELECT float8_regr_combine('{0,0,0,0,0,0,0,0}'::float8[],
+ '{2,180,200,740,57800,-3400,NaN,1}'::float8[]);
+ float8_regr_combine
+-----------------------------------
+ {2,180,200,740,57800,-3400,NaN,1}
(1 row)
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
- float8_regr_combine
-------------------------------
- {5,240,6280,1490,95080,8680}
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,7,8}'::float8[],
+ '{2,180,200,740,57800,-3400,7,9}'::float8[]);
+ float8_regr_combine
+------------------------------------
+ {5,240,6280,1490,95080,8680,7,NaN}
(1 row)
DROP TABLE regr_test;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 77ca6ffa3a9..0bc7fea8186 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -140,6 +140,22 @@ SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
+-- check some cases that formerly had poor roundoff-error behavior
+SELECT corr(0.09, g), regr_r2(0.09, g)
+ FROM generate_series(1, 30) g;
+SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
+ FROM generate_series(1, 30) g;
+SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
+ FROM generate_series(1, 3) g;
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 3) g;
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 30) g;
+
+-- these examples pose definitional questions for NaN inputs
+SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
+
-- test accum and combine functions directly
CREATE TABLE regr_test (x float8, y float8);
INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
@@ -148,7 +164,7 @@ FROM regr_test WHERE x IN (10,20,30,80);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
FROM regr_test;
SELECT float8_accum('{4,140,2900}'::float8[], 100);
-SELECT float8_regr_accum('{4,140,2900,1290,83075,15050}'::float8[], 200, 100);
+SELECT float8_regr_accum('{4,140,2900,1290,83075,15050,100,0}'::float8[], 200, 100);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
FROM regr_test WHERE x IN (10,20,30);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
@@ -156,12 +172,12 @@ FROM regr_test WHERE x IN (80,100);
SELECT float8_combine('{3,60,200}'::float8[], '{0,0,0}'::float8[]);
SELECT float8_combine('{0,0,0}'::float8[], '{2,180,200}'::float8[]);
SELECT float8_combine('{3,60,200}'::float8[], '{2,180,200}'::float8[]);
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{0,0,0,0,0,0}'::float8[]);
-SELECT float8_regr_combine('{0,0,0,0,0,0}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,1,NaN}'::float8[],
+ '{0,0,0,0,0,0,0,0}'::float8[]);
+SELECT float8_regr_combine('{0,0,0,0,0,0,0,0}'::float8[],
+ '{2,180,200,740,57800,-3400,NaN,1}'::float8[]);
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,7,8}'::float8[],
+ '{2,180,200,740,57800,-3400,7,9}'::float8[]);
DROP TABLE regr_test;
-- test count, distinct
I wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> Looking at float8_regr_accum(), I think it would be preferable to
>> arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
>> values are seen. I.e., something like this:
> That seems like a good idea. I was initially worried that the extra
> isnan() checks would slow down aggregation noticeably in the normal
> case where we soon discover that the inputs aren't all equal.
BTW, re-reading the patch, I now think we should drop the initial
if (isnan(commonX) || isnan(commonY))
test, instead bulling ahead with computing tmpX/tmpY/scale, and
only skip the updates of Sxx/Syy/Sxy when we have constant inputs.
Using that initial test is optimizing for constant inputs at the
expense of non-constant inputs, which seems like the wrong way
to bet.
regards, tom lane
On Sat, 6 Dec 2025 at 06:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> >> Looking at float8_regr_accum(), I think it would be preferable to
> >> arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
> >> values are seen. I.e., something like this:
>
> > That seems like a good idea. I was initially worried that the extra
> > isnan() checks would slow down aggregation noticeably in the normal
> > case where we soon discover that the inputs aren't all equal.
>
> BTW, re-reading the patch, I now think we should drop the initial
>
> if (isnan(commonX) || isnan(commonY))
>
> test, instead bulling ahead with computing tmpX/tmpY/scale, and
> only skip the updates of Sxx/Syy/Sxy when we have constant inputs.
> Using that initial test is optimizing for constant inputs at the
> expense of non-constant inputs, which seems like the wrong way
> to bet.
[shrug] I doubt that it will make any noticeable performance
difference, but it will reduce the size of the diff and the final
code.
I've been thinking some more about the behaviour with NaNs, and I was
initially surprised by the difference between these 2 results:
SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
corr
------
(1 row)
SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
corr
------
NaN
(1 row)
but actually, on reflection, I think it makes sense.
The first result is required by the SQL standard, which says first
that if N * Sxx = Sx * Sx, then the result is NULL, and similarly for
y. So that rule should take precedence over any rule for NaNs.
The second result can be justified by the IEEE rules for NaNs,
according to which NaN does not equal any other number, including
itself, and so an all-NaN column is not all equal (and it doesn't
satisfy "N * Sxx = Sx * Sx" either). So the SQL standard all-the-same
rule doesn't apply to the second query, and the standard computation
yields NaN.
If we're happy with those decisions, then I think this comment should
be updated:
+ * commonX is defined as the common X value if all the X values were the same,
+ * else NaN; likewise for commonY. This is useful for deciding whether corr()
+ * and related functions should return NULL. This representation cannot
+ * distinguish all-the-values-were-NaN from the-values-weren't-all-the-same,
+ * but that's okay because we want to return NaN for all-NaN input.
because that second sentence is only true for all-NaN input in both X
and Y. Perhaps say that all-NaNs in a column is intentionally not
treated as all-the-same in that column, in accordance with IEEE.
Also, we should probably include the second query above in the regression tests.
Regards,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> The first result is required by the SQL standard, which says first
> that if N * Sxx = Sx * Sx, then the result is NULL, and similarly for
> y. So that rule should take precedence over any rule for NaNs.
> The second result can be justified by the IEEE rules for NaNs,
> according to which NaN does not equal any other number, including
> itself, and so an all-NaN column is not all equal (and it doesn't
> satisfy "N * Sxx = Sx * Sx" either). So the SQL standard all-the-same
> rule doesn't apply to the second query, and the standard computation
> yields NaN.
Right, that's pretty much the thinking I've ended up on.
> If we're happy with those decisions, then I think this comment should
> be updated:
Here's a v3 with another try at that comment, and the other points
addressed. Also now with a draft commit message. I credited you
as co-author since so much of this is your ideas.
regards, tom lane
From 20bada97f9f36d5cb0d6f211f57c084b8cdb2d79 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 6 Dec 2025 13:13:14 -0500
Subject: [PATCH v3] Handle constant inputs to corr() and related aggregates
more precisely.
The SQL standard says that corr() and friends should return NULL in
the mathematically-undefined case where all the inputs in one of
the columns have the same value. We were checking that by seeing
if the sums Sxx and Syy were zero, but that approach is very
vulnerable to roundoff error: if a sum is close to zero but not
exactly that, we'd come out with a pretty silly non-NULL result.
Instead, directly track whether the inputs are all equal by
remembering the common value in each column. Once we detect
that a new input is different from before, represent that by
storing NaN for the common value. (An objection to this scheme
is that if the inputs are all NaN, we will consider that they
were not all equal. But under IEEE float arithmetic rules,
one NaN is never equal to another, so this behavior is arguably
correct. Moreover it matches what we did before in such cases.)
Then, leave the sums at their exact value of zero for as long
as we haven't detected different input values.
This solution requires the aggregate transition state to contain
8 float values not 6, which is not problematic, and it seems to add
less than 1% to the aggregates' runtime, which seems acceptable.
While we're here, improve corr()'s final function to cope with
overflow/underflow in the final calculation, and to clamp its
result to [-1, 1] in case of roundoff error.
Although this is arguably a bug fix, it requires a catversion bump
due to the change in aggregates' initial states, so it can't be
back-patched.
Patch by me, but many of the ideas are due to Dean Rasheed,
who also did a deal of testing.
Bug: #19340
Reported-by: Oleg Ivanov <o15611@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/19340-6fb9f6637f562092@postgresql.org
---
src/backend/utils/adt/float.c | 165 +++++++++++++++++++----
src/include/catalog/pg_aggregate.dat | 22 +--
src/test/regress/expected/aggregates.out | 94 ++++++++++---
src/test/regress/sql/aggregates.sql | 32 ++++-
4 files changed, 247 insertions(+), 66 deletions(-)
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..3a60dc73a94 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,21 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is an 8-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), commonX, and commonY
+ * in that order.
+ *
+ * commonX is defined as the common X value if all the X values were the same,
+ * else NaN; likewise for commonY. This is useful for deciding whether corr()
+ * and related functions should return NULL. This representation cannot
+ * distinguish the-values-were-all-NaN from the-values-were-not-all-the-same,
+ * but that's okay because for this purpose we use the IEEE float arithmetic
+ * principle that two NaNs are never equal. The SQL standard doesn't mention
+ * NaNs, but it says that NULL is to be returned when N*sum(X*X) equals
+ * sum(X)*sum(X) (etc), and that shouldn't be considered true for NaNs.
+ * Testing that as written in the spec would be highly subject to roundoff
+ * error, so instead we directly track whether all the inputs are equal.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3357,21 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ commonX,
+ commonY,
tmpX,
tmpY,
scale;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3366,12 +3382,33 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy += newvalY;
if (transvalues[0] > 0.0)
{
+ /*
+ * Check to see if we have seen distinct inputs. We can use a test
+ * that's a bit cheaper than float8_ne() because if commonX is already
+ * NaN, it does not matter whether the != test returns true or not.
+ */
+ if (newvalX != commonX || isnan(newvalX))
+ commonX = get_float8_nan();
+ if (newvalY != commonY || isnan(newvalY))
+ commonY = get_float8_nan();
+
tmpX = newvalX * N - Sx;
tmpY = newvalY * N - Sy;
scale = 1.0 / (N * transvalues[0]);
- Sxx += tmpX * tmpX * scale;
- Syy += tmpY * tmpY * scale;
- Sxy += tmpX * tmpY * scale;
+
+ /*
+ * If we have not seen distinct inputs, then Sxx, Syy, and/or Sxy
+ * should remain zero (since Sx's exact value would be N * commonX,
+ * etc). Updating them would just create the possibility of injecting
+ * roundoff error, and we need exact zero results so that the final
+ * functions will return NULL in the right cases.
+ */
+ if (isnan(commonX))
+ Sxx += tmpX * tmpX * scale;
+ if (isnan(commonY))
+ Syy += tmpY * tmpY * scale;
+ if (isnan(commonX) && isnan(commonY))
+ Sxy += tmpX * tmpY * scale;
/*
* Overflow check. We only report an overflow error when finite
@@ -3410,6 +3447,9 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+
+ commonX = newvalX;
+ commonY = newvalY;
}
/*
@@ -3425,12 +3465,14 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = commonX;
+ transvalues[7] = commonY;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3481,10 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(commonX);
+ transdatums[7] = Float8GetDatumFast(commonY);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3449,7 +3493,7 @@ float8_regr_accum(PG_FUNCTION_ARGS)
/*
* float8_regr_combine
*
- * An aggregate combine function used to combine two 6 fields
+ * An aggregate combine function used to combine two 8-fields
* aggregate transition data into a single transition data.
* This function is used only in two stage aggregation and
* shouldn't be called outside aggregate context.
@@ -3467,12 +3511,16 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy1,
Syy1,
Sxy1,
+ Cx1,
+ Cy1,
N2,
Sx2,
Sxx2,
Sy2,
Syy2,
Sxy2,
+ Cx2,
+ Cy2,
tmp1,
tmp2,
N,
@@ -3480,10 +3528,12 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sxx,
Sy,
Syy,
- Sxy;
+ Sxy,
+ Cx,
+ Cy;
- transvalues1 = check_float8_array(transarray1, "float8_regr_combine", 6);
- transvalues2 = check_float8_array(transarray2, "float8_regr_combine", 6);
+ transvalues1 = check_float8_array(transarray1, "float8_regr_combine", 8);
+ transvalues2 = check_float8_array(transarray2, "float8_regr_combine", 8);
N1 = transvalues1[0];
Sx1 = transvalues1[1];
@@ -3491,6 +3541,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy1 = transvalues1[3];
Syy1 = transvalues1[4];
Sxy1 = transvalues1[5];
+ Cx1 = transvalues1[6];
+ Cy1 = transvalues1[7];
N2 = transvalues2[0];
Sx2 = transvalues2[1];
@@ -3498,6 +3550,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy2 = transvalues2[3];
Syy2 = transvalues2[4];
Sxy2 = transvalues2[5];
+ Cx2 = transvalues2[6];
+ Cy2 = transvalues2[7];
/*--------------------
* The transition values combine using a generalization of the
@@ -3523,6 +3577,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy = Sy2;
Syy = Syy2;
Sxy = Sxy2;
+ Cx = Cx2;
+ Cy = Cy2;
}
else if (N2 == 0.0)
{
@@ -3532,6 +3588,8 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sy = Sy1;
Syy = Syy1;
Sxy = Sxy1;
+ Cx = Cx1;
+ Cy = Cy1;
}
else
{
@@ -3549,6 +3607,14 @@ float8_regr_combine(PG_FUNCTION_ARGS)
Sxy = Sxy1 + Sxy2 + N1 * N2 * tmp1 * tmp2 / N;
if (unlikely(isinf(Sxy)) && !isinf(Sxy1) && !isinf(Sxy2))
float_overflow_error();
+ if (float8_eq(Cx1, Cx2))
+ Cx = Cx1;
+ else
+ Cx = get_float8_nan();
+ if (float8_eq(Cy1, Cy2))
+ Cy = Cy1;
+ else
+ Cy = get_float8_nan();
}
/*
@@ -3564,12 +3630,14 @@ float8_regr_combine(PG_FUNCTION_ARGS)
transvalues1[3] = Sy;
transvalues1[4] = Syy;
transvalues1[5] = Sxy;
+ transvalues1[6] = Cx;
+ transvalues1[7] = Cy;
PG_RETURN_ARRAYTYPE_P(transarray1);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3578,8 +3646,10 @@ float8_regr_combine(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(Cx);
+ transdatums[7] = Float8GetDatumFast(Cy);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3594,7 +3664,7 @@ float8_regr_sxx(PG_FUNCTION_ARGS)
float8 N,
Sxx;
- transvalues = check_float8_array(transarray, "float8_regr_sxx", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_sxx", 8);
N = transvalues[0];
Sxx = transvalues[2];
@@ -3615,7 +3685,7 @@ float8_regr_syy(PG_FUNCTION_ARGS)
float8 N,
Syy;
- transvalues = check_float8_array(transarray, "float8_regr_syy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_syy", 8);
N = transvalues[0];
Syy = transvalues[4];
@@ -3636,7 +3706,7 @@ float8_regr_sxy(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_sxy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_sxy", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3655,16 +3725,22 @@ float8_regr_avgx(PG_FUNCTION_ARGS)
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
float8 N,
- Sx;
+ Sx,
+ commonX;
- transvalues = check_float8_array(transarray, "float8_regr_avgx", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_avgx", 8);
N = transvalues[0];
Sx = transvalues[1];
+ commonX = transvalues[6];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
+ /* if all inputs were the same just return that, avoiding roundoff error */
+ if (!isnan(commonX))
+ PG_RETURN_FLOAT8(commonX);
+
PG_RETURN_FLOAT8(Sx / N);
}
@@ -3674,16 +3750,22 @@ float8_regr_avgy(PG_FUNCTION_ARGS)
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
float8 N,
- Sy;
+ Sy,
+ commonY;
- transvalues = check_float8_array(transarray, "float8_regr_avgy", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_avgy", 8);
N = transvalues[0];
Sy = transvalues[3];
+ commonY = transvalues[7];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
+ /* if all inputs were the same just return that, avoiding roundoff error */
+ if (!isnan(commonY))
+ PG_RETURN_FLOAT8(commonY);
+
PG_RETURN_FLOAT8(Sy / N);
}
@@ -3695,7 +3777,7 @@ float8_covar_pop(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_covar_pop", 6);
+ transvalues = check_float8_array(transarray, "float8_covar_pop", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3714,7 +3796,7 @@ float8_covar_samp(PG_FUNCTION_ARGS)
float8 N,
Sxy;
- transvalues = check_float8_array(transarray, "float8_covar_samp", 6);
+ transvalues = check_float8_array(transarray, "float8_covar_samp", 8);
N = transvalues[0];
Sxy = transvalues[5];
@@ -3733,9 +3815,12 @@ float8_corr(PG_FUNCTION_ARGS)
float8 N,
Sxx,
Syy,
- Sxy;
+ Sxy,
+ product,
+ sqrtproduct,
+ result;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
+ transvalues = check_float8_array(transarray, "float8_corr", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
@@ -3751,7 +3836,29 @@ float8_corr(PG_FUNCTION_ARGS)
if (Sxx == 0 || Syy == 0)
PG_RETURN_NULL();
- PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
+ /*
+ * The product Sxx * Syy might underflow or overflow. If so, we can
+ * recover by computing sqrt(Sxx) * sqrt(Syy) instead of sqrt(Sxx * Syy).
+ * However, the double sqrt() calculation is a bit slower and less
+ * accurate, so don't do it if we don't have to.
+ */
+ product = Sxx * Syy;
+ if (product == 0 || isinf(product))
+ sqrtproduct = sqrt(Sxx) * sqrt(Syy);
+ else
+ sqrtproduct = sqrt(product);
+ result = Sxy / sqrtproduct;
+
+ /*
+ * Despite all these precautions, this formula can yield results outside
+ * [-1, 1] due to roundoff error. Clamp it to the expected range.
+ */
+ if (result < -1)
+ result = -1;
+ else if (result > 1)
+ result = 1;
+
+ PG_RETURN_FLOAT8(result);
}
Datum
@@ -3764,7 +3871,7 @@ float8_regr_r2(PG_FUNCTION_ARGS)
Syy,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_r2", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_r2", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
@@ -3796,7 +3903,7 @@ float8_regr_slope(PG_FUNCTION_ARGS)
Sxx,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_slope", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_slope", 8);
N = transvalues[0];
Sxx = transvalues[2];
Sxy = transvalues[5];
@@ -3825,7 +3932,7 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
Sy,
Sxy;
- transvalues = check_float8_array(transarray, "float8_regr_intercept", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_intercept", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..f22ccfbf49f 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -475,37 +475,37 @@
aggcombinefn => 'int8pl', aggtranstype => 'int8', agginitval => '0' },
{ aggfnoid => 'regr_sxx', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_sxx', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_syy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_syy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_sxy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_sxy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_avgx', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_avgx', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_avgy', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_avgy', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_r2', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_r2', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_slope', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_slope', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'regr_intercept', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_regr_intercept', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'covar_pop', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_covar_pop', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'covar_samp', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_covar_samp', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index be0e1573183..cae8e7bca31 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -515,6 +515,62 @@ SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
NaN |
(1 row)
+-- check some cases that formerly had poor roundoff-error behavior
+SELECT corr(0.09, g), regr_r2(0.09, g)
+ FROM generate_series(1, 30) g;
+ corr | regr_r2
+------+---------
+ | 1
+(1 row)
+
+SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
+ FROM generate_series(1, 30) g;
+ corr | regr_r2 | regr_slope | regr_intercept
+------+---------+------------+----------------
+ | | |
+(1 row)
+
+SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
+ FROM generate_series(1, 3) g;
+ corr
+------
+
+(1 row)
+
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 3) g;
+ corr
+------
+ 1
+(1 row)
+
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 30) g;
+ corr
+------
+ 1
+(1 row)
+
+-- these examples pose definitional questions for NaN inputs,
+-- which we resolve by saying that an all-NaN input column is not all equal
+SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+ NaN
+(1 row)
+
+SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+
+(1 row)
+
+SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
+ corr
+------
+ NaN
+(1 row)
+
-- test accum and combine functions directly
CREATE TABLE regr_test (x float8, y float8);
INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
@@ -538,10 +594,10 @@ SELECT float8_accum('{4,140,2900}'::float8[], 100);
{5,240,6280}
(1 row)
-SELECT float8_regr_accum('{4,140,2900,1290,83075,15050}'::float8[], 200, 100);
- float8_regr_accum
-------------------------------
- {5,240,6280,1490,95080,8680}
+SELECT float8_regr_accum('{4,140,2900,1290,83075,15050,100,0}'::float8[], 200, 100);
+ float8_regr_accum
+---------------------------------------
+ {5,240,2900,1490,95080,15050,100,NaN}
(1 row)
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
@@ -576,25 +632,25 @@ SELECT float8_combine('{3,60,200}'::float8[], '{2,180,200}'::float8[]);
{5,240,6280}
(1 row)
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{0,0,0,0,0,0}'::float8[]);
- float8_regr_combine
----------------------------
- {3,60,200,750,20000,2000}
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,1,NaN}'::float8[],
+ '{0,0,0,0,0,0,0,0}'::float8[]);
+ float8_regr_combine
+---------------------------------
+ {3,60,200,750,20000,2000,1,NaN}
(1 row)
-SELECT float8_regr_combine('{0,0,0,0,0,0}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
- float8_regr_combine
------------------------------
- {2,180,200,740,57800,-3400}
+SELECT float8_regr_combine('{0,0,0,0,0,0,0,0}'::float8[],
+ '{2,180,200,740,57800,-3400,NaN,1}'::float8[]);
+ float8_regr_combine
+-----------------------------------
+ {2,180,200,740,57800,-3400,NaN,1}
(1 row)
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
- float8_regr_combine
-------------------------------
- {5,240,6280,1490,95080,8680}
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,7,8}'::float8[],
+ '{2,180,200,740,57800,-3400,7,9}'::float8[]);
+ float8_regr_combine
+------------------------------------
+ {5,240,6280,1490,95080,8680,7,NaN}
(1 row)
DROP TABLE regr_test;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 77ca6ffa3a9..850f5a5787f 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -140,6 +140,24 @@ SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
+-- check some cases that formerly had poor roundoff-error behavior
+SELECT corr(0.09, g), regr_r2(0.09, g)
+ FROM generate_series(1, 30) g;
+SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
+ FROM generate_series(1, 30) g;
+SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
+ FROM generate_series(1, 3) g;
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 3) g;
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+ FROM generate_series(1, 30) g;
+
+-- these examples pose definitional questions for NaN inputs,
+-- which we resolve by saying that an all-NaN input column is not all equal
+SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
+
-- test accum and combine functions directly
CREATE TABLE regr_test (x float8, y float8);
INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
@@ -148,7 +166,7 @@ FROM regr_test WHERE x IN (10,20,30,80);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
FROM regr_test;
SELECT float8_accum('{4,140,2900}'::float8[], 100);
-SELECT float8_regr_accum('{4,140,2900,1290,83075,15050}'::float8[], 200, 100);
+SELECT float8_regr_accum('{4,140,2900,1290,83075,15050,100,0}'::float8[], 200, 100);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
FROM regr_test WHERE x IN (10,20,30);
SELECT count(*), sum(x), regr_sxx(y,x), sum(y),regr_syy(y,x), regr_sxy(y,x)
@@ -156,12 +174,12 @@ FROM regr_test WHERE x IN (80,100);
SELECT float8_combine('{3,60,200}'::float8[], '{0,0,0}'::float8[]);
SELECT float8_combine('{0,0,0}'::float8[], '{2,180,200}'::float8[]);
SELECT float8_combine('{3,60,200}'::float8[], '{2,180,200}'::float8[]);
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{0,0,0,0,0,0}'::float8[]);
-SELECT float8_regr_combine('{0,0,0,0,0,0}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
-SELECT float8_regr_combine('{3,60,200,750,20000,2000}'::float8[],
- '{2,180,200,740,57800,-3400}'::float8[]);
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,1,NaN}'::float8[],
+ '{0,0,0,0,0,0,0,0}'::float8[]);
+SELECT float8_regr_combine('{0,0,0,0,0,0,0,0}'::float8[],
+ '{2,180,200,740,57800,-3400,NaN,1}'::float8[]);
+SELECT float8_regr_combine('{3,60,200,750,20000,2000,7,8}'::float8[],
+ '{2,180,200,740,57800,-3400,7,9}'::float8[]);
DROP TABLE regr_test;
-- test count, distinct
--
2.43.7
On Sat, 6 Dec 2025 at 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Here's a v3 with another try at that comment, and the other points > addressed. Also now with a draft commit message. I credited you > as co-author since so much of this is your ideas. Thanks. LGTM. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Sat, 6 Dec 2025 at 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a v3 with another try at that comment, and the other points
>> addressed. Also now with a draft commit message. I credited you
>> as co-author since so much of this is your ideas.
> Thanks. LGTM.
Pushed, then. Thanks for putting time into it.
regards, tom lane