Обсуждение: bug on aggregate function AVG()

Поиск
Список
Период
Сортировка

bug on aggregate function AVG()

От
"Jose' Soares"
Дата:
prova=> select * from a;
int_2|     int_4|              int_8
-----+----------+-------------------
32767|2147483647|9223372036620802086
32767|2147483647|9223372036620802086
(2 rows)

prova=> select avg(int_2), avg(int_4), avg(int_8) from a;
avg|avg|       avg
---+---+-----------1| -1|-233973722
(1 row)


Jose'


Re: bug on aggregate function AVG()

От
"Jose' Soares"
Дата:
Jose' Soares wrote:
> 
> prova=> select * from a;
> int_2|     int_4|              int_8
> -----+----------+-------------------
> 32767|2147483647|9223372036620802086
> 32767|2147483647|9223372036620802086
> (2 rows)
> 
> prova=> select avg(int_2), avg(int_4), avg(int_8) from a;
> avg|avg|       avg
> ---+---+----------
>  -1| -1|-233973722
> (1 row)
> 
> Jose'

The same on SUM():

prova=> select sum(int_2), sum(int_4), sum(int_8) from a;
sum|sum|       sum
---+---+-----------2| -2|-467947444
(1 row)

but it works as follow:

prova=> select avg(int_2*1.0), avg(int_4*1.0), avg(int_8*1.0) from a; avg|       avg|avg
-----+----------+-------------------
32767|2147483647|9.2233720366208e+18
(1 row)

prova=> select sum(int_2*1.0), sum(int_4*1.0), sum(int_8*1.0) from a; sum|       sum|sum
-----+----------+--------------------
65534|4294967294|1.84467440732416e+19
(1 row)

Jose'


Re: [HACKERS] Re: bug on aggregate function AVG()

От
"Thomas G. Lockhart"
Дата:
> > prova=> select avg(int_2), avg(int_4), avg(int_8) from a;
> The same on SUM():

Sure. For some reason, on most platforms integers are allowed to
overflow in Postgres. Of course, both SUM() and AVG() take a running
sum, and once they overflow you are hosed...
                   - Tom


Re[2]: [HACKERS] Re: bug on aggregate function AVG()

От
Sferacarta Software
Дата:
Hi Tom,

mercoledì, 4 novembre 98, you wrote:

>> > prova=> select avg(int_2), avg(int_4), avg(int_8) from a;
>> The same on SUM():

TGL> Sure. For some reason, on most platforms integers are allowed to
TGL> overflow in Postgres. Of course, both SUM() and AVG() take a running
TGL> sum, and once they overflow you are hosed...                                        ^^^^^
I don't know what does the word "hosed" mean Tom, I hope you don't want to tell me
there's no solution for this problem.
I see that AVG() and SUM() uses an accumulator not enough big to hold
the result of calculation, but the point is: should we consider this
thing a "terrible" bug or an acceptable feature ?
What about to convert every accumulator to float8 ?

select intero4 from a;  intero4
----------
2147483647
2147483647
2147483647
(3 rows)

select sum(inter04),sum(intero4*1.0) from a;      sum|       sum
----------+----------
2147483645|6442450941
(1 row)

select avg(intero4),avg(intero4*1.0) from a;     avg|       avg
---------+----------
715827881|2147483647
(1 row)

Anyway I think we need to work a little bit on aggregates:

MIN() and MAX() doesn't accept a string as parameter.
SUM() and AVG() gives a wrong result because it goes on overflow.
and none of them allows the clause DISTINCT.

What do you think about ? ;)
      Jose'




Re: [HACKERS] Re: bug on aggregate function AVG()

От
"Thomas G. Lockhart"
Дата:
> I don't know what does the word "hosed" mean Tom, I hope you don't 
> want to tell me there's no solution for this problem.

As you guessed, "hosed" isn't good ;)

> I see that AVG() and SUM() uses an accumulator not enough big to hold
> the result of calculation, but the point is: should we consider this
> thing a "terrible" bug or an acceptable feature ?
> What about to convert every accumulator to float8 ?

imho we can't do that because we lose the exact qualities of integers.
If you accumulate in float8, and if you take a sum over a very large
table, you might start ignoring values. That is, if you have accumulated
15 or 16 digits worth of number, and then try adding 1 as the next
number, the result will be the same as the input. With integers that is
never the case, but we have to deal with overflows better.

I would think we should start signalling overflows rather than silently
overflowing, but I'm not sure what that entails.

> Anyway I think we need to work a little bit on aggregates:
> MIN() and MAX() doesn't accept a string as parameter.

Yes, at the moment only numeric quantities are supported.

> SUM() and AVG() gives a wrong result because it goes on overflow.
> and none of them allows the clause DISTINCT.

Yes, SELECT SUM(DISTINCT i) FROM t; is not yet supported. That's a
project for v6.5.

btw, I'm also planning on working on your "NULL problem" you mentioned
earlier...
                   - Tom


Re: [HACKERS] Re: bug on aggregate function AVG()

От
Tom Lane
Дата:
"Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
>> I see that AVG() and SUM() uses an accumulator not enough big to hold
>> the result of calculation, but the point is: should we consider this
>> thing a "terrible" bug or an acceptable feature ?
>> What about to convert every accumulator to float8 ?

> imho we can't do that because we lose the exact qualities of integers.
> If you accumulate in float8, and if you take a sum over a very large
> table, you might start ignoring values.

I think that SUM() on an int column ought to produce an exact result.
AVG() is a different story --- I think you could make a good case that
it ought to produce a float result even when the input is integers,
since the exact right answer would typically not be integral anyway.
(A programmer who wants the average rounded to integer should have to
write something like ROUND(AVG(x)), I think.)

One way you could postpone the overflow problem for SUM() is to
accumulate the running sum in a "long", or even better "long long" where
available, even if the input datatype is a smaller flavor of int.
You might still find that the end result overflows, but if the incoming
values are not all the same sign then this might avoid an unnecessary
intermediate overflow.
        regards, tom lane


Re[2]: [HACKERS] Re: bug on aggregate function AVG()

От
Sferacarta Software
Дата:
TL> "Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
>>> I see that AVG() and SUM() uses an accumulator not enough big to hold
>>> the result of calculation, but the point is: should we consider this
>>> thing a "terrible" bug or an acceptable feature ?
>>> What about to convert every accumulator to float8 ?

>> imho we can't do that because we lose the exact qualities of integers.
>> If you accumulate in float8, and if you take a sum over a very large
>> table, you might start ignoring values.

TL> I think that SUM() on an int column ought to produce an exact result.
TL> AVG() is a different story --- I think you could make a good case that
TL> it ought to produce a float result even when the input is integers,
TL> since the exact right answer would typically not be integral anyway.
TL> (A programmer who wants the average rounded to integer should have to
TL> write something like ROUND(AVG(x)), I think.)

TL> One way you could postpone the overflow problem for SUM() is to
TL> accumulate the running sum in a "long", or even better "long long" where
TL> available, even if the input datatype is a smaller flavor of int.
TL> You might still find that the end result overflows, but if the incoming
TL> values are not all the same sign then this might avoid an unnecessary
TL> intermediate overflow.
TL>                         regards, tom lane

I see there are four new built in functions:
int42pl,int42div, int84pl and int84div.
I created four new aggregate functions SUM(int2), SUM(int4), AVG(int2) and
AVG(int4) using these new functions, in this way the accumulator for int2
is int4 and the accumulator for int4 is int8, this reduce the probabilities
of overflow cases.
Remains just the overflow problem for int8.

Is there any reason for not use these functions on SUM() and AVG() on
official release ?
------------------------------------------------
drop aggregate sum2 int2;
DROP
create aggregate sum2(       sfunc1    = int42pl,       basetype  = int2,       stype1    = int4,       initcond1 =
'0'
);
CREATE
drop aggregate avg2 int2;
DROP
create aggregate avg2(       sfunc1    = int42pl,       basetype  = int2,       stype1    = int4,       initcond1 =
'0',      sfunc2    = int2inc,       stype2    = int2,       finalfunc = int42div,       initcond2 = '0'
 
);
CREATE
drop aggregate sum4 int4;
DROP
create aggregate sum4(       sfunc1    = int84pl,       basetype  = int4,       stype1    = int8,       initcond1 =
'0'
);
CREATE

drop aggregate avg4 int4;
DROP
create aggregate avg4(       sfunc1    = int84pl,       basetype  = int4,       stype1    = int8,       initcond1 =
'0',      sfunc2    = int4inc,       stype2    = int4,       finalfunc = int84div,       initcond2 = '0'
 
);
CREATE
drop table b;
DROP
create table b(       i2 int2,       i4 int4,       i8 int8
);
CREATE
insert into b values (32767,2147483647,9223372036620802086);
NOTICE:  Integer input '9223372036620802086' is out of range; promoted to float
^^^^^^^^^^^^^^^^^^^^                      what's happen here? seems a test for an int4                       not an
int8.
INSERT 246255 1
insert into b values (32767,2147483647,2147483647);
INSERT 246256 1
select * from b;  i2|        i4|                 i8
-----+----------+-------------------
32767|2147483647|9223372036620802048
32767|2147483647|         2147483647
(2 rows)

select sum2(i2) as new_sum2,      sum(i2) as old_sum2,      sum4(i4) as new_sum4,      sum(i4) as old_sum4,
sum(i8)as only_sum8
 
from b;
new_sum2|old_sum2|  new_sum4|old_sum4|           only_sum8
--------+--------+----------+--------+--------------------  65534|      -2|4294967294|      -2|-9223372034941265921
(1 row)

select avg2(i2) as new_avg2,      avg(i2) as old_avg2,      avg4(i4) as new_avg4,      avg(i4) as old_avg4,
avg(i8)as only_sum8
 
from b;
new_avg2|old_avg2|  new_avg4|old_avg4|           only_sum8
--------+--------+----------+--------+--------------------  32767|      -1|2147483647|      -1|-4611686017470632960
(1 row)

Is there any reason for not use these functions on SUM() and AVG() on
official release ?
- Jose' -




Re: [HACKERS] Re: bug on aggregate function AVG()

От
"Thomas G. Lockhart"
Дата:
> Is there any reason for not use these functions on SUM() and AVG() on
> official release ?

It sounds like a good idea. The only hesitation I have at the moment is
that not all platforms have int8 support, and I'm not certain which
these are. Also, accumulating int4 into int8 is probably pretty slow
since on 32-bit machines the "long long" is usually done in a s/w
library, not in machine code.

float8 might be a better choice for accumulating AVG(), but I'm worried
about incorrect results with large tables (> 1M entries) which have
pathological distributions of numbers (e.g. 1M entries with MAXINT and
1M entries with zero). int4 gives ~9.2 decimal places, float8 gives ~15
decimal places, so there is only about ~6 decimal places of headroom.

Of course, why am I worried? That is much better than what we have
currently. And someone reported that at least one commercial system
(Sybase?) returns float8 for avg() (and sum()?) as I recall.

So, your suggestion is that for AVG() at least we return something other
than the input type; how about returning float8 for any input type?
Don't know if SUM() could/should behave similarly...
                      - Tom


Re: [HACKERS] Re: bug on aggregate function AVG()

От
"Thomas G. Lockhart"
Дата:
> MIN() and MAX() doesn't accept a string as parameter.

I've added capabilities to do min() and max() on strings. To help with
this I've put the new implicit type coersion techniques into the
aggregate function handling. So, for example, the same routine which
handles min(text) will also handle min(char()) and min(varchar()).

These changes are already in the main cvs tree, but are not in the
v6.4.x branch since they require a small number of catalog changes to
implement.

Will post the patches on the patches list in case anyone needs them
before v6.5 comes out...
                      - Tom