Re: Median/Quantile Aggregate
От | Tom Lane |
---|---|
Тема | Re: Median/Quantile Aggregate |
Дата | |
Msg-id | 29177.1116347496@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Median/Quantile Aggregate (David Orme <d.orme@imperial.ac.uk>) |
Список | pgsql-novice |
David Orme <d.orme@imperial.ac.uk> writes: > I know people have asked about this before but I can't find a working > solution on the web - I've found code for specific instances of > calculating medians but I need a general aggregate function for > calculating medians, or more generally any given quantile. > The kind of thing I need to do is to be able to extract the median > value from a table of 4 million rows, aggregating across more than > 50,000 grouping values - the sort of thing that is really easy to do > for averaging: > SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id; You could build a custom aggregate for this. array_append() will do fine as the transition function, so all you really need to write is a final function that sorts the given array and then picks out the middle (or appropriate-quantile) element. In fact, you could cheat a bit and let the system do the sorting for you: SELECT grid_id, myagg(rs) FROM (SELECT grid_id, rs FROM behr_grid ORDER BY grid_id, rs) ss GROUP BY grid_id; If the aggregate is only used in a context like this, it will always see presorted input and so it can just pull out the middle element. (Note: I think this trick only works in PG 7.4 and later.) So, lightly tested: regression=# create function get_middle(anyarray) returns anyelement as regression-# 'declare n integer; regression'# begin regression'# n := (array_lower($1, 1) + array_upper($1, 1)) / 2; regression'# return $1[n]; regression'# end' language plpgsql; CREATE FUNCTION regression=# create aggregate sortedmedian( regression(# sfunc = array_append, regression(# finalfunc = get_middle, regression(# basetype = anyelement, regression(# stype = anyarray, regression(# initcond = '{}' regression(# ); CREATE AGGREGATE regression=# select hundred, min(thousand), max(thousand), sortedmedian(thousand) from regression-# (select hundred, thousand from tenk1 order by 1,2) ss regression-# group by hundred; hundred | min | max | sortedmedian ---------+-----+-----+-------------- 0 | 0 | 900 | 400 1 | 1 | 901 | 401 2 | 2 | 902 | 402 3 | 3 | 903 | 403 4 | 4 | 904 | 404 5 | 5 | 905 | 405 6 | 6 | 906 | 406 7 | 7 | 907 | 407 8 | 8 | 908 | 408 9 | 9 | 909 | 409 10 | 10 | 910 | 410 ... regards, tom lane
В списке pgsql-novice по дате отправления: