Re: Strategy for doing number-crunching
От | Sean Davis |
---|---|
Тема | Re: Strategy for doing number-crunching |
Дата | |
Msg-id | CANeAVB=9dessVbx=p9TCtggZuj=LaGvDZqhKT_zV4zHS5tO7XQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Strategy for doing number-crunching ("ktm@rice.edu" <ktm@rice.edu>) |
Ответы |
Re: Strategy for doing number-crunching
|
Список | pgsql-novice |
On Wed, Jan 4, 2012 at 11:58 AM, ktm@rice.edu <ktm@rice.edu> wrote: > On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote: >> On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote: >> > We have a database with approximately 130M rows, and we need to produce >> > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, >> > we're generating these stats via a single SELECT, and it is extremely >> > slow...like it can take hours to return results. The queries are variable, >> > so there is no way I can think of to "pre-calculate" the stats. >> > >> > Here is an example query... >> > >> > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, >> > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') >> > AS bias, >> > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS >> > mae_min, >> > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') >> > AS bias_min, >> > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), >> > 'FM990.999') AS rms_min, >> > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, >> > to_char(stddev(mae), 'FM990.999') AS mae_stddev, >> > to_char(stddev(bias), 'FM990.999') AS bias_stddev >> > FROM verify_data.verification_data >> > WHERE model_name='foo' >> > AND...several other conditions... >> > >> > As one might imagine, this query pounds the CPU relentlessly for hours. >> > >> > I'm thinking there must be a better approach for doing large amounts of >> > calculations on the data. Any pointers would be greatly appreciated! >> >> I'm assuming that someone will point out a quicker way with postgres, >> so I'll mention another solution. >> >> It may be quicker to pull the data out of the database and calculate >> in-memory using something like R (take a look at the ff package if >> memory is an issue). In R, a numeric vector of length 130M uses about >> 1Gb (a little less). Calculating both a mean and stddev on this >> vector takes a total of 750ms on my Mac laptop. >> >> Sean >> > > Maybe pl/R would be an option as well. Good point. I do not know what the performance is for these types of aggregates using Pl/R. Perhaps Joe Conway or others could comment. Sean
В списке pgsql-novice по дате отправления: