Re: Strategy for doing number-crunching
От | ktm@rice.edu |
---|---|
Тема | Re: Strategy for doing number-crunching |
Дата | |
Msg-id | 20120104165847.GD28556@staff-mud-56-27.rice.edu обсуждение исходный текст |
Ответ на | Re: Strategy for doing number-crunching (Sean Davis <sdavis2@mail.nih.gov>) |
Ответы |
Re: Strategy for doing number-crunching
|
Список | pgsql-novice |
On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote: > On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote: > > We have a database with approximately 130M rows, and we need to produce > > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, > > we're generating these stats via a single SELECT, and it is extremely > > slow...like it can take hours to return results. The queries are variable, > > so there is no way I can think of to "pre-calculate" the stats. > > > > Here is an example query... > > > > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, > > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') > > AS bias, > > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS > > mae_min, > > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') > > AS bias_min, > > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), > > 'FM990.999') AS rms_min, > > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, > > to_char(stddev(mae), 'FM990.999') AS mae_stddev, > > to_char(stddev(bias), 'FM990.999') AS bias_stddev > > FROM verify_data.verification_data > > WHERE model_name='foo' > > AND...several other conditions... > > > > As one might imagine, this query pounds the CPU relentlessly for hours. > > > > I'm thinking there must be a better approach for doing large amounts of > > calculations on the data. Any pointers would be greatly appreciated! > > I'm assuming that someone will point out a quicker way with postgres, > so I'll mention another solution. > > It may be quicker to pull the data out of the database and calculate > in-memory using something like R (take a look at the ff package if > memory is an issue). In R, a numeric vector of length 130M uses about > 1Gb (a little less). Calculating both a mean and stddev on this > vector takes a total of 750ms on my Mac laptop. > > Sean > Maybe pl/R would be an option as well. Ken
В списке pgsql-novice по дате отправления: