Re: Strategy for doing number-crunching
От | Sean Davis |
---|---|
Тема | Re: Strategy for doing number-crunching |
Дата | |
Msg-id | CANeAVBmy0dQARTi6F8Fqu_DMtWii-s7rCQrOyEY5Qy9NFur1pA@mail.gmail.com обсуждение исходный текст |
Ответ на | Strategy for doing number-crunching (Matthew Foster <matthew.foster@noaa.gov>) |
Ответы |
Re: Strategy for doing number-crunching
|
Список | pgsql-novice |
On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote: > We have a database with approximately 130M rows, and we need to produce > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, > we're generating these stats via a single SELECT, and it is extremely > slow...like it can take hours to return results. The queries are variable, > so there is no way I can think of to "pre-calculate" the stats. > > Here is an example query... > > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') > AS bias, > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS > mae_min, > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') > AS bias_min, > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), > 'FM990.999') AS rms_min, > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, > to_char(stddev(mae), 'FM990.999') AS mae_stddev, > to_char(stddev(bias), 'FM990.999') AS bias_stddev > FROM verify_data.verification_data > WHERE model_name='foo' > AND...several other conditions... > > As one might imagine, this query pounds the CPU relentlessly for hours. > > I'm thinking there must be a better approach for doing large amounts of > calculations on the data. Any pointers would be greatly appreciated! I'm assuming that someone will point out a quicker way with postgres, so I'll mention another solution. It may be quicker to pull the data out of the database and calculate in-memory using something like R (take a look at the ff package if memory is an issue). In R, a numeric vector of length 130M uses about 1Gb (a little less). Calculating both a mean and stddev on this vector takes a total of 750ms on my Mac laptop. Sean
В списке pgsql-novice по дате отправления: