Обсуждение: PG performance in high volume environment (many INSERTs and lots of aggregation reporting)
PG performance in high volume environment (many INSERTs and lots of aggregation reporting)
От
Phoenix Kiula
Дата:
[Ppsted similar note to PG General but I suppose it's more appropriate in this list. Apologies for cross-posting.] Hi. Further to my bafflement with the "count(*)" queries as described in this thread: http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php It seems that whenever this question has come up, Postgresql comes up very short in terms of "count(*)" functions. The performance is always slow, because of the planner's need to guess and such. I don't fully understand how the statistics work (and the explanation on the PG website is way too geeky) but he columns I work with already have a stat level of 100. Not helping at all. We are now considering a web based logging functionality for users of our website. This means the table could be heavily INSERTed into. We get about 10 million hits a day, and I'm guessing that we will have to keep this data around for a while. My question: with that kind of volume and the underlying aggregation functions (by product id, dates, possibly IP addresses or at least countries of origin..) will PG ever be a good choice? Or should I be looking at some other kind of tools? I wonder if OLAP tools would be overkill for something that needs to look like a barebones version of google analytics limited to our site.. Appreciate any thoughts. If possible I would prefer to tone down any requests for MySQL and such! Thanks!
> My question: with that kind of volume and the underlying aggregation > functions (by product id, dates, possibly IP addresses or at least > countries of origin..) will PG ever be a good choice? Or should I be > looking at some other kind of tools? I wonder if OLAP tools would be > overkill for something that needs to look like a barebones version of > google analytics limited to our site.. Some other databases might have an optimization that makes this much faster that it would ordinarily be. select count(*) from table; But I don't think anyone has an optimization that makes this fast: select column, count(*) from table group by 1; How do you expect the database to get this information other than be reading the whole table and counting up the number of occurrences of each value? I guess an OLAP cube might precompute all the answers for you, but I don't think MySQL is going to do that. One option is to write a script that runs in the background and updates all your statistics every 10 minutes or so, dumping the results into separate (and smaller) tables that you can query quickly. Another option (which is probably what I would do for really high volume logging of web traffic) is to write your log records to a flat file and then postprocess them with perl or something and load the summary statistics into your database later. PostgreSQL is really fast, but nothing is as fast as writing to a flatfile. ...Robert
Re: PG performance in high volume environment (many INSERTs and lots of aggregation reporting)
От
Richard Huxton
Дата:
Phoenix Kiula wrote: > [Ppsted similar note to PG General but I suppose it's more appropriate > in this list. Apologies for cross-posting.] > > Hi. Further to my bafflement with the "count(*)" queries as described > in this thread: > > http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php > > It seems that whenever this question has come up, Postgresql comes up > very short in terms of "count(*)" functions. Sorry - I'm confused. That thread doesn't seem to contain a slow count(*) query. You seem to be saying you're having problems with the query taking 10-15 seconds, but the example takes less then half a second. How have you identified the count() as being the problem here? > The performance is always slow, because of the planner's need to guess > and such. I don't fully understand how the statistics work (and the > explanation on the PG website is way too geeky) but he columns I work > with already have a stat level of 100. Not helping at all. But your own email says it's slow sometimes: "My queries are fast in general *except* the first time" I'm not sure how the planner comes into this. > We are now considering a web based logging functionality for users of > our website. This means the table could be heavily INSERTed into. We > get about 10 million hits a day, and I'm guessing that we will have to > keep this data around for a while. > > My question: with that kind of volume and the underlying aggregation > functions (by product id, dates, possibly IP addresses or at least > countries of origin..) will PG ever be a good choice? A good choice compared to what? > Or should I be > looking at some other kind of tools? I wonder if OLAP tools would be > overkill for something that needs to look like a barebones version of > google analytics limited to our site.. Typically you'd summarise the data by hour/day via triggers / a scheduled script if you weren't going towards a pre-packaged OLAP toolkit. Otherwise you're going to have to scan the hundreds of millions of rows you've accumulated. > Appreciate any thoughts. If possible I would prefer to tone down any > requests for MySQL and such! I'm not sure MySQL is going to help you here - if you were running lots of small, simple queries it might make sense. If you want to aggregate data by varying criteria I don't think there is any sensible optimisation (other than pre-calculating summaries). -- Richard Huxton Archonet Ltd
Re: PG performance in high volume environment (many INSERTs and lots of aggregation reporting)
От
Merlin Moncure
Дата:
On 1/28/09, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > [Ppsted similar note to PG General but I suppose it's more appropriate > in this list. Apologies for cross-posting.] > > Hi. Further to my bafflement with the "count(*)" queries as described > in this thread: > > http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php > > It seems that whenever this question has come up, Postgresql comes up > very short in terms of "count(*)" functions. > > The performance is always slow, because of the planner's need to guess > and such. I don't fully understand how the statistics work (and the > explanation on the PG website is way too geeky) but he columns I work > with already have a stat level of 100. Not helping at all. Your issue is not statistics/planner. postgres just can't apply the special case optimization that some other database do because of the locking model. all planner's 'guess'. the main goal of statistics is to make the guess better educated. > We are now considering a web based logging functionality for users of > our website. This means the table could be heavily INSERTed into. We > get about 10 million hits a day, and I'm guessing that we will have to > keep this data around for a while. 10m hits/day = 115 hits/sec. This is no problem for even workstation box assuming your disks can handle the syncs. however, with extreme insert heavy loads it helps alot to look at partitioning/rotation to ease the pain of big deletes. merlin
phoenix.kiula@gmail.com (Phoenix Kiula) writes: > [Ppsted similar note to PG General but I suppose it's more appropriate > in this list. Apologies for cross-posting.] > > Hi. Further to my bafflement with the "count(*)" queries as described > in this thread: > > http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php > > It seems that whenever this question has come up, Postgresql comes up > very short in terms of "count(*)" functions. > > The performance is always slow, because of the planner's need to guess > and such. I don't fully understand how the statistics work (and the > explanation on the PG website is way too geeky) but he columns I work > with already have a stat level of 100. Not helping at all. That's definitely *NOT* due to "planner's need to guess"; it's due to there being some *specific* work that PostgreSQL needs to do that some other databases can avoid due to different storage strategies. The matter is quite succinctly described here: http://wiki.postgresql.org/wiki/Why_PostgreSQL_Instead_of_MySQL:_Comparing_Reliability_and_Speed_in_2007#Counting_rows_in_a_table I'll just take one excerpt: --------------------------- It is worth observing that it is only this precise form of aggregate that must be so pessimistic; if augmented with a "WHERE" clause like SELECT COUNT(*) FROM table WHERE status = 'something' PostgreSQL, MySQL, and most other database implementations will take advantage of available indexes against the restricted field(s) to limit how many records must be counted, which can greatly accelerate such queries. --------------------------- It is common for systems where it is necessary for aggregation reporting to be fast to do pre-computation of the aggregates, and that is in no way specific to PostgreSQL. If you need *really* fast aggregates, then it will be worthwhile to put together triggers or procedures or something of the sort to help pre-compute the aggregates. -- (reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc")) http://linuxfinances.info/info/wp.html "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." -- Sir Arthur Conan Doyle (1859-1930), English author. Sherlock Holmes, in The Sign of Four, ch. 6 (1889). [...but see the Holmesian Fallacy, due to Bob Frankston... <http://www.frankston.com/public/Essays/Holmesian%20Fallacy.asp>]
On Thu, Jan 29, 2009 at 1:56 PM, Chris Browne <cbbrowne@acm.org> wrote: > > It is common for systems where it is necessary for aggregation > reporting to be fast to do pre-computation of the aggregates, and that > is in no way specific to PostgreSQL. > > If you need *really* fast aggregates, then it will be worthwhile to > put together triggers or procedures or something of the sort to help > pre-compute the aggregates. Just to add to this, at me last employer in Chicago, we had a database from a very large database company who's CEO makes more than all the people on this mailing list combined that shall not be named for reasons like I don't want to be sued. This database had a large statistical dataset we replicated over to pgsql on a by the minute basis so we could run big ugly queries anytime we felt like it without blowing out the production database. At night, or by hand, I would run such queries as select count(*) from reallyreallyreallybigstatstable on it and compare it to postgresql. PostgreSQL would take about 4 or 5 minutes to run this on a local server running a software RAID-10 4 disc set on a single core P-4 Dell workstation, and the really really big server in production took about 15 to 20 seconds. Our local test server that ran the same really big database that cannot be named and had a 16 disk RAID-6 array with gigs of memory and 4 cpu cores, took about 45 seconds to a minute to run the same select count(*) query. All of the machines showed high CPU and moderate I/O usage while running said query. So, there's probably some room for improvement in pgsql's way of doing things, but it's not like the other database software was providing instantaneous answers. Basically, the second that a database server becomes fast at running lots of update / select queries in a mixed environment, things like fast select count(*) get slower. To the OP: Try running 100 transactional clients against mysql (updates/inserts/deletes/selects) while running a select count(*) and see how it behaves. Single thread use cases are kind of uninteresting compared to lotsa users. But if single thread use cases are your bread and butter, then pgsql is possibly a poor choice of db.