Re: PG vs ElasticSearch for Logs
От | Thomas Güttler |
---|---|
Тема | Re: PG vs ElasticSearch for Logs |
Дата | |
Msg-id | 6ff8364d-4854-ab98-087a-45c618d142be@thomas-guettler.de обсуждение исходный текст |
Ответ на | Re: PG vs ElasticSearch for Logs (Chris Mair <chris@1006.org>) |
Список | pgsql-general |
Thank you Chris for looking at my issue in such detail. Yes, the parallel feature rocks. Regards, Thomas Güttler Am 19.08.2016 um 22:40 schrieb Chris Mair: > On 19/08/16 10:57, Thomas Güttler wrote: > > >>>> What do you think? >>> >>> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. >>> >>> 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. >>> >>> in fact thats several rows/second on a 24/7 basis >> >> There is no need to store them more then 6 weeks in my current use case. > > > Hi, > > to me this kind of data looks like something Postgres can handle with ease. > > We're talking about 8.4M rows here. > > Coincidentally, I was trying out the new parallel query feature in the > 9.6 beta just now and decided to use your numbers as a test case :) > > I can create 8.4M records having a timestamp and a random ~ 250 character string > in ~ 31 seconds: > > pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts, > pg96-# repeat(random()::text, 15) as msg > pg96-# into t1 > pg96-# from generate_series(1, 6 * 7 * 200000) as sec; > SELECT 8400000 > Time: 30858.274 ms > > Table size is 2.4 GB. > > This gives about 6 weeks. A query to scan the whole thing on the narrow column > takes ~ 400 msec, like this: > > pg96=# select min(ts), max(ts) from t1; > min | max > -------------------------------+------------------------------- > 2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00 > (1 row) > > Time: 409.468 ms > > Even running an unanchored regular expression (!) on the wider column is doable: > > pg96=# select count(*) from t1 where msg ~ '12345'; > count > ------- > 955 > (1 row) > > Time: 3146.838 ms > > If you have some filter, not everything needs to be regexped and this gets pretty fast: > > pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345'; > count > ------- > 24 > (1 row) > > Time: 391.577 ms > > All this is without indices. Your data is more structured than my test, so undoubtly you will > get some gain from indices... > > Here is something more analytical - basically same as the count(*) above: > > pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date; > ts | count > ------------+------- > 2016-08-19 | 26 > 2016-08-20 | 28 > [...] > 2016-09-28 | 21 > 2016-09-29 | 33 > (42 rows) > > Time: 3157.010 ms > > Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is > run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and > the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost > like cheating ;) > > Bye, > Chris. > > > > > -- Thomas Guettler http://www.thomas-guettler.de/
В списке pgsql-general по дате отправления: