Re: Including unique users in huge data warehouse in Postgresql...
От | Richard Huxton |
---|---|
Тема | Re: Including unique users in huge data warehouse in Postgresql... |
Дата | |
Msg-id | 456DDF9F.30207@archonet.com обсуждение исходный текст |
Ответ на | Including unique users in huge data warehouse in Postgresql... (Mark Jensen <musicnyman1974@yahoo.com>) |
Список | pgsql-general |
Mark Jensen wrote: > So i've been given the task of designing a data warehouse in > either Postgresql or Mysql for our clickstream data for our sites. I > started with Mysql but the joins in Mysql are just way too slow > compared to Postgresql when playing with star schemas. Mark - it's not my usual area, but no-one else has picked up your posting, so I'll poke my nose in. The other thing you might want to do is post this on the performance list - that's probably the best place. Might be worth talking to those at www.bizgres.org too (although I think they all hang out on the performance list). > I can't say > which sites i'm working on, but we get close to 3-5 million uniques > users per day, so over time, that's a lot of unique users to keep > around and de-dup your fact tables by. Need to be able to query normal > analytics like: <snip> > i've > made a lot of optimizations in postgresql.conf by playing with work_mem > and shared_buffers and such and i think the database is using as much > as it can disk/memory/cpu wise. Big work_mem, I'm guessing. Limiting factor is presumably disk I/O. <snip> > here's a sample query that takes a while to run... just a simple report that shows gender by area of the site. > > select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as users > from uus as A, areas as B, daily_area_fact as C > where A.uu_id = C.uu_id > and B.area_id = C.area_id > group by gender,area; > > so > by just having one day of data, with 3,168,049 rows in the user > dimension table (uus), 17,213,420 in the daily_area_fact table that > joins all the dimension tables, takes about 15 minutes. if i had 30-90 > days in this fact table, who knows how long this would take... i know > doing a distinct on uu_id is very expensive, so that's the main problem > here i guess and would want to know if anyone else is doing it this way > or better. In the end, I'd suspect the seq-scan over the fact table will be your biggest problem. Can you pre-aggregate your fact-table into daily summaries? See you over on the performance list, where there are more experienced people than myself to help you. -- Richard Huxton Archonet Ltd
В списке pgsql-general по дате отправления: