Re: Slow query with a lot of data
От | Moritz Onken |
---|---|
Тема | Re: Slow query with a lot of data |
Дата | |
Msg-id | 3663375B-7AC0-43F6-B6EC-D3EE572F3D34@houseofdesign.de обсуждение исходный текст |
Ответ на | Re: Slow query with a lot of data (Matthew Wakeling <matthew@flymine.org>) |
Ответы |
Re: Slow query with a lot of data
|
Список | pgsql-performance |
Am 18.08.2008 um 18:05 schrieb Matthew Wakeling: > On Mon, 18 Aug 2008, Moritz Onken wrote: >> Running the query for more than one user is indeed not much slower. >> That's what I need. I'm clustering the results table on domain >> right now. But why is this better than clustering it on "user"? > > The reason is the way that the merge join algorithm works. What it > does is takes two tables, and sorts them both by the join fields. > Then it can stream through both tables producing results as it goes. > It's the best join algorithm, but it does require both tables to be > sorted by the same thing, which is domain in this case. The > aggregating on user happens after the join has been done, and the > hash aggregate can accept the users in random order. > > If you look at your last EXPLAIN, see that it has to sort the result > table on domain, although it can read the domain_categories in > domain order due to the clustered index. explain select a."user", b.category, sum(1.0/b.cat_count)::float from result a, domain_categories b where a."domain" = b."domain" group by a."user", b.category; "GroupAggregate (cost=21400443313.69..22050401897.13 rows=35049240 width=12)" " -> Sort (cost=21400443313.69..21562757713.35 rows=64925759864 width=12)" " Sort Key: a."user", b.category" " -> Merge Join (cost=4000210.40..863834009.08 rows=64925759864 width=12)" " Merge Cond: (b.domain = a.domain)" " -> Index Scan using domain_categories_domain on domain_categories b (cost=0.00..391453.79 rows=12105014 width=12)" " -> Materialize (cost=3999931.73..4253766.93 rows=20306816 width=8)" " -> Sort (cost=3999931.73..4050698.77 rows=20306816 width=8)" " Sort Key: a.domain" " -> Seq Scan on result a (cost=0.00..424609.16 rows=20306816 width=8)" Both results and domain_categories are clustered on domain and analyzed. It took 50 minutes to run this query for 280 users ("and "user" IN ([280 ids])"), 78000 rows were returned and stored in a table. Is this reasonable? Why is it still sorting on domain? I thought the clustering should prevent the planner from doing this? moritz
В списке pgsql-performance по дате отправления: