Re: question about executing JOINs
От | Josh Burdick |
---|---|
Тема | Re: question about executing JOINs |
Дата | |
Msg-id | 3DAC28F1.9080306@gradient.cis.upenn.edu обсуждение исходный текст |
Ответ на | question about executing JOINs (Josh Burdick <jburdick@gradient.cis.upenn.edu>) |
Список | pgsql-general |
Tom Lane wrote: >Josh Burdick <jburdick@gradient.cis.upenn.edu> writes: > > >>Nested Loop (cost=0.00..198183643.06 rows=85230539 width=51) >> -> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65 rows=15165 >>width=29) >> -> Index Scan using ucsc_snp_tsc_hg12_chrom_start on >>ucsc_snp_tsc_hg12 (cost=0.00..12962.39 rows=4713 width=22) >> >> > >What's the actual numbers of rows involved? (EXPLAIN ANALYZE output >would be far more useful than plain EXPLAIN.) > > The two tables have 15485 and 1095497 rows, respectively. The join returns 11135953 records, which is humongous, but is still only 0.066% of the number of pairs of rows. (Which is why I stopped EXPLAIN ANALYZE after 45 minutes.) >Do you have a feeling for the number of rows that would be produced by >just the JOIN/ON condition (no constraint on snp_start)? How does that >compare to EXPLAIN's estimate of that number of rows? > > The JOIN/ON condition is on a column with 24 possible values. I'm not quite sure how close the planner would be on that case, but it would be a lot closer. > > >> The planner is assuming that it's a cross join: that we need all >>pairs of records. It's not taking into account the WHERE clause which >>restricts to a tiny fraction of the records. >> >> > >No it isn't, and yes it is, but it evidently is making a bad estimate of >the fraction of rows eliminated by those clauses. I'd like to find out >just what its estimate of that fraction is and what the correct value >would be. > > The planner's original estimates of how many rows are needed from each table (15165 and 1145280) are quite close; pretty much all rows are needed from each table. It estimates that the MERGE JOIN will return 85230539 rows, and 85230539 / (15165 * 1145280) = 0.49%. (As opposed to 0.066%.) For what its worth, if you just multiply the number of rows together for the original query, you get 17368171200, while for the version using the index scan, you get 71472645, which is a lot less, and within a factor of 10 of the actual number of rows returned. So perhaps that number could be factored into the cost estimate better somehow. > > >>Perhaps the planner should assume that a nested loop over an index >>scan only looks at 1% of its records? >> >> > >Arbitrary assumptions designed to fix one example tend to break other >examples ... > > Fair enough. I think the current hack is good enough -- if you really need to override the planner, you can. And the eventual fix should be to make the planner smarter. > regards, tom lane > > > > Hope this helps, Josh -- Josh Burdick jburdick@gradient.cis.upenn.edu http://www.cis.upenn.edu/~jburdick
В списке pgsql-general по дате отправления: