Re: BUG #2930: Hash join abyssmal with many null fields.
От | Maciej Babinski |
---|---|
Тема | Re: BUG #2930: Hash join abyssmal with many null fields. |
Дата | |
Msg-id | 45BA3EC1.1050307@killer-robot.net обсуждение исходный текст |
Ответ на | Re: BUG #2930: Hash join abyssmal with many null fields. (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-bugs |
Tom Lane wrote: > Maciej Babinski <maciej@apathy.killer-robot.net> writes: >> Tom Lane wrote: >>> I see no bug here. AFAICT your "much faster" query gets that way by >>> having eliminated all the candidate join rows on the B side. > >> The additional clause eliminates no rows beyond what the existing >> clause would. Any row eliminated by "b.join_id IS NOT NULL" could not >> possibly have satisfied "a.join_id = b.join_id". > > Hmm. You assume that the = operator is strict, which is probably true, > but the hash join code isn't assuming that. > Yes, both queries return identical results. > It might be worth checking for the case, though. What's happening, > since we go ahead and put the null rows into the hash table, is that > they all end up in the same hash chain because they all get hash code 0. > And then that very long chain gets searched for each null outer row. > If we knew the join operator is strict we could discard null rows > immediately on both sides. > >> Please note that if the join columns are not null, but still produce >> no matches for the join, the results are fast without the need for an >> extra clause in the join: > > Yeah, because the rows get reasonably well distributed into different > hash buckets. The optimizer will avoid a hash if it sees the data is > not well-distributed, but IIRC it's not considering nulls when it > makes that decision. You're right. Filling one join column with '1', and the other with '2' yields a plan without a hash join. Something that may be worth noting is that while my initial problem case (200,000 rows in more complicated tables), as well as the sample test case I put together (10,000 rows in simple tables) both yield this result, changing my test case to use 20,000 rows produces a plan that avoids hash joins as well: maciej_test=# EXPLAIN ANALYZE SELECT * FROM a JOIN b ON a.join_id = b.join_id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Merge Join (cost=3435.54..3635.55 rows=1 width=16) (actual time=100.822..100.822 rows=0 loops=1) Merge Cond: ("outer".join_id = "inner".join_id) -> Sort (cost=1717.77..1767.77 rows=20000 width=8) (actual time=65.401..86.333 rows=20000 loops=1) Sort Key: a.join_id -> Seq Scan on a (cost=0.00..289.00 rows=20000 width=8) (actual time=0.005..21.119 rows=20000 loops=1) -> Sort (cost=1717.77..1767.77 rows=20000 width=8) (never executed) Sort Key: b.join_id -> Seq Scan on b (cost=0.00..289.00 rows=20000 width=8) (never executed) Total runtime: 101.891 ms (9 rows) Time: 103.051 ms Thanks for you time, Maciej Babinski
В списке pgsql-bugs по дате отправления: