Re: MergeJoin beats HashJoin in the case of multiple hash clauses
От | Lepikhov Andrei |
---|---|
Тема | Re: MergeJoin beats HashJoin in the case of multiple hash clauses |
Дата | |
Msg-id | 84114cba-97e9-49f8-952f-04cf321125c1@app.fastmail.com обсуждение исходный текст |
Ответ на | Re: MergeJoin beats HashJoin in the case of multiple hash clauses (Andy Fan <zhihui.fan1213@gmail.com>) |
Ответы |
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
|
Список | pgsql-hackers |
On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote: > Hi, > > On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Hi, all. >> >> Some of my clients use JOIN's with three - four clauses. Quite >> frequently, I see complaints on unreasonable switch of JOIN algorithm to >> Merge Join instead of Hash Join. Quick research have shown one weak >> place - estimation of an average bucket size in final_cost_hashjoin (see >> q2.sql in attachment) with very conservative strategy. >> Unlike estimation of groups, here we use smallest ndistinct value across >> all buckets instead of multiplying them (or trying to make multivariate >> analysis). >> It works fine for the case of one clause. But if we have many clauses, >> and if each has high value of ndistinct, we will overestimate average >> size of a bucket and, as a result, prefer to use Merge Join. As the >> example in attachment shows, it leads to worse plan than possible, >> sometimes drastically worse. >> I assume, this is done with fear of functional dependencies between hash >> clause components. But as for me, here we should go the same way, as >> estimation of groups. > > I can reproduce the visitation you want to improve and verify the patch > can do it expectedly. I think this is a right thing to do. > >> The attached patch shows a sketch of the solution. > > I understand that this is a sketch of the solution, but the below > changes still > make me confused. > > + if (innerbucketsize > virtualbuckets) > + innerbucketsize = 1.0 / virtualbuckets; > > innerbucketsize is a fraction of rows in all the rows, so it is between > 0.0 and 1.0. > and virtualbuckets is the number of buckets in total (when considered > the mutli > batchs), how is it possible for 'innerbucketsize > virtualbuckets' ? > Am > I missing something? You are right here. I've made a mistake here. Changed diff is in attachment. -- Regards, Andrei Lepikhov
Вложения
В списке pgsql-hackers по дате отправления: