Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
От | Quan Zongliang |
---|---|
Тема | Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics |
Дата | |
Msg-id | 83d056c0-c3eb-6ef7-bf3a-7e461fcde74d@yeah.net обсуждение исходный текст |
Ответ на | Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Ответы |
Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics |
Список | pgsql-hackers |
On 2023/6/16 23:39, Tomas Vondra wrote: > > > On 6/16/23 11:25, Quan Zongliang wrote: >> >> We have a small table with only 23 rows and 21 values. >> >> The resulting MCV and histogram is as follows >> stanumbers1 | {0.08695652,0.08695652} >> stavalues1 | {v1,v2} >> stavalues2 | >> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21} >> >> An incorrect number of rows was estimated when HashJoin was done with >> another large table (about 2 million rows). >> >> Hash Join (cost=1.52..92414.61 rows=2035023 width=0) (actual >> time=1.943..1528.983 rows=3902 loops=1) >> > > That's interesting. I wonder how come the estimate gets this bad simply > by skipping values entries with a single row in the sample, which means > we know the per-value selectivity pretty well. > > I guess the explanation has to be something strange happening when > estimating the join condition selectivity, where we combine MCVs from > both sides of the join (which has to be happening here, otherwise it > would not matter what gets to the MCV). > > It'd be interesting to know what's in the other MCV, and what are the > other statistics for the attributes (ndistinct etc.). > > Or even better, a reproducer SQL script that builds two tables and then > joins them. > The other table is severely skewed. Most rows cannot JOIN the small table. This special case causes the inaccuracy of cost calculation. >> The reason is that the MCV of the small table excludes values with rows >> of 1. Put them in the MCV in the statistics to get the correct result. >> >> Using the conservative samplerows <= attstattarget doesn't completely >> solve this problem. It can solve this case. >> >> After modification we get statistics without histogram: >> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... } >> stavalues1 | {v,v2, ... } >> >> And we have the right estimates: >> Hash Join (cost=1.52..72100.69 rows=3631 width=0) (actual >> time=1.447..1268.385 rows=3902 loops=1) >> > > I'm not against building a "complete" MCV, but I guess the case where > (samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV > complete whenever we decide (ndistinct <= num_mcv)? > > That would need to happen later, because we don't have the ndistinct > estimate yet at this point - we'd have to do the loop a bit later (or > likely twice). > > FWIW the patch breaks the calculation of nmultiple (and thus likely the > ndistinct estimate). > It's not just a small table. If a column's value is nearly unique. It also causes the same problem because we exclude values that occur only once. samplerows <= num_mcv just solves one scenario. Perhaps we should discard this (dups cnt > 1) restriction? > > regards >
В списке pgsql-hackers по дате отправления: