Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
От | Tomas Vondra |
---|---|
Тема | Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics |
Дата | |
Msg-id | a06caba6-6821-9d6c-d799-8db1dd0228d2@enterprisedb.com обсуждение исходный текст |
Ответ на | Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics (Quan Zongliang <quanzongliang@yeah.net>) |
Ответы |
Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
|
Список | pgsql-hackers |
On 6/16/23 11:25, Quan Zongliang wrote: > > We have a small table with only 23 rows and 21 values. > > The resulting MCV and histogram is as follows > stanumbers1 | {0.08695652,0.08695652} > stavalues1 | {v1,v2} > stavalues2 | > {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21} > > An incorrect number of rows was estimated when HashJoin was done with > another large table (about 2 million rows). > > Hash Join (cost=1.52..92414.61 rows=2035023 width=0) (actual > time=1.943..1528.983 rows=3902 loops=1) > That's interesting. I wonder how come the estimate gets this bad simply by skipping values entries with a single row in the sample, which means we know the per-value selectivity pretty well. I guess the explanation has to be something strange happening when estimating the join condition selectivity, where we combine MCVs from both sides of the join (which has to be happening here, otherwise it would not matter what gets to the MCV). It'd be interesting to know what's in the other MCV, and what are the other statistics for the attributes (ndistinct etc.). Or even better, a reproducer SQL script that builds two tables and then joins them. > The reason is that the MCV of the small table excludes values with rows > of 1. Put them in the MCV in the statistics to get the correct result. > > Using the conservative samplerows <= attstattarget doesn't completely > solve this problem. It can solve this case. > > After modification we get statistics without histogram: > stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... } > stavalues1 | {v,v2, ... } > > And we have the right estimates: > Hash Join (cost=1.52..72100.69 rows=3631 width=0) (actual > time=1.447..1268.385 rows=3902 loops=1) > I'm not against building a "complete" MCV, but I guess the case where (samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV complete whenever we decide (ndistinct <= num_mcv)? That would need to happen later, because we don't have the ndistinct estimate yet at this point - we'd have to do the loop a bit later (or likely twice). FWIW the patch breaks the calculation of nmultiple (and thus likely the ndistinct estimate). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: