Re: Eager aggregation, take 3
От | Richard Guo |
---|---|
Тема | Re: Eager aggregation, take 3 |
Дата | |
Msg-id | CAMbWs4-QU_nkFqFZLdzWRsEsVE8aLWx4qBBVq7g4rXw+cvYDMg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Eager aggregation, take 3 (Richard Guo <guofenglinux@gmail.com>) |
Ответы |
Re: Eager aggregation, take 3
|
Список | pgsql-hackers |
On Thu, Oct 2, 2025 at 10:39 AM Richard Guo <guofenglinux@gmail.com> wrote: > It seems eager aggregation doesn't cope well with parallel plans for > this query. Looking into it. It turns out that this is not related to parallel plans but rather to poor size estimates. Looking at query 31, it involves joining 6 base relations, all of which are CTE references (i.e., RTE_CTE relations) to two different CTEs. Each CTE involves aggregations and GROUP BY clauses. Unfortunately, our size estimates for CTE relations are quite poor, especially when the CTE uses GROUP BY. In these cases, we don't have any ANALYZE statistics available (cf. examine_simple_variable). As a result, when computing the selectivity of the CTE relation's qual clauses, we have to fall back on default values. For example, for quals like "CTE.var = const", which are used a lot in query 31, the selectivity is computed as "1.0 / DEFAULT_NUM_DISTINCT(200)", with the assumption that there are DEFAULT_NUM_DISTINCT distinct values in the relation, and that these values are equally common (cf. var_eq_const). The consequence is that the size estimates are significantly different from the actual values. For example, from the EXPLAIN(ANALYZE) output provided by Matheus: -> CTE Scan on ws ws3 (cost=0.00..1797.35 rows=2 width=110) (actual time=0.001..74.725 rows=1261.00 loops=1) Filter: ((d_year = 1999) AND (d_qoy = 3)) Interestingly, with eager aggregation applied, the row count estimates for the two CTE plans actually become closer to the actual values. -- without eager aggregation CTE ws -> HashAggregate (cost=96009.03..114825.35 rows=718952 width=54) (actual time=977.215..1014.889 rows=23320.00 loops=1) -- with eager aggregation CTE ws -> Finalize GroupAggregate (cost=52144.19..62314.79 rows=71894 width=54) (actual time=275.121..340.107 rows=23312.00 loops=1) However, due to the highly underestimated selectivity for the qual clauses, the row count estimates for CTE Scan nodes become worse. This is because: -- without eager aggregation 718952 * (1.0/200) * (1.0/200) ~= 18 -- with eager aggregation 71894 * (1.0/200) * (1.0/200) ~= 2 ... while the actual row count is 1261.00 as shown above. That is to say, on master, the CTE plan rows are overestimated while the selectivity estimates are severely underestimated. With eager aggregation, the CTE plan rows become closer to the actual values, but the selectivity estimates remain equally underestimated. As a result, the row count estimates for the CTE Scan nodes worsen with eager aggregation. This causes the join order in the final plan to change when eager aggregation is applied, leading to longer execution times in this case. Another point to note is that, due to severely underestimated selectivity estimates (0.000025, sometimes 0.000000125), the size estimates for the CTE relations are very small, causing the planner to tend to choose nestloops. I tried manually disabling nestloop, and here are what I got for query 31. -- on master, set enable_nestloop to on; Planning Time: 4.613 ms Execution Time: 7142.090 ms -- on master, set enable_nestloop to off; Planning Time: 4.315 ms Execution Time: 2262.330 ms -- on patched, set enable_nestloop to off; Planning Time: 4.321 ms Execution Time: 1214.376 ms That is, on master, simply disabling nestloop makes query 31 run more than 3 times faster. Enabling eager aggregation on top of that improves performance further, making it run 1.86 times faster relative to the nested-loop-disabled baseline. I manually disabled nested loops for other TPC-DS queries on master and discovered some additional interesting findings. For query 4, on master: -- set enable_nestloop to on Planning Time: 3.054 ms Execution Time: 3231356.258 ms -- set enable_nestloop to off Planning Time: 4.291 ms Execution Time: 12751.170 ms That is, on master, simply disabling nestloop makes query 4 run more than 253 times faster. For query 11, on master: -- set enable_nestloop to on Planning Time: 1.435 ms Execution Time: 1824860.937 ms -- set enable_nestloop to off Planning Time: 2.479 ms Execution Time: 7984.360 ms Disabling nestloop makes query 11 run more than 228 times faster. I believe you can find more such queries in TPC-DS if you keep looking. Given this, I don't think it makes much sense to debug a performance regression on TPC-DS with nestloop enabled. Matheus, I wonder if you could help run TPC-DS again with this patch, this time with nested loops disabled for all queries. - Richard
В списке pgsql-hackers по дате отправления: