Обсуждение: The issue of incorrect width estimation in UNION queries
Hi
When I perform a union operation on the varchar fields of two tables,
the planner incorrectly selects the sort method. But the efficiency of hashagg will be higher.
This seems to be because the planner has miscalculated the length of the columns.
The test cases are as follows:
(version is PostgreSQL 17.4)
CREATE TABLE public.t1 (
id integer primary key,
name varchar(10)
);
INSERT INTO public.t1 (id, name)
SELECT generate_series(1, 100000) AS id,'hgbnmjujty' AS name;
CREATE TABLE public.t2 (
id integer primary key,
name varchar(10)
);
INSERT INTO public.t2 (id, name)
SELECT generate_series(1, 100000) AS id,'mnyknvkuma' AS name;
set work_mem to '256kB';
explain analyze select name from t1 union select name from t2;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
------------
Unique (cost=38103.14..39103.14 rows=200000 width=38) (actual time=84.674..127.446 rows=2 loops=1)
-> Sort (cost=38103.14..38603.14 rows=200000 width=38) (actual time=84.670..111.885 rows=200000 loops=1)
Sort Key: t1.name
Sort Method: external merge Disk: 2976kB
-> Append (cost=0.00..4082.00 rows=200000 width=38) (actual time=0.016..40.349 rows=200000 loops=1)
-> Seq Scan on t1 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.015..11.187 rows=1000
00 loops=1)
-> Seq Scan on t2 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.011..11.098 rows=1000
00 loops=1)
Planning Time: 0.098 ms
Execution Time: 127.757 ms
set enable_sort to off;
explain analyze select name from t1 union select name from t2;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
------
HashAggregate (cost=37582.00..45832.00 rows=200000 width=38) (actual time=69.529..69.550 rows=2 loops=1)
Group Key: t1.name
Planned Partitions: 16 Batches: 1 Memory Usage: 121kB
-> Append (cost=0.00..4082.00 rows=200000 width=38) (actual time=0.016..39.050 rows=200000 loops=1)
-> Seq Scan on t1 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.015..10.056 rows=100000 loo
ps=1)
-> Seq Scan on t2 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.018..10.654 rows=100000 loo
ps=1)
Planning Time: 0.105 ms
Execution Time: 69.606 ms
In the execution plan, the width of the name field is estimated to be 38,
but the name fields in both tables are strings with a length of 10.
It seems that this is the reason why the optimizer incorrectly estimated the data size.
Regards
On Wed, 24 Sept 2025 at 18:13, sunw.fnst <936739278@qq.com> wrote: > -> Append (cost=0.00..4082.00 rows=200000 width=38) (actual time=0.016..39.050 rows=200000 loops=1) > -> Seq Scan on t1 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.015..10.056 rows=100000 loo > ps=1) > -> Seq Scan on t2 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.018..10.654 rows=100000 loo > ps=1) This happens because in generate_union_paths() -> create_pathtarget() -> set_pathtarget_cost_width() in get_expr_width(), the varno for the union's RelOptInfo is 0, which results in using the get_typavgwidth() result. set_append_rel_size() does a better job of this for the UNION ALL code path. Maybe we can copy the relevant parts of that. Also, I think if we're doing this, then it'd be a master-only fix. Changing this could result in plan changes in the back-branches, which we normally try to avoid. The attached patch is against master. I need to spend a bit longer on this as generate_nonunion_paths() might need the same treatment. I've just run out of time for tonight. David
Вложения
ok
Thank you for the reply,currently, we can also temporarily work around this issue by adding a hint,
look forward to the fix in the master branch.
Regards
Wei Sun
------------------ 原始邮件 ------------------
发件人: "David Rowley" <dgrowleyml@gmail.com>;
发送时间: 2025年9月24日(星期三) 晚上6:47
收件人: "Wei Sun"<936739278@qq.com>;
抄送: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
主题: Re: The issue of incorrect width estimation in UNION queries
> -> Append (cost=0.00..4082.00 rows=200000 width=38) (actual time=0.016..39.050 rows=200000 loops=1)
> -> Seq Scan on t1 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.015..10.056 rows=100000 loo
> ps=1)
> -> Seq Scan on t2 (cost=0.00..1541.00 rows=100000 width=11) (actual time=0.018..10.654 rows=100000 loo
> ps=1)
This happens because in generate_union_paths() -> create_pathtarget()
-> set_pathtarget_cost_width() in get_expr_width(), the varno for the
union's RelOptInfo is 0, which results in using the get_typavgwidth()
result.
set_append_rel_size() does a better job of this for the UNION ALL code
path. Maybe we can copy the relevant parts of that.
Also, I think if we're doing this, then it'd be a master-only fix.
Changing this could result in plan changes in the back-branches, which
we normally try to avoid. The attached patch is against master. I
need to spend a bit longer on this as generate_nonunion_paths() might
need the same treatment. I've just run out of time for tonight.
David
On Wed, 24 Sept 2025 at 22:47, David Rowley <dgrowleyml@gmail.com> wrote: > The attached patch is against master. I > need to spend a bit longer on this as generate_nonunion_paths() might > need the same treatment. I've just run out of time for tonight. Here's an updated patch which includes adjusting the width estimate for generate_nonunion_paths() too. I gave a bit of thought to how that should work for EXCEPT and INTERSECT and concluded that for EXCEPT, we should take the width estimate for the left-hand child, as no rows from the right-child will be used. For INTERSECT, I couldn't think of anything better than taking the weighted average of both children, the same as UNION. Of course, only the rows which exist in both will make the final result, but I don't see any way to do anything smart with a single width estimate for each child. David
Вложения
On Fri, 26 Sept 2025 at 09:42, David Rowley <dgrowleyml@gmail.com> wrote: > Here's an updated patch which includes adjusting the width estimate > for generate_nonunion_paths() too. > > I gave a bit of thought to how that should work for EXCEPT and > INTERSECT and concluded that for EXCEPT, we should take the width > estimate for the left-hand child, as no rows from the right-child will > be used. For INTERSECT, I couldn't think of anything better than > taking the weighted average of both children, the same as UNION. Of > course, only the rows which exist in both will make the final result, > but I don't see any way to do anything smart with a single width > estimate for each child. I looked at this again today. All seems fairly trivial and not too dissimilar to what happens in set_append_rel_size(). I did consider if it should be done by having some special varno for setops and then have get_expr_width() to something better, but I don't see how that would work exactly. In any case, that would be way more complex than what I did. With some more adjustments to the comments, I pushed the v1 patch. David