Обсуждение: MergeAppend could consider sorting cheapest child path
Hi. Now when planner finds suitable pathkeys in generate_orderedappend_paths(), it uses them, even if explicit sort of the cheapest child path could be more efficient. We encountered this issue on partitioned table with two indexes, where one is suitable for sorting, and another is good for selecting data. MergeAppend was generated with subpaths doing index scan on suitably ordered index and filtering a lot of data. The suggested fix allows MergeAppend to consider sorting on cheapest childrel total path as an alternative. -- Best regards, Alexander Pyhalov, Postgres Professional
Вложения
Is this still being considered? --------------------------------------------------------------------------- On Tue, Jun 18, 2024 at 07:45:09PM +0300, Alexander Pyhalov wrote: > Hi. > > Now when planner finds suitable pathkeys in generate_orderedappend_paths(), > it uses them, even if explicit sort of the cheapest child path could be more > efficient. > > We encountered this issue on partitioned table with two indexes, where one > is suitable for sorting, and another is good for selecting data. MergeAppend > was generated > with subpaths doing index scan on suitably ordered index and filtering a lot > of data. > The suggested fix allows MergeAppend to consider sorting on cheapest > childrel total path as an alternative. > > -- > Best regards, > Alexander Pyhalov, > Postgres Professional > From d5eb3d326d83e2ca47c17552fcc6fd27b6b98d4e Mon Sep 17 00:00:00 2001 > From: Alexander Pyhalov <a.pyhalov@postgrespro.ru> > Date: Tue, 18 Jun 2024 15:56:18 +0300 > Subject: [PATCH] MergeAppend could consider using sorted best path. > > This helps when index with suitable pathkeys is not > good for filtering data. > --- > .../postgres_fdw/expected/postgres_fdw.out | 6 +- > src/backend/optimizer/path/allpaths.c | 21 +++++ > src/test/regress/expected/inherit.out | 45 +++++++++- > src/test/regress/expected/partition_join.out | 87 +++++++++++-------- > src/test/regress/expected/union.out | 6 +- > src/test/regress/sql/inherit.sql | 17 ++++ > 6 files changed, 141 insertions(+), 41 deletions(-) > > diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out > index ea566d50341..687591e4a97 100644 > --- a/contrib/postgres_fdw/expected/postgres_fdw.out > +++ b/contrib/postgres_fdw/expected/postgres_fdw.out > @@ -10074,13 +10074,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a > -> Nested Loop > Join Filter: (t1.a = t2.b) > -> Append > - -> Foreign Scan on ftprt1_p1 t1_1 > + -> Sort > + Sort Key: t1_1.a > + -> Foreign Scan on ftprt1_p1 t1_1 > -> Foreign Scan on ftprt1_p2 t1_2 > -> Materialize > -> Append > -> Foreign Scan on ftprt2_p1 t2_1 > -> Foreign Scan on ftprt2_p2 t2_2 > -(10 rows) > +(12 rows) > > SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1; > a | b > diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c > index 4895cee9944..827bc469269 100644 > --- a/src/backend/optimizer/path/allpaths.c > +++ b/src/backend/optimizer/path/allpaths.c > @@ -1845,6 +1845,27 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, > /* Assert we do have an unparameterized path for this child */ > Assert(cheapest_total->param_info == NULL); > } > + else > + { > + /* > + * Even if we found necessary pathkeys, using unsorted path > + * can be more efficient. > + */ > + Path sort_path; > + > + cost_sort(&sort_path, > + root, > + pathkeys, > + childrel->cheapest_total_path->total_cost, > + childrel->cheapest_total_path->rows, > + childrel->cheapest_total_path->pathtarget->width, > + 0.0, > + work_mem, > + -1.0 /* need all tuples to sort them */ ); > + > + if (compare_path_costs(&sort_path, cheapest_total, TOTAL_COST) < 0) > + cheapest_total = childrel->cheapest_total_path; > + } > > /* > * When building a fractional path, determine a cheapest > diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out > index ad732134148..16e78c8d2ff 100644 > --- a/src/test/regress/expected/inherit.out > +++ b/src/test/regress/expected/inherit.out > @@ -1555,6 +1555,7 @@ insert into matest2 (name) values ('Test 4'); > insert into matest3 (name) values ('Test 5'); > insert into matest3 (name) values ('Test 6'); > set enable_indexscan = off; -- force use of seqscan/sort, so no merge > +set enable_sort = off; -- avoid sorting below MergeAppend > explain (verbose, costs off) select * from matest0 order by 1-id; > QUERY PLAN > ------------------------------------------------------------ > @@ -1608,6 +1609,7 @@ select min(1-id) from matest0; > (1 row) > > reset enable_indexscan; > +reset enable_sort; > set enable_seqscan = off; -- plan with fewest seqscans should be merge > set enable_parallel_append = off; -- Don't let parallel-append interfere > explain (verbose, costs off) select * from matest0 order by 1-id; > @@ -1702,7 +1704,9 @@ order by t1.b limit 10; > Merge Cond: (t1.b = t2.b) > -> Merge Append > Sort Key: t1.b > - -> Index Scan using matest0i on matest0 t1_1 > + -> Sort > + Sort Key: t1_1.b > + -> Seq Scan on matest0 t1_1 > -> Index Scan using matest1i on matest1 t1_2 > -> Materialize > -> Merge Append > @@ -1711,7 +1715,7 @@ order by t1.b limit 10; > Filter: (c = d) > -> Index Scan using matest1i on matest1 t2_2 > Filter: (c = d) > -(14 rows) > +(16 rows) > > reset enable_nestloop; > drop table matest0 cascade; > @@ -2663,6 +2667,43 @@ explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c; > > reset enable_bitmapscan; > drop table mcrparted; > +-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys > +create table hash_parted (i int, j int, k int) partition by hash(i); > +create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0); > +create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1); > +create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2); > +create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3); > +--create table hash_parted_5 partition of hash_parted for values with (modulus 6, remainder 4); > +--create table hash_parted_6 partition of hash_parted for values with (modulus 6, remainder 5); > +create index on hash_parted(j); > +create index on hash_parted(k); > +insert into hash_parted select i, i, i from generate_series(1,10000) i; > +analyze hash_parted; > +explain (costs off) select * from hash_parted where k<100 order by j limit 100; > + QUERY PLAN > +------------------------------------------------------------------------- > + Limit > + -> Merge Append > + Sort Key: hash_parted.j > + -> Sort > + Sort Key: hash_parted_1.j > + -> Index Scan using hash_parted_1_k_idx on hash_parted_1 > + Index Cond: (k < 100) > + -> Sort > + Sort Key: hash_parted_2.j > + -> Index Scan using hash_parted_2_k_idx on hash_parted_2 > + Index Cond: (k < 100) > + -> Sort > + Sort Key: hash_parted_3.j > + -> Index Scan using hash_parted_3_k_idx on hash_parted_3 > + Index Cond: (k < 100) > + -> Sort > + Sort Key: hash_parted_4.j > + -> Index Scan using hash_parted_4_k_idx on hash_parted_4 > + Index Cond: (k < 100) > +(19 rows) > + > +drop table hash_parted; > -- Ensure LIST partitions allow an Append to be used instead of a MergeAppend > create table bool_lp (b bool) partition by list(b); > create table bool_lp_true partition of bool_lp for values in(true); > diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out > index 6d07f86b9bc..80d480d33d5 100644 > --- a/src/test/regress/expected/partition_join.out > +++ b/src/test/regress/expected/partition_join.out > @@ -1309,28 +1309,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 > -- This should generate a partitionwise join, but currently fails to > EXPLAIN (COSTS OFF) > SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a =t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; > - QUERY PLAN > ------------------------------------------------------------ > - Incremental Sort > + QUERY PLAN > +----------------------------------------------------------------- > + Sort > Sort Key: prt1.a, prt2.b > - Presorted Key: prt1.a > - -> Merge Left Join > - Merge Cond: (prt1.a = prt2.b) > - -> Sort > - Sort Key: prt1.a > - -> Append > - -> Seq Scan on prt1_p1 prt1_1 > - Filter: ((a < 450) AND (b = 0)) > - -> Seq Scan on prt1_p2 prt1_2 > - Filter: ((a < 450) AND (b = 0)) > - -> Sort > - Sort Key: prt2.b > - -> Append > + -> Merge Right Join > + Merge Cond: (prt2.b = prt1.a) > + -> Append > + -> Sort > + Sort Key: prt2_1.b > -> Seq Scan on prt2_p2 prt2_1 > Filter: (b > 250) > + -> Sort > + Sort Key: prt2_2.b > -> Seq Scan on prt2_p3 prt2_2 > Filter: (b > 250) > -(19 rows) > + -> Materialize > + -> Append > + -> Sort > + Sort Key: prt1_1.a > + -> Seq Scan on prt1_p1 prt1_1 > + Filter: ((a < 450) AND (b = 0)) > + -> Sort > + Sort Key: prt1_2.a > + -> Seq Scan on prt1_p2 prt1_2 > + Filter: ((a < 450) AND (b = 0)) > +(23 rows) > > SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a =t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; > a | b > @@ -1350,25 +1354,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * > -- partitionwise join does not apply > EXPLAIN (COSTS OFF) > SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a; > - QUERY PLAN > ------------------------------------------------------------------------------------------ > + QUERY PLAN > +------------------------------------------------------------------ > Merge Join > - Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text))) > - -> Sort > - Sort Key: t1.a, ((((t1.*)::prt1))::text) > - -> Result > - -> Append > - -> Seq Scan on prt1_p1 t1_1 > - -> Seq Scan on prt1_p2 t1_2 > - -> Seq Scan on prt1_p3 t1_3 > - -> Sort > - Sort Key: t2.b, ((((t2.*)::prt2))::text) > - -> Result > - -> Append > + Merge Cond: (t1.a = t2.b) > + Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text) > + -> Append > + -> Sort > + Sort Key: t1_1.a > + -> Seq Scan on prt1_p1 t1_1 > + -> Sort > + Sort Key: t1_2.a > + -> Seq Scan on prt1_p2 t1_2 > + -> Sort > + Sort Key: t1_3.a > + -> Seq Scan on prt1_p3 t1_3 > + -> Materialize > + -> Append > + -> Sort > + Sort Key: t2_1.b > -> Seq Scan on prt2_p1 t2_1 > + -> Sort > + Sort Key: t2_2.b > -> Seq Scan on prt2_p2 t2_2 > + -> Sort > + Sort Key: t2_3.b > -> Seq Scan on prt2_p3 t2_3 > -(16 rows) > +(24 rows) > > SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a; > a | b > @@ -4906,21 +4918,26 @@ EXPLAIN (COSTS OFF) > SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDERBY t1.a, t1.b; > QUERY PLAN > -------------------------------------------------------------------- > - Sort > + Merge Append > Sort Key: t1.a, t1.b > - -> Append > + -> Sort > + Sort Key: t1_1.a, t1_1.b > -> Hash Join > Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b)) > -> Seq Scan on alpha_neg_p1 t1_1 > Filter: ((b >= 125) AND (b < 225)) > -> Hash > -> Seq Scan on beta_neg_p1 t2_1 > + -> Sort > + Sort Key: t1_2.a, t1_2.b > -> Hash Join > Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b)) > -> Seq Scan on beta_neg_p2 t2_2 > -> Hash > -> Seq Scan on alpha_neg_p2 t1_2 > Filter: ((b >= 125) AND (b < 225)) > + -> Sort > + Sort Key: t1_4.a, t1_4.b > -> Hash Join > Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b)) > -> Append > @@ -4935,7 +4952,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2 > Filter: ((b >= 125) AND (b < 225)) > -> Seq Scan on alpha_pos_p3 t1_6 > Filter: ((b >= 125) AND (b < 225)) > -(29 rows) > +(34 rows) > > SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDERBY t1.a, t1.b; > a | b | c | a | b | c > diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out > index 0fd0e1c38b3..4c1c173d8e6 100644 > --- a/src/test/regress/expected/union.out > +++ b/src/test/regress/expected/union.out > @@ -1195,12 +1195,14 @@ select event_id > ---------------------------------------------------------- > Merge Append > Sort Key: events.event_id > - -> Index Scan using events_pkey on events > + -> Sort > + Sort Key: events.event_id > + -> Seq Scan on events > -> Sort > Sort Key: events_1.event_id > -> Seq Scan on events_child events_1 > -> Index Scan using other_events_pkey on other_events > -(7 rows) > +(9 rows) > > drop table events_child, events, other_events; > reset enable_indexonlyscan; > diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql > index e3bcfdb181e..5331e49283f 100644 > --- a/src/test/regress/sql/inherit.sql > +++ b/src/test/regress/sql/inherit.sql > @@ -586,11 +586,13 @@ insert into matest3 (name) values ('Test 5'); > insert into matest3 (name) values ('Test 6'); > > set enable_indexscan = off; -- force use of seqscan/sort, so no merge > +set enable_sort = off; -- avoid sorting below MergeAppend > explain (verbose, costs off) select * from matest0 order by 1-id; > select * from matest0 order by 1-id; > explain (verbose, costs off) select min(1-id) from matest0; > select min(1-id) from matest0; > reset enable_indexscan; > +reset enable_sort; > > set enable_seqscan = off; -- plan with fewest seqscans should be merge > set enable_parallel_append = off; -- Don't let parallel-append interfere > @@ -976,6 +978,21 @@ reset enable_bitmapscan; > > drop table mcrparted; > > +-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys > +create table hash_parted (i int, j int, k int) partition by hash(i); > +create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0); > +create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1); > +create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2); > +create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3); > +--create table hash_parted_5 partition of hash_parted for values with (modulus 6, remainder 4); > +--create table hash_parted_6 partition of hash_parted for values with (modulus 6, remainder 5); > +create index on hash_parted(j); > +create index on hash_parted(k); > +insert into hash_parted select i, i, i from generate_series(1,10000) i; > +analyze hash_parted; > +explain (costs off) select * from hash_parted where k<100 order by j limit 100; > +drop table hash_parted; > + > -- Ensure LIST partitions allow an Append to be used instead of a MergeAppend > create table bool_lp (b bool) partition by list(b); > create table bool_lp_true partition of bool_lp for values in(true); > -- > 2.34.1 > -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > Is this still being considered? I'd +1 on this feature. I guess this would be more useful on parallel case, where the Sort can be pushed down to parallel worker, and in the distributed database case, where the Sort can be pushed down to multiple nodes, at the result, the leader just do the merge works. At the high level implementaion, sorting *cheapest* child path looks doesn't add too much overhead on the planning effort. -- Best Regards Andy Fan
Hi!
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.46..19.90 rows=10 width=16) (actual time=0.007..0.008 rows=0 loops=1)
-> Merge Join (cost=0.46..181.24 rows=93 width=16) (actual time=0.007..0.007 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.17..90.44 rows=1851 width=16) (actual time=0.006..0.007 rows=0 loops=1)
Sort Key: t1.b
-> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1)
Sort Key: t1_1.b
Sort Method: quicksort Memory: 25kB
-> Seq Scan on matest0 t1_1 (cost=0.00..0.00 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.252 ms
Execution Time: 0.048 ms
(19 rows)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..20.88 rows=10 width=16) (actual time=0.004..0.004 rows=0 loops=1)
-> Merge Join (cost=0.57..189.37 rows=93 width=16) (actual time=0.003..0.004 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.29..98.56 rows=1851 width=16) (actual time=0.002..0.003 rows=0 loops=1)
Sort Key: t1.b
-> Index Scan using matest0i on matest0 t1_1 (cost=0.12..8.14 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.001..0.001 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.278 ms
Execution Time: 0.025 ms
(16 rows)
(patched)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.46..19.90 rows=10 width=16) (actual time=0.007..0.008 rows=0 loops=1)
-> Merge Join (cost=0.46..181.24 rows=93 width=16) (actual time=0.007..0.007 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.17..90.44 rows=1851 width=16) (actual time=0.006..0.007 rows=0 loops=1)
Sort Key: t1.b
-> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1)
Sort Key: t1_1.b
Sort Method: quicksort Memory: 25kB
-> Seq Scan on matest0 t1_1 (cost=0.00..0.00 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.252 ms
Execution Time: 0.048 ms
(19 rows)
(vanilla)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..20.88 rows=10 width=16) (actual time=0.004..0.004 rows=0 loops=1)
-> Merge Join (cost=0.57..189.37 rows=93 width=16) (actual time=0.003..0.004 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.29..98.56 rows=1851 width=16) (actual time=0.002..0.003 rows=0 loops=1)
Sort Key: t1.b
-> Index Scan using matest0i on matest0 t1_1 (cost=0.12..8.14 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.001..0.001 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.278 ms
Execution Time: 0.025 ms
(16 rows)
I've checked this thread and examples in it, and do not see stable improvements
in base tests. Sometimes base tests are considerably slower with patch, like:
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.46..19.90 rows=10 width=16) (actual time=0.007..0.008 rows=0 loops=1)
-> Merge Join (cost=0.46..181.24 rows=93 width=16) (actual time=0.007..0.007 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.17..90.44 rows=1851 width=16) (actual time=0.006..0.007 rows=0 loops=1)
Sort Key: t1.b
-> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1)
Sort Key: t1_1.b
Sort Method: quicksort Memory: 25kB
-> Seq Scan on matest0 t1_1 (cost=0.00..0.00 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.252 ms
Execution Time: 0.048 ms
(19 rows)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..20.88 rows=10 width=16) (actual time=0.004..0.004 rows=0 loops=1)
-> Merge Join (cost=0.57..189.37 rows=93 width=16) (actual time=0.003..0.004 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.29..98.56 rows=1851 width=16) (actual time=0.002..0.003 rows=0 loops=1)
Sort Key: t1.b
-> Index Scan using matest0i on matest0 t1_1 (cost=0.12..8.14 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.001..0.001 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.278 ms
Execution Time: 0.025 ms
(16 rows)
(patched)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.46..19.90 rows=10 width=16) (actual time=0.007..0.008 rows=0 loops=1)
-> Merge Join (cost=0.46..181.24 rows=93 width=16) (actual time=0.007..0.007 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.17..90.44 rows=1851 width=16) (actual time=0.006..0.007 rows=0 loops=1)
Sort Key: t1.b
-> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1)
Sort Key: t1_1.b
Sort Method: quicksort Memory: 25kB
-> Seq Scan on matest0 t1_1 (cost=0.00..0.00 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.252 ms
Execution Time: 0.048 ms
(19 rows)
(vanilla)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..20.88 rows=10 width=16) (actual time=0.004..0.004 rows=0 loops=1)
-> Merge Join (cost=0.57..189.37 rows=93 width=16) (actual time=0.003..0.004 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.29..98.56 rows=1851 width=16) (actual time=0.002..0.003 rows=0 loops=1)
Sort Key: t1.b
-> Index Scan using matest0i on matest0 t1_1 (cost=0.12..8.14 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2 (cost=0.15..71.90 rows=1850 width=16) (actual time=0.001..0.001 rows=0 loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1 (cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2 (cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.278 ms
Execution Time: 0.025 ms
(16 rows)
--
Andy Fan писал(а) 2024-10-17 03:33: > Bruce Momjian <bruce@momjian.us> writes: > >> Is this still being considered? > > I'd +1 on this feature. I guess this would be more useful on parallel > case, where the Sort can be pushed down to parallel worker, and in the > distributed database case, where the Sort can be pushed down to > multiple > nodes, at the result, the leader just do the merge works. > > At the high level implementaion, sorting *cheapest* child path looks > doesn't add too much overhead on the planning effort. Hi. I've updated patch. One more interesting case which we found - when fractional path is selected, it still can be more expensive than sorted cheapest total path (as we look only on indexes whith necessary pathkeys, not on indexes which allow efficiently fetch data). So far couldn't find artificial example, but we've seen inadequate index selection due to this issue - instead of using index suited for filters in where, index, suitable for sorting was selected as one having the cheapest fractional cost. -- Best regards, Alexander Pyhalov, Postgres Professional
Вложения
On 3/28/25 09:19, Alexander Pyhalov wrote: > Andy Fan писал(а) 2024-10-17 03:33: > I've updated patch. One more interesting case which we found - when > fractional path is selected, it still can be more expensive than sorted > cheapest total path (as we look only on indexes whith necessary > pathkeys, not on indexes which allow efficiently fetch data). > So far couldn't find artificial example, but we've seen inadequate index > selection due to this issue - instead of using index suited for filters > in where, index, suitable for sorting was selected as one having the > cheapest fractional cost. I think it is necessary to generalise the approach a little. Each MergeAppend subpath candidate that fits pathkeys should be compared to the overall-optimal path + explicit Sort node. Let's label this two-node composition as base_path. There are three cases exist: startup-optimal, total-optimal and fractional-optimal. In the startup case, base_path should use cheapest_startup_path, the total-optimal case - cheapest_total_path and for a fractional case, we may employ the get_cheapest_fractional_path routine to detect proper base_path. It may provide a more effective plan either in full, fractional and partial scan cases: 1. The Sort node may be pushed down to subpaths under a parallel or async Append. 2. When a minor set of subpaths doesn't have a proper index, and it is profitable to sort them instead of switching to plain Append. In general, analysing the regression tests changed by this code, I see that the cost model prefers a series of small sortings instead of a single massive one. May be it will show some profit for execution time. I am not afraid of any palpable planning overhead here because we just do cheap cost estimation and comparison operations that don't need additional memory allocations. The caller is responsible for building a proper Sort node if this method is chosen as optimal. In the attachment, see the patch written according to the idea. There are I introduced two new routines: get_cheapest_path_for_pathkeys_ext get_cheapest_fractional_path_for_pathkeys_ext I have designed the code that way to reduce duplicated code in the generate_orderedappend_paths routine. But the main point is that I envision these new routines may be reused elsewhere. This feature looks сlose to the one we discussed before [1]. So, let me CC the people from the previous conversation to the discussion. [1] https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com -- regards, Andrei Lepikhov
Вложения
Andrei Lepikhov писал(а) 2025-04-24 16:01: > On 3/28/25 09:19, Alexander Pyhalov wrote: >> Andy Fan писал(а) 2024-10-17 03:33: >> I've updated patch. One more interesting case which we found - when >> fractional path is selected, it still can be more expensive than >> sorted cheapest total path (as we look only on indexes whith necessary >> pathkeys, not on indexes which allow efficiently fetch data). >> So far couldn't find artificial example, but we've seen inadequate >> index selection due to this issue - instead of using index suited for >> filters in where, index, suitable for sorting was selected as one >> having the cheapest fractional cost. > I think it is necessary to generalise the approach a little. > > Each MergeAppend subpath candidate that fits pathkeys should be > compared to the overall-optimal path + explicit Sort node. Let's label > this two-node composition as base_path. There are three cases exist: > startup-optimal, total-optimal and fractional-optimal. > In the startup case, base_path should use cheapest_startup_path, the > total-optimal case - cheapest_total_path and for a fractional case, we > may employ the get_cheapest_fractional_path routine to detect proper > base_path. > > It may provide a more effective plan either in full, fractional and > partial scan cases: > 1. The Sort node may be pushed down to subpaths under a parallel or > async Append. > 2. When a minor set of subpaths doesn't have a proper index, and it is > profitable to sort them instead of switching to plain Append. > > In general, analysing the regression tests changed by this code, I see > that the cost model prefers a series of small sortings instead of a > single massive one. May be it will show some profit for execution time. > > I am not afraid of any palpable planning overhead here because we just > do cheap cost estimation and comparison operations that don't need > additional memory allocations. The caller is responsible for building a > proper Sort node if this method is chosen as optimal. > > In the attachment, see the patch written according to the idea. There > are I introduced two new routines: > get_cheapest_path_for_pathkeys_ext > get_cheapest_fractional_path_for_pathkeys_ext Hi. I'm a bit confused that get_cheapest_fractional_path_for_pathkeys_ext() looks only on sorting cheapest fractional path, and get_cheapest_path_for_pathkeys_ext() in STARTUP_COST case looks only on sorting cheapest_startup_path. Usually, sorted cheapest_total_path will be cheaper than sorted fractional/startup path at least by startup cost (as after sorting it includes total_cost of input path). But we ignore this case when selecting cheapest_startup and cheapest_fractional subpaths. As result selected cheapest_startup and cheapest_fractional can be not cheapest for startup or selecting a fraction of rows. Consider the partition with the following access paths: 1) cheapest_startup without required pathkeys: startup_cost: 0.42 total_cost: 4004 2) some index_path with required pathkeys: startup_cost: 6.6 total_cost: 2000 3) cheapest_total_path: startup_cost: 0.42 total_cost: 3.48 Here, when selecting cheapest startup subpath we'll compare costs of index path (2) and sorted cheapest_startup (1), but will ignore sorted cheapest_total_path (3), despite the fact that it really can be the cheapest startup path, providing required sorting order. -- Best regards, Alexander Pyhalov, Postgres Professional
On 4/25/25 11:16, Alexander Pyhalov wrote: > Andrei Lepikhov писал(а) 2025-04-24 16:01: >> On 3/28/25 09:19, Alexander Pyhalov wrote: >> In the attachment, see the patch written according to the idea. There >> are I introduced two new routines: >> get_cheapest_path_for_pathkeys_ext >> get_cheapest_fractional_path_for_pathkeys_ext > > Hi. I'm a bit confused that Thanks for the participation! > get_cheapest_fractional_path_for_pathkeys_ext() looks only on sorting > cheapest fractional path, and get_cheapest_path_for_pathkeys_ext() in > STARTUP_COST case looks only on sorting cheapest_startup_path. At first, at the moment, I don't understand why we calculate the cheapest_startup path at all. In my opinion, after commit 6b94e7a [1, 2], the min-fractional path totally covers the case. I began this discussion in [3] - maybe we need to scrutinise that issue beforehand. Looking into the min-fractional-path + Sort, we propose a path for the case when extracting minor part of tuples with sorting later is cheaper than doing a massive job of non-selective index scan. You also may imagine the case of a JOIN as a subpath: non-sorted, highly selective parameterised NestLoop may be way more optimal than MergeJoin, which fits the pathkeys. > Usually, sorted cheapest_total_path will be cheaper than sorted > fractional/startup path at least by startup cost (as after sorting it > includes total_cost of input path). But we ignore this case when > selecting cheapest_startup and cheapest_fractional subpaths. As result > selected cheapest_startup and cheapest_fractional can be not cheapest > for startup or selecting a fraction of rows. I don't know what you mean by that. The cheapest_total_path is considered when we chose optimal cheapest_total path. The same works for the fractional path - get_cheapest_fractional_path gives us the most optimal fractional path and probes cheapest_total_path too. As above, not sure about min-startup case for now. I can imagine MergeAppend over sophisticated subquery: non-sorted includes highly parameterised JOINs and the alternative (with pathkeys) includes HashJoin, drastically increasing startup cost. It is only a theory, of course. So, lets discover how min-startup works. At the end, when the sorted path already estimated, we each time compare it with previously selected pathkeys-cheapest. So, if the sorted path is worse, we end up with the original path and don't lose anything. [1] https://www.postgresql.org/message-id/e8f9ec90-546d-e948-acce-0525f3e92773%40enterprisedb.com [2] https://www.postgresql.org/message-id/1581042da8044e71ada2d6e3a51bf7bb%40index.de [3] https://www.postgresql.org/message-id/f0206ef2-6b5a-4d07-8770-cfa7cd30f685@gmail.com -- regards, Andrei Lepikhov
On 4/25/25 17:13, Andrei Lepikhov wrote: > On 4/25/25 11:16, Alexander Pyhalov wrote: >> Usually, sorted cheapest_total_path will be cheaper than sorted >> fractional/startup path at least by startup cost (as after sorting it >> includes total_cost of input path). But we ignore this case when >> selecting cheapest_startup and cheapest_fractional subpaths. As result >> selected cheapest_startup and cheapest_fractional can be not cheapest >> for startup or selecting a fraction of rows. > I don't know what you mean by that. The cheapest_total_path is > considered when we chose optimal cheapest_total path. The same works for > the fractional path - get_cheapest_fractional_path gives us the most > optimal fractional path and probes cheapest_total_path too. > As above, not sure about min-startup case for now. I can imagine > MergeAppend over sophisticated subquery: non-sorted includes highly > parameterised JOINs and the alternative (with pathkeys) includes > HashJoin, drastically increasing startup cost. It is only a theory, of > course. So, lets discover how min-startup works. After a second thought I have caught your idea. I agree that for a fractional path it have no sense to choose any other path except a cheapest total one. There are the modified patch in the attachment. Also, to be more objective, I propose to use examples in argumentation - something like in attached test2.sql script. -- regards, Andrei Lepikhov
Вложения
Andrei Lepikhov писал(а) 2025-04-29 16:52: > On 4/25/25 17:13, Andrei Lepikhov wrote: >> On 4/25/25 11:16, Alexander Pyhalov wrote: >>> Usually, sorted cheapest_total_path will be cheaper than sorted >>> fractional/startup path at least by startup cost (as after sorting it >>> includes total_cost of input path). But we ignore this case when >>> selecting cheapest_startup and cheapest_fractional subpaths. As >>> result selected cheapest_startup and cheapest_fractional can be not >>> cheapest for startup or selecting a fraction of rows. >> I don't know what you mean by that. The cheapest_total_path is >> considered when we chose optimal cheapest_total path. The same works >> for the fractional path - get_cheapest_fractional_path gives us the >> most optimal fractional path and probes cheapest_total_path too. >> As above, not sure about min-startup case for now. I can imagine >> MergeAppend over sophisticated subquery: non-sorted includes highly >> parameterised JOINs and the alternative (with pathkeys) includes >> HashJoin, drastically increasing startup cost. It is only a theory, of >> course. So, lets discover how min-startup works. > After a second thought I have caught your idea. I agree that for a > fractional path it have no sense to choose any other path except a > cheapest total one. > There are the modified patch in the attachment. > > Also, to be more objective, I propose to use examples in argumentation > - something like in attached test2.sql script. Hi. I've looked through new patch and found minor inconsistencies in get_cheapest_path_for_pathkeys_ext() and get_cheapest_fractional_path_for_pathkeys_ext(). In get_cheapest_fractional_path_for_pathkeys_ext() we check that base_path is not NULL path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys, required_outer, fraction); base_path = rel->cheapest_total_path; /* Stop here if the path doesn't satisfy necessary conditions */ if (!base_path || !bms_is_subset(PATH_REQ_OUTER(base_path), required_outer)) return path; But it seems, base_path can't be NULL (as add_paths_to_append_rel() is called after set_rel_pathlist() for childrels). However, path can. Can we do these two functions get_cheapest_path_for_pathkeys_ext() and get_cheapest_fractional_path_for_pathkeys_ext() more similar? Also we check base_path for required_outer and require_parallel_safe, but if cheapest path for pathkeys is NULL, these checks are not performed. Luckily, they seen to be no-op anyway due to cheapest_total->param_info == NULL and function arguments being NULL (required_outer) and false (require_parallel_safe). Should we do something about this? Don't know, perhaps, remove these misleading arguments? Now, if we return cheapest_total_path from get_cheapest_fractional_path_for_pathkeys_ext() if cheapest paths for pathkeys don't exist, do the following lines /* * If we found no path with matching pathkeys, use the * cheapest total path instead. * * XXX We might consider partially sorted paths too (with an * incremental sort on top). But we'd have to build all the * incremental paths, do the costing etc. */ if (!cheapest_fractional) cheapest_fractional = cheapest_total; become no-op? And we do return non-null path from get_cheapest_fractional_path_for_pathkeys_ext(), as it seems we return either cheapest_total_path or cheapest fractional path from get_cheapest_fractional_path_for_pathkeys_ext(). -- Best regards, Alexander Pyhalov, Postgres Professional
On 4/29/25 19:25, Alexander Pyhalov wrote: > Andrei Lepikhov писал(а) 2025-04-29 16:52: > But it seems, base_path can't be NULL Correct. Fixed. > > Also we check base_path for required_outer and require_parallel_safe, > but if cheapest path for pathkeys is NULL, these checks are not > performed. Thanks. Fixed. > Luckily, they seen to be no-op anyway due to cheapest_total- > >param_info == NULL and function arguments being NULL (required_outer) > and false (require_parallel_safe). Should we do something about this? > Don't know, perhaps, remove these misleading arguments? The main reason why I introduced these _ext routines was that this schema may be used somewhere else. And in that case parameterised paths may exist. Who knows, may be parameterised paths will be introduced for merge append too? > become no-op? And we do return non-null path from > get_cheapest_fractional_path_for_pathkeys_ext(), as it seems we return > either cheapest_total_path or cheapest fractional path from > get_cheapest_fractional_path_for_pathkeys_ext(). Ok. Let's check next version of the patch in the attachment. -- regards, Andrei Lepikhov
Вложения
On 5/5/25 13:38, Andrei Lepikhov wrote: > Let's check next version of the patch in the attachment. Oops, I forgot some tails - see this new version. -- regards, Andrei Lepikhov
Вложения
Andrei Lepikhov писал(а) 2025-05-05 14:38: > On 4/29/25 19:25, Alexander Pyhalov wrote: >> Andrei Lepikhov писал(а) 2025-04-29 16:52: >> But it seems, base_path can't be NULL > Correct. Fixed. > >> >> Also we check base_path for required_outer and require_parallel_safe, >> but if cheapest path for pathkeys is NULL, these checks are not >> performed. > Thanks. Fixed. > >> Luckily, they seen to be no-op anyway due to cheapest_total- > >> >param_info == NULL and function arguments being NULL (required_outer) >> and false (require_parallel_safe). Should we do something about this? >> Don't know, perhaps, remove these misleading arguments? > The main reason why I introduced these _ext routines was that this > schema may be used somewhere else. And in that case parameterised paths > may exist. Who knows, may be parameterised paths will be introduced for > merge append too? > >> become no-op? And we do return non-null path from >> get_cheapest_fractional_path_for_pathkeys_ext(), as it seems we return >> either cheapest_total_path or cheapest fractional path from >> get_cheapest_fractional_path_for_pathkeys_ext(). > Ok. > > Let's check next version of the patch in the attachment. Hi. Both functions are a bit different: Path *base_path = rel->cheapest_total_path; Path *path; path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys, required_outer, cost_criterion, require_parallel_safe); /* Stop here if the path doesn't satisfy necessary conditions */ if ((require_parallel_safe && !base_path->parallel_safe) || !bms_is_subset(PATH_REQ_OUTER(base_path), required_outer)) return path; Here comment speaks about "the path", and check is performed on base_path, could it be misleading? In get_cheapest_fractional_path_for_pathkeys_ext(): path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys, required_outer, fraction); base_path = rel->cheapest_total_path; /* Stop here if the path doesn't satisfy necessary conditions */ if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer)) return path; Here "the path" also refers to base_path, but here at least it's the last path we've just looked at. Should we make base_path initialization consistent and fix comment a bit, so that there's no possible ambiguity and it's evident that it refers to the base_path? Also logic a bit differs if path is NULL. In get_cheapest_path_for_pathkeys_ext() we explicitly check for path being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only after calculating sort cost. I've tried to fix comments a bit and unified functions definitions. -- Best regards, Alexander Pyhalov, Postgres Professional
Вложения
On 5/5/2025 15:56, Alexander Pyhalov wrote: > Andrei Lepikhov писал(а) 2025-05-05 14:38: > Also logic a bit differs if path is NULL. In > get_cheapest_path_for_pathkeys_ext() we explicitly check for path being > NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only after > calculating sort cost. > > I've tried to fix comments a bit and unified functions definitions. Generally seems ok, I'm not a native speaker to judge the comments. But: if (base_path && path != base_path) What is the case in your mind where the base_path pointer still may be null at that point? -- regards, Andrei Lepikhov
Andrei Lepikhov писал(а) 2025-05-07 08:02: > On 5/5/2025 15:56, Alexander Pyhalov wrote: >> Andrei Lepikhov писал(а) 2025-05-05 14:38: >> Also logic a bit differs if path is NULL. In >> get_cheapest_path_for_pathkeys_ext() we explicitly check for path >> being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only >> after calculating sort cost. >> >> I've tried to fix comments a bit and unified functions definitions. > Generally seems ok, I'm not a native speaker to judge the comments. > But: > if (base_path && path != base_path) > > What is the case in your mind where the base_path pointer still may be > null at that point? Hi. It seems if some childrel doesn't have valid pathlist, subpaths_valid would be false in add_paths_to_append_rel() and generate_orderedappend_paths() will not be called. So when we iterate over live_childrels, all of them will have cheapest_total path. This is why we can assert that base_path is not NULL. -- Best regards, Alexander Pyhalov, Postgres Professional
Вложения
On 7/5/2025 08:57, Alexander Pyhalov wrote: > Andrei Lepikhov писал(а) 2025-05-07 08:02: >> On 5/5/2025 15:56, Alexander Pyhalov wrote: >>> Andrei Lepikhov писал(а) 2025-05-05 14:38: >>> Also logic a bit differs if path is NULL. In >>> get_cheapest_path_for_pathkeys_ext() we explicitly check for path >>> being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only >>> after calculating sort cost. >>> >>> I've tried to fix comments a bit and unified functions definitions. >> Generally seems ok, I'm not a native speaker to judge the comments. But: >> if (base_path && path != base_path) >> >> What is the case in your mind where the base_path pointer still may be >> null at that point? > > Hi. > > It seems if some childrel doesn't have valid pathlist, subpaths_valid > would be false in add_paths_to_append_rel() > and generate_orderedappend_paths() will not be called. So when we > iterate over live_childrels, > all of them will have cheapest_total path. This is why we can assert > that base_path is not NULL. I'm not sure I understand you correctly. Under which conditions will rel->cheapest_total_path be set to NULL for a childrel? Could you provide an example? -- regards, Andrei Lepikhov
Andrei Lepikhov писал(а) 2025-05-07 12:03: > On 7/5/2025 08:57, Alexander Pyhalov wrote: >> Andrei Lepikhov писал(а) 2025-05-07 08:02: >>> On 5/5/2025 15:56, Alexander Pyhalov wrote: >>>> Andrei Lepikhov писал(а) 2025-05-05 14:38: >>>> Also logic a bit differs if path is NULL. In >>>> get_cheapest_path_for_pathkeys_ext() we explicitly check for path >>>> being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only >>>> after calculating sort cost. >>>> >>>> I've tried to fix comments a bit and unified functions definitions. >>> Generally seems ok, I'm not a native speaker to judge the comments. >>> But: >>> if (base_path && path != base_path) >>> >>> What is the case in your mind where the base_path pointer still may >>> be null at that point? >> >> Hi. >> >> It seems if some childrel doesn't have valid pathlist, subpaths_valid >> would be false in add_paths_to_append_rel() >> and generate_orderedappend_paths() will not be called. So when we >> iterate over live_childrels, >> all of them will have cheapest_total path. This is why we can assert >> that base_path is not NULL. > I'm not sure I understand you correctly. Under which conditions will > rel->cheapest_total_path be set to NULL for a childrel? Could you > provide an example? Sorry, perhaps I was not clear enough. I've stated the opposite - it seems we can be sure that it's not NULL. -- Best regards, Alexander Pyhalov, Postgres Professional
Hi, Alexander! On Wed, May 7, 2025 at 12:06 PM Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote: > Andrei Lepikhov писал(а) 2025-05-07 12:03: > > On 7/5/2025 08:57, Alexander Pyhalov wrote: > >> Andrei Lepikhov писал(а) 2025-05-07 08:02: > >>> On 5/5/2025 15:56, Alexander Pyhalov wrote: > >>>> Andrei Lepikhov писал(а) 2025-05-05 14:38: > >>>> Also logic a bit differs if path is NULL. In > >>>> get_cheapest_path_for_pathkeys_ext() we explicitly check for path > >>>> being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only > >>>> after calculating sort cost. > >>>> > >>>> I've tried to fix comments a bit and unified functions definitions. > >>> Generally seems ok, I'm not a native speaker to judge the comments. > >>> But: > >>> if (base_path && path != base_path) > >>> > >>> What is the case in your mind where the base_path pointer still may > >>> be null at that point? > >> > >> Hi. > >> > >> It seems if some childrel doesn't have valid pathlist, subpaths_valid > >> would be false in add_paths_to_append_rel() > >> and generate_orderedappend_paths() will not be called. So when we > >> iterate over live_childrels, > >> all of them will have cheapest_total path. This is why we can assert > >> that base_path is not NULL. > > I'm not sure I understand you correctly. Under which conditions will > > rel->cheapest_total_path be set to NULL for a childrel? Could you > > provide an example? > > Sorry, perhaps I was not clear enough. I've stated the opposite - it > seems we can be sure that it's not NULL. Thank you for your work on this subject! I have the following question. I see patch changes some existing plans from Sort(Append(...)) to MergeAppend(Sort(), ..., Sort(...)) or even Materialize(MergeAppend(Sort(), ..., Sort(...))). This should be some problem in cost_sort(). Otherwise, that would mean that Sort node doesn't know how to do its job: explicit splitting dataset into pieces then merging sorting result appears to be cheaper, but Sort node contains merge-sort algorithm inside and it's supposed to be more efficient. Could you, please, revise the patch to avoid these unwanted changes? ------ Regards, Alexander Korotkov Supabase
On Tue, Jun 3, 2025 at 4:23 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > On 2/6/2025 20:21, Alexander Korotkov wrote: > > I have the following question. I see patch changes some existing > > plans from Sort(Append(...)) to MergeAppend(Sort(), ..., Sort(...)) or > > even Materialize(MergeAppend(Sort(), ..., Sort(...))). This should be > > some problem in cost_sort(). Otherwise, that would mean that Sort > > node doesn't know how to do its job: explicit splitting dataset into > > pieces then merging sorting result appears to be cheaper, but Sort > > node contains merge-sort algorithm inside and it's supposed to be more > > efficient. Could you, please, revise the patch to avoid these > > unwanted changes? > I think, this issue is related to corner-cases of the > compare_path_costs_fuzzily. > > Let's glance into one of the problematic queries: > EXPLAIN (COSTS ON) > SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" > ORDER BY 1; > > if you play with the plan, you can find that total_cost of the > Sort->Append path is cheaper: > > Sort (cost=2.40..2.41 rows=4 width=40) > -> Append (cost=1.15..2.36 rows=4 width=40) > Merge Append (cost=2.37..2.42 rows=4 width=40) > > But the difference is less than fuzz_factor. In this case, Postgres > probes startup_cost, which is obviously less for the MergeAppend strategy. > This is a good decision, and I think it should stay as is. > What can we do here? We might change the test to increase the cost gap. > However, while designing this patch, I skimmed through each broken query > and didn't find a reason to specifically shift to the Sort->Append > strategy, as it tested things that were not dependent on Append or Sort. > > To establish a stable foundation for discussion, I conducted simple > tests - see, for example, a couple of queries in the attachment. As I > see it, Sort->Append works faster: in my test bench, it takes 1250ms on > average versus 1430ms, and it also has lower costs - the same for data > with and without massive numbers of duplicates. Playing with sizes of > inputs, I see the same behaviour. I run your tests. For Sort(Append()) case I've got actual time=811.047..842.473. For MergeAppend case I've got actual time actual time=723.678..967.004. That looks interesting. At some point we probably should teach our Sort node to start returning tuple before finishing the last merge stage. However, I think costs are not adequate to the timing. Our cost model predicts that startup cost of MergeAppend is less than startup cost of Sort(Append()). And that's correct. However, in fast total time of MergeAppend is bigger than total time of Sort(Append()). The differences in these two cases are comparable. I think we need to just our cost_sort() to reflect that. ------ Regards, Alexander Korotkov Supabase
On 3/6/2025 15:38, Alexander Korotkov wrote: > On Tue, Jun 3, 2025 at 4:23 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> To establish a stable foundation for discussion, I conducted simple >> tests - see, for example, a couple of queries in the attachment. As I >> see it, Sort->Append works faster: in my test bench, it takes 1250ms on >> average versus 1430ms, and it also has lower costs - the same for data >> with and without massive numbers of duplicates. Playing with sizes of >> inputs, I see the same behaviour. > > I run your tests. For Sort(Append()) case I've got actual > time=811.047..842.473. For MergeAppend case I've got actual time > actual time=723.678..967.004. That looks interesting. At some point > we probably should teach our Sort node to start returning tuple before > finishing the last merge stage. > > However, I think costs are not adequate to the timing. Our cost model > predicts that startup cost of MergeAppend is less than startup cost of > Sort(Append()). And that's correct. However, in fast total time of > MergeAppend is bigger than total time of Sort(Append()). The > differences in these two cases are comparable. I think we need to > just our cost_sort() to reflect that. May you explain your idea? As I see (and have shown in the previous message), the total cost of the Sort->Append is fewer than MergeAppend->Sort. Additionally, as I mentioned earlier, the primary reason for choosing MergeAppend in the regression test was a slight total cost difference that triggered the startup cost comparison. May you show the query and its explain, that is a subject of concern for you? -- regards, Andrei Lepikhov
On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > On 3/6/2025 15:38, Alexander Korotkov wrote: > > On Tue, Jun 3, 2025 at 4:23 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > >> To establish a stable foundation for discussion, I conducted simple > >> tests - see, for example, a couple of queries in the attachment. As I > >> see it, Sort->Append works faster: in my test bench, it takes 1250ms on > >> average versus 1430ms, and it also has lower costs - the same for data > >> with and without massive numbers of duplicates. Playing with sizes of > >> inputs, I see the same behaviour. > > > > I run your tests. For Sort(Append()) case I've got actual > > time=811.047..842.473. For MergeAppend case I've got actual time > > actual time=723.678..967.004. That looks interesting. At some point > > we probably should teach our Sort node to start returning tuple before > > finishing the last merge stage. > > > > However, I think costs are not adequate to the timing. Our cost model > > predicts that startup cost of MergeAppend is less than startup cost of > > Sort(Append()). And that's correct. However, in fast total time of > > MergeAppend is bigger than total time of Sort(Append()). The > > differences in these two cases are comparable. I think we need to > > just our cost_sort() to reflect that. > May you explain your idea? As I see (and have shown in the previous > message), the total cost of the Sort->Append is fewer than > MergeAppend->Sort. > Additionally, as I mentioned earlier, the primary reason for choosing > MergeAppend in the regression test was a slight total cost difference > that triggered the startup cost comparison. > May you show the query and its explain, that is a subject of concern for > you? My point is that difference in total cost is very small. For small datasets it could be even within the fuzzy limit. However, in practice difference in total time is as big as difference in startup time. So, it would be good to make the total cost difference bigger. ------ Regards, Alexander Korotkov Supabase
On 3/6/2025 16:05, Alexander Korotkov wrote: > On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> Additionally, as I mentioned earlier, the primary reason for choosing >> MergeAppend in the regression test was a slight total cost difference >> that triggered the startup cost comparison. >> May you show the query and its explain, that is a subject of concern for >> you? > > My point is that difference in total cost is very small. For small > datasets it could be even within the fuzzy limit. However, in > practice difference in total time is as big as difference in startup > time. So, it would be good to make the total cost difference bigger. For me, it seems like a continuation of the 7d8ac98 discussion. We may charge a small fee for MergeAppend to adjust the balance, of course. However, I think this small change requires a series of benchmarks to determine how it affects the overall cost balance. Without examples it is hard to say how important this issue is and its worthiness to commence such work. -- regards, Andrei Lepikhov
On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > On 3/6/2025 16:05, Alexander Korotkov wrote: > > On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > >> Additionally, as I mentioned earlier, the primary reason for choosing > >> MergeAppend in the regression test was a slight total cost difference > >> that triggered the startup cost comparison. > >> May you show the query and its explain, that is a subject of concern for > >> you? > > > > My point is that difference in total cost is very small. For small > > datasets it could be even within the fuzzy limit. However, in > > practice difference in total time is as big as difference in startup > > time. So, it would be good to make the total cost difference bigger. > For me, it seems like a continuation of the 7d8ac98 discussion. We may > charge a small fee for MergeAppend to adjust the balance, of course. > However, I think this small change requires a series of benchmarks to > determine how it affects the overall cost balance. Without examples it > is hard to say how important this issue is and its worthiness to > commence such work. Yes, I think it's fair to charge the MergeAppend node. We currently cost it similarly to Sort merge stage, but it's clearly more expensive. It dealing on the executor level dealing with Slot's etc, while Sort node have a set of lower level optimizations. ------ Regards, Alexander Korotkov Supabase
On 4/6/2025 00:41, Alexander Korotkov wrote: > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> On 3/6/2025 16:05, Alexander Korotkov wrote: >>> On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >>>> Additionally, as I mentioned earlier, the primary reason for choosing >>>> MergeAppend in the regression test was a slight total cost difference >>>> that triggered the startup cost comparison. >>>> May you show the query and its explain, that is a subject of concern for >>>> you? >>> >>> My point is that difference in total cost is very small. For small >>> datasets it could be even within the fuzzy limit. However, in >>> practice difference in total time is as big as difference in startup >>> time. So, it would be good to make the total cost difference bigger. >> For me, it seems like a continuation of the 7d8ac98 discussion. We may >> charge a small fee for MergeAppend to adjust the balance, of course. >> However, I think this small change requires a series of benchmarks to >> determine how it affects the overall cost balance. Without examples it >> is hard to say how important this issue is and its worthiness to >> commence such work. > > Yes, I think it's fair to charge the MergeAppend node. We currently > cost it similarly to Sort merge stage, but it's clearly more > expensive. It dealing on the executor level dealing with Slot's etc, > while Sort node have a set of lower level optimizations. As I see it, it makes sense to charge MergeAppend for the heap operation or, what is more logical, reduce the charge on Sort due to internal optimisations. Playing with both approaches, I found that it breaks many more tests than the current patch does. Hence, it needs additional work on the results analysis to realise how correct these changes are. -- regards, Andrei Lepikhov
On 4/6/2025 00:41, Alexander Korotkov wrote: > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> For me, it seems like a continuation of the 7d8ac98 discussion. We may >> charge a small fee for MergeAppend to adjust the balance, of course. >> However, I think this small change requires a series of benchmarks to >> determine how it affects the overall cost balance. Without examples it >> is hard to say how important this issue is and its worthiness to >> commence such work. > > Yes, I think it's fair to charge the MergeAppend node. We currently > cost it similarly to Sort merge stage, but it's clearly more > expensive. It dealing on the executor level dealing with Slot's etc, > while Sort node have a set of lower level optimizations. After conducting additional research, I concluded that you are correct, and the current cost model doesn't allow the optimiser to detect the best option. A simple test with a full scan and sort of a partitioned table (see attachment) shows that the query plan prefers small sortings merged by the MergeAppend node. I have got the following results for different numbers of tuples to be sorted (in the range from 10 tuples to 1E8): EXPLAIN SELECT * FROM test ORDER BY y; 1E1: Sort (cost=9.53..9.57 rows=17 width=8) 1E2: Sort (cost=20.82..21.07 rows=100) 1E3: Merge Append (cost=56.19..83.69 rows=1000) 1E4: Merge Append (cost=612.74..887.74 rows=10000) 1E5: Merge Append (cost=7754.19..10504.19 rows=100000) 1E6: Merge Append (cost=94092.25..121592.25 rows=1000000) 1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000) 1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000) That happens because both total costs lie within the fuzzy factor gap, and the optimiser chooses the path based on the startup cost, which is obviously better for the MergeAppend case. At the same time, execution, involving a single Sort node, dominates the MergeAppend case: 1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms 1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms 1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms 1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms 1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms 1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms Looking inside the code, I found the only difference we can employ to rationalise the cost model change: re-structuring of a heap. The siftDown routine employs two tuple comparisons to find the proper position for a tuple. So, we have objections to changing the constant in the cost model of the merge operation: @@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root, logN = LOG2(N); /* Assumed cost per tuple comparison */ - comparison_cost = 2.0 * cpu_operator_cost; + comparison_cost = 4.0 * cpu_operator_cost; /* Heap creation cost */ startup_cost += comparison_cost * N * logN; The exact change also needs to be made in the cost_gather_merge function, of course. At this moment, I'm not sure that it should be multiplied by 2 - it is a subject for further discussion. However, it fixes the current issue and adds a little additional cost to the merge operation, which, as I see it, is quite limited. Please see the new version of the patch attached. -- regards, Andrei Lepikhov
Вложения
On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
>
> On 4/6/2025 00:41, Alexander Korotkov wrote:
> > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> >> For me, it seems like a continuation of the 7d8ac98 discussion. We may
> >> charge a small fee for MergeAppend to adjust the balance, of course.
> >> However, I think this small change requires a series of benchmarks to
> >> determine how it affects the overall cost balance. Without examples it
> >> is hard to say how important this issue is and its worthiness to
> >> commence such work.
> >
> > Yes, I think it's fair to charge the MergeAppend node. We currently
> > cost it similarly to Sort merge stage, but it's clearly more
> > expensive. It dealing on the executor level dealing with Slot's etc,
> > while Sort node have a set of lower level optimizations.
> After conducting additional research, I concluded that you are correct,
> and the current cost model doesn't allow the optimiser to detect the
> best option. A simple test with a full scan and sort of a partitioned
> table (see attachment) shows that the query plan prefers small sortings
> merged by the MergeAppend node. I have got the following results for
> different numbers of tuples to be sorted (in the range from 10 tuples to
> 1E8):
>
> EXPLAIN SELECT * FROM test ORDER BY y;
>
> 1E1: Sort (cost=9.53..9.57 rows=17 width=8)
> 1E2: Sort (cost=20.82..21.07 rows=100)
> 1E3: Merge Append (cost=56.19..83.69 rows=1000)
> 1E4: Merge Append (cost=612.74..887.74 rows=10000)
> 1E5: Merge Append (cost=7754.19..10504.19 rows=100000)
> 1E6: Merge Append (cost=94092.25..121592.25 rows=1000000)
> 1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000)
> 1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000)
>
> That happens because both total costs lie within the fuzzy factor gap,
> and the optimiser chooses the path based on the startup cost, which is
> obviously better for the MergeAppend case.
>
> At the same time, execution, involving a single Sort node, dominates the
> MergeAppend case:
>
> 1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms
> 1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms
> 1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms
> 1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms
> 1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms
> 1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms
>
> Looking inside the code, I found the only difference we can employ to
> rationalise the cost model change: re-structuring of a heap. The
> siftDown routine employs two tuple comparisons to find the proper
> position for a tuple. So, we have objections to changing the constant in
> the cost model of the merge operation:
>
> @@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
> logN = LOG2(N);
>
> /* Assumed cost per tuple comparison */
> - comparison_cost = 2.0 * cpu_operator_cost;
> + comparison_cost = 4.0 * cpu_operator_cost;
>
> /* Heap creation cost */
> startup_cost += comparison_cost * N * logN;
>
> The exact change also needs to be made in the cost_gather_merge
> function, of course.
> At this moment, I'm not sure that it should be multiplied by 2 - it is a
> subject for further discussion. However, it fixes the current issue and
> adds a little additional cost to the merge operation, which, as I see
> it, is quite limited.
> Please see the new version of the patch attached.
I've checked the cost adjustment you've made. If you change the cost of top-N sort, you must also change the prior if condition on when it gets selected. We currently do the switch on tuples = 2 * limit_tuples. If we apply the proposed change, we should switch on tuples = limit_tuples^2 ^ 2. But also, in order for this cost to reflect reality, we must change tuplesort_puttuple_common() in the same way. These inconsistencies lead to failures on contrib/postgres_fdw checks. But these changes don't appear to be a win. See the example below. Top-N sort appears to be a win for LIMITS up to 500000.
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..112182.85 rows=10000 width=8) (actual time=638.998..639.841 rows=10000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=638.996..639.340 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=126.582..205.610 rows=1000000.00 loops=1)
Planning Time: 0.118 ms
Execution Time: 653.283 ms
(7 rows)
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual time=646.522..688.573 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=646.520..663.562 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=129.028..208.936 rows=1000000.00 loops=1)
Planning Time: 0.188 ms
Execution Time: 713.738 ms
(7 rows)
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=78938.56..78963.56 rows=10000 width=8) (actual time=412.633..413.459 rows=10000.00 loops=1)
Buffers: shared hit=3, temp read=1709 written=1709
-> Sort (cost=78938.56..81438.56 rows=1000000 width=8) (actual time=412.631..412.969 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 769kB
Buffers: shared hit=3, temp read=1709 written=1709
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=185.892..333.233 rows=1000000.00 loops=1)
Buffers: temp read=1709 written=1709
Planning:
Buffers: shared hit=7 read=8
Planning Time: 2.058 ms
Execution Time: 416.040 ms
(12 rows)
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual time=631.185..673.768 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=631.183..649.072 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=121.274..200.453 rows=1000000.00 loops=1)
Planning Time: 0.243 ms
Execution Time: 698.841 ms
(7 rows)
I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer tuplesort over heapsort. I think we can adjust cost_gather_merge() and cost_merge_append() to do the same. 0001 patch implements that. I think the plan changes of 0001 might be reasonable since most cases deal with small rowsets. One thing concerns me: 0002 still affects one of the postgres_fdw checks. Could you, please, take a look?
------
Regards,
Alexander Korotkov
Supabase
>
> On 4/6/2025 00:41, Alexander Korotkov wrote:
> > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> >> For me, it seems like a continuation of the 7d8ac98 discussion. We may
> >> charge a small fee for MergeAppend to adjust the balance, of course.
> >> However, I think this small change requires a series of benchmarks to
> >> determine how it affects the overall cost balance. Without examples it
> >> is hard to say how important this issue is and its worthiness to
> >> commence such work.
> >
> > Yes, I think it's fair to charge the MergeAppend node. We currently
> > cost it similarly to Sort merge stage, but it's clearly more
> > expensive. It dealing on the executor level dealing with Slot's etc,
> > while Sort node have a set of lower level optimizations.
> After conducting additional research, I concluded that you are correct,
> and the current cost model doesn't allow the optimiser to detect the
> best option. A simple test with a full scan and sort of a partitioned
> table (see attachment) shows that the query plan prefers small sortings
> merged by the MergeAppend node. I have got the following results for
> different numbers of tuples to be sorted (in the range from 10 tuples to
> 1E8):
>
> EXPLAIN SELECT * FROM test ORDER BY y;
>
> 1E1: Sort (cost=9.53..9.57 rows=17 width=8)
> 1E2: Sort (cost=20.82..21.07 rows=100)
> 1E3: Merge Append (cost=56.19..83.69 rows=1000)
> 1E4: Merge Append (cost=612.74..887.74 rows=10000)
> 1E5: Merge Append (cost=7754.19..10504.19 rows=100000)
> 1E6: Merge Append (cost=94092.25..121592.25 rows=1000000)
> 1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000)
> 1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000)
>
> That happens because both total costs lie within the fuzzy factor gap,
> and the optimiser chooses the path based on the startup cost, which is
> obviously better for the MergeAppend case.
>
> At the same time, execution, involving a single Sort node, dominates the
> MergeAppend case:
>
> 1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms
> 1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms
> 1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms
> 1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms
> 1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms
> 1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms
>
> Looking inside the code, I found the only difference we can employ to
> rationalise the cost model change: re-structuring of a heap. The
> siftDown routine employs two tuple comparisons to find the proper
> position for a tuple. So, we have objections to changing the constant in
> the cost model of the merge operation:
>
> @@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
> logN = LOG2(N);
>
> /* Assumed cost per tuple comparison */
> - comparison_cost = 2.0 * cpu_operator_cost;
> + comparison_cost = 4.0 * cpu_operator_cost;
>
> /* Heap creation cost */
> startup_cost += comparison_cost * N * logN;
>
> The exact change also needs to be made in the cost_gather_merge
> function, of course.
> At this moment, I'm not sure that it should be multiplied by 2 - it is a
> subject for further discussion. However, it fixes the current issue and
> adds a little additional cost to the merge operation, which, as I see
> it, is quite limited.
> Please see the new version of the patch attached.
I've checked the cost adjustment you've made. If you change the cost of top-N sort, you must also change the prior if condition on when it gets selected. We currently do the switch on tuples = 2 * limit_tuples. If we apply the proposed change, we should switch on tuples = limit_tuples^2 ^ 2. But also, in order for this cost to reflect reality, we must change tuplesort_puttuple_common() in the same way. These inconsistencies lead to failures on contrib/postgres_fdw checks. But these changes don't appear to be a win. See the example below. Top-N sort appears to be a win for LIMITS up to 500000.
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..112182.85 rows=10000 width=8) (actual time=638.998..639.841 rows=10000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=638.996..639.340 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=126.582..205.610 rows=1000000.00 loops=1)
Planning Time: 0.118 ms
Execution Time: 653.283 ms
(7 rows)
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual time=646.522..688.573 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=646.520..663.562 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=129.028..208.936 rows=1000000.00 loops=1)
Planning Time: 0.188 ms
Execution Time: 713.738 ms
(7 rows)
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=78938.56..78963.56 rows=10000 width=8) (actual time=412.633..413.459 rows=10000.00 loops=1)
Buffers: shared hit=3, temp read=1709 written=1709
-> Sort (cost=78938.56..81438.56 rows=1000000 width=8) (actual time=412.631..412.969 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 769kB
Buffers: shared hit=3, temp read=1709 written=1709
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=185.892..333.233 rows=1000000.00 loops=1)
Buffers: temp read=1709 written=1709
Planning:
Buffers: shared hit=7 read=8
Planning Time: 2.058 ms
Execution Time: 416.040 ms
(12 rows)
# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual time=631.185..673.768 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=631.183..649.072 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=121.274..200.453 rows=1000000.00 loops=1)
Planning Time: 0.243 ms
Execution Time: 698.841 ms
(7 rows)
I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer tuplesort over heapsort. I think we can adjust cost_gather_merge() and cost_merge_append() to do the same. 0001 patch implements that. I think the plan changes of 0001 might be reasonable since most cases deal with small rowsets. One thing concerns me: 0002 still affects one of the postgres_fdw checks. Could you, please, take a look?
------
Regards,
Alexander Korotkov
Supabase
Вложения
On 27/7/2025 00:51, Alexander Korotkov wrote: > On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov@gmail.com > I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer > tuplesort over heapsort. I think we can adjust cost_gather_merge() and > cost_merge_append() to do the same. 0001 patch implements that. I > think the plan changes of 0001 might be reasonable since most cases deal > with small rowsets. One thing concerns me: 0002 still affects one of > the postgres_fdw checks. Could you, please, take a look? Thanks for the idea! I analysed your approach a little bit. Initially, I ran the test script I had created previously [1] and discovered that on a large scale (1e6 - 1e7 tuples), the plan still defaults to MergeAppend, which deviates from the execution time (7190 ms for Sort+Append and 8450 ms for MergeAppend+Sort). Attempting to find out the reason, I combined all the costs into a single formula for each strategy: MergeAppend+Sort: total_cost =CO*ntuples*(1+2*log(ntuples)) + Ccput * 0.5 * ntuples+ 2*CO*N*log(N) + A Sort+Append: total_cost = CO*ntuples*(1+2*log(ntuples))+ Ccput * 0.5 * ntuples + A Terms: - A - sum of total costs of underlying subtrees - CO - cpu_operator_cost - Ccput - cpu_tuple_cost - N - number of subpaths (streams) Given the significant gap in total execution time between these strategies, I believe it would be reasonable to introduce a coefficient to the equation's 'ntuples' variable component that will keep the gap between big quicksort and MergeAppend's heapsort out of the fuzzy factor gap. Discovering papers on the value of constant in quicksort [2] and heapsort [3], I realised that there is a difference. The constant's value varies in a wide range: 1.3-1.5 for quicksort and 2-3 for heapsort. Considering that we should change the current cost model as little as possible, not to break the balance, we may just increase the constant value for the heap sort to maintain a bare minimum gap between strategies out of the fuzzy factor. In this case, the merge append constant should be around 3.8 - 4.0. With this minor change, we see a shift in the regression tests. Most of these changes were introduced by the new append strategy. Although I haven't analysed these changes in depth yet, I believe they are all related to the small data sets and should fade out on a larger scale. See this minor correction in the attachment. postgres_fdw tests are stable now. [1] https://github.com/danolivo/conf/blob/main/Scripts/sort-vs-mergeappend-3.sql [2] https://en.wikipedia.org/wiki/Quicksort?utm_source=chatgpt.com [2] https://arxiv.org/abs/1504.01459 -- regards, Andrei Lepikhov
Вложения
On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > On 27/7/2025 00:51, Alexander Korotkov wrote: > > On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov@gmail.com > > I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer > > tuplesort over heapsort. I think we can adjust cost_gather_merge() and > > cost_merge_append() to do the same. 0001 patch implements that. I > > think the plan changes of 0001 might be reasonable since most cases deal > > with small rowsets. One thing concerns me: 0002 still affects one of > > the postgres_fdw checks. Could you, please, take a look? > Thanks for the idea! > I analysed your approach a little bit. > Initially, I ran the test script I had created previously [1] and > discovered that on a large scale (1e6 - 1e7 tuples), the plan still > defaults to MergeAppend, which deviates from the execution time (7190 ms > for Sort+Append and 8450 ms for MergeAppend+Sort). > > Attempting to find out the reason, I combined all the costs into a > single formula for each strategy: > > MergeAppend+Sort: > total_cost =CO*ntuples*(1+2*log(ntuples)) + Ccput * 0.5 * ntuples+ > 2*CO*N*log(N) + A > Sort+Append: > total_cost = CO*ntuples*(1+2*log(ntuples))+ Ccput * 0.5 * ntuples + A > > Terms: > - A - sum of total costs of underlying subtrees > - CO - cpu_operator_cost > - Ccput - cpu_tuple_cost > - N - number of subpaths (streams) > > Given the significant gap in total execution time between these > strategies, I believe it would be reasonable to introduce a coefficient > to the equation's 'ntuples' variable component that will keep the gap > between big quicksort and MergeAppend's heapsort out of the fuzzy factor > gap. > > Discovering papers on the value of constant in quicksort [2] and > heapsort [3], I realised that there is a difference. The constant's > value varies in a wide range: 1.3-1.5 for quicksort and 2-3 for > heapsort. Considering that we should change the current cost model as > little as possible, not to break the balance, we may just increase the > constant value for the heap sort to maintain a bare minimum gap between > strategies out of the fuzzy factor. In this case, the merge append > constant should be around 3.8 - 4.0. > > With this minor change, we see a shift in the regression tests. Most of > these changes were introduced by the new append strategy. Although I > haven't analysed these changes in depth yet, I believe they are all > related to the small data sets and should fade out on a larger scale. > > See this minor correction in the attachment. postgres_fdw tests are > stable now. I have another idea. What if we allow MergeAppend paths only when at least one subpath is preordered. This trick also allow us to exclude MergeAppend(Sort) dominating Sort(Append). I see the regression tests changes now have much less volume and looks more reasonable. What do you think? Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and get_cheapest_path_for_pathkeys_ext() should consider incremental sort? The revised patch teaches them to do so. ------ Regards, Alexander Korotkov Supabase
Вложения
On Tue, Sep 2, 2025 at 5:26 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > I have another idea. What if we allow MergeAppend paths only when at > least one subpath is preordered. This trick also allow us to exclude > MergeAppend(Sort) dominating Sort(Append). I see the regression tests > changes now have much less volume and looks more reasonable. What do > you think? I skimmed through the test case changes, and I'm not sure all of them are actual improvements. For example: -> Append - -> Foreign Scan on ftprt1_p1 t1_1 + -> Sort + Sort Key: t1_1.a + -> Foreign Scan on ftprt1_p1 t1_1 -> Foreign Scan on ftprt1_p2 t1_2 It seems that this patch moves the sort operation for ftprt1_p1 from the remote server to local. I'm not sure if this is an improvement, or why it applies only to ftprt1_p1 and not to ftprt1_p2 (they have very similar statistics). Besides, I noticed that some plans have changed from an "Index Scan with Index Cond" to a "Seq Scan with Filter + Sort". I'm also not sure whether this change results in better performance. - Richard
On 2/9/2025 03:27, Richard Guo wrote: > On Tue, Sep 2, 2025 at 5:26 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: >> I have another idea. What if we allow MergeAppend paths only when at >> least one subpath is preordered. This trick also allow us to exclude >> MergeAppend(Sort) dominating Sort(Append). I see the regression tests >> changes now have much less volume and looks more reasonable. What do >> you think? > > I skimmed through the test case changes, and I'm not sure all of them > are actual improvements. For example: > > -> Append > - -> Foreign Scan on ftprt1_p1 t1_1 > + -> Sort > + Sort Key: t1_1.a > + -> Foreign Scan on ftprt1_p1 t1_1 > -> Foreign Scan on ftprt1_p2 t1_2 > > It seems that this patch moves the sort operation for ftprt1_p1 from > the remote server to local. I'm not sure if this is an improvement, > or why it applies only to ftprt1_p1 and not to ftprt1_p2 (they have > very similar statistics). I had a look into this case. The next stuff happens. Initially, within generate_orderedappend_paths, the planner creates an Append according to the 'match_partition_order' strategy, which dominates the others. Next, pathlists of 'Foreign Scan on ftprt1_p1' and 'Foreign Scan on ftprt1_p2' are different: the first one contains two paths: 1. startup_cost: 100.000, total_cost: 103.090, pathkeys: false 2. startup_cost: 102.880, total_cost: 103.110, pathkeys: true And the second subpath has only one option to scan: startup_cost: 100.000, total_cost: 103.660, pathkeys: true Before, the optimiser always chose the path with pathkeys. However, this patch attempts to do its best by comparing ForeignScan+Sort and ForeignScan. Comparing the total path with the explicit Sort and pre-sorted one, we have: - ForeignScan+Sort: startup_cost: 103.100, total_cost: 103.105 - Presorted: startup_cost: 102.880, total_cost: 103.110 And here is the issue: a difference in the third sign after decimal point. Let's check remote estimations with and without Sort: With: LockRows (cost=2.88..2.90 rows=1 width=25) -> Sort (cost=2.88..2.89 rows=1 width=25) Sort Key: t1.a -> Seq Scan on public.fprt1_p1 t1 (cost=0.00..2.88 ... Without: LockRows (cost=0.00..2.88 rows=1 width=25) -> Seq Scan on public.fprt1_p1 t1 (cost=0.00..2.88 ... As you can see, according to these estimations, LockRows costs nothing without sorting and 0.1 with Sort. So, fluctuation was added by EXPLAIN's rounding. What to do? At first, we can do nothing and just correct the output. But I don't like unstable tests. We can adjust the query slightly to increase the estimations or improve the estimation using extended statistics. I prefer the more elegant variant with extended statistics. See the attachment for a sketch on how to stabilise the output. With this patch applied before this feature, the test output stays the same. > > Besides, I noticed that some plans have changed from an "Index Scan > with Index Cond" to a "Seq Scan with Filter + Sort". I'm also not > sure whether this change results in better performance. As you know, according to the cost model, SeqScan looks better on scans of tiny tables and full scans. I didn't delve as deeply into these cases yet as I did in the previous one, but it's clear that we're still seeing the issue with tiny tables. -- regards, Andrei Lepikhov
Вложения
On 1/9/2025 22:26, Alexander Korotkov wrote: > On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> See this minor correction in the attachment. postgres_fdw tests are >> stable now. > > I have another idea. What if we allow MergeAppend paths only when at > least one subpath is preordered. This trick also allow us to exclude > MergeAppend(Sort) dominating Sort(Append). I see the regression tests > changes now have much less volume and looks more reasonable. What do > you think? I believe a slight mistake has been made with the total_has_ordered / startup_has_ordered parameters, which has caused unnecessary test changes in inherit.out (See updated patch in the attachment). Although not the best test in general (it depends on the autovacuum), it highlighted the case where a startup-optimal strategy is necessary, even when a fractional-optimal path is available, which may lead to continue of the discussion [1].> > Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and > get_cheapest_path_for_pathkeys_ext() should consider incremental sort? > The revised patch teaches them to do so. Following 55a780e9476 [2] it should be considered, of course. [1] https://www.postgresql.org/message-id/CAPpHfduicoMCJ0b0mMvMQ5KVqLimJ7pKdxajciSF%2BP7JF31v%2BA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAMbWs49wSNPPD%3DFOQqzjPNZ_N9EGDv%3D7-ou0dFgd0HSwP3fTAg%40mail.gmail.com?utm_source=chatgpt.com -- regards, Andrei Lepikhov
Вложения
On Fri, Sep 5, 2025 at 11:45 AM Andrei Lepikhov <lepihov@gmail.com> wrote: > > On 1/9/2025 22:26, Alexander Korotkov wrote: > > On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > >> See this minor correction in the attachment. postgres_fdw tests are > >> stable now. > > > > I have another idea. What if we allow MergeAppend paths only when at > > least one subpath is preordered. This trick also allow us to exclude > > MergeAppend(Sort) dominating Sort(Append). I see the regression tests > > changes now have much less volume and looks more reasonable. What do > > you think? > I believe a slight mistake has been made with the total_has_ordered / > startup_has_ordered parameters, which has caused unnecessary test > changes in inherit.out (See updated patch in the attachment). Although > not the best test in general (it depends on the autovacuum), it > highlighted the case where a startup-optimal strategy is necessary, even > when a fractional-optimal path is available, which may lead to continue > of the discussion [1].> > > Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and > > get_cheapest_path_for_pathkeys_ext() should consider incremental sort? > > The revised patch teaches them to do so. > Following 55a780e9476 [2] it should be considered, of course. Great, thank you for catching this. The diff of costs is attached. I see the costs now are better or within the fuzz factor. ------ Regards, Alexander Korotkov Supabase
Вложения
On Sun, Sep 7, 2025 at 8:26 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Great, thank you for catching this. The diff of costs is attached. I > see the costs now are better or within the fuzz factor. Will have a review by the end of this commitfest. - Richard
On Mon, Sep 8, 2025 at 11:39 AM Richard Guo <guofenglinux@gmail.com> wrote: > On Sun, Sep 7, 2025 at 8:26 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Great, thank you for catching this. The diff of costs is attached. I > > see the costs now are better or within the fuzz factor. > > Will have a review by the end of this commitfest. Great, thank you, Richard! ------ Regards, Alexander Korotkov Supabase
On Mon, Sep 8, 2025 at 11:39 AM Richard Guo <guofenglinux@gmail.com> wrote: > > On Sun, Sep 7, 2025 at 8:26 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Great, thank you for catching this. The diff of costs is attached. I > > see the costs now are better or within the fuzz factor. > > Will have a review by the end of this commitfest. Did you manage to take a look at this patch? ------ Regards, Alexander Korotkov Supabase
On 07.09.2025 14:26, Alexander Korotkov wrote: > On Fri, Sep 5, 2025 at 11:45 AM Andrei Lepikhov <lepihov@gmail.com> wrote: >> On 1/9/2025 22:26, Alexander Korotkov wrote: >>> On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >>>> See this minor correction in the attachment. postgres_fdw tests are >>>> stable now. >>> I have another idea. What if we allow MergeAppend paths only when at >>> least one subpath is preordered. This trick also allow us to exclude >>> MergeAppend(Sort) dominating Sort(Append). I see the regression tests >>> changes now have much less volume and looks more reasonable. What do >>> you think? >> I believe a slight mistake has been made with the total_has_ordered / >> startup_has_ordered parameters, which has caused unnecessary test >> changes in inherit.out (See updated patch in the attachment). Although >> not the best test in general (it depends on the autovacuum), it >> highlighted the case where a startup-optimal strategy is necessary, even >> when a fractional-optimal path is available, which may lead to continue >> of the discussion [1].> >>> Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and >>> get_cheapest_path_for_pathkeys_ext() should consider incremental sort? >>> The revised patch teaches them to do so. >> Following 55a780e9476 [2] it should be considered, of course. > Great, thank you for catching this. The diff of costs is attached. I > see the costs now are better or within the fuzz factor. > Hi! I looked at regression test changes but one of them confused me. Example where the plan shape changed: EXPLAIN (COSTS OFF) SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; -- before QUERY PLAN ----------------------------------------------------------- Incremental Sort Sort Key: prt1.a, prt2.b Presorted Key: prt1.a -> Merge Left Join Merge Cond: (prt1.a = prt2.b) -> Sort Sort Key: prt1.a -> Append -> Seq Scan on prt1_p1 prt1_1 Filter: ((a < 450) AND (b = 0)) -> Seq Scan on prt1_p2 prt1_2 Filter: ((a < 450) AND (b = 0)) -> Sort Sort Key: prt2.b -> Append -> Seq Scan on prt2_p2 prt2_1 Filter: (b > 250) -> Seq Scan on prt2_p3 prt2_2 Filter: (b > 250) (19 rows) -- now QUERY PLAN ----------------------------------------------------------------- Sort Sort Key: prt1.a, prt2.b -> Merge Right Join Merge Cond: (prt2.b = prt1.a) -> Append -> Sort Sort Key: prt2_1.b -> Seq Scan on prt2_p2 prt2_1 Filter: (b > 250) -> Sort Sort Key: prt2_2.b -> Seq Scan on prt2_p3 prt2_2 Filter: (b > 250) -> Materialize -> Append -> Sort Sort Key: prt1_1.a -> Seq Scan on prt1_p1 prt1_1 Filter: ((a < 450) AND (b = 0)) -> Sort Sort Key: prt1_2.a -> Seq Scan on prt1_p2 prt1_2 Filter: ((a < 450) AND (b = 0)) (23 rows) Previously we had incremental sort on (t1.a, t2.b) with prt1.a already presorted; now we sort both t1.a and t2.b after a merge right join. It looks inefficiently or I missed something? Other tests looked fine for me.
Hi, Alena! On Fri, Oct 3, 2025 at 1:42 AM Alena Rybakina <a.rybakina@postgrespro.ru> wrote: > On 07.09.2025 14:26, Alexander Korotkov wrote: > > On Fri, Sep 5, 2025 at 11:45 AM Andrei Lepikhov <lepihov@gmail.com> wrote: > >> On 1/9/2025 22:26, Alexander Korotkov wrote: > >>> On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > >>>> See this minor correction in the attachment. postgres_fdw tests are > >>>> stable now. > >>> I have another idea. What if we allow MergeAppend paths only when at > >>> least one subpath is preordered. This trick also allow us to exclude > >>> MergeAppend(Sort) dominating Sort(Append). I see the regression tests > >>> changes now have much less volume and looks more reasonable. What do > >>> you think? > >> I believe a slight mistake has been made with the total_has_ordered / > >> startup_has_ordered parameters, which has caused unnecessary test > >> changes in inherit.out (See updated patch in the attachment). Although > >> not the best test in general (it depends on the autovacuum), it > >> highlighted the case where a startup-optimal strategy is necessary, even > >> when a fractional-optimal path is available, which may lead to continue > >> of the discussion [1].> > >>> Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and > >>> get_cheapest_path_for_pathkeys_ext() should consider incremental sort? > >>> The revised patch teaches them to do so. > >> Following 55a780e9476 [2] it should be considered, of course. > > Great, thank you for catching this. The diff of costs is attached. I > > see the costs now are better or within the fuzz factor. > > > Hi! I looked at regression test changes but one of them confused me. > > Example where the plan shape changed: > > EXPLAIN (COSTS OFF) > SELECT t1.a, t2.b > FROM (SELECT * FROM prt1 WHERE a < 450) t1 > LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 > ON t1.a = t2.b > WHERE t1.b = 0 > ORDER BY t1.a, t2.b; > > -- before > QUERY PLAN > ----------------------------------------------------------- > Incremental Sort > Sort Key: prt1.a, prt2.b > Presorted Key: prt1.a > -> Merge Left Join > Merge Cond: (prt1.a = prt2.b) > -> Sort > Sort Key: prt1.a > -> Append > -> Seq Scan on prt1_p1 prt1_1 > Filter: ((a < 450) AND (b = 0)) > -> Seq Scan on prt1_p2 prt1_2 > Filter: ((a < 450) AND (b = 0)) > -> Sort > Sort Key: prt2.b > -> Append > -> Seq Scan on prt2_p2 prt2_1 > Filter: (b > 250) > -> Seq Scan on prt2_p3 prt2_2 > Filter: (b > 250) > (19 rows) > > -- now > QUERY PLAN > ----------------------------------------------------------------- > Sort > Sort Key: prt1.a, prt2.b > -> Merge Right Join > Merge Cond: (prt2.b = prt1.a) > -> Append > -> Sort > Sort Key: prt2_1.b > -> Seq Scan on prt2_p2 prt2_1 > Filter: (b > 250) > -> Sort > Sort Key: prt2_2.b > -> Seq Scan on prt2_p3 prt2_2 > Filter: (b > 250) > -> Materialize > -> Append > -> Sort > Sort Key: prt1_1.a > -> Seq Scan on prt1_p1 prt1_1 > Filter: ((a < 450) AND (b = 0)) > -> Sort > Sort Key: prt1_2.a > -> Seq Scan on prt1_p2 prt1_2 > Filter: ((a < 450) AND (b = 0)) > (23 rows) > > Previously we had incremental sort on (t1.a, t2.b) with prt1.a already > presorted; now we sort both t1.a and t2.b after a merge right join. It > looks inefficiently or I missed something? > > Other tests looked fine for me. Thank you for taking a look at this. According to our cost model incremental sort has additional overhead but gives huge wins on large row sets. The row sets here are very small. You can get from [1] that the total cost became smaller. Links. 1. https://www.postgresql.org/message-id/CAPpHfdsn_mPy1v6Gf8rmdkBDsDLU%2B%3DJ4M4sBzgaFv21cWruZFA%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase
On Thu, Oct 2, 2025 at 11:49 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Mon, Sep 8, 2025 at 11:39 AM Richard Guo <guofenglinux@gmail.com> wrote: > > Will have a review by the end of this commitfest. > Did you manage to take a look at this patch? Sorry, I haven't had a chance to review it yet, but it's on my to-do list. I'll get to it as soon as I can. - Richard