Обсуждение: [MASSMAIL]apply_scanjoin_target_to_paths and partitionwise join

Поиск
Список
Период
Сортировка

[MASSMAIL]apply_scanjoin_target_to_paths and partitionwise join

От
Ashutosh Bapat
Дата:
Hi All,
Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation.
/*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones.  This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append
... snip ...
*/
if (rel_is_partitioned)
rel->pathlist = NIL;

Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.

We have one such query in our regression set.

SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;

For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.

The solution is to zap the pathlists only for simple partitioned relations like the attached patch.

With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.

The problem is reproducible on PG 15. The patch is based on 15_STABLE branch. But the problem exists in recent branches as well.

--
Best Wishes,
Ashutosh Bapat
Вложения

Re: apply_scanjoin_target_to_paths and partitionwise join

От
Ashutosh Bapat
Дата:


On Thu, Apr 11, 2024 at 12:07 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi All,
Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation.
/*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones.  This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append
... snip ...
*/
if (rel_is_partitioned)
rel->pathlist = NIL;

Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.

We have one such query in our regression set.

SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;

For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.

The solution is to zap the pathlists only for simple partitioned relations like the attached patch.

With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.

The problem is reproducible on PG 15. The patch is based on 15_STABLE branch. But the problem exists in recent branches as well.



After sending email I found another thread where this problem was discussed [1]. Tom has the same explanation as mine however he favoured not doing anything about it. Since the rationale was based on the cost and not actual performance, it was fine at that time. At EDB we have a customer case where partitionwise join plan is worse than non-partitionwise join plan. The suboptimal plan is chosen because of the above code. I think we should fix the problem.


--
Best Wishes,
Ashutosh Bapat

Re: apply_scanjoin_target_to_paths and partitionwise join

От
Ashutosh Bapat
Дата:
Here's patch with

On Thu, Apr 11, 2024 at 12:24 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:


On Thu, Apr 11, 2024 at 12:07 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi All,
Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation.
/*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones.  This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append
... snip ...
*/
if (rel_is_partitioned)
rel->pathlist = NIL;

Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.

We have one such query in our regression set.

SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;

For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.

The solution is to zap the pathlists only for simple partitioned relations like the attached patch.

With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.

Found a replacement for that query by using a 2-way join instead of 3-way join. The query still executes the referenced code in process_outer_partition() as mentioned in the comment. I did think about removing the original query. But it is the only example in our regression tests where partitionwise join is more costly than non-partitionwise join. So I have left it as is in the test. I am fine if others think that we should remove it.
 
Adding to the next commitfest but better to consider this for the next set of minor releases.

--
Best Wishes,
Ashutosh Bapat
Вложения

Re: apply_scanjoin_target_to_paths and partitionwise join

От
Jakub Wartak
Дата:
Hi Ashutosh & hackers,

On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.

1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206   + partition_join"

2. Without the patch applied, the result of the meson test on master
was clean (no failures , so master is fine). After applying patch
there were expected some hunk failures (as the patch was created for
15_STABLE):

patching file src/backend/optimizer/plan/planner.c
Hunk #1 succeeded at 7567 (offset 468 lines).
Hunk #2 succeeded at 7593 (offset 468 lines).
patching file src/test/regress/expected/partition_join.out
Hunk #1 succeeded at 4777 (offset 56 lines).
Hunk #2 succeeded at 4867 (offset 56 lines).
patching file src/test/regress/sql/partition_join.sql
Hunk #1 succeeded at 1136 (offset 1 line).

3. Without patch there is performance regression/bug on master (cost
is higher with enable_partitionwise_join=on that without it):

data preparation:
-- Test the process_outer_partition() code path
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0000',
'0001', '0002');
CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i;
ANALYZE plt1_adv;

CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002');
CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2_adv;

CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001');
CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3_adv;

off:
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]


with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65):
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=27.65..28.37 rows=289 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Append  (cost=2.23..15.83 rows=289 width=27)
         ->  Hash Full Join  (cost=2.23..4.81 rows=41 width=27)
[..]
         ->  Hash Full Join  (cost=2.45..9.57 rows=248 width=27)
[..]


However with the patch applied the plan with minimal cost is always
chosen ("22"):

explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]


set enable_partitionwise_join to on;
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]

with the patch applied, the minimal cost (with toggle on or off) the
cost always stays the minimal from the available ones. We cannot
provide a reproducer for real performance regression, but for the
affected customer it took 530+s (with enable_partitionwise_join=on)
and without that GUC it it was ~23s.

4. meson test ends up with failures like below:

  4/290 postgresql:regress / regress/regress
                 ERROR          32.67s
  6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
                 ERROR          56.96s
 35/290 postgresql:recovery / recovery/027_stream_regress
                 ERROR          40.20s

(all due to "regression tests pass" failures)

the partition_join.sql is failing for test 206, so for this:

-- partitionwise join with fractional paths
CREATE TABLE fract_t (id BIGINT, PRIMARY KEY (id)) PARTITION BY RANGE (id);
CREATE TABLE fract_t0 PARTITION OF fract_t FOR VALUES FROM ('0') TO ('1000');
CREATE TABLE fract_t1 PARTITION OF fract_t FOR VALUES FROM ('1000') TO ('2000');

-- insert data
INSERT INTO fract_t (id) (SELECT generate_series(0, 1999));
ANALYZE fract_t;

-- verify plan; nested index only scans
SET max_parallel_workers_per_gather = 0;
SET enable_partitionwise_join = on;

the testsuite was expecting the below with enable_partitionwise_join = on;

EXPLAIN (COSTS OFF)
SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER
BY x.id ASC LIMIT 10;
                              QUERY PLAN
-----------------------------------------------------------------------
 Limit
   ->  Merge Append
         Sort Key: x.id
         ->  Merge Left Join
               Merge Cond: (x_1.id = y_1.id)
               ->  Index Only Scan using fract_t0_pkey on fract_t0 x_1
               ->  Index Only Scan using fract_t0_pkey on fract_t0 y_1
         ->  Merge Left Join
               Merge Cond: (x_2.id = y_2.id)
               ->  Index Only Scan using fract_t1_pkey on fract_t1 x_2
               ->  Index Only Scan using fract_t1_pkey on fract_t1 y_2

but actually with patch it gets this (here with costs):

EXPLAIN (COSTS) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y
USING (id) ORDER BY x.id ASC LIMIT 10;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.10..2.21 rows=10 width=16)
   ->  Merge Left Join  (cost=1.10..223.10 rows=2000 width=16)
         Merge Cond: (x.id = y.id)
         ->  Append  (cost=0.55..96.55 rows=2000 width=8)
[..]
         ->  Append  (cost=0.55..96.55 rows=2000 width=8)
[..]


if you run it without patch and again with enable_partitionwise_join=on:

EXPLAIN SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING
(id) ORDER BY x.id ASC LIMIT 10;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.11..2.22 rows=10 width=16)
   ->  Merge Append  (cost=1.11..223.11 rows=2000 width=16)
         Sort Key: x.id
         ->  Merge Left Join  (cost=0.55..101.55 rows=1000 width=16)
[..]
         ->  Merge Left Join  (cost=0.55..101.55 rows=1000 width=16)
[..]

So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).

-J.



Re: apply_scanjoin_target_to_paths and partitionwise join

От
Ashutosh Bapat
Дата:


On Mon, May 6, 2024 at 4:26 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:
Hi Ashutosh & hackers,

On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.

1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206   + partition_join"

So I need to create a patch for master first. I thought CFBot somehow knew that the patch was created for PG 15. :)
 

2. Without the patch applied, the result of the meson test on master
was clean (no failures , so master is fine). After applying patch
there were expected some hunk failures (as the patch was created for
15_STABLE):

patching file src/backend/optimizer/plan/planner.c
Hunk #1 succeeded at 7567 (offset 468 lines).
Hunk #2 succeeded at 7593 (offset 468 lines).
patching file src/test/regress/expected/partition_join.out
Hunk #1 succeeded at 4777 (offset 56 lines).
Hunk #2 succeeded at 4867 (offset 56 lines).
patching file src/test/regress/sql/partition_join.sql
Hunk #1 succeeded at 1136 (offset 1 line).

3. Without patch there is performance regression/bug on master (cost
is higher with enable_partitionwise_join=on that without it):

data preparation:
-- Test the process_outer_partition() code path
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0000',
'0001', '0002');
CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i;
ANALYZE plt1_adv;

CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002');
CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2_adv;

CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001');
CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3_adv;

off:
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]


with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65):
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=27.65..28.37 rows=289 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Append  (cost=2.23..15.83 rows=289 width=27)
         ->  Hash Full Join  (cost=2.23..4.81 rows=41 width=27)
[..]
         ->  Hash Full Join  (cost=2.45..9.57 rows=248 width=27)
[..]


However with the patch applied the plan with minimal cost is always
chosen ("22"):

explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]


set enable_partitionwise_join to on;
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0    ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]

with the patch applied, the minimal cost (with toggle on or off) the
cost always stays the minimal from the available ones. We cannot
provide a reproducer for real performance regression, but for the
affected customer it took 530+s (with enable_partitionwise_join=on) 
and without that GUC it it was ~23s.
 
Thanks for providing actual timing. That's a huge difference.


4. meson test ends up with failures like below:

  4/290 postgresql:regress / regress/regress
                 ERROR          32.67s
  6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
                 ERROR          56.96s
 35/290 postgresql:recovery / recovery/027_stream_regress
                 ERROR          40.20s

(all due to "regression tests pass" failures)

the partition_join.sql is failing for test 206, so for this:

-- partitionwise join with fractional paths
CREATE TABLE fract_t (id BIGINT, PRIMARY KEY (id)) PARTITION BY RANGE (id);
CREATE TABLE fract_t0 PARTITION OF fract_t FOR VALUES FROM ('0') TO ('1000');
CREATE TABLE fract_t1 PARTITION OF fract_t FOR VALUES FROM ('1000') TO ('2000');

-- insert data
INSERT INTO fract_t (id) (SELECT generate_series(0, 1999));
ANALYZE fract_t;

-- verify plan; nested index only scans
SET max_parallel_workers_per_gather = 0;
SET enable_partitionwise_join = on;

the testsuite was expecting the below with enable_partitionwise_join = on;

EXPLAIN (COSTS OFF)
SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER
BY x.id ASC LIMIT 10;
                              QUERY PLAN
-----------------------------------------------------------------------
 Limit
   ->  Merge Append
         Sort Key: x.id
         ->  Merge Left Join
               Merge Cond: (x_1.id = y_1.id)
               ->  Index Only Scan using fract_t0_pkey on fract_t0 x_1
               ->  Index Only Scan using fract_t0_pkey on fract_t0 y_1
         ->  Merge Left Join
               Merge Cond: (x_2.id = y_2.id)
               ->  Index Only Scan using fract_t1_pkey on fract_t1 x_2
               ->  Index Only Scan using fract_t1_pkey on fract_t1 y_2

but actually with patch it gets this (here with costs):

EXPLAIN (COSTS) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y
USING (id) ORDER BY x.id ASC LIMIT 10;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.10..2.21 rows=10 width=16)
   ->  Merge Left Join  (cost=1.10..223.10 rows=2000 width=16)
         Merge Cond: (x.id = y.id)
         ->  Append  (cost=0.55..96.55 rows=2000 width=8)
[..]
         ->  Append  (cost=0.55..96.55 rows=2000 width=8)
[..]


if you run it without patch and again with enable_partitionwise_join=on:

EXPLAIN SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING
(id) ORDER BY x.id ASC LIMIT 10;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.11..2.22 rows=10 width=16)
   ->  Merge Append  (cost=1.11..223.11 rows=2000 width=16)
         Sort Key: x.id
         ->  Merge Left Join  (cost=0.55..101.55 rows=1000 width=16)
[..]
         ->  Merge Left Join  (cost=0.55..101.55 rows=1000 width=16)
[..]

So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).

I think we need to replace the failing query with something which uses partitionwise join even with the patch.

I will take a look at this after returning from a two week long vacation, unless someone else is interested in fixing this before that.
 
--
Best Wishes,
Ashutosh Bapat

Re: apply_scanjoin_target_to_paths and partitionwise join

От
Ashutosh Bapat
Дата:


On Mon, May 6, 2024 at 6:28 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:


On Mon, May 6, 2024 at 4:26 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:
Hi Ashutosh & hackers,

On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.

1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206   + partition_join"

So I need to create a patch for master first. I thought CFBot somehow knew that the patch was created for PG 15. :)

PFA patch for master. That should fix CfBot.
 

4. meson test ends up with failures like below:

  4/290 postgresql:regress / regress/regress
                 ERROR          32.67s
  6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
                 ERROR          56.96s
 35/290 postgresql:recovery / recovery/027_stream_regress
                 ERROR          40.20s

(all due to "regression tests pass" failures)
 [...]

So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).


The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The modification just avoided eliminating the join, so that change can be ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to test fractional paths being considered when creating ordered append paths. Reading the commit message, I was expecting a test which did not use a join as well and also which used inheritance. But it seems that the queries added by that commit, test all the required scenarios and hence we see two queries involving join between partitioned tables. As the comment there says the intention is to verify index only scans and not exactly partitionwise join. So just fixing the expected output of one query looks fine. The other query will test partitionwise join and fractional paths anyway. I am including Tomas, Arne and Zhihong, who worked on the first commit, to comment on expected output changes.

I will create patches for the back-branches once the patch for master is in a committable state.

--
Best Wishes,
Ashutosh Bapat
Вложения

Re: apply_scanjoin_target_to_paths and partitionwise join

От
arne.roland@malkut.net
Дата:
Hi Ashutosh,

thanks for bringing this to my attention. I'll first share a few 
thoughts about the change and respond regarding the test below.

I clearly understand your intention with this patch. It's an issue I run 
into from time to time.

I did some testing with some benchmark sets back with pg 14. I did the 
following: I planned with and without the partitionwise join GUC 
(explain) and took the one with the lower cost to execute the query.

Interestingly, even discounting the overhead and additional planning 
time, the option with the lower cost turned out to be slower on our 
benchmark set back then. The median query with disabled GUC was quicker, 
but on average that was not the case. The observation is one, I'd 
generally describe as "The more options you consider, the more ways we 
have to be horribly wrong. More options for the planner are a great way 
to uncover the various shortcomings of it."

That might be specific to the benchmark I was working with at the time. 
But that made me drop the issue back then. That is ofc no valid reason 
not to go in the direction of making the planner to consider more 
options. :)

Maybe we can discuss that in person next week?

On 2024-05-22 07:57, Ashutosh Bapat wrote:
> On Mon, May 6, 2024 at 6:28 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>>> 4. meson test ends up with failures like below:
>>> 
>>> 4/290 postgresql:regress / regress/regress
>>> ERROR          32.67s
>>> 6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
>>> ERROR          56.96s
>>> 35/290 postgresql:recovery / recovery/027_stream_regress
>>> ERROR          40.20s
>>> 
>>> (all due to "regression tests pass" failures)
>>> [...]
> 
>>> So with the patch that SQL does not use partitionwise join as it
>>> finds
>>> it more optimal to stick to a plan with cost of "1.10..2.21"
>>> instead
>>> of "1.11..2.22" (w/ partition_join), nitpicking but still a
>>> failure
>>> technically. Perhaps it could be even removed? (it's pretty close
>>> to
>>> noise?).
> 
> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and
> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The
> modification just avoided eliminating the join, so that change can be
> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to
> test fractional paths being considered when creating ordered append
> paths. Reading the commit message, I was expecting a test which did
> not use a join as well and also which used inheritance. But it seems
> that the queries added by that commit, test all the required scenarios
> and hence we see two queries involving join between partitioned
> tables. As the comment there says the intention is to verify index
> only scans and not exactly partitionwise join. So just fixing the
> expected output of one query looks fine. The other query will test
> partitionwise join and fractional paths anyway. I am including Tomas,
> Arne and Zhihong, who worked on the first commit, to comment on
> expected output changes.

The test was put there to make sure a fractional join is considered in 
the case that a partitionwise join is considered. Because that wasn't 
the case before.

The important part for my use case back then was that we do Merge 
Join(s) at all. The test result after your patch still confirms that.

If we simply modify the test as such, we no longer confirm, whether the 
code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is 
still working.

Maybe it's worthwhile to add something like

create index on fract_t0 ((id*id));

EXPLAIN (COSTS OFF)
SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id DESC 
LIMIT 10;
                                   QUERY PLAN
-------------------------------------------------------------------------------
  Limit
    ->  Merge Append
          Sort Key: ((x.id * x.id)) DESC
          ->  Nested Loop
                ->  Index Scan Backward using fract_t0_expr_idx on 
fract_t0 x_1
                ->  Index Only Scan using fract_t0_pkey on fract_t0 y_1
                      Index Cond: (id = x_1.id)
          ->  Sort
                Sort Key: ((x_2.id * x_2.id)) DESC
                ->  Hash Join
                      Hash Cond: (x_2.id = y_2.id)
                      ->  Seq Scan on fract_t1 x_2
                      ->  Hash
                            ->  Seq Scan on fract_t1 y_2


I am not sure, whether it's worth the extra test cycles on every animal, 
but since we are not creating an extra table it might be ok.
I don't have a very strong feeling about the above test case.

> I will create patches for the back-branches once the patch for master
> is in a committable state.

I am not sure, whether it's really a bug. I personally wouldn't be brave 
enough to back patch this. I don't want to deal with complaining end 
users. Suddenly their optimizer, which always had horrible estimates, 
was actually able to do harmful stuff with them. Only due to a minor 
version upgrade. I think that's a bad idea to backpatch something with 
complex performance implications. Especially since they might even be 
based on potentially inaccurate data...

> 
> --
> 
> Best Wishes,
> Ashutosh Bapat

All the best
Arne



Re: apply_scanjoin_target_to_paths and partitionwise join

От
Robert Haas
Дата:
On Fri, May 24, 2024 at 2:02 PM <arne.roland@malkut.net> wrote:
> I am not sure, whether it's really a bug. I personally wouldn't be brave
> enough to back patch this. I don't want to deal with complaining end
> users. Suddenly their optimizer, which always had horrible estimates,
> was actually able to do harmful stuff with them. Only due to a minor
> version upgrade. I think that's a bad idea to backpatch something with
> complex performance implications. Especially since they might even be
> based on potentially inaccurate data...

+1.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: apply_scanjoin_target_to_paths and partitionwise join

От
Ashutosh Bapat
Дата:


On Fri, May 24, 2024 at 11:02 AM <arne.roland@malkut.net> wrote:
Hi Ashutosh,

thanks for bringing this to my attention. I'll first share a few
thoughts about the change and respond regarding the test below.

I clearly understand your intention with this patch. It's an issue I run
into from time to time.

I did some testing with some benchmark sets back with pg 14. I did the
following: I planned with and without the partitionwise join GUC
(explain) and took the one with the lower cost to execute the query.

Interestingly, even discounting the overhead and additional planning
time, the option with the lower cost turned out to be slower on our
benchmark set back then. The median query with disabled GUC was quicker,
but on average that was not the case. The observation is one, I'd
generally describe as "The more options you consider, the more ways we
have to be horribly wrong. More options for the planner are a great way
to uncover the various shortcomings of it."

That might be specific to the benchmark I was working with at the time.
But that made me drop the issue back then. That is ofc no valid reason
not to go in the direction of making the planner to consider more
options. :)

In summary, you are suggesting that partitionwise join performs better than plain join even if the latter one has lower cost. Hence fixing this issue has never become a priority for you. Am I right?

Plans with lower costs being slower is not new for optimizer. Partitionwise join just adds another case.
 

Maybe we can discuss that in person next week?

Sure.
 

On 2024-05-22 07:57, Ashutosh Bapat wrote:
>
> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and
> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The
> modification just avoided eliminating the join, so that change can be
> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to
> test fractional paths being considered when creating ordered append
> paths. Reading the commit message, I was expecting a test which did
> not use a join as well and also which used inheritance. But it seems
> that the queries added by that commit, test all the required scenarios
> and hence we see two queries involving join between partitioned
> tables. As the comment there says the intention is to verify index
> only scans and not exactly partitionwise join. So just fixing the
> expected output of one query looks fine. The other query will test
> partitionwise join and fractional paths anyway. I am including Tomas,
> Arne and Zhihong, who worked on the first commit, to comment on
> expected output changes.

The test was put there to make sure a fractional join is considered in
the case that a partitionwise join is considered. Because that wasn't
the case before.

The important part for my use case back then was that we do Merge
Join(s) at all. The test result after your patch still confirms that.

If we simply modify the test as such, we no longer confirm, whether the
code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is
still working.

Maybe it's worthwhile to add something like

create index on fract_t0 ((id*id));

EXPLAIN (COSTS OFF)
SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id DESC
LIMIT 10;
                                   QUERY PLAN
-------------------------------------------------------------------------------
  Limit
    ->  Merge Append
          Sort Key: ((x.id * x.id)) DESC
          ->  Nested Loop
                ->  Index Scan Backward using fract_t0_expr_idx on
fract_t0 x_1
                ->  Index Only Scan using fract_t0_pkey on fract_t0 y_1
                      Index Cond: (id = x_1.id)
          ->  Sort
                Sort Key: ((x_2.id * x_2.id)) DESC
                ->  Hash Join
                      Hash Cond: (x_2.id = y_2.id)
                      ->  Seq Scan on fract_t1 x_2
                      ->  Hash
                            ->  Seq Scan on fract_t1 y_2


I am not sure, whether it's worth the extra test cycles on every animal,
but since we are not creating an extra table it might be ok.
I don't have a very strong feeling about the above test case.

My patch removes redundant enable_partitionwise_join = on since that's done very early in the test. Apart from that it does not change the test. So if the expected output change is fine with you, I think we should leave the test as is. Plan outputs are sometimes fragile and thus make expected outputs flaky.


> I will create patches for the back-branches once the patch for master
> is in a committable state.

I am not sure, whether it's really a bug. I personally wouldn't be brave
enough to back patch this. I don't want to deal with complaining end
users. Suddenly their optimizer, which always had horrible estimates,
was actually able to do harmful stuff with them. Only due to a minor
version upgrade. I think that's a bad idea to backpatch something with
complex performance implications. Especially since they might even be
based on potentially inaccurate data...

Since it's a thinko I considered it as a bug. But I agree that it has the potential to disturb plans after upgrade and thus upset users. So I am fine if we don't backpatch. 

--
Best Wishes,
Ashutosh Bapat

Re: apply_scanjoin_target_to_paths and partitionwise join

От
arne.roland@malkut.net
Дата:
Hi Ashutosh!

On 2024-05-27 14:17, Ashutosh Bapat wrote:
> On Fri, May 24, 2024 at 11:02 AM <arne.roland@malkut.net> wrote:
> 
>> Hi Ashutosh,
>> 
>> thanks for bringing this to my attention. I'll first share a few
>> thoughts about the change and respond regarding the test below.
>> 
>> I clearly understand your intention with this patch. It's an issue I
>> run
>> into from time to time.
>> 
>> I did some testing with some benchmark sets back with pg 14. I did
>> the
>> following: I planned with and without the partitionwise join GUC
>> (explain) and took the one with the lower cost to execute the query.
>> 
>> Interestingly, even discounting the overhead and additional planning
>> 
>> time, the option with the lower cost turned out to be slower on our
>> benchmark set back then. The median query with disabled GUC was
>> quicker,
>> but on average that was not the case. The observation is one, I'd
>> generally describe as "The more options you consider, the more ways
>> we
>> have to be horribly wrong. More options for the planner are a great
>> way
>> to uncover the various shortcomings of it."
>> 
>> That might be specific to the benchmark I was working with at the
>> time.
>> But that made me drop the issue back then. That is ofc no valid
>> reason
>> not to go in the direction of making the planner to consider more
>> options. :)
> 
> In summary, you are suggesting that partitionwise join performs better
> than plain join even if the latter one has lower cost. Hence fixing
> this issue has never become a priority for you. Am I right?
> 
> Plans with lower costs being slower is not new for optimizer.
> Partitionwise join just adds another case.

Sorry for my confusing long text. I will try to recap my points 
concisely.

1. I think the order by pk frac limit plans had just to similar 
performance behaviour for me to bother.
But afaics the main point of your proposal is not related to frac plans 
at all.
2. We can't expect the optimizers to simply yield better results by 
being given more options to be wrong. (Let me give a simple example: 
This patch makes our lack of cross table cross column statistics worse. 
We give it more opportunity to pick something horrible.
3. I dislike, that this patch makes much harder to debug, why no 
partitionwise join is chosen.

> 
>> Maybe we can discuss that in person next week?
> 
> Sure.
> 
>> On 2024-05-22 07:57, Ashutosh Bapat wrote:
>>> 
>>> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and
>>> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The
>>> modification just avoided eliminating the join, so that change can
>> be
>>> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests
>> to
>>> test fractional paths being considered when creating ordered
>> append
>>> paths. Reading the commit message, I was expecting a test which
>> did
>>> not use a join as well and also which used inheritance. But it
>> seems
>>> that the queries added by that commit, test all the required
>> scenarios
>>> and hence we see two queries involving join between partitioned
>>> tables. As the comment there says the intention is to verify index
>>> only scans and not exactly partitionwise join. So just fixing the
>>> expected output of one query looks fine. The other query will test
>>> partitionwise join and fractional paths anyway. I am including
>> Tomas,
>>> Arne and Zhihong, who worked on the first commit, to comment on
>>> expected output changes.
>> 
>> The test was put there to make sure a fractional join is considered
>> in
>> the case that a partitionwise join is considered. Because that
>> wasn't
>> the case before.
>> 
>> The important part for my use case back then was that we do Merge
>> Join(s) at all. The test result after your patch still confirms
>> that.
>> 
>> If we simply modify the test as such, we no longer confirm, whether
>> the
>> code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is
>> still working.
>> 
>> Maybe it's worthwhile to add something like
>> 
>> create index on fract_t0 ((id*id));
>> 
>> EXPLAIN (COSTS OFF)
>> SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id
>> DESC
>> LIMIT 10;
>> QUERY PLAN
>> 
> -------------------------------------------------------------------------------
>> Limit
>> ->  Merge Append
>> Sort Key: ((x.id [1] * x.id [1])) DESC
>> ->  Nested Loop
>> ->  Index Scan Backward using fract_t0_expr_idx on
>> fract_t0 x_1
>> ->  Index Only Scan using fract_t0_pkey on fract_t0
>> y_1
>> Index Cond: (id = x_1.id [2])
>> ->  Sort
>> Sort Key: ((x_2.id [3] * x_2.id [3])) DESC
>> ->  Hash Join
>> Hash Cond: (x_2.id [3] = y_2.id [4])
>> ->  Seq Scan on fract_t1 x_2
>> ->  Hash
>> ->  Seq Scan on fract_t1 y_2
>> 
>> I am not sure, whether it's worth the extra test cycles on every
>> animal,
>> but since we are not creating an extra table it might be ok.
>> I don't have a very strong feeling about the above test case.
> 
> My patch removes redundant enable_partitionwise_join = on since that's
> done very early in the test. Apart from that it does not change the
> test. So if the expected output change is fine with you, I think we
> should leave the test as is. Plan outputs are sometimes fragile and
> thus make expected outputs flaky.

If at all, we can add to that. That would indeed give us more code test 
coverage. I will refrain from commenting further, since that discussion 
would get completely disconnected from the patch at hand.

> 
>>> I will create patches for the back-branches once the patch for
>> master
>>> is in a committable state.
>> 
>> I am not sure, whether it's really a bug. I personally wouldn't be
>> brave
>> enough to back patch this. I don't want to deal with complaining end
>> 
>> users. Suddenly their optimizer, which always had horrible
>> estimates,
>> was actually able to do harmful stuff with them. Only due to a minor
>> 
>> version upgrade. I think that's a bad idea to backpatch something
>> with
>> complex performance implications. Especially since they might even
>> be
>> based on potentially inaccurate data...
> 
> Since it's a thinko I considered it as a bug. But I agree that it has
> the potential to disturb plans after upgrade and thus upset users. So
> I am fine if we don't backpatch.
> 
> --
> 
> Best Wishes,
> Ashutosh Bapat
> 
> 
> Links:
> ------
> [1] http://x.id
> [2] http://x_1.id
> [3] http://x_2.id
> [4] http://y_2.id

All the best
Arne