Re: apply_scanjoin_target_to_paths and partitionwise join

Поиск
Список
Период
Сортировка
От Arne Roland
Тема Re: apply_scanjoin_target_to_paths and partitionwise join
Дата
Msg-id 527dee29-2fd1-4a97-b1bc-d16b9db120c9@malkut.net
обсуждение исходный текст
Ответ на Re: apply_scanjoin_target_to_paths and partitionwise join  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: apply_scanjoin_target_to_paths and partitionwise join
Список pgsql-hackers


On 2025-10-29 19:55, Robert Haas wrote:
On Wed, Oct 29, 2025 at 2:06 PM Arne Roland <arne.roland@malkut.net> wrote:
This virtually equivalent query issue occurs when the join condition is
(almost) unique. The different amount of tuples to process clearly
occurs when they are not.
I'm having trouble interpreting this. If it's important, please
clarify and show an example.

Thank you for asking. I hope my explanations are clear. If not I am happy to explain a particular thing in more detail.

The main factor of your example is, that the amount of rows handled by the (Merge) Append is different.

In the partition wise join we process a lot of rows, namely 300060003:

Aggregate  (cost=12130962.32..12130962.33 rows=1 width=8)
  ->  Merge Append  (cost=0.88..8380212.28 rows=300060003 width=34)
        Sort Key: t1.a
        ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
              Join Filter: (t1_1.a = t2_1.a)
              ->  Index Only Scan using dupfest1_a_idx on dupfest1 t1_1  (cost=0.29..194.30 rows=10001 width=2)
              ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                    ->  Seq Scan on dupfest1 t2_1  (cost=0.00..145.01 rows=10001 width=2)
        ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
              Join Filter: (t1_2.a = t2_2.a)
              ->  Index Only Scan using dupfest2_a_idx on dupfest2 t1_2  (cost=0.29..194.30 rows=10001 width=2)
              ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                    ->  Seq Scan on dupfest2 t2_2  (cost=0.00..145.01 rows=10001 width=2)
        ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
              Join Filter: (t1_3.a = t2_3.a)
              ->  Index Only Scan using dupfest3_a_idx on dupfest3 t1_3  (cost=0.29..194.30 rows=10001 width=2)
              ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                    ->  Seq Scan on dupfest3 t2_3  (cost=0.00..145.01 rows=10001 width=2)

In the non partitioned case we have less rows, because we have *more* rows after joining the two relations, because the join has more rows, than either of the partitioned tables had before. The Append only processes 30003 rows.

Aggregate  (cost=8253191.53..8253191.54 rows=1 width=8) (actual time=64208.334..64208.337 rows=1 loops=1)
  ->  Merge Join  (cost=1.71..4502441.21 rows=300060025 width=34) (actual time=28.900..51731.558 rows=300060003 loops=1)
        Merge Cond: (t1.a = t2.a)
        ->  Append  (cost=0.86..732.91 rows=30003 width=2) (actual time=0.036..7.044 rows=30003 loops=1)
              ->  Index Only Scan using dupfest1_a_idx on dupfest1 t1_1  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.034..1.436 rows=10001 loops=1)
              ->  Index Only Scan using dupfest2_a_idx on dupfest2 t1_2  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.513 rows=10001 loops=1)
              ->  Index Only Scan using dupfest3_a_idx on dupfest3 t1_3  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.408 rows=10001 loops=1)
        ->  Materialize  (cost=0.86..807.92 rows=30003 width=2) (actual time=0.014..13787.902 rows=300050003 loops=1)
              ->  Append  (cost=0.86..732.91 rows=30003 width=2) (actual time=0.011..4.225 rows=30003 loops=1)
                    ->  Index Only Scan using dupfest1_a_idx on dupfest1 t2_1  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.010..0.698 rows=10001 loops=1)
                    ->  Index Only Scan using dupfest2_a_idx on dupfest2 t2_2  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.005..0.708 rows=10001 loops=1)
                    ->  Index Only Scan using dupfest3_a_idx on dupfest3 t2_3  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.006..0.776 rows=10001 loops=1)

A very common case I have seen for partitionwise joins is some foreign key structure. There are several design paradigms, that make it common. If you join across some foreign key structure,  the amount of tuples doesn't increase. Think the table definition of

create table dupfest (a text, id bigint not null generated always as identity, primary key (a, id), foreign key (a, id) references dupfest (a, id)) partition by range(a);

With the query

select count(*) from (select * from dupfest t1, dupfest t2 where t1.a = t2.a and t1.id = t2.id order by t1.a offset 0);

If we join alongside a foreign key from t1 to t2, we know that the join can't contain more tuples than t1 did. This may seem like a very special case, but it's as far as enable_partitionwise_join = true is concerned definitely a common use case.

If we remove the order condition these cases also have very similar performance behavior (The difference in time is less than the noise threshold of my benchmark (regardless of the amount of data as long as work_mem is sufficiently large).):

select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
= t2.a and t1.id = t2.id offset 0);

If we enter 10 times the data, we see a only marginal difference in the cost. (And similar performance differences, unless I do the explain analyze, which ruins the time readings again.)

My second sentence just captured the mundane observation, if the join has significantly more tuples, than any base relation, the place of the (Merge) Append might be more relevant. If I join everything with a generate_series(1, 30000) I get more tuples to process.

I'd like to make one more side note about this example: The planner punishes the partitionwise join for having an extra node, that emits N rows (three Hash joins + Append vs two Appends + Hash Join). This plan is chosen because of the cpu_tuple_cost. I'm happy it picks the plan with the smaller memory footprint, but in my real world experience for a timing based approach the default cpu_tuple_cost tends to be too high to get a fair comparison between partitionwise and non partitionwise joins.

All the best
Arne


В списке pgsql-hackers по дате отправления: