Обсуждение: An improvement on parallel DISTINCT

Поиск
Список
Период
Сортировка

An improvement on parallel DISTINCT

От
Richard Guo
Дата:
While reviewing Heikki's Omit-junk-columns patchset[1], I noticed that
root->upper_targets[] is used to set target for partial_distinct_rel,
which is not great because root->upper_targets[] is not supposed to be
used by the core code.  The comment in grouping_planner() says:

  * Save the various upper-rel PathTargets we just computed into
  * root->upper_targets[].  The core code doesn't use this, but it
  * provides a convenient place for extensions to get at the info.

Then while fixing this issue, I noticed an opportunity for improvement
in how we generate Gather/GatherMerge paths for the two-phase DISTINCT.
The Gather/GatherMerge paths are added by generate_gather_paths(), which
does not consider ordering that might be useful above the GatherMerge
node.  This can be improved by using generate_useful_gather_paths()
instead.  With this change I can see query plan improvement from the
regression test "select_distinct.sql".  For instance,

-- Test parallel DISTINCT
SET parallel_tuple_cost=0;
SET parallel_setup_cost=0;
SET min_parallel_table_scan_size=0;
SET max_parallel_workers_per_gather=2;

-- Ensure we get a parallel plan
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;

-- on master
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
                     QUERY PLAN
----------------------------------------------------
 Unique
   ->  Sort
         Sort Key: four
         ->  Gather
               Workers Planned: 2
               ->  HashAggregate
                     Group Key: four
                     ->  Parallel Seq Scan on tenk1
(8 rows)

-- on patched
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
                     QUERY PLAN
----------------------------------------------------
 Unique
   ->  Gather Merge
         Workers Planned: 2
         ->  Sort
               Sort Key: four
               ->  HashAggregate
                     Group Key: four
                     ->  Parallel Seq Scan on tenk1
(8 rows)

I believe the second plan is better.

Attached is a patch that includes this change and also eliminates the
usage of root->upper_targets[] in the core code.  It also makes some
tweaks for the comment.

Any thoughts?

[1] https://www.postgresql.org/message-id/flat/2ca5865b-4693-40e5-8f78-f3b45d5378fb%40iki.fi

Thanks
Richard
Вложения

Re: An improvement on parallel DISTINCT

От
David Rowley
Дата:
On Wed, 27 Dec 2023 at 00:23, Richard Guo <guofenglinux@gmail.com> wrote:
> -- on master
> EXPLAIN (costs off)
> SELECT DISTINCT four FROM tenk1;
>                      QUERY PLAN
> ----------------------------------------------------
>  Unique
>    ->  Sort
>          Sort Key: four
>          ->  Gather
>                Workers Planned: 2
>                ->  HashAggregate
>                      Group Key: four
>                      ->  Parallel Seq Scan on tenk1
> (8 rows)
>
> -- on patched
> EXPLAIN (costs off)
> SELECT DISTINCT four FROM tenk1;
>                      QUERY PLAN
> ----------------------------------------------------
>  Unique
>    ->  Gather Merge
>          Workers Planned: 2
>          ->  Sort
>                Sort Key: four
>                ->  HashAggregate
>                      Group Key: four
>                      ->  Parallel Seq Scan on tenk1
> (8 rows)
>
> I believe the second plan is better.

I wonder if this change is worthwhile. The sort is only required at
all because the planner opted to HashAggregate in phase1, of which the
rows are output unordered. If phase1 was done by Group Aggregate, then
no sorting would be needed.  The only reason the planner didn't Hash
Aggregate for phase2 is because of the order we generate the distinct
paths and because of STD_FUZZ_FACTOR.

Look at the costs of the above plan:

Unique  (cost=397.24..397.28 rows=4 width=4)

if I enable_sort=0; then I get a cheaper plan:

 HashAggregate  (cost=397.14..397.18 rows=4 width=4)

If we add more rows then the cost of sorting will grow faster than the
cost of hash aggregate due to the O(N log2 N) part of our sort
costing.

If I drop the index on tenk1(hundred), I only need to go to the
"hundred" column to have it switch to Hash Aggregate on the 2nd phase.
This is because the number of distinct groups costs the paths for
Group Aggregate and Hash Aggregate more than STD_FUZZ_FACTOR apart.
Adjusting the STD_FUZZ_FACTOR with the following means Hash Aggregate
is used for both phases.

-#define STD_FUZZ_FACTOR 1.01
+#define STD_FUZZ_FACTOR 1.0000001

In light of this, do you still think it's worthwhile making this change?

For me, I think all it's going to result in is extra planner work
without any performance gains.

> Attached is a patch that includes this change and also eliminates the
> usage of root->upper_targets[] in the core code.  It also makes some
> tweaks for the comment.

We should fix that.  We can consider it independently from the other
change you're proposing.

David



Re: An improvement on parallel DISTINCT

От
Richard Guo
Дата:

On Fri, Feb 2, 2024 at 11:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
In light of this, do you still think it's worthwhile making this change?

For me, I think all it's going to result in is extra planner work
without any performance gains.

Hmm, with the query below, I can see that the new plan is cheaper than
the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.

create table t (a int, b int);
insert into t select i%100000, i from generate_series(1,10000000)i;
analyze t;

-- on master
explain (costs on) select distinct a from t order by a limit 1;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=120188.50..120188.51 rows=1 width=4)
   ->  Sort  (cost=120188.50..120436.95 rows=99379 width=4)
         Sort Key: a
         ->  HashAggregate  (cost=118697.82..119691.61 rows=99379 width=4)
               Group Key: a
               ->  Gather  (cost=97331.33..118200.92 rows=198758 width=4)
                     Workers Planned: 2
                     ->  HashAggregate  (cost=96331.33..97325.12 rows=99379 width=4)
                           Group Key: a
                           ->  Parallel Seq Scan on t  (cost=0.00..85914.67 rows=4166667 width=4)
(10 rows)

-- on patched
explain (costs on) select distinct a from t order by a limit 1;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=106573.93..106574.17 rows=1 width=4)
   ->  Unique  (cost=106573.93..130260.88 rows=99379 width=4)
         ->  Gather Merge  (cost=106573.93..129763.98 rows=198758 width=4)
               Workers Planned: 2
               ->  Sort  (cost=105573.91..105822.35 rows=99379 width=4)
                     Sort Key: a
                     ->  HashAggregate  (cost=96331.33..97325.12 rows=99379 width=4)
                           Group Key: a
                           ->  Parallel Seq Scan on t  (cost=0.00..85914.67 rows=4166667 width=4)
(9 rows)

It seems that including a LIMIT clause can potentially favor the new
plan.

Thanks
Richard

Re: An improvement on parallel DISTINCT

От
David Rowley
Дата:
On Fri, 2 Feb 2024 at 20:47, Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> On Fri, Feb 2, 2024 at 11:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> In light of this, do you still think it's worthwhile making this change?
>>
>> For me, I think all it's going to result in is extra planner work
>> without any performance gains.
>
>
> Hmm, with the query below, I can see that the new plan is cheaper than
> the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.
>
> create table t (a int, b int);
> insert into t select i%100000, i from generate_series(1,10000000)i;
> analyze t;
>
> explain (costs on) select distinct a from t order by a limit 1;

OK, a LIMIT clause... I didn't think of that.  Given the test results
below, I'm pretty convinced we should make the change.

Performance testing on an AMD 3990x with work_mem=4MB and hash_mem_multiplier=2.

$ cat bench.sql
select distinct a from t order by a limit 1;
$ pgbench -n -T 60 -f bench.sql postgres

-- Master

max_parallel_workers_per_gather=2;
latency average = 470.310 ms
latency average = 468.673 ms
latency average = 469.463 ms

max_parallel_workers_per_gather=4;
latency average = 346.012 ms
latency average = 346.662 ms
latency average = 347.591 ms

max_parallel_workers_per_gather=8; + alter table t set (parallel_workers=8);
latency average = 300.298 ms
latency average = 300.029 ms
latency average = 300.314 ms

-- Patched

max_parallel_workers_per_gather=2;
latency average = 424.176 ms
latency average = 431.870 ms
latency average = 431.870 ms (9.36% faster than master)

max_parallel_workers_per_gather=4;
latency average = 279.837 ms
latency average = 280.893 ms
latency average = 281.518 ms (23.51% faster than master)

max_parallel_workers_per_gather=8; + alter table t set (parallel_workers=8);
latency average = 178.585 ms
latency average = 178.780 ms
latency average = 179.768 ms (67.68% faster than master)

So the gains increase with more parallel workers due to pushing more
work to the worker. Amdahl's law approves of this.

I'll push the patch shortly.

David



Re: An improvement on parallel DISTINCT

От
David Rowley
Дата:
On Fri, 2 Feb 2024 at 23:39, David Rowley <dgrowleyml@gmail.com> wrote:
> I'll push the patch shortly.

I've pushed the partial path sort part.

Now for the other stuff you had.   I didn't really like this part:

+ /*
+ * Set target for partial_distinct_rel as generate_useful_gather_paths
+ * requires that the input rel has a valid reltarget.
+ */
+ partial_distinct_rel->reltarget = cheapest_partial_path->pathtarget;

I think we should just make it work the same way as
create_grouping_paths(), where grouping_target is passed as a
parameter.

I've done it that way in the attached.

David

Вложения

Re: An improvement on parallel DISTINCT

От
Richard Guo
Дата:

On Fri, Feb 2, 2024 at 6:39 PM David Rowley <dgrowleyml@gmail.com> wrote:
So the gains increase with more parallel workers due to pushing more
work to the worker. Amdahl's law approves of this.

I'll push the patch shortly.

Thanks for the detailed testing and pushing the patch!

Thanks
Richard

Re: An improvement on parallel DISTINCT

От
Richard Guo
Дата:

On Fri, Feb 2, 2024 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
Now for the other stuff you had.   I didn't really like this part:

+ /*
+ * Set target for partial_distinct_rel as generate_useful_gather_paths
+ * requires that the input rel has a valid reltarget.
+ */
+ partial_distinct_rel->reltarget = cheapest_partial_path->pathtarget;

I think we should just make it work the same way as
create_grouping_paths(), where grouping_target is passed as a
parameter.

I've done it that way in the attached.

The change looks good to me.

BTW, I kind of doubt that 'create_partial_distinct_paths' is a proper
function name given what it actually does.  It not only generates
distinct paths based on input_rel's partial paths, but also adds
Gather/GatherMerge on top of these partially distinct paths, followed by
a final unique/aggregate path to ensure uniqueness of the final result.
So maybe 'create_parallel_distinct_paths' or something like that would
be better?

I asked because I noticed that in create_partial_grouping_paths(), we
only generate partially aggregated paths, and any subsequent
FinalizeAggregate step is called in the caller.

Thanks
Richard

Re: An improvement on parallel DISTINCT

От
David Rowley
Дата:
On Mon, 5 Feb 2024 at 14:42, Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> On Fri, Feb 2, 2024 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> I think we should just make it work the same way as
>> create_grouping_paths(), where grouping_target is passed as a
>> parameter.
>>
>> I've done it that way in the attached.
>
>
> The change looks good to me.

I pushed the PathTarget changes.

> BTW, I kind of doubt that 'create_partial_distinct_paths' is a proper
> function name given what it actually does.

I didn't make any changes here. I don't think messing with this is
worth the trouble.

David