Обсуждение: Eager aggregation, take 3

Поиск
Список
Период
Сортировка

Eager aggregation, take 3

От
Richard Guo
Дата:
Hi All,

Eager aggregation is a query optimization technique that partially
pushes a group-by past a join, and finalizes it once all the relations
are joined.  Eager aggregation reduces the number of input rows to the
join and thus may result in a better overall plan.  This technique is
thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
paper [1].

Back in 2017, a patch set has been proposed by Antonin Houska to
implement eager aggregation in thread [2].  However, it was at last
withdrawn after entering the pattern of "please rebase thx" followed by
rebasing and getting no feedback until "please rebase again thx".  A
second attempt in 2022 unfortunately fell into the same pattern about
one year ago and was eventually closed again [3].

That patch set has included most of the necessary concepts to implement
eager aggregation.  However, as far as I can see, it has several weak
points that we need to address.  It introduces invasive changes to some
core planner functions, such as make_join_rel().  And with such changes
join_is_legal() would be performed three times for the same proposed
join, which is not great.  Another weak point is that the complexity of
join searching dramatically increases with the growing number of
relations to be joined.  This occurs because when we generate partially
aggregated paths, each path of the input relation is considered as an
input path for the grouped paths.  As a result, the number of grouped
paths we generate increases exponentially, leading to a significant
explosion in computational complexity.  Other weak points include the
lack of support for outer joins and partitionwise joins.  And during my
review of the code, I came across several bugs (planning error or crash)
that need to be addressed.

I'd like to give it another take to implement eager aggregation, while
borrowing lots of concepts and many chunks of codes from the previous
patch set.  Please see attached.  I have chosen to use the term 'Eager
Aggregation' from the paper [1] instead of 'Aggregation push-down', to
differentiate the aggregation push-down technique in FDW.

The patch has been split into small pieces to make the review easier.

0001 introduces the RelInfoList structure, which encapsulates both a
list and a hash table, so that we can leverage the hash table for faster
lookups not only for join relations but also for upper relations.  With
eager aggregation, it is possible that we generate so many upper rels of
type UPPERREL_PARTIAL_GROUP_AGG that a hash table can help a lot with
lookups.

0002 introduces the RelAggInfo structure to store information needed to
create grouped paths for base and join rels.  It also revises the
RelInfoList related structures and functions so that they can be used
with RelAggInfos.

0003 checks if eager aggregation is applicable, and if so, collects
suitable aggregate expressions and grouping expressions in the query,
and records them in root->agg_clause_list and root->group_expr_list
respectively.

0004 implements the functions that check if eager aggregation is
applicable for a given relation, and if so, create RelAggInfo structure
for the relation, using the infos about aggregate expressions and
grouping expressions we collected earlier.  In this patch, when we check
if a target expression can act as grouping expression, we need to check
if this expression can be known equal to other expressions due to ECs
that can act as grouping expressions.  This patch leverages function
exprs_known_equal() to achieve that, after enhancing this function to
consider opfamily if provided.

0005 implements the functions that generate paths for grouped relations
by adding sorted and hashed partial aggregation paths on top of paths of
the plain base or join relations.  For sorted partial aggregation paths,
we only consider any suitably-sorted input paths as well as sorting the
cheapest-total path.  For hashed partial aggregation paths, we only
consider the cheapest-total path as input.  By not considering other
paths we can reduce the number of grouping paths as much as possible
while still achieving reasonable results.

0006 builds grouped relations for each base relation if possible, and
generates aggregation paths for the grouped base relations.

0007 builds grouped relations for each just-processed join relation if
possible, and generates aggregation paths for the grouped join
relations.  The changes made to make_join_rel() are relatively minor,
with the addition of a new function make_grouped_join_rel(), which finds
or creates a grouped relation for the just-processed joinrel, and
generates grouped paths by joining a grouped input relation with a
non-grouped input relation.

The other way to generate grouped paths is by adding sorted and hashed
partial aggregation paths on top of paths of the joinrel.  This occurs
in standard_join_search(), after we've run set_cheapest() for the
joinrel.  The reason for performing this step after set_cheapest() is
that we need to know the joinrel's cheapest paths (see 0005).

This patch also makes the grouped relation for the topmost join rel act
as the upper rel representing the result of partial aggregation, so that
we can add the final aggregation on top of that.  Additionally, this
patch extends the functionality of eager aggregation to work with
partitionwise join and geqo.

This patch also makes eager aggregation work with outer joins.  With
outer join, the aggregate cannot be pushed down if any column referenced
by grouping expressions or aggregate functions is nullable by an outer
join above the relation to which we want to apply the partiall
aggregation.  Thanks to Tom's outer-join-aware-Var infrastructure, we
can easily identify such situations and subsequently refrain from
pushing down the aggregates.

Starting from this patch, you should be able to see plans with eager
aggregation.

0008 adds test cases for eager aggregation.

0009 adds a section in README that describes this feature (copied from
previous patch set, with minor tweaks).

Thoughts and comments are welcome.

Вложения

Re: Eager aggregation, take 3

От
Andy Fan
Дата:
Richard Guo <guofenglinux@gmail.com> writes:

> Hi All,
>
> Eager aggregation is a query optimization technique that partially
> pushes a group-by past a join, and finalizes it once all the relations
> are joined.  Eager aggregation reduces the number of input rows to the
> join and thus may result in a better overall plan.  This technique is
> thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
> paper [1].

This is a really helpful and not easy task, even I am not sure when I
can spend time to study this, I want to say "Thanks for working on
this!" first and hope we can really progress on this topic. Good luck! 

-- 
Best Regards
Andy Fan




Re: Eager aggregation, take 3

От
Richard Guo
Дата:

On Mon, Mar 4, 2024 at 7:49 PM Andy Fan <zhihuifan1213@163.com> wrote:
This is a really helpful and not easy task, even I am not sure when I
can spend time to study this, I want to say "Thanks for working on
this!" first and hope we can really progress on this topic. Good luck!

Thanks.  I hope this take can go even further and ultimately find its
way to be committed.

This needs a rebase after dbbca2cf29.  I also revised the commit message
for 0007 and fixed a typo in 0009.

Thanks
Richard
Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:

Re: Eager aggregation, take 3

От
Richard Guo
Дата:

On Tue, Mar 5, 2024 at 7:19 PM Richard Guo <guofenglinux@gmail.com> wrote:
Here is another rebase, mainly to make the test cases more stable by
adding ORDER BY clauses to the test queries.  Also fixed more typos in
passing.

This needs another rebase after 97d85be365.  I also addressed several
issues that I identified during self-review, which include:

* In some cases GroupPathExtraData.agg_final_costs, which is the cost of
final aggregation, fails to be calculated.  This can lead to bogus cost
estimation and end up with unexpected plan.

* If the cheapest partially grouped path is generated through eager
aggregation, the number of groups estimated for the final phase will be
different from the number of groups estimated for non-split aggregation.
That is to say, we should not use 'dNumGroups' for the final aggregation
in add_paths_to_grouping_rel().

* It is possible that we may generate dummy grouped join relations, and
that would trigger the Assert in make_grouped_join_rel().

* More typo fixes.

Thanks
Richard
Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
Here is an update of the patchset with the following changes:

* Fix a 'Aggref found where not expected' error caused by the PVC call
in is_var_in_aggref_only.  This would happen if we have Aggrefs
contained in other expressions.

* Use joinrel's relids rather than the union of the relids of its outer
and inner to search for its grouped rel.  This is more correct as we
need to include OJs into consideration.

* Remove RelAggInfo.agg_exprs as it is not used anymore.

Thanks
Richard
Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
Another rebase is needed after d1d286d83c.  Also I realized that the
partially_grouped_rel generated by eager aggregation might be dummy,
such as in query:

select count(t2.c) from t t1 join t t2 on t1.b = t2.b where false group by t1.a;

If somehow we choose this dummy path with a Finalize Agg Path on top of
it as the final cheapest path (a very rare case), we would encounter the
"Aggref found in non-Agg plan node" error.  The v7 patch fixes this
issue.

Thanks
Richard
Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Jun 13, 2024 at 4:07 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I spent some time testing this patchset and found a few more issues.
> ...

> Hence here is the v8 patchset, with fixes for all the above issues.

I found an 'ORDER/GROUP BY expression not found in targetlist' error
with this patchset, with the query below:

create table t (a boolean);

set enable_eager_aggregate to on;

explain (costs off)
select min(1) from t t1 left join t t2 on t1.a group by (not (not
t1.a)), t1.a order by t1.a;
ERROR:  ORDER/GROUP BY expression not found in targetlist

This happens because the two grouping items are actually the same and
standard_qp_callback would remove one of them.  The fully-processed
groupClause is kept in root->processed_groupClause.  However, when
collecting grouping expressions in create_grouping_expr_infos, we are
checking parse->groupClause, which is incorrect.

The fix is straightforward: check root->processed_groupClause instead.

Here is a new rebase with this fix.

Thanks
Richard

Вложения

Re: Eager aggregation, take 3

От
Paul George
Дата:
Richard:

Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my work and I'm hoping to see it land.


I was playing around with v9 of the patches and was specifically curious about this previous statement...

>This patch also makes eager aggregation work with outer joins.  With
>outer join, the aggregate cannot be pushed down if any column referenced
>by grouping expressions or aggregate functions is nullable by an outer
>join above the relation to which we want to apply the partiall
>aggregation.  Thanks to Tom's outer-join-aware-Var infrastructure, we
>can easily identify such situations and subsequently refrain from
>pushing down the aggregates.

 ...and this related comment in eager_aggregate.out:

>-- Ensure aggregation cannot be pushed down to the nullable side

While I'm new to this work and its subtleties, I'm wondering if this is too broad a condition.

I modified the first test query in eager_aggregate.sql to make it a LEFT JOIN and eager aggregation indeed did not happen, which is expected based on the comments upthread.

query:
SET enable_eager_aggregate=ON;
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, sum(t2.c) FROM eager_agg_t1 t1 LEFT JOIN eager_agg_t2 t2 ON t1.b = t2.b GROUP BY t1.a ORDER BY t1.a;

plan:
                         QUERY PLAN                        
------------------------------------------------------------
 GroupAggregate
   Output: t1.a, sum(t2.c)
   Group Key: t1.a
   ->  Sort
         Output: t1.a, t2.c
         Sort Key: t1.a
         ->  Hash Right Join
               Output: t1.a, t2.c
               Hash Cond: (t2.b = t1.b)
               ->  Seq Scan on public.eager_agg_t2 t2
                     Output: t2.a, t2.b, t2.c
               ->  Hash
                     Output: t1.a, t1.b
                     ->  Seq Scan on public.eager_agg_t1 t1
                           Output: t1.a, t1.b
(15 rows)

(NOTE: I changed the aggregate from avg(...) to sum(...) for simplicity)

But, it seems that eager aggregation for the query above can be "replicated" as:

query:

EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, sum(t2.c)
FROM eager_agg_t1 t1
LEFT JOIN (
    SELECT b, sum(c) c
    FROM eager_agg_t2 t2p
    GROUP BY b
) t2 ON t1.b = t2.b
GROUP BY t1.a
ORDER BY t1.a;

The output of both the original query and this one match (and the plans with eager aggregation and the subquery are nearly identical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean that there are conditions under which eager aggregation can be pushed down to the nullable side?


-Paul-

On Sat, Jul 6, 2024 at 4:56 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Thu, Jun 13, 2024 at 4:07 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I spent some time testing this patchset and found a few more issues.
> ...

> Hence here is the v8 patchset, with fixes for all the above issues.

I found an 'ORDER/GROUP BY expression not found in targetlist' error
with this patchset, with the query below:

create table t (a boolean);

set enable_eager_aggregate to on;

explain (costs off)
select min(1) from t t1 left join t t2 on t1.a group by (not (not
t1.a)), t1.a order by t1.a;
ERROR:  ORDER/GROUP BY expression not found in targetlist

This happens because the two grouping items are actually the same and
standard_qp_callback would remove one of them.  The fully-processed
groupClause is kept in root->processed_groupClause.  However, when
collecting grouping expressions in create_grouping_expr_infos, we are
checking parse->groupClause, which is incorrect.

The fix is straightforward: check root->processed_groupClause instead.

Here is a new rebase with this fix.

Thanks
Richard

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Sun, Jul 7, 2024 at 10:45 AM Paul George <p.a.george19@gmail.com> wrote:
> Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my
workand I'm hoping to see it land. 

Thanks for looking at this patch!

> The output of both the original query and this one match (and the plans with eager aggregation and the subquery are
nearlyidentical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean
thatthere are conditions under which eager aggregation can be pushed down to the nullable side? 

I think it's a very risky thing to push a partial aggregation down to
the nullable side of an outer join, because the NULL-extended rows
produced by the outer join would not be available when we perform the
partial aggregation, while with a non-eager-aggregation plan these
rows are available for the top-level aggregation.  This may put the
rows into groups in a different way than expected, or get wrong values
from the aggregate functions.  I've managed to compose an example:

create table t (a int, b int);
insert into t select 1, 1;

select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
 a | count
---+-------
   |     1
(1 row)

This is the expected result, because after the outer join we have got
a NULL-extended row.

But if we somehow push down the partial aggregation to the nullable
side of this outer join, we would get a wrong result.

explain (costs off)
select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
                QUERY PLAN
-------------------------------------------
 Finalize HashAggregate
   Group Key: t2.a
   ->  Nested Loop Left Join
         Filter: (t2.a IS NULL)
         ->  Seq Scan on t t1
         ->  Materialize
               ->  Partial HashAggregate
                     Group Key: t2.a
                     ->  Seq Scan on t t2
                           Filter: (b > 1)
(10 rows)

select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
 a | count
---+-------
   |     0
(1 row)

I believe there are cases where pushing a partial aggregation down to
the nullable side of an outer join can be safe, but I doubt that there
is an easy way to identify these cases and do the push-down for them.
So for now I think we'd better refrain from doing that.

Thanks
Richard



Re: Eager aggregation, take 3

От
Paul George
Дата:
Hey Richard,

Looking more closely at this example

>select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by t2.a having t2.a is null;

I wonder if the inability to exploit eager aggregation is more based on the fact that COUNT(*) cannot be decomposed into an aggregation of PARTIAL COUNT(*)s (apologies if my terminology is off/made up...I'm new to the codebase). In other words, is it the case that a given aggregate function already has built-in protection against the error case you correctly pointed out?

To highlight this, in the simple example below we don't see aggregate pushdown even with an INNER JOIN when the agg function is COUNT(*) but we do when it's COUNT(t2.*):

-- same setup
drop table if exists t;
create table t(a int, b int, c int);
insert into t select i % 100, i % 10, i from generate_series(1, 1000) i;
analyze t;

-- query 1: COUNT(*) --> no pushdown

set enable_eager_aggregate=on;
explain (verbose, costs off) select t1.a, count(*) from t t1 join t t2 on t1.a=t2.a group by t1.a;

                QUERY PLAN                
-------------------------------------------
 HashAggregate
   Output: t1.a, count(*)
   Group Key: t1.a
   ->  Hash Join
         Output: t1.a
         Hash Cond: (t1.a = t2.a)
         ->  Seq Scan on public.t t1
               Output: t1.a, t1.b, t1.c
         ->  Hash
               Output: t2.a
               ->  Seq Scan on public.t t2
                     Output: t2.a
(12 rows)


-- query 2: COUNT(t2.*) --> agg pushdown

set enable_eager_aggregate=on;
explain (verbose, costs off) select t1.a, count(t2.*) from t t1 join t t2 on t1.a=t2.a group by t1.a;

                      QUERY PLAN                      
-------------------------------------------------------
 Finalize HashAggregate
   Output: t1.a, count(t2.*)
   Group Key: t1.a
   ->  Hash Join
         Output: t1.a, (PARTIAL count(t2.*))
         Hash Cond: (t1.a = t2.a)
         ->  Seq Scan on public.t t1
               Output: t1.a, t1.b, t1.c
         ->  Hash
               Output: t2.a, (PARTIAL count(t2.*))
               ->  Partial HashAggregate
                     Output: t2.a, PARTIAL count(t2.*)
                     Group Key: t2.a
                     ->  Seq Scan on public.t t2
                           Output: t2.*, t2.a
(15 rows)

...while it might be true that COUNT(*) ... INNER JOIN should allow eager agg pushdown (I haven't thought deeply about it, TBH), I did find this result pretty interesting.


-Paul

On Wed, Jul 10, 2024 at 1:27 AM Richard Guo <guofenglinux@gmail.com> wrote:
On Sun, Jul 7, 2024 at 10:45 AM Paul George <p.a.george19@gmail.com> wrote:
> Thanks for reviving this patch and for all of your work on it! Eager aggregation pushdown will be beneficial for my work and I'm hoping to see it land.

Thanks for looking at this patch!

> The output of both the original query and this one match (and the plans with eager aggregation and the subquery are nearly identical if you restore the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this mean that there are conditions under which eager aggregation can be pushed down to the nullable side?

I think it's a very risky thing to push a partial aggregation down to
the nullable side of an outer join, because the NULL-extended rows
produced by the outer join would not be available when we perform the
partial aggregation, while with a non-eager-aggregation plan these
rows are available for the top-level aggregation.  This may put the
rows into groups in a different way than expected, or get wrong values
from the aggregate functions.  I've managed to compose an example:

create table t (a int, b int);
insert into t select 1, 1;

select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
 a | count
---+-------
   |     1
(1 row)

This is the expected result, because after the outer join we have got
a NULL-extended row.

But if we somehow push down the partial aggregation to the nullable
side of this outer join, we would get a wrong result.

explain (costs off)
select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
                QUERY PLAN
-------------------------------------------
 Finalize HashAggregate
   Group Key: t2.a
   ->  Nested Loop Left Join
         Filter: (t2.a IS NULL)
         ->  Seq Scan on t t1
         ->  Materialize
               ->  Partial HashAggregate
                     Group Key: t2.a
                     ->  Seq Scan on t t2
                           Filter: (b > 1)
(10 rows)

select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
 a | count
---+-------
   |     0
(1 row)

I believe there are cases where pushing a partial aggregation down to
the nullable side of an outer join can be safe, but I doubt that there
is an easy way to identify these cases and do the push-down for them.
So for now I think we'd better refrain from doing that.

Thanks
Richard

Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Aug 21, 2024 at 3:11 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Attached is the updated version of the patchset that fixes this bug
> and includes further code refactoring.

Here are some initial, high-level thoughts about this patch set.

1. As far as I can see, there's no real performance testing on this
thread. I expect that it's possible to show an arbitrarily large gain
for the patch by finding a case where partial aggregation is way
better than anything we currently know, but that's not very
interesting. What I think would be useful to do is find a corpus of
existing queries on an existing data set and try them with and without
the patch and see which query plans change and whether they're
actually better. For example, maybe TPC-H or the subset of TPC-DS that
we can actually run would be a useful starting point. One could then
also measure how much the planning time increases with the patch to
get a sense of what the overhead of enabling this feature would be.
Even if it's disabled by default, people aren't going to want to
enable it if it causes planning times to become much longer on many
queries for which there is no benefit.

2. I think there might be techniques we could use to limit planning
effort at an earlier stage when the approach doesn't appear promising.
For example, if the proposed grouping column is already unique, the
exercise is pointless (I think). Ideally we'd like to detect that
without even creating the grouped_rel. But the proposed grouping
column might also be *mostly* unique. For example, consider a table
with a million rows and a column 500,000 distinct values. I suspect it
will be difficult for partial aggregation to work out to a win in a
case like this, because I think that the cost of performing the
partial aggregation will not reduce the cost either of the final
aggregation or of the intervening join steps by enough to compensate.
It would be best to find a way to avoid generating a lot of rels and
paths in cases where there's really not much hope of a win.

One could, perhaps, imagine going further with this by postponing
eager aggregation planning until after regular paths have been built,
so that we have good cardinality estimates. Suppose the query joins a
single fact table to a series of dimension tables. The final plan thus
uses the fact table as the driving table and joins to the dimension
tables one by one. Do we really need to consider partial aggregation
at every level? Perhaps just where there's been a significant row
count reduction since the last time we tried it, but at the next level
the row count will increase again?

Maybe there are other heuristics we could use in addition or instead.

3. In general, we are quite bad at estimating what will happen to the
row count after an aggregation, and we have no real idea what the
distribution of values will be. That might be a problem for this
patch, because it seems like the decisions we will make about where to
perform the partial aggregation might end up being quite random. At
the top of the join tree, I'll need to compare directly aggregating
the best join path with various paths that involve a finalize
aggregation step at the top and a partial aggregation step further
down. But my cost estimates and row counts for the partial aggregate
steps seem like they will often be quite poor, which means that the
plans that use those partial aggregate steps might also be quite poor.
Even if they're not, I fear that comparing the cost of those
PartialAggregate-Join(s)-FinalizeAggregate paths to the direct
Aggregate path will look too much like comparing random numbers. We
need to know whether the combination of the FinalizeAggregate step and
the PartialAggregate step will be more or less expensive than a plain
old Aggregate, but how can we tell that if we don't have accurate
cardinality estimates?

Thanks for working on this.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation.  While there were no major changes, the code should
> now be simpler.

I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.

Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.

Rectenly, I do some benchmark tests, mainly on tpch and tpcds.
tpch tests have no plan diff, so I do not continue to test on tpch.
tpcds(10GB) tests have 22 plan diff as below:
4.sql, 5.sql, 8.sql,11.sql,19.sql,23.sql,31.sql, 33.sql,39.sql,45.sql,46.sql,47.sql,53.sql,
56.sql,57.sql,60.sql,63.sql,68.sql,74.sql,77.sql,80.sql,89.sql

I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
performance regress.

I will continue to do benchmark on this feature.


--
Tender Wang

Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> Rectenly, I do some benchmark tests, mainly on tpch and tpcds.
> tpch tests have no plan diff, so I do not continue to test on tpch.

Interesting to know.

> tpcds(10GB) tests have 22 plan diff as below:
> 4.sql, 5.sql, 8.sql,11.sql,19.sql,23.sql,31.sql, 33.sql,39.sql,45.sql,46.sql,47.sql,53.sql,
> 56.sql,57.sql,60.sql,63.sql,68.sql,74.sql,77.sql,80.sql,89.sql

OK.

> I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> performance regress.

Yeah, this is one of the things I was worried about in my previous
reply to Richard. It would be worth Richard, or someone, probing into
exactly why that's happening. My fear is that we just don't have good
enough estimates to make good decisions, but there might well be
another explanation.

> I will continue to do benchmark on this feature.
>
> [1] https://github.com/tenderwg/eager_agg

Thanks!

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Aug 23, 2024 at 11:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Here are some initial, high-level thoughts about this patch set.

Thank you for your review and feedback!  It helps a lot in moving this
work forward.

> 1. As far as I can see, there's no real performance testing on this
> thread. I expect that it's possible to show an arbitrarily large gain
> for the patch by finding a case where partial aggregation is way
> better than anything we currently know, but that's not very
> interesting. What I think would be useful to do is find a corpus of
> existing queries on an existing data set and try them with and without
> the patch and see which query plans change and whether they're
> actually better. For example, maybe TPC-H or the subset of TPC-DS that
> we can actually run would be a useful starting point. One could then
> also measure how much the planning time increases with the patch to
> get a sense of what the overhead of enabling this feature would be.
> Even if it's disabled by default, people aren't going to want to
> enable it if it causes planning times to become much longer on many
> queries for which there is no benefit.

Right.  I haven’t had time to run any benchmarks yet, but that is
something I need to do.

> 2. I think there might be techniques we could use to limit planning
> effort at an earlier stage when the approach doesn't appear promising.
> For example, if the proposed grouping column is already unique, the
> exercise is pointless (I think). Ideally we'd like to detect that
> without even creating the grouped_rel. But the proposed grouping
> column might also be *mostly* unique. For example, consider a table
> with a million rows and a column 500,000 distinct values. I suspect it
> will be difficult for partial aggregation to work out to a win in a
> case like this, because I think that the cost of performing the
> partial aggregation will not reduce the cost either of the final
> aggregation or of the intervening join steps by enough to compensate.
> It would be best to find a way to avoid generating a lot of rels and
> paths in cases where there's really not much hope of a win.
>
> One could, perhaps, imagine going further with this by postponing
> eager aggregation planning until after regular paths have been built,
> so that we have good cardinality estimates. Suppose the query joins a
> single fact table to a series of dimension tables. The final plan thus
> uses the fact table as the driving table and joins to the dimension
> tables one by one. Do we really need to consider partial aggregation
> at every level? Perhaps just where there's been a significant row
> count reduction since the last time we tried it, but at the next level
> the row count will increase again?
>
> Maybe there are other heuristics we could use in addition or instead.

Yeah, one of my concerns with this work is that it can use
significantly more CPU time and memory during planning once enabled.
It would be great if we have some efficient heuristics to limit the
effort.  I'll work on that next and see what happens.

> 3. In general, we are quite bad at estimating what will happen to the
> row count after an aggregation, and we have no real idea what the
> distribution of values will be. That might be a problem for this
> patch, because it seems like the decisions we will make about where to
> perform the partial aggregation might end up being quite random. At
> the top of the join tree, I'll need to compare directly aggregating
> the best join path with various paths that involve a finalize
> aggregation step at the top and a partial aggregation step further
> down. But my cost estimates and row counts for the partial aggregate
> steps seem like they will often be quite poor, which means that the
> plans that use those partial aggregate steps might also be quite poor.
> Even if they're not, I fear that comparing the cost of those
> PartialAggregate-Join(s)-FinalizeAggregate paths to the direct
> Aggregate path will look too much like comparing random numbers. We
> need to know whether the combination of the FinalizeAggregate step and
> the PartialAggregate step will be more or less expensive than a plain
> old Aggregate, but how can we tell that if we don't have accurate
> cardinality estimates?

Yeah, I'm concerned about this too.  In addition to the inaccuracies
in aggregation estimates, our estimates for joins are sometimes not
very accurate either.  All this are likely to result in regressions
with eager aggregation in some cases.  Currently I don't have a good
answer to this problem.  Maybe we can run some benchmarks first and
investigate the regressions discovered on a case-by-case basis to better
understand the specific issues.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Aug 28, 2024 at 11:57 AM Tender Wang <tndrwang@gmail.com> wrote:
> Rectenly, I do some benchmark tests, mainly on tpch and tpcds.
> tpch tests have no plan diff, so I do not continue to test on tpch.
> tpcds(10GB) tests have 22 plan diff as below:
> 4.sql, 5.sql, 8.sql,11.sql,19.sql,23.sql,31.sql, 33.sql,39.sql,45.sql,46.sql,47.sql,53.sql,
> 56.sql,57.sql,60.sql,63.sql,68.sql,74.sql,77.sql,80.sql,89.sql
>
> I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> performance regress.
>
> I will continue to do benchmark on this feature.

Thank you for running the benchmarks.  That really helps a lot.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.

It's great that we have a query to probe into.  Your guess is likely
correct: it may be caused by poor estimates.

Tender, would you please help provide the outputs of

EXPLAIN (COSTS ON, ANALYZE)

on 19.sql with and without eager aggregation?

> > I will continue to do benchmark on this feature.

Thanks again for running the benchmarks.

Thanks
Richard



Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Richard Guo <guofenglinux@gmail.com> 于2024年8月29日周四 10:46写道:
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.

It's great that we have a query to probe into.  Your guess is likely
correct: it may be caused by poor estimates.

Tender, would you please help provide the outputs of

EXPLAIN (COSTS ON, ANALYZE)

on 19.sql with and without eager aggregation?

Yeah, in [1], 19_off.out and 19_on.out are the output of explain(costs off, analyze).
I will do EXPLAIN(COSTS ON, ANALYZE) tests and upload them later today.




--
Tender Wang

Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Richard Guo <guofenglinux@gmail.com> 于2024年8月29日周四 10:46写道:
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.

It's great that we have a query to probe into.  Your guess is likely
correct: it may be caused by poor estimates.

Tender, would you please help provide the outputs of

EXPLAIN (COSTS ON, ANALYZE)

on 19.sql with and without eager aggregation?

I upload EXPLAIN(COSTS ON, ANALYZE) test to [1].
I ran the same query three times, and I chose the third time result.
You can check 19_off_explain.out and 19_on_explain.out.




--
Tender Wang

Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Aug 28, 2024 at 10:26 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Yeah, I'm concerned about this too.  In addition to the inaccuracies
> in aggregation estimates, our estimates for joins are sometimes not
> very accurate either.  All this are likely to result in regressions
> with eager aggregation in some cases.  Currently I don't have a good
> answer to this problem.  Maybe we can run some benchmarks first and
> investigate the regressions discovered on a case-by-case basis to better
> understand the specific issues.

While it's true that we can make mistakes during join estimation, I
believe aggregate estimation tends to be far worse.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Aug 28, 2024 at 11:38 PM Tender Wang <tndrwang@gmail.com> wrote:
> I upload EXPLAIN(COSTS ON, ANALYZE) test to [1].
> I ran the same query three times, and I chose the third time result.
> You can check 19_off_explain.out and 19_on_explain.out.

So, in 19_off_explain.out, we got this:

         ->  Finalize GroupAggregate  (cost=666986.48..667015.35
rows=187 width=142) (actual time=272.649..334.318 rows=900 loops=1)
               ->  Gather Merge  (cost=666986.48..667010.21 rows=187
width=142) (actual time=272.644..333.847 rows=901 loops=1)
                     ->  Partial GroupAggregate
(cost=665986.46..665988.60 rows=78 width=142) (actual
time=266.379..267.476 rows=300 loops=3)
                           ->  Sort  (cost=665986.46..665986.65
rows=78 width=116) (actual time=266.367..266.583 rows=5081 loops=3)

And in 19_on_explan.out, we got this:

         ->  Finalize GroupAggregate  (cost=666987.03..666989.77
rows=19 width=142) (actual time=285.018..357.374 rows=900 loops=1)
               ->  Gather Merge  (cost=666987.03..666989.25 rows=19
width=142) (actual time=285.000..352.793 rows=15242 loops=1)
                     ->  Sort  (cost=665987.01..665987.03 rows=8
width=142) (actual time=273.391..273.580 rows=5081 loops=3)
                           ->  Nested Loop  (cost=665918.00..665986.89
rows=8 width=142) (actual time=252.667..269.719 rows=5081 loops=3)
                                 ->  Nested Loop
(cost=665917.85..665985.43 rows=8 width=157) (actual
time=252.656..264.755 rows=5413 loops=3)
                                       ->  Partial GroupAggregate
(cost=665917.43..665920.10 rows=82 width=150) (actual
time=252.643..255.627 rows=5413 loops=3)
                                             ->  Sort
(cost=665917.43..665917.64 rows=82 width=124) (actual
time=252.636..252.927 rows=5413 loops=3)

So, the patch was expected to cause the number of rows passing through
the Gather Merge to decrease from 197 to 19, but actually caused the
number of rows passing through the Gather Merge to increase from 901
to 15242. When the PartialAggregate was positioned at the top of the
join tree, it reduced the number of rows from 5081 to 300; but when it
was pushed down below two joins, it didn't reduce the row count at
all, and the subsequent two joins reduced it by less than 10%.

Now, you could complain about the fact that the Parallel Hash Join
isn't well-estimated here, but my question is: why does the planner
think that the PartialAggregate should go specifically here? In both
plans, the PartialAggregate isn't expected to change the row count.
And if that is true, then it's going to be cheapest to do it at the
point where the joins have reduced the row count to the minimum value.
Here, that would be at the top of the plan tree, where we have only
5081 estimated rows, but instead, the patch chooses to do it as soon
as we have all of the grouping columns, when we. still have 5413 rows.
I don't understand why that path wins on cost, unless it's just that
the paths compare fuzzily the same, in which case it kind of goes to
my earlier point about not really having the statistics to know which
way is actually going to be better.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation.  While there were no major changes, the code should
> now be simpler.

I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.

Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.

The v11-0002 git am failed on HEAD(6c2b5edecc).

tender@iZ2ze6la2dizi7df9q3xheZ:/workspace/postgres$ git am v11-0002-Implement-Eager-Aggregation.patch
Applying: Implement Eager Aggregation
error: patch failed: src/test/regress/parallel_schedule:119
error: src/test/regress/parallel_schedule: patch does not apply
Patch failed at 0001 Implement Eager Aggregation
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
 


--
Thanks,
Tender Wang

Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation.  While there were no major changes, the code should
> now be simpler.

I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.

Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.

 I review the v11 patch set, and here are a few of my thoughts:
 
1.  in setup_eager_aggregation(), before calling create_agg_clause_infos(), it does
some checks if eager aggregation is available. Can we move those checks into a function,
for example, can_eager_agg(), like can_partial_agg() does?

2.  I found that outside of joinrel.c we all use IS_DUMMY_REL,  but in joinrel.c, Tom always uses
is_dummy_rel(). Other commiters use IS_DUMMY_REL. 

3.  The attached patch does not consider FDW when creating a path for grouped_rel or grouped_join.
Do we need to think about FDW?

I haven't finished reviewing the patch set. I will continue to learn this feature.

--
Thanks,
Tender Wang

Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation.  While there were no major changes, the code should
> now be simpler.

I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.

Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.


I continue to review the v11 version patches. Here are some my thoughts.

1. In make_one_rel(), we have the below codes:
/*
* Build grouped base relations for each base rel if possible.
*/
setup_base_grouped_rels(root);

As far as I know, each base rel only has one grouped base relation, if possible.
The comments may be changed to "Build a grouped base relation for each base rel if possible."

2.  According to the comments of generate_grouped_paths(), we may generate paths for a grouped
relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be
confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel?

3. In create_partial_grouping_paths(), The partially_grouped_rel could have been already created due to eager
aggregation. If partially_grouped_rel exists,  its reltarget has been created. So do we need below logic?

/*
* Build target list for partial aggregate paths.  These paths cannot just
* emit the same tlist as regular aggregate paths, because (1) we must
* include Vars and Aggrefs needed in HAVING, which might not appear in
* the result tlist, and (2) the Aggrefs must be set in partial mode.
*/
partially_grouped_rel->reltarget =
       make_partial_grouping_target(root, grouped_rel->reltarget,
                                                        extra->havingQual);


--
Thanks,
Tender Wang

Re: Eager aggregation, take 3

От
Tender Wang
Дата:


Tender Wang <tndrwang@gmail.com> 于2024年9月4日周三 11:48写道:


Richard Guo <guofenglinux@gmail.com> 于2024年8月21日周三 15:11写道:
On Fri, Aug 16, 2024 at 4:14 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I had a self-review of this patchset and made some refactoring,
> especially to the function that creates the RelAggInfo structure for a
> given relation.  While there were no major changes, the code should
> now be simpler.

I found a bug in v10 patchset: when we generate the GROUP BY clauses
for the partial aggregation that is pushed down to a non-aggregated
relation, we may produce a clause with a tleSortGroupRef that
duplicates one already present in the query's groupClause, which would
cause problems.

Attached is the updated version of the patchset that fixes this bug
and includes further code refactoring.

The v11-0002 git am failed on HEAD(6c2b5edecc).

tender@iZ2ze6la2dizi7df9q3xheZ:/workspace/postgres$ git am v11-0002-Implement-Eager-Aggregation.patch
Applying: Implement Eager Aggregation
error: patch failed: src/test/regress/parallel_schedule:119
error: src/test/regress/parallel_schedule: patch does not apply
Patch failed at 0001 Implement Eager Aggregation
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


Since MERGE/SPLIT partition has been reverted, the tests  *partition_merge* and  *partition_split*  should be removed
from parallel_schedule. After doing the above, the 0002 patch can be applied.

--
Thanks,
Tender Wang

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Aug 28, 2024 at 9:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang <tndrwang@gmail.com> wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.

Sorry it takes some time to switch back to this thread.

I revisited the part about cost estimates for grouped paths in this
patch, and I found a big issue: the row estimate for a join path could
be significantly inaccurate if there is a grouped join path beneath
it.

The reason is that it is very tricky to set the size estimates for a
grouped join relation.  For a non-grouped join relation, we know that
all its paths have the same rowcount estimate (well, in theory).  But
this is not true for a grouped join relation.  Suppose we have a
grouped join relation for t1/t2 join.  There might be two paths for
it:

Aggregate
    -> Join
        -> Scan on t1
        -> Scan on t2

Or

Join
 -> Scan on t1
 -> Aggregate
     -> Scan on t2

These two paths can have very different rowcount estimates, and we
have no way of knowing which one to set for this grouped join
relation, because we do not know which path would be picked in the
final plan.  This issue can be illustrated with the query below.

create table t (a int, b int, c int);
insert into t select i%10, i%10, i%10 from generate_series(1,1000)i;
analyze t;

set enable_eager_aggregate to on;

explain (costs on)
select sum(t2.c) from t t1 join t t2 on t1.a = t2.a join t t3 on t2.b
= t3.b group by t3.a;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Finalize HashAggregate  (cost=6840.60..6840.70 rows=10 width=12)
   Group Key: t3.a
   ->  Nested Loop  (cost=1672.00..1840.60 rows=1000000 width=12)
         Join Filter: (t2.b = t3.b)
         ->  Partial HashAggregate  (cost=1672.00..1672.10 rows=10 width=12)
               Group Key: t2.b
               ->  Hash Join  (cost=28.50..1172.00 rows=100000 width=8)
                     Hash Cond: (t1.a = t2.a)
                     ->  Seq Scan on t t1  (cost=0.00..16.00 rows=1000 width=4)
                     ->  Hash  (cost=16.00..16.00 rows=1000 width=12)
                           ->  Seq Scan on t t2  (cost=0.00..16.00
rows=1000 width=12)
         ->  Materialize  (cost=0.00..21.00 rows=1000 width=8)
               ->  Seq Scan on t t3  (cost=0.00..16.00 rows=1000 width=8)
(13 rows)

Look at the Nested Loop node:

   ->  Nested Loop  (cost=1672.00..1840.60 rows=1000000 width=12)

How can a 10-row outer path joining a 1000-row inner path generate
1000000 rows?  This is because we are using the plan of the first path
described above, and the rowcount estimate of the second path.  What a
kluge!

To address this issue, one solution I’m considering is to recalculate
the row count estimate for a grouped join path using its outer and
inner paths.  While this may seem expensive, it might not be that bad
since we will cache the results of the selectivity calculation.  In
fact, this is already the approach we take for parameterized join
paths (see get_parameterized_joinrel_size).

Any thoughts on this?

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Sep 5, 2024 at 9:40 AM Tender Wang <tndrwang@gmail.com> wrote:
> 1.  in setup_eager_aggregation(), before calling create_agg_clause_infos(), it does
> some checks if eager aggregation is available. Can we move those checks into a function,
> for example, can_eager_agg(), like can_partial_agg() does?

We can do this, but I'm not sure this would be better.

> 2.  I found that outside of joinrel.c we all use IS_DUMMY_REL,  but in joinrel.c, Tom always uses
> is_dummy_rel(). Other commiters use IS_DUMMY_REL.

They are essentially the same: IS_DUMMY_REL() is a macro that wraps
is_dummy_rel().  I think they are interchangeable, and I don’t have a
preference for which one is better.

> 3.  The attached patch does not consider FDW when creating a path for grouped_rel or grouped_join.
> Do we need to think about FDW?

We may add support for foreign relations in the future, but for now, I
think we'd better not expand the scope too much until we ensure that
everything is working correctly.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Sep 11, 2024 at 10:52 AM Tender Wang <tndrwang@gmail.com> wrote:
> 1. In make_one_rel(), we have the below codes:
> /*
> * Build grouped base relations for each base rel if possible.
> */
> setup_base_grouped_rels(root);
>
> As far as I know, each base rel only has one grouped base relation, if possible.
> The comments may be changed to "Build a grouped base relation for each base rel if possible."

Yeah, each base rel has only one grouped rel.  However, there is a
comment nearby stating 'consider_parallel flags for each base rel',
which confuses me about whether it should be singular or plural in
this context.  Perhaps someone more proficient in English could
clarify this.

> 2.  According to the comments of generate_grouped_paths(), we may generate paths for a grouped
> relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be
> confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel?

I don't think 'plain relation' necessarily means 'base relation'.  In
this context I think it can mean 'non-grouped relation'.  But maybe
I'm wrong.

> 3. In create_partial_grouping_paths(), The partially_grouped_rel could have been already created due to eager
> aggregation. If partially_grouped_rel exists,  its reltarget has been created. So do we need below logic?
>
> /*
> * Build target list for partial aggregate paths.  These paths cannot just
> * emit the same tlist as regular aggregate paths, because (1) we must
> * include Vars and Aggrefs needed in HAVING, which might not appear in
> * the result tlist, and (2) the Aggrefs must be set in partial mode.
> */
> partially_grouped_rel->reltarget =
>        make_partial_grouping_target(root, grouped_rel->reltarget,
>                                                         extra->havingQual);

Yeah, maybe we can avoid building the target list here for
partially_grouped_rel that is generated by eager aggregation.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Sep 13, 2024 at 3:48 PM Tender Wang <tndrwang@gmail.com> wrote:
> Since MERGE/SPLIT partition has been reverted, the tests  *partition_merge* and  *partition_split*  should be removed
> from parallel_schedule. After doing the above, the 0002 patch can be applied.

Yeah, that's what I need to do.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Sat, Oct 5, 2024 at 6:23 PM Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Fri, Sep 27, 2024 at 11:53 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > Here is an updated version of this patch that fixes the rowcount
> > estimate issue along this routine. (see set_joinpath_size.)
>
> I have worked on inventing some heuristics to limit the planning
> effort of eager aggregation.  One simple yet effective approach I'm
> thinking of is to consider a grouped path as NOT useful if its row
> reduction ratio falls below a predefined minimum threshold.  Currently
> I'm using 0.5 as the threshold, but I'm open to other values.

I ran the TPC-DS benchmark at scale 10 and observed eager aggregation
applied in several queries, including q4, q8, q11, q23, q31, q33, and
q77.  Notably, the regression in q19 that Tender identified with v11
has disappeared in v13.

Here’s a comparison of Execution Time and Planning Time for the seven
queries with eager aggregation disabled versus enabled (best of 3).

Execution Time:

        EAGER-AGG-OFF           EAGER-AGG-ON

q4      105787.963 ms           34807.938 ms

q8      1407.454 ms             1654.923 ms

q11     67899.213 ms            18670.086 ms

q23     45945.849 ms            42990.652 ms

q31     10463.536 ms            10244.175 ms

q33     2186.928 ms             2217.228 ms

q77     2360.565 ms             2416.674 ms


Planning Time:

        EAGER-AGG-OFF           EAGER-AGG-ON

q4      2.334 ms                2.602 ms

q8      0.685 ms                0.647 ms

q11     0.935 ms                1.094 ms

q23     2.666 ms                2.582 ms

q31     1.051 ms                1.206 ms

q33     1.248 ms                1.796 ms

q77     0.967 ms                0.962 ms


There are good performance improvements in q4 and q11 (3~4 times).
For the other queries, execution times remain largely unchanged,
falling within the margin of error, with no notable regressions
observed.

For the planning time, I do not see notable regressions for any of the
seven queries.

It seems that the new cost estimates and the new heuristic are working
pretty well.

Thanks
Richard



Re: Eager aggregation, take 3

От
jian he
Дата:
On Sat, Oct 5, 2024 at 6:23 PM Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Fri, Sep 27, 2024 at 11:53 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > Here is an updated version of this patch that fixes the rowcount
> > estimate issue along this routine. (see set_joinpath_size.)
>


in the function setup_eager_aggregation,
can we be more conservative about cases where eager aggregation can be applied.
I see the following case where eager aggregation is not OK.
we can return earlier, so we won't call
create_grouping_expr_infos(root); create_agg_clause_infos(root);
, so we can avoid unintended consequences.

1. root->parse->resultRelation > 0
    just be 100% sure we are only dealing with SELECT, or we can add
Assert at the end of setup_eager_aggregation.
2. join type is FULL JOIN, (i am not sure about other Semijoins and
anti-semijoins  types).
3.  root->parse->windowClause != NIL

I am not sure whether enable_eager_aggregate can be useful when the
LIMIT clause is there,
the code comment not mentioned.

I am also not sure about the Locking Clause, since the code is not mentioned.
EXPLAIN (COSTS OFF, settings, verbose)
SELECT avg(t2.c)
FROM (select * from eager_agg_t1 for update) t1 JOIN (select * from
eager_agg_t2 for update) t2 ON t1.b = t2.b GROUP BY t1.a;
can eager aggregate apply to above query?



in struct PlannerInfo.
    /* list of AggClauseInfos */
    List       *agg_clause_list;
    /* list of GroupExprInfos */
    List       *group_expr_list;
    /* list of plain Vars contained in targetlist and havingQual */
    List       *tlist_vars;
we can comment that that agg_clause_list,  tlist_vars are unique.


lack doc entry in doc/src/sgml/config.sgml
we can put after varlistentry enable_bitmapscan
we can at least mention that
enable_eager_aggregate, The default value is <literal>off</literal>.


There are no tests related to aggregate with filter clauses.
currently seems to support it.


some of the "foreach" can be rewritten to foreach_node
see
https://git.postgresql.org/cgit/postgresql.git/commit/?id=14dd0f27d7cd56ffae9ecdbe324965073d01a9ff



Re: Eager aggregation, take 3

От
jian he
Дата:
        /*
         * Eager aggregation is only possible if equality of grouping keys, as
         * defined by the equality operator, implies bitwise equality.
         * Otherwise, if we put keys with different byte images into the same
         * group, we may lose some information that could be needed to
         * evaluate upper qual clauses.
         *
         * For example, the NUMERIC data type is not supported because values
         * that fall into the same group according to the equality operator
         * (e.g. 0 and 0.0) can have different scale.
         */
        tce = lookup_type_cache(exprType((Node *) tle->expr),
                                TYPECACHE_BTREE_OPFAMILY);
        if (!OidIsValid(tce->btree_opf) ||
            !OidIsValid(tce->btree_opintype))
            return;
        equalimageproc = get_opfamily_proc(tce->btree_opf,
                                           tce->btree_opintype,
                                           tce->btree_opintype,
                                           BTEQUALIMAGE_PROC);
        if (!OidIsValid(equalimageproc) ||
            !DatumGetBool(OidFunctionCall1Coll(equalimageproc,
                                               tce->typcollation,

ObjectIdGetDatum(tce->btree_opintype))))
            return;

I am confused by BTEQUALIMAGE_PROC.

 *    To facilitate B-Tree deduplication, an operator class may choose to
 *    offer a forth amproc procedure (BTEQUALIMAGE_PROC).  For full details,
 *    see doc/src/sgml/btree.sgml.
the above is comments about BTEQUALIMAGE_PROC in src/include/access/nbtree.h


equalimage
Optionally, a btree operator family may provide equalimage (“equality implies
image equality”) support functions, registered under support function number 4.
These functions allow the core code to determine when it is safe to apply the
btree deduplication optimization. Currently, equalimage functions are only
called when building or rebuilding an index.

the above is BTEQUALIMAGE_PROC on
https://www.postgresql.org/docs/current/btree.html#BTREE-SUPPORT-FUNCS

integers support eager aggregate.
select amproc.*, amproclefttype::regtype
from  pg_amproc amproc join pg_opfamily opf on amproc.amprocfamily = opf.oid
where  amproc.amprocnum = 4
and amproc.amproclefttype = amproc.amprocrighttype
and opf.opfmethod = 403
and amproc.amprocrighttype = 'int'::regtype;
returns

  oid  | amprocfamily | amproclefttype | amprocrighttype | amprocnum |
   amproc    | amproclefttype
-------+--------------+----------------+-----------------+-----------+--------------+----------------
 10052 |         1976 |             23 |              23 |         4 |
btequalimage | integer


but btequalimage returns true unconditionally.

So overall I doubt here BTEQUALIMAGE_PROC flag usage is correct.



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Oct 18, 2024 at 12:44 PM jian he <jian.universality@gmail.com> wrote:
> 1. root->parse->resultRelation > 0
>     just be 100% sure we are only dealing with SELECT, or we can add
> Assert at the end of setup_eager_aggregation.

Can GROUP BY clauses be used in INSERT/UPDATE/DELETE/MERGE statements?
If not, I think there is no need to check 'resultRelation > 0', as
setup_eager_aggregation already checks for GROUP BY clauses.

> 2. join type is FULL JOIN, (i am not sure about other Semijoins and
> anti-semijoins  types).

The presence of a FULL JOIN does not preclude the use of eager
aggregation.  We still can push a partial aggregation down to a level
that is above the FULL JOIN.

> 3.  root->parse->windowClause != NIL

Why does the presence of windowClause prevent the use of eager
aggregation?

> lack doc entry in doc/src/sgml/config.sgml
> we can put after varlistentry enable_bitmapscan
> we can at least mention that
> enable_eager_aggregate, The default value is <literal>off</literal>.

Yeah, that's what I need to do.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Oct 18, 2024 at 10:22 PM jian he <jian.universality@gmail.com> wrote:
> So overall I doubt here BTEQUALIMAGE_PROC flag usage is correct.

The BTEQUALIMAGE_PROC flag is used to prevent eager aggregation for
types whose equality operators do not imply bitwise equality, such as
NUMERIC.

After a second thought, I think it should be OK to just check the
equality operator specified by the SortGroupClause for btree equality.
I’m not very sure about this point, though, and would appreciate any
inputs.

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Sep 25, 2024 at 3:03 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Wed, Sep 11, 2024 at 10:52 AM Tender Wang <tndrwang@gmail.com> wrote:
> > 1. In make_one_rel(), we have the below codes:
> > /*
> > * Build grouped base relations for each base rel if possible.
> > */
> > setup_base_grouped_rels(root);
> >
> > As far as I know, each base rel only has one grouped base relation, if possible.
> > The comments may be changed to "Build a grouped base relation for each base rel if possible."
>
> Yeah, each base rel has only one grouped rel.  However, there is a
> comment nearby stating 'consider_parallel flags for each base rel',
> which confuses me about whether it should be singular or plural in
> this context.  Perhaps someone more proficient in English could
> clarify this.

It's not confusing the way you have it, but I think an English teacher
wouldn't like it, because part of the sentence is singular ("each base
rel") and the other part is plural ("grouped base relations").
Tender's proposed rewrite fixes that. Another way to fix it is to
write "Build group relations for base rels where possible".

> > 2.  According to the comments of generate_grouped_paths(), we may generate paths for a grouped
> > relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be
> > confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel?
>
> I don't think 'plain relation' necessarily means 'base relation'.  In
> this context I think it can mean 'non-grouped relation'.  But maybe
> I'm wrong.

We use the term "plain relation" in several different ways. In the
header comments for addFkRecurseReferenced, it means a non-partitioned
relation. In the struct comments for RangeTblEntry, it means any sort
of named thing in pg_class that you can scan, so either a partitioned
or unpartitioned table but not a join or a table function or
something. AFAICT, the most common meaning of "plain relation" is a
pg_class entry where relkind==RELKIND_RELATION.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Sep 24, 2024 at 11:20 PM Richard Guo <guofenglinux@gmail.com> wrote:
> The reason is that it is very tricky to set the size estimates for a
> grouped join relation.  For a non-grouped join relation, we know that
> all its paths have the same rowcount estimate (well, in theory).  But
> this is not true for a grouped join relation.  Suppose we have a
> grouped join relation for t1/t2 join.  There might be two paths for
> it:

What exactly do you mean by "well, in theory" here? My understanding
of how things work today is that every relation is supposed to produce
a specific set of rows and every unparameterized path must produce
that set of rows. The order of the rows may vary but the set of rows
may not. With your proposed design here, that's no longer true.
Instead, what's promised is that the row sets will become equivalent
after a later FinalizeAggregate step. In a sense, this is like
parameterization or partial paths. Suppose I have:

SELECT * FROM foo, bar WHERE foo.x = bar.x;

While every unparameterized path for bar has the same row count,
there's also the possibility of performing an index scan on bar.x
parameterized by foo.x, and that path will have a far lower row count
than the unparameterized paths. Instead of producing all the same rows
as every other path, the parameterized path promises only that if run
repeatedly, with all relevant values of foo.x, you'll eventually get
all the same rows you would have gotten from the unparameterized path.
Because of this difference, parameterized paths need special handling
in many different parts of the code.

And the same thing is true of partial paths. They also do not promise
to generate all the same rows -- instead, they promise that when run
simultaneously across multiple workers, the total set of rows returned
across all invocations will be equal to what a normal path would have
produced. Here again, there's a need for special handling because
these paths behave differently than standard paths.

I think what you're doing here is roughly equivalent to either of
these two cases. It's more like the parameterized path case. Instead
of having a path for a relation which is parameterized by some input
parameter, you have a path for a relation, say bar, which is partially
aggregated by some grouping column. But there's no guarantee of how
much partial aggregation has been done. In your example, PartialAgg(t1
JOIN t2) is "more aggregated" than t1 JOIN PartialAgg(t2), so the row
counts are different. This makes me quite nervous. You can't compare a
parameterized path to an unparameterized path, but you can compare it
to another parameterized path if the parameterizations are the same.
You can't compare a partial path to a non-partial path, but you can
compare partial paths to each other. But with this work,
unparameterized, non-partial paths in the same RelOptInfo don't seem
like they are truly comparable. Maybe that's OK, but I'm not sure that
it isn't going to break other things.

You might for example imagine a design where PartialAgg(t1 JOIN t2)
and t1 JOIN PartialAgg(t2) get separate RelOptInfos. After all, there
are probably multiple ways to generate paths for each of those things,
and paths in each category can be compared to each other apples to
apples. What's less clear is whether it's fair to compare across the
two categories, and how many assumptions will be broken by doing so.
I'm not sure that it's right to have separate RelOptInfos; we
definitely don't want to create more RelOptInfos than necessary. At
the same time, if we mix together all of those paths into a single
RelOptInfo, we need to be confident that we're neither going to break
anything nor introduce too many special cases into hot code paths. For
instance, set_joinpath_size() represents an unwelcome complexity
increase that could impact performance generally, even apart from the
cases this patch intends to handle.

It's tempting to wonder if there's some way that we can avoid
generating paths for both PartialAgg(t1 JOIN t2) and t1 JOIN
PartialAgg(t2). Either the former has lower cardinality, or the latter
does. It seems likely that the lower-cardinality set is the winning
strategy. Even if the path has higher cost to generate, we save work
at every subsequent join level and at the final aggregation step. Are
there counterexamples where it's better to  use a path from the
higher-cardinality set?

By the way, the work of figuring out what target list should be
produced by partial grouping is done by init_grouping_targets(), but
the comments seem to take it for granted that I know what result we're
trying to produce, and I don't. I think some more high-level
explanation of the goals of this code would be useful. It seems to me
that if I'm looking at a path for an ungrouped relation and it
produces a certain target list, then every column of that target list
is needed somewhere. If those columns are group keys, cool: we pass
those through. If they're inputs to the aggregates, cool: we feed them
to the aggregates. But if they are neither, then what? In the patch,
you either group on those columns or add them to the
possibly_dependent list depending on the result of
is_var_needed_by_join(). I can believe that there are some cases where
we can group on such columns and others where we can't, but find it
difficult to believe that this test reliably distinguishes between
those two cases. If it does, I don't understand why it does. Don't I
need to know something about how those columns are used in the upper
joins? Like, if those columns are connected by a chain of
binary-equality operators back to the user's choice of grouping
columns, that sounds good, but this test doesn't distinguish between
that case and an upper join on the < operator.
create_grouping_expr_infos() does reason based on whether there's an
equal-image operator available, but AIUI that's only reasoning about
the group columns the user mentioned, not the sort of implicit
grouping columns that init_grouping_targets() is creating.

I spent a lot of time thinking today about what makes it safe to push
down grouping or not. I'm not sure that I have a solid answer to that
question even yet. But it seems to me that there are at least two
cases that clearly won't fly. One is the case where the partial
aggregation would merge rows that need to be included in the final
aggregation with rows that should be filtered out. If the
partially-grouped relation simply has a filter qual, that's fine,
because it will be evaluated before the aggregation. But there might
be a qual that has to be evaluated later, either because (a) it
involves another rel, like this_rel.x + that_rel.y > 10 or (b) it
appears in the ON clause of an outer join and thus needs to be
deferred to the level of the OJ, e.g. A LEFT JOIN B ON a.x = b.x AND
b.y = 42. I wonder if you can comment on how these cases are handled.
Perhaps this coding around functional dependencies has something to do
with it, but it isn't clear to me.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Sat, Oct 5, 2024 at 11:30 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Here’s a comparison of Execution Time and Planning Time for the seven
> queries with eager aggregation disabled versus enabled (best of 3).
>
> Execution Time:
>
>         EAGER-AGG-OFF           EAGER-AGG-ON
>
> q4      105787.963 ms           34807.938 ms
> q8      1407.454 ms             1654.923 ms
> q11     67899.213 ms            18670.086 ms
> q23     45945.849 ms            42990.652 ms
> q31     10463.536 ms            10244.175 ms
> q33     2186.928 ms             2217.228 ms
> q77     2360.565 ms             2416.674 ms

Could you attach the EXPLAIN ANALYZE output for these queries, with
and without the patch?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Oct 29, 2024 at 3:57 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Fri, Oct 18, 2024 at 10:22 PM jian he <jian.universality@gmail.com> wrote:
> > So overall I doubt here BTEQUALIMAGE_PROC flag usage is correct.
>
> The BTEQUALIMAGE_PROC flag is used to prevent eager aggregation for
> types whose equality operators do not imply bitwise equality, such as
> NUMERIC.
>
> After a second thought, I think it should be OK to just check the
> equality operator specified by the SortGroupClause for btree equality.
> I’m not very sure about this point, though, and would appreciate any
> inputs.

Well, the key thing here is the reasoning, which you don't really
spell out either here or in the patch. The patch's justification for
the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image
operators, "we may lose some information that could be needed to
evaluate upper qual clauses." But there's no explanation of why that
would happen, and I think that makes it difficult to have a good
discussion.

Here's an example from the optimizer README:

# 3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
#         = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
#
# Identity 3 only holds if predicate Pbc must fail for all-null B rows
# (that is, Pbc is strict for at least one column of B).  If Pbc is not
# strict, the first form might produce some rows with nonnull C columns
# where the second form would make those entries null.

This isn't the clearest justification for a rule that I've ever read,
but it's something. Someone reading this can think about whether the
two sentences of justification are a correct argument that Pbc must be
strict in order for the identity to hold.

I think you should be trying to offer similar (or better)
justifications, and perhaps similar identities as well. It's a little
bit hard to think about how to write the identities for this patch,
although you could start with aggregate(X) =
finalizeagg(partialagg(X)) and then explain how the partialagg can be
commuted with certain joins and what is required for it to do so. The
argument needs to be detailed enough that we can evaluate whether it's
correct or not.

Personally, I'm still stuck on the point I wrote about yesterday. When
you partially aggregate a set of rows, the resulting row serves as a
proxy for the original set of rows. So, all of those rows must have
the same destiny. As I mentioned then, if you ever partially aggregate
a row that should ultimately be filtered out with one that shouldn't,
that breaks everything. But also, I need all the rows that feed into
each partial group to be part of the same set of output rows. For
instance, suppose I run a report like this:

SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
WHERE o.order_id = l.order_id GROUP BY 1;

If I'm thinking of executing this as FinalizeAgg(order JOIN
PartialAgg(order_lines)), what must be true in order for that to be a
valid execution plan? I certainly can't aggregate on some unrelated
column, say part_number. If I do, then first of all I don't know how
to get an order_id column out so that I can still do the join. But
even if that were no issue, you might have two rows with different
values of the order_id column and the same value for part_number, and
that the partial groups that you've created on the order_lines table
don't line up properly with the groups that you need to create on the
orders table. Some particular order_id might need some of the rows
that went into a certain part_number group and not others; that's no
good.

After some thought, I think the process of deduction goes something
like this. We ultimately need to aggregate on a column in the orders
table, but we want to partially aggregate the order_lines table. So we
need a set of columns in the order_lines table that can act as a proxy
for o.order_number. Our proxy should be all the columns that appear in
the join clauses between orders and order_lines. That way, all the
rows in a single partially aggregate row will have the "same" order_id
according to the operator class implementing the = operator, so for a
given row in the "orders" table, either every row in the group has
o.order_id = l.order_id or none of them do; hence they're all part of
the same set of output rows.

It doesn't matter, for example, whether o.order_number is unique. If
it isn't, then we'll flatten together some different rows from the
orders table -- but each of those rows will match all the rows in
partial groupings where o.order_id = partially_grouped_l.order_id and
none of the rows where that isn't the case, so the optimization is
still valid. Likewise it doesn't matter whether o.order_id is unique:
if order_id = 1 occurs twice, then both of the rows will match the
partially aggregated group with order_id = 1, but that's fine. The
only thing that's a problem is if the same row from orders was going
to match some but not all of the partially aggregate rows from some
order_lines group, and that won't happen here.

Now consider this example:

SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
WHERE o.order_id = l.order_id  AND o.something < l.something GROUP BY
1;

Here, we cannot partially group order_lines on just l.order_id,
because we might have a row in the orders table with order_id = 1 and
something = 1 and two rows in the order_lines table that both have
order_id = 1 but one has something = 0 and the other has something =
2. The orders row needs to join to one of those but not the other, so
we can't put them in the same partial group. However, it seems like it
would be legal to group order_lines on (order_id, something), provided
that the operator classes we use for the grouping operation matches
the operator classes of the operators in the join clause - i.e. we
group on order_id using the operator class of = and on something using
the operator class of <. If the operator classes don't match in this
way, then it could happen that the grouping operator thinks the values
are the same but the join operator thinks they're different.
(Everything is also OK if the grouping operator tests
bitwise-equality, because then nothing can ever get merged that
shouldn't; but other notions of equality are also fine as long as
they're compatible with the operator actually used.)

Now let's consider this example, using an imaginary user-defined operator:

SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
WHERE o.order_id = l.order_id  AND o.something ?!?! l.something GROUP
BY 1;

Here, we can partially aggregate order_lines by (order_id, something)
as long as order_id is grouped using bitwise equality OR the same
operator class as the = operator used in the query, and something has
to use bitwise equality.

What about this:

SELECT o.order_number, SUM(l.quantity) FROM orders o LEFT JOIN
order_lines l ON o.order_id = l.order_id  AND l.something = 1 GROUP BY
1;

It's really important that we don't try to aggregate on just
l.order_id here. Some rows in the group might have l.something = 1 and
others not. It would be legal (but probably not efficient) to
aggregate order_lines on (order_id, something).

So it seems to me that the general rule here is that when the table
for which we need a surrogate key is directly joined to the table
where the original grouping column is located, we need to group on all
columns involved in the join clause, using either compatible operators
or bitwise equality operators. As a practical matter, we probably only
want to do the optimization when all of the join clauses are
equijoins. Then we don't need to worry about bitwise equality at all,
AFAICS; we just group using the operator class that contains the
operator specified by the user.

What if there are more than two tables involved, like this:

SELECT o.order_number, MAX(n.creation_time) FROM orders o, order_lines
l, order_line_notes n WHERE o.order_id = l.order_id AND o.note_id =
n.note_id GROUP BY 1;

Here, there's no direct connection between the table with the original
grouping column and the table we want to aggregate. Using the rule
mentioned above, we can deduce that l.order_id is a valid grouping
column for order_lines. Applying the same rule again, we can then
deduce that n.note_id is a valid grouping column for note_id. We could
either group note_id on n and then perform the remaining joins; or we
could join order_lines and note_id and then partially aggregate the
result on l.order_id.

What if there are multiple grouping columns, like this:

SELECT foo.x, bar.y, SUM(baz.z) FROM foo, bar, baz WHERE foo.a = baz.a
AND bar.b = baz.b GROUP BY 1, 2;

In this case, we need to find a surrogate grouping column for foo.x
and also a surrogate grouping column for bar.y. We can group baz on a,
b; but not just on a or just on b.

Finally, I think this example is interesting:

SELECT p.title, SUM(c.stuff) FROM parent p, link l, child c WHERE p.x
= l.x AND l.y = c.y AND p.z = c.z GROUP BY 1;

Based on the rule that I articulated earlier, you might think we can
just look at the join clauses between parent and child and group the
child on c.z. However, that's not correct -- we'd have to group on
both c.x and c.z. I'm not sure how to adjust the rule so that it
produces the right answer here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Oct 29, 2024 at 3:51 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > 2. join type is FULL JOIN, (i am not sure about other Semijoins and
> > anti-semijoins  types).
>
> The presence of a FULL JOIN does not preclude the use of eager
> aggregation.  We still can push a partial aggregation down to a level
> that is above the FULL JOIN.

I think you can also push a partial aggregation step through a FULL
JOIN. Consider this:

SELECT p.name, string_agg(c.name, ', ') FROM parents p FULL JOIN
children c ON p.id = c.parent_id GROUP BY 1;

I don't see why it matters here that this is a FULL JOIN. If it were
an inner join, we'd have one group for every parent that has at least
one child. Since it's a full join, we'll also get one group for every
parent with no children, and also one group for the null parent. But
that should work fine with a partially aggregated c.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
jian he
Дата:
hi.
still trying to understand v13. found a bug.

minimum test :
drop table if exists t1, t2;
CREATE TABLE t1 (a int, b int, c int);
CREATE TABLE t2 (a int, b int, c int);
SET enable_eager_aggregate TO on;
explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON
t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0;


create_agg_clause_infos
    foreach(lc, tlist_exprs)
    {
        Expr       *expr = (Expr *) lfirst(lc);
        if (IsA(expr, GroupingFunc))
            return;
    }
    if (root->parse->havingQual != NULL)
    {
        List       *having_exprs;
        having_exprs = pull_var_clause((Node *) root->parse->havingQual,
                                       PVC_INCLUDE_AGGREGATES |
                                       PVC_RECURSE_PLACEHOLDERS);
        if (having_exprs != NIL)
        {
            tlist_exprs = list_concat(tlist_exprs, having_exprs);
            list_free(having_exprs);
        }
    }

havingQual can have GroupingFunc.
if that happens, then segmentation fault.



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Tue, Oct 29, 2024 at 9:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 25, 2024 at 3:03 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > On Wed, Sep 11, 2024 at 10:52 AM Tender Wang <tndrwang@gmail.com> wrote:
> > > 1. In make_one_rel(), we have the below codes:
> > > /*
> > > * Build grouped base relations for each base rel if possible.
> > > */
> > > setup_base_grouped_rels(root);
> > >
> > > As far as I know, each base rel only has one grouped base relation, if possible.
> > > The comments may be changed to "Build a grouped base relation for each base rel if possible."
> >
> > Yeah, each base rel has only one grouped rel.  However, there is a
> > comment nearby stating 'consider_parallel flags for each base rel',
> > which confuses me about whether it should be singular or plural in
> > this context.  Perhaps someone more proficient in English could
> > clarify this.
>
> It's not confusing the way you have it, but I think an English teacher
> wouldn't like it, because part of the sentence is singular ("each base
> rel") and the other part is plural ("grouped base relations").
> Tender's proposed rewrite fixes that. Another way to fix it is to
> write "Build group relations for base rels where possible".

Thank you for the suggestion.  The new wording looks much better
grammatically.  It seems to me that we should address the nearby
comment too, which goes like "consider_parallel flags for each base
rel", as each rel has only one consider_parallel flag.

> > > 2.  According to the comments of generate_grouped_paths(), we may generate paths for a grouped
> > > relation on top of paths of join relation. So the ”rel_plain" argument in generate_grouped_paths() may be
> > > confused. "plain" usually means "base rel" . How about Re-naming rel_plain to input_rel?
> >
> > I don't think 'plain relation' necessarily means 'base relation'.  In
> > this context I think it can mean 'non-grouped relation'.  But maybe
> > I'm wrong.
>
> We use the term "plain relation" in several different ways. In the
> header comments for addFkRecurseReferenced, it means a non-partitioned
> relation. In the struct comments for RangeTblEntry, it means any sort
> of named thing in pg_class that you can scan, so either a partitioned
> or unpartitioned table but not a join or a table function or
> something. AFAICT, the most common meaning of "plain relation" is a
> pg_class entry where relkind==RELKIND_RELATION.

Agreed.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Oct 30, 2024 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 24, 2024 at 11:20 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > The reason is that it is very tricky to set the size estimates for a
> > grouped join relation.  For a non-grouped join relation, we know that
> > all its paths have the same rowcount estimate (well, in theory).  But
> > this is not true for a grouped join relation.  Suppose we have a
> > grouped join relation for t1/t2 join.  There might be two paths for
> > it:
>
> What exactly do you mean by "well, in theory" here? My understanding
> of how things work today is that every relation is supposed to produce
> a specific set of rows and every unparameterized path must produce
> that set of rows. The order of the rows may vary but the set of rows
> may not. With your proposed design here, that's no longer true.
> Instead, what's promised is that the row sets will become equivalent
> after a later FinalizeAggregate step. In a sense, this is like
> parameterization or partial paths.

Yeah, you're correct that currently each relation is expected to
produce the same specific set of rows.  When I say "well, in theory" I
mean that for a join relation, all its unparameterized paths should
theoretically have the same row count estimate.  However, in practice,
because there are more than one way to make a joinrel for more than
two base relations, and the selectivity estimation routines don’t
handle all cases equally well, we might get different row count
estimates depending on the pair provided.

And yes, with the grouped relations proposed in this patch, the
situation changes.  For a grouped join relation, different paths can
produce very different row sets, depending on where the partial
aggregation is placed in the path tree.  This is also why we need to
recalculate the row count estimate for a grouped join path using its
outer and inner paths provided, rather than using path->parent->rows
directly.  This is very like the parameterized path case.

> I think what you're doing here is roughly equivalent to either of
> these two cases. It's more like the parameterized path case. Instead
> of having a path for a relation which is parameterized by some input
> parameter, you have a path for a relation, say bar, which is partially
> aggregated by some grouping column. But there's no guarantee of how
> much partial aggregation has been done. In your example, PartialAgg(t1
> JOIN t2) is "more aggregated" than t1 JOIN PartialAgg(t2), so the row
> counts are different. This makes me quite nervous. You can't compare a
> parameterized path to an unparameterized path, but you can compare it
> to another parameterized path if the parameterizations are the same.
> You can't compare a partial path to a non-partial path, but you can
> compare partial paths to each other. But with this work,
> unparameterized, non-partial paths in the same RelOptInfo don't seem
> like they are truly comparable. Maybe that's OK, but I'm not sure that
> it isn't going to break other things.

Perhaps we could introduce a GroupPathInfo into the Path structure to
store common information for a grouped path, such as the location of
the partial aggregation (which could be the relids of the relation on
top of which we place the partial aggregation) and the estimated
rowcount for this grouped path, similar to how ParamPathInfo functions
for parameterized paths.  Then we should be able to compare the
grouped paths of the same location apples to apples.  I haven’t
thought this through in detail yet, though.

> It's tempting to wonder if there's some way that we can avoid
> generating paths for both PartialAgg(t1 JOIN t2) and t1 JOIN
> PartialAgg(t2). Either the former has lower cardinality, or the latter
> does. It seems likely that the lower-cardinality set is the winning
> strategy. Even if the path has higher cost to generate, we save work
> at every subsequent join level and at the final aggregation step. Are
> there counterexamples where it's better to  use a path from the
> higher-cardinality set?

This is very appealing if we can retain only the lower-cardinality
path, as it would greatly simplify the overall design.  I haven't
looked for counterexamples yet, but I plan to do so later.

> By the way, the work of figuring out what target list should be
> produced by partial grouping is done by init_grouping_targets(), but
> the comments seem to take it for granted that I know what result we're
> trying to produce, and I don't. I think some more high-level
> explanation of the goals of this code would be useful. It seems to me
> that if I'm looking at a path for an ungrouped relation and it
> produces a certain target list, then every column of that target list
> is needed somewhere. If those columns are group keys, cool: we pass
> those through. If they're inputs to the aggregates, cool: we feed them
> to the aggregates. But if they are neither, then what? In the patch,
> you either group on those columns or add them to the
> possibly_dependent list depending on the result of
> is_var_needed_by_join(). I can believe that there are some cases where
> we can group on such columns and others where we can't, but find it
> difficult to believe that this test reliably distinguishes between
> those two cases. If it does, I don't understand why it does. Don't I
> need to know something about how those columns are used in the upper
> joins? Like, if those columns are connected by a chain of
> binary-equality operators back to the user's choice of grouping
> columns, that sounds good, but this test doesn't distinguish between
> that case and an upper join on the < operator.
> create_grouping_expr_infos() does reason based on whether there's an
> equal-image operator available, but AIUI that's only reasoning about
> the group columns the user mentioned, not the sort of implicit
> grouping columns that init_grouping_targets() is creating.

Yeah, this patch does not get it correct here.  Basically the logic is
that for the partial aggregation pushed down to a non-aggregated
relation, we need to consider all columns of that relation involved in
upper join clauses and include them in the grouping keys.  Currently,
the patch only checks whether a column is involved in upper join
clauses but does not verify how the column is used.  We need to ensure
that the operator used in the join clause is at least compatible with
the grouping operator; otherwise, the grouping operator might
interpret the values as the same while the join operator sees them as
different.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Oct 31, 2024 at 12:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Well, the key thing here is the reasoning, which you don't really
> spell out either here or in the patch. The patch's justification for
> the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image
> operators, "we may lose some information that could be needed to
> evaluate upper qual clauses." But there's no explanation of why that
> would happen, and I think that makes it difficult to have a good
> discussion.
>
> Here's an example from the optimizer README:
>
> # 3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
> #         = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
> #
> # Identity 3 only holds if predicate Pbc must fail for all-null B rows
> # (that is, Pbc is strict for at least one column of B).  If Pbc is not
> # strict, the first form might produce some rows with nonnull C columns
> # where the second form would make those entries null.
>
> This isn't the clearest justification for a rule that I've ever read,
> but it's something. Someone reading this can think about whether the
> two sentences of justification are a correct argument that Pbc must be
> strict in order for the identity to hold.
>
> I think you should be trying to offer similar (or better)
> justifications, and perhaps similar identities as well. It's a little
> bit hard to think about how to write the identities for this patch,
> although you could start with aggregate(X) =
> finalizeagg(partialagg(X)) and then explain how the partialagg can be
> commuted with certain joins and what is required for it to do so. The
> argument needs to be detailed enough that we can evaluate whether it's
> correct or not.
>
> Personally, I'm still stuck on the point I wrote about yesterday. When
> you partially aggregate a set of rows, the resulting row serves as a
> proxy for the original set of rows. So, all of those rows must have
> the same destiny. As I mentioned then, if you ever partially aggregate
> a row that should ultimately be filtered out with one that shouldn't,
> that breaks everything. But also, I need all the rows that feed into
> each partial group to be part of the same set of output rows. For
> instance, suppose I run a report like this:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id GROUP BY 1;
>
> If I'm thinking of executing this as FinalizeAgg(order JOIN
> PartialAgg(order_lines)), what must be true in order for that to be a
> valid execution plan? I certainly can't aggregate on some unrelated
> column, say part_number. If I do, then first of all I don't know how
> to get an order_id column out so that I can still do the join. But
> even if that were no issue, you might have two rows with different
> values of the order_id column and the same value for part_number, and
> that the partial groups that you've created on the order_lines table
> don't line up properly with the groups that you need to create on the
> orders table. Some particular order_id might need some of the rows
> that went into a certain part_number group and not others; that's no
> good.
>
> After some thought, I think the process of deduction goes something
> like this. We ultimately need to aggregate on a column in the orders
> table, but we want to partially aggregate the order_lines table. So we
> need a set of columns in the order_lines table that can act as a proxy
> for o.order_number. Our proxy should be all the columns that appear in
> the join clauses between orders and order_lines. That way, all the
> rows in a single partially aggregate row will have the "same" order_id
> according to the operator class implementing the = operator, so for a
> given row in the "orders" table, either every row in the group has
> o.order_id = l.order_id or none of them do; hence they're all part of
> the same set of output rows.
>
> It doesn't matter, for example, whether o.order_number is unique. If
> it isn't, then we'll flatten together some different rows from the
> orders table -- but each of those rows will match all the rows in
> partial groupings where o.order_id = partially_grouped_l.order_id and
> none of the rows where that isn't the case, so the optimization is
> still valid. Likewise it doesn't matter whether o.order_id is unique:
> if order_id = 1 occurs twice, then both of the rows will match the
> partially aggregated group with order_id = 1, but that's fine. The
> only thing that's a problem is if the same row from orders was going
> to match some but not all of the partially aggregate rows from some
> order_lines group, and that won't happen here.
>
> Now consider this example:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id  AND o.something < l.something GROUP BY
> 1;
>
> Here, we cannot partially group order_lines on just l.order_id,
> because we might have a row in the orders table with order_id = 1 and
> something = 1 and two rows in the order_lines table that both have
> order_id = 1 but one has something = 0 and the other has something =
> 2. The orders row needs to join to one of those but not the other, so
> we can't put them in the same partial group. However, it seems like it
> would be legal to group order_lines on (order_id, something), provided
> that the operator classes we use for the grouping operation matches
> the operator classes of the operators in the join clause - i.e. we
> group on order_id using the operator class of = and on something using
> the operator class of <. If the operator classes don't match in this
> way, then it could happen that the grouping operator thinks the values
> are the same but the join operator thinks they're different.
> (Everything is also OK if the grouping operator tests
> bitwise-equality, because then nothing can ever get merged that
> shouldn't; but other notions of equality are also fine as long as
> they're compatible with the operator actually used.)
>
> Now let's consider this example, using an imaginary user-defined operator:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id  AND o.something ?!?! l.something GROUP
> BY 1;
>
> Here, we can partially aggregate order_lines by (order_id, something)
> as long as order_id is grouped using bitwise equality OR the same
> operator class as the = operator used in the query, and something has
> to use bitwise equality.
>
> What about this:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o LEFT JOIN
> order_lines l ON o.order_id = l.order_id  AND l.something = 1 GROUP BY
> 1;
>
> It's really important that we don't try to aggregate on just
> l.order_id here. Some rows in the group might have l.something = 1 and
> others not. It would be legal (but probably not efficient) to
> aggregate order_lines on (order_id, something).
>
> So it seems to me that the general rule here is that when the table
> for which we need a surrogate key is directly joined to the table
> where the original grouping column is located, we need to group on all
> columns involved in the join clause, using either compatible operators
> or bitwise equality operators. As a practical matter, we probably only
> want to do the optimization when all of the join clauses are
> equijoins. Then we don't need to worry about bitwise equality at all,
> AFAICS; we just group using the operator class that contains the
> operator specified by the user.
>
> What if there are more than two tables involved, like this:
>
> SELECT o.order_number, MAX(n.creation_time) FROM orders o, order_lines
> l, order_line_notes n WHERE o.order_id = l.order_id AND o.note_id =
> n.note_id GROUP BY 1;
>
> Here, there's no direct connection between the table with the original
> grouping column and the table we want to aggregate. Using the rule
> mentioned above, we can deduce that l.order_id is a valid grouping
> column for order_lines. Applying the same rule again, we can then
> deduce that n.note_id is a valid grouping column for note_id. We could
> either group note_id on n and then perform the remaining joins; or we
> could join order_lines and note_id and then partially aggregate the
> result on l.order_id.
>
> What if there are multiple grouping columns, like this:
>
> SELECT foo.x, bar.y, SUM(baz.z) FROM foo, bar, baz WHERE foo.a = baz.a
> AND bar.b = baz.b GROUP BY 1, 2;
>
> In this case, we need to find a surrogate grouping column for foo.x
> and also a surrogate grouping column for bar.y. We can group baz on a,
> b; but not just on a or just on b.
>
> Finally, I think this example is interesting:
>
> SELECT p.title, SUM(c.stuff) FROM parent p, link l, child c WHERE p.x
> = l.x AND l.y = c.y AND p.z = c.z GROUP BY 1;
>
> Based on the rule that I articulated earlier, you might think we can
> just look at the join clauses between parent and child and group the
> child on c.z. However, that's not correct -- we'd have to group on
> both c.x and c.z. I'm not sure how to adjust the rule so that it
> produces the right answer here.

Thank you for the thorough deduction process; this is something I
should have done before proposing the patch.  As we discussed
off-list, what I need to do next is to establish a theory that proves
the transformation proposed in this patch is correct in all cases.

What I have in mind now is that to push a partial aggregation down to
a relation, we need to group by all the columns of that relation
involved in the upper join clauses, using compatible operators.  This
is essential to ensure that an aggregated row from the partial
aggregation matches the other side of the join if and only if each row
in the partial group does, thereby ensuring that all rows in the same
partial group have the same 'destiny'.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Oct 31, 2024 at 9:16 PM jian he <jian.universality@gmail.com> wrote:
>
> hi.
> still trying to understand v13. found a bug.
>
> minimum test :
> drop table if exists t1, t2;
> CREATE TABLE t1 (a int, b int, c int);
> CREATE TABLE t2 (a int, b int, c int);
> SET enable_eager_aggregate TO on;
> explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON
> t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0;
>
> havingQual can have GroupingFunc.
> if that happens, then segmentation fault.

Nice catch.  Thanks.

create_agg_clause_infos does check for GROUPING() expressions, but
not in the right place.  Will fix it in the next version.

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Fri, Nov 1, 2024 at 2:21 AM Richard Guo <guofenglinux@gmail.com> wrote:
> ... an aggregated row from the partial
> aggregation matches the other side of the join if and only if each row
> in the partial group does, thereby ensuring that all rows in the same
> partial group have the same 'destiny'.

Ah, I really like this turn of phrase! I think it's clearer and
simpler than what I said. And I think it implies that we don't need to
explicitly deduce surrogate grouping keys. For example if we have  A
JOIN B JOIN C JOIN D JOIN E JOIN F, grouped by columns from A, we
don't need to work out surrogate grouping keys for B and then C and
then D and then E and then F. We can just look at F's join clauses and
that tells us how to group, independent of anything else.

But is there any hole in that approach? I think the question is
whether the current row could be used in some way that doesn't show up
in the join clauses. I can't think of any way for that to happen,
really. I believe that any outerjoin-delayed quals will show up as
join clauses, and stuff like ORDER BY and HAVING will happen after the
aggregation (at least logically) so it should be fine. Windowed
functions and ordered aggregates may be a blocker to the optimization,
though: if the window function needs access to the unaggregated rows,
or even just needs to know how many rows there are, then we'd better
not aggregate them before the window function runs; and if the
aggregate is ordered, we can only partially aggregate the data if it
is already ordered in a way that is compatible with the final, desired
ordering. Another case we might need to watch out for is RLS. RLS
wants to apply all the security quals before any non-leakproof
functions, and pushing down the aggregation might push an aggregate
function past security quals. Perhaps there are other cases to worry
about as well; this is all I can think of at the moment.

But regardless of those kinds of cases, the basic idea that we want
the partially aggregate rows to join if and only if the unaggregated
rows would have joined seems exactly correct to me, and that provides
theoretical justification for deriving the surrogate grouping key
directly from the join quals. Woot!

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
jian he
Дата:
On Thu, Aug 29, 2024 at 10:26 AM Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> > 2. I think there might be techniques we could use to limit planning
> > effort at an earlier stage when the approach doesn't appear promising.
> > For example, if the proposed grouping column is already unique, the
> > exercise is pointless (I think). Ideally we'd like to detect that
> > without even creating the grouped_rel. But the proposed grouping
> > column might also be *mostly* unique. For example, consider a table
> > with a million rows and a column 500,000 distinct values. I suspect it
> > will be difficult for partial aggregation to work out to a win in a
> > case like this, because I think that the cost of performing the
> > partial aggregation will not reduce the cost either of the final
> > aggregation or of the intervening join steps by enough to compensate.
> > It would be best to find a way to avoid generating a lot of rels and
> > paths in cases where there's really not much hope of a win.
> >
> > One could, perhaps, imagine going further with this by postponing
> > eager aggregation planning until after regular paths have been built,
> > so that we have good cardinality estimates. Suppose the query joins a
> > single fact table to a series of dimension tables. The final plan thus
> > uses the fact table as the driving table and joins to the dimension
> > tables one by one. Do we really need to consider partial aggregation
> > at every level? Perhaps just where there's been a significant row
> > count reduction since the last time we tried it, but at the next level
> > the row count will increase again?
> >
> > Maybe there are other heuristics we could use in addition or instead.
>
> Yeah, one of my concerns with this work is that it can use
> significantly more CPU time and memory during planning once enabled.
> It would be great if we have some efficient heuristics to limit the
> effort.  I'll work on that next and see what happens.
>

in v13, latest version. we can

    /* ... and initialize these targets */
    if (!init_grouping_targets(root, rel, target, agg_input,
                               &group_clauses, &group_exprs))
        return NULL;
    if (rel->reloptkind == RELOPT_BASEREL && group_exprs != NIL)
    {
        foreach_node(Var, var, group_exprs)
        {
            if(var->varno == rel->relid &&
                has_unique_index(rel, var->varattno))
                return NULL;
        }
    }

since in init_grouping_targets we already Asserted that group_exprs is
a list of Var.


--------------------------------------------------------------------------------
also in create_rel_agg_info, estimate_num_groups

    result->group_exprs = group_exprs;
    result->grouped_rows = estimate_num_groups(root, result->group_exprs,
                                               rel->rows, NULL, NULL);
        /*
         * The grouped paths for the given relation are considered useful iff
         * the row reduction ratio is greater than EAGER_AGGREGATE_RATIO.
         */
        agg_info->agg_useful =
            (agg_info->grouped_rows <= rel->rows * (1 - EAGER_AGGREGATE_RATIO));

If the associated Var in group_exprs is too many, then result->grouped_rows
will be less accurate, therefore agg_info->agg_useful will be less accurate.
should we limit the number of Var associated with Var group_exprs.


for example:
SET enable_eager_aggregate TO on;
drop table if exists eager_agg_t1,eager_agg_t2, eager_agg_t3;
CREATE TABLE eager_agg_t1 (a int, b int, c double precision);
CREATE TABLE eager_agg_t2 (a int, b int, c double precision);
INSERT INTO eager_agg_t1 SELECT i % 100, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(-4, -2)i;
explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1;



explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Finalize HashAggregate
   Output: t1.a, avg(t2.c)
   Group Key: t1.a
   ->  Merge Join
         Output: t1.a, (PARTIAL avg(t2.c))
         Merge Cond: ((abs(((t2.b % 10) + t2.a))) = (abs(t1.b)))
         ->  Sort
               Output: t2.b, t2.a, (PARTIAL avg(t2.c)), (abs(((t2.b %
10) + t2.a)))
               Sort Key: (abs(((t2.b % 10) + t2.a)))
               ->  Partial HashAggregate
                     Output: t2.b, t2.a, PARTIAL avg(t2.c), abs(((t2.b
% 10) + t2.a))
                     Group Key: t2.b, t2.a
                     ->  Seq Scan on public.eager_agg_t2 t2
                           Output: t2.a, t2.b, t2.c
         ->  Sort
               Output: t1.a, t1.b, (abs(t1.b))
               Sort Key: (abs(t1.b))
               ->  Seq Scan on public.eager_agg_t1 t1
                     Output: t1.a, t1.b, abs(t1.b)
 Settings: enable_eager_aggregate = 'on'
 Query Identifier: -734044107933323262



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Nov 1, 2024 at 9:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Nov 1, 2024 at 2:21 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > ... an aggregated row from the partial
> > aggregation matches the other side of the join if and only if each row
> > in the partial group does, thereby ensuring that all rows in the same
> > partial group have the same 'destiny'.
>
> Ah, I really like this turn of phrase! I think it's clearer and
> simpler than what I said. And I think it implies that we don't need to
> explicitly deduce surrogate grouping keys. For example if we have  A
> JOIN B JOIN C JOIN D JOIN E JOIN F, grouped by columns from A, we
> don't need to work out surrogate grouping keys for B and then C and
> then D and then E and then F. We can just look at F's join clauses and
> that tells us how to group, independent of anything else.
>
> But is there any hole in that approach? I think the question is
> whether the current row could be used in some way that doesn't show up
> in the join clauses. I can't think of any way for that to happen,
> really. I believe that any outerjoin-delayed quals will show up as
> join clauses, and stuff like ORDER BY and HAVING will happen after the
> aggregation (at least logically) so it should be fine. Windowed
> functions and ordered aggregates may be a blocker to the optimization,
> though: if the window function needs access to the unaggregated rows,
> or even just needs to know how many rows there are, then we'd better
> not aggregate them before the window function runs; and if the
> aggregate is ordered, we can only partially aggregate the data if it
> is already ordered in a way that is compatible with the final, desired
> ordering. Another case we might need to watch out for is RLS. RLS
> wants to apply all the security quals before any non-leakproof
> functions, and pushing down the aggregation might push an aggregate
> function past security quals. Perhaps there are other cases to worry
> about as well; this is all I can think of at the moment.

Yeah, ordered aggregates could be a blocker.  I think it might be best
to prevent the use of eager aggregation if root->numOrderedAggs > 0
for now.

I've been thinking about the window functions case, as Jian He also
mentioned it some time ago.  It seems that the window function's
argument(s), as well as the PARTITION BY expression(s), are supposed
to appear in the GROUP BY clause or be used in an aggregate function.
And window functions are applied after the aggregation.  So it seems
that there is no problem with window functions.  But maybe I'm wrong.

I hadn't considered the RLS case before, but I think you're right.  To
simplify things, maybe for now we can just prevent pushing down the
aggregation if the query applies some RLS policy, by checking
query->hasRowSecurity.

> But regardless of those kinds of cases, the basic idea that we want
> the partially aggregate rows to join if and only if the unaggregated
> rows would have joined seems exactly correct to me, and that provides
> theoretical justification for deriving the surrogate grouping key
> directly from the join quals. Woot!

Thank you for the confirmation!

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Nov 6, 2024 at 3:22 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Yeah, ordered aggregates could be a blocker.  I think it might be best
> to prevent the use of eager aggregation if root->numOrderedAggs > 0
> for now.
>
> I've been thinking about the window functions case, as Jian He also
> mentioned it some time ago.  It seems that the window function's
> argument(s), as well as the PARTITION BY expression(s), are supposed
> to appear in the GROUP BY clause or be used in an aggregate function.
> And window functions are applied after the aggregation.  So it seems
> that there is no problem with window functions.  But maybe I'm wrong.
>
> I hadn't considered the RLS case before, but I think you're right.  To
> simplify things, maybe for now we can just prevent pushing down the
> aggregation if the query applies some RLS policy, by checking
> query->hasRowSecurity.

Particularly for the RLS case, I think we should be reluctant to
disable the optimization entirely just because there might be a
problem. We have existing infrastructure to keep security quals from
being applied too late, and I suspect it's mostly applicable to this
situation. Therefore, I suspect it might not be very much work to
allow this optimization even when RLS is in use, as long as it
wouldn't actually cause a violation of the RLS rules. If, upon
investigation, you find some reason why we can't assess accurately
whether pushing down a specific aggregate is a problem, then the
approach that you propose is reasonable, but I think the question
should be investigated. I don't like the idea of giving up on
RLS-using queries completely without even trying to figure out how
difficult it would be to do the right thing.

I have similar but weaker feelings about ordered aggregates. Consider:

explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3
where t1.id = t2.id and t2.id = t3.id group by 1;

We can't partially aggregate t2, but we could partially aggregate t2
join t3. So this case is a lot like:

explain select t1.id, array_agg(t2.v + t3.o) from t1, t2, t3 where
t1.id = t2.id and t2.id = t3.id group by 1;

I don't know whether the patch handles the second case correctly right
now, but that certainly seems like a case that has to work. We must be
able to determine in such a case that the partial aggregate has to be
above the t2-t3 join. And if we can determine that, then why can't
basically the same logic handle the first case? There are certainly
some differences. The first case not only needs the aggregate to be
above the t2-t3 join but also needs the input data to be sorted, so we
don't get the right behavior for ordered aggregates just by using the
contents of the ORDER BY clause to decide at what level the partial
aggregate can be applied. On the other hand, if we're looking at paths
for (T2 JOIN T3) to build paths for PartialAgg(T2 join T3), the
stipulation that we need to use ordered paths or sorting doesn't make
the code very much more complicated. I'm open to the conclusion that
this is too much complexity but I'd rather not dismiss it instantly.

Regarding window functions, you've said a few times now that you don't
see the problem, but the more I think about it, the more obvious it
seems to me that there are BIG problems. Consider this example from
the documentation:

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname)
FROM empsalary;

I get a query plan like this:

 WindowAgg  (cost=83.46..104.37 rows=1200 width=72)
   ->  Sort  (cost=83.37..86.37 rows=1200 width=40)
         Sort Key: depname
         ->  Seq Scan on empsalary  (cost=0.00..22.00 rows=1200 width=40)

Already we see warning signs here. The WindowAgg node needs the input
rows to be ordered, because it's going to average the salary for each
group of rows with the same depname. So we have the same kinds of
issues that we do for ordered aggregates, at the very least. But
window aggregates are not just ordering-sensitive. They are also
empowered to look at other rows in the frame. Consider the following
example:

create table names (n text);
insert into names values ('Tom'), ('Dick'), ('Harry');
select n, lag(n, 1) over () from names;

The result is:

   n   | lag
-------+------
 Tom   |
 Dick  | Tom
 Harry | Dick

I think it is pretty obvious that if any form of partial aggregation
had been applied here, it would be impossible to correctly evaluate
lag(). Or am I missing something?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Nov 6, 2024 at 11:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 6, 2024 at 3:22 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > Yeah, ordered aggregates could be a blocker.  I think it might be best
> > to prevent the use of eager aggregation if root->numOrderedAggs > 0
> > for now.
> >
> > I've been thinking about the window functions case, as Jian He also
> > mentioned it some time ago.  It seems that the window function's
> > argument(s), as well as the PARTITION BY expression(s), are supposed
> > to appear in the GROUP BY clause or be used in an aggregate function.
> > And window functions are applied after the aggregation.  So it seems
> > that there is no problem with window functions.  But maybe I'm wrong.
> >
> > I hadn't considered the RLS case before, but I think you're right.  To
> > simplify things, maybe for now we can just prevent pushing down the
> > aggregation if the query applies some RLS policy, by checking
> > query->hasRowSecurity.
>
> Particularly for the RLS case, I think we should be reluctant to
> disable the optimization entirely just because there might be a
> problem. We have existing infrastructure to keep security quals from
> being applied too late, and I suspect it's mostly applicable to this
> situation. Therefore, I suspect it might not be very much work to
> allow this optimization even when RLS is in use, as long as it
> wouldn't actually cause a violation of the RLS rules. If, upon
> investigation, you find some reason why we can't assess accurately
> whether pushing down a specific aggregate is a problem, then the
> approach that you propose is reasonable, but I think the question
> should be investigated. I don't like the idea of giving up on
> RLS-using queries completely without even trying to figure out how
> difficult it would be to do the right thing.

That makes sense.  I shouldn’t be lazy and simply disable this
optimization for the RLS case.  I'm not familiar with the RLS stuff
but I’ll take some time to investigate it further.

> I have similar but weaker feelings about ordered aggregates. Consider:
>
> explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3
> where t1.id = t2.id and t2.id = t3.id group by 1;
>
> We can't partially aggregate t2, but we could partially aggregate t2
> join t3. So this case is a lot like:
>
> explain select t1.id, array_agg(t2.v + t3.o) from t1, t2, t3 where
> t1.id = t2.id and t2.id = t3.id group by 1;
>
> I don't know whether the patch handles the second case correctly right
> now, but that certainly seems like a case that has to work. We must be
> able to determine in such a case that the partial aggregate has to be
> above the t2-t3 join. And if we can determine that, then why can't
> basically the same logic handle the first case? There are certainly
> some differences. The first case not only needs the aggregate to be
> above the t2-t3 join but also needs the input data to be sorted, so we
> don't get the right behavior for ordered aggregates just by using the
> contents of the ORDER BY clause to decide at what level the partial
> aggregate can be applied. On the other hand, if we're looking at paths
> for (T2 JOIN T3) to build paths for PartialAgg(T2 join T3), the
> stipulation that we need to use ordered paths or sorting doesn't make
> the code very much more complicated. I'm open to the conclusion that
> this is too much complexity but I'd rather not dismiss it instantly.

It seems to me that a partially aggregated row might need to be
combined with other partially aggregated rows after the join, if they
belong to the same t1.id group.  IIUC, this implies that we cannot
perform partial aggregation on ordered input before the join,
otherwise we may get incorrect results during the final aggregation
phase.

> Regarding window functions, you've said a few times now that you don't
> see the problem, but the more I think about it, the more obvious it
> seems to me that there are BIG problems. Consider this example from
> the documentation:
>
> SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname)
> FROM empsalary;
>
> I get a query plan like this:
>
>  WindowAgg  (cost=83.46..104.37 rows=1200 width=72)
>    ->  Sort  (cost=83.37..86.37 rows=1200 width=40)
>          Sort Key: depname
>          ->  Seq Scan on empsalary  (cost=0.00..22.00 rows=1200 width=40)
>
> Already we see warning signs here. The WindowAgg node needs the input
> rows to be ordered, because it's going to average the salary for each
> group of rows with the same depname. So we have the same kinds of
> issues that we do for ordered aggregates, at the very least. But
> window aggregates are not just ordering-sensitive. They are also
> empowered to look at other rows in the frame. Consider the following
> example:
>
> create table names (n text);
> insert into names values ('Tom'), ('Dick'), ('Harry');
> select n, lag(n, 1) over () from names;
>
> The result is:
>
>    n   | lag
> -------+------
>  Tom   |
>  Dick  | Tom
>  Harry | Dick
>
> I think it is pretty obvious that if any form of partial aggregation
> had been applied here, it would be impossible to correctly evaluate
> lag(). Or am I missing something?

Hmm, currently we only consider grouped aggregation for eager
aggregation.  For grouped aggregation, the window function's
arguments, as well as the PARTITION BY expressions, must appear in the
GROUP BY clause.  That is to say, the depname column in the first
query, or the n column in the second query, will not be aggregated
into the partial groups.  Instead, they will remain as they are as
input for the WindowAgg nodes.  It seems to me that this ensures
that we're good with window functions.  But maybe I'm wrong.

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Sun, Nov 10, 2024 at 7:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > I have similar but weaker feelings about ordered aggregates. Consider:
> >
> > explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3
> > where t1.id = t2.id and t2.id = t3.id group by 1;
> >
>
> It seems to me that a partially aggregated row might need to be
> combined with other partially aggregated rows after the join, if they
> belong to the same t1.id group.  IIUC, this implies that we cannot
> perform partial aggregation on ordered input before the join,
> otherwise we may get incorrect results during the final aggregation
> phase.

Hmm, I think you're right. I think that if the t1.id=t2.id join is one
to one, then it would work out fine, but that need not be the case.

> Hmm, currently we only consider grouped aggregation for eager
> aggregation.  For grouped aggregation, the window function's
> arguments, as well as the PARTITION BY expressions, must appear in the
> GROUP BY clause.  That is to say, the depname column in the first
> query, or the n column in the second query, will not be aggregated
> into the partial groups.  Instead, they will remain as they are as
> input for the WindowAgg nodes.  It seems to me that this ensures
> that we're good with window functions.  But maybe I'm wrong.

Unfortunately, I don't know what you mean by grouped aggregation. I
think of grouping and aggregation as synonyms, pretty much.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Tue, Nov 12, 2024 at 1:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 10, 2024 at 7:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > Hmm, currently we only consider grouped aggregation for eager
> > aggregation.  For grouped aggregation, the window function's
> > arguments, as well as the PARTITION BY expressions, must appear in the
> > GROUP BY clause.  That is to say, the depname column in the first
> > query, or the n column in the second query, will not be aggregated
> > into the partial groups.  Instead, they will remain as they are as
> > input for the WindowAgg nodes.  It seems to me that this ensures
> > that we're good with window functions.  But maybe I'm wrong.
>
> Unfortunately, I don't know what you mean by grouped aggregation. I
> think of grouping and aggregation as synonyms, pretty much.

Ah, sorry for the confusion.  By "grouped aggregation", I mean
aggregation with a GROUP BY clause, where we produce a result row for
each group.  This contrasts with plain aggregation, where there is a
single result row for the whole query.

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Sun, Nov 10, 2024 at 7:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Hmm, currently we only consider grouped aggregation for eager
> aggregation.  For grouped aggregation, the window function's
> arguments, as well as the PARTITION BY expressions, must appear in the
> GROUP BY clause.  That is to say, the depname column in the first
> query, or the n column in the second query, will not be aggregated
> into the partial groups.  Instead, they will remain as they are as
> input for the WindowAgg nodes.  It seems to me that this ensures
> that we're good with window functions.  But maybe I'm wrong.

Returning to this point now that I understand what you meant by
grouped aggregation:

I still don't understand how you expect to be able to evaluate
functions like LEAD() and LAG() if any form of partial aggregation has
been done.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Dec 4, 2024 at 11:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 10, 2024 at 7:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > Hmm, currently we only consider grouped aggregation for eager
> > aggregation.  For grouped aggregation, the window function's
> > arguments, as well as the PARTITION BY expressions, must appear in the
> > GROUP BY clause.  That is to say, the depname column in the first
> > query, or the n column in the second query, will not be aggregated
> > into the partial groups.  Instead, they will remain as they are as
> > input for the WindowAgg nodes.  It seems to me that this ensures
> > that we're good with window functions.  But maybe I'm wrong.
>
> Returning to this point now that I understand what you meant by
> grouped aggregation:
>
> I still don't understand how you expect to be able to evaluate
> functions like LEAD() and LAG() if any form of partial aggregation has
> been done.

In grouped aggregation, the non-aggregate arguments of the window
function must appear in the GROUP BY clause, so they will not be
aggregated into the partial groups.  It seems to me that this ensures
that they remain available as valid inputs for the window function.

For the Aggref arguments of the window function, their final values
are calculated in the Finalize Agg node, meaning they, too, are good
to be used as inputs for the window function.

As an example, please consider

create table tbl (a int, b int, c int);
insert into tbl select i%3, i%3, i%3 from generate_series(1,1000)i;
analyze tbl;

explain (verbose, costs off)
select lead(t1.a+sum(t2.b)) over (), sum(t2.c) from
tbl t1 join tbl t2 on t1.b = t2.b group by t1.a;
                                  QUERY PLAN
------------------------------------------------------------------------------
 WindowAgg
   Output: lead((t1.a + (sum(t2.b)))) OVER (?), (sum(t2.c)), t1.a
   ->  Finalize HashAggregate
         Output: t1.a, sum(t2.b), sum(t2.c)
         Group Key: t1.a
         ->  Hash Join
               Output: t1.a, (PARTIAL sum(t2.b)), (PARTIAL sum(t2.c))
               Hash Cond: (t1.b = t2.b)
               ->  Seq Scan on public.tbl t1
                     Output: t1.a, t1.b, t1.c
               ->  Hash
                     Output: t2.b, (PARTIAL sum(t2.b)), (PARTIAL sum(t2.c))
                     ->  Partial HashAggregate
                           Output: t2.b, PARTIAL sum(t2.b), PARTIAL sum(t2.c)
                           Group Key: t2.b
                           ->  Seq Scan on public.tbl t2
                                 Output: t2.a, t2.b, t2.c
(17 rows)

It seems to me that both 't1.a' and 'sum(t2.b)' are valid inputs for
LEAD(), even though we have performed partial aggregation.

Am I missing something?

Thanks
Richard



Re: Eager aggregation, take 3

От
jian he
Дата:
hi.
in create_grouping_expr_infos

        tce = lookup_type_cache(exprType((Node *) tle->expr),
                                TYPECACHE_BTREE_OPFAMILY);
        if (!OidIsValid(tce->btree_opf) ||
            !OidIsValid(tce->btree_opintype))
            return;
       ....
        /*
         * Get the operator in the btree's opfamily.
         */
        eq_op = get_opfamily_member(tce->btree_opf,
                                    tce->btree_opintype,
                                    tce->btree_opintype,
                                    BTEqualStrategyNumber);
        if (!OidIsValid(eq_op))
            return;
        eq_opfamilies = get_mergejoin_opfamilies(eq_op);
        if (!eq_opfamilies)
            return;
        btree_opfamily = linitial_oid(eq_opfamilies);


If eq_op is valid, then we don't need to call get_mergejoin_opfamilies?
since get_mergejoin_opfamilies output will be the same as tce->btree_opf.
and we already checked (tce->btree_opf) is valid.

In other words, I think eq_op is valid imply
that tce->btree_opf is the value (btree opfamily) we need.



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Jan 9, 2025 at 12:15 PM jian he <jian.universality@gmail.com> wrote:
> hi.
> in create_grouping_expr_infos
>
>         tce = lookup_type_cache(exprType((Node *) tle->expr),
>                                 TYPECACHE_BTREE_OPFAMILY);
>         if (!OidIsValid(tce->btree_opf) ||
>             !OidIsValid(tce->btree_opintype))
>             return;
>        ....
>         /*
>          * Get the operator in the btree's opfamily.
>          */
>         eq_op = get_opfamily_member(tce->btree_opf,
>                                     tce->btree_opintype,
>                                     tce->btree_opintype,
>                                     BTEqualStrategyNumber);
>         if (!OidIsValid(eq_op))
>             return;
>         eq_opfamilies = get_mergejoin_opfamilies(eq_op);
>         if (!eq_opfamilies)
>             return;
>         btree_opfamily = linitial_oid(eq_opfamilies);
>
>
> If eq_op is valid, then we don't need to call get_mergejoin_opfamilies?
> since get_mergejoin_opfamilies output will be the same as tce->btree_opf.
> and we already checked (tce->btree_opf) is valid.
>
> In other words, I think eq_op is valid imply
> that tce->btree_opf is the value (btree opfamily) we need.

Nice catch!  Actually, we can use tce->btree_opf directly, without
needing to check its equality operator, since we know it's a btree
opfamily and it's valid.  If it were a different opfamily (such as a
hash opfamily), we would need to look up its equality operator, and
select some btree opfamily that that operator is part of.  But in this
case, that's not necessary.

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Sun, Jan 12, 2025 at 9:04 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Attached is an updated version of this patch that addresses Jian's
> review comments, along with some more cosmetic tweaks.  I'm going to
> be looking at this patch again from the point of view of committing
> it, with the plan to commit it late this week or early next week,
> barring any further comments or objections.

I feel this is rushed. This is a pretty big patch touching a sensitive
area of the code. I'm the only senior hacker who has reviewed it, and
I would say that I've only reviewed it pretty lightly, and that the
concerns I raised were fairly substantial. I don't think it's
customary to go from that point to commit after one more patch
revision. This really deserves to be looked at by multiple senior
hackers familiar with the planner; or at least by Tom.

My core concerns here are still what they were in the first email I
posted to the thread: it's unclear that the cost model will deliver
meaningful answers for the grouped rels, and it doesn't seem like
you've done enough to limit the overhead of the feature.

With regard to the first, I reiterate that we are in general quite bad
at having meaningful statistics for anything above an aggregate, and
this patch greatly expands how much of a query could be above an
aggregate. I felt back in August when I did my first review, and still
feel now, that when faced with a query where aggregation could be done
at any of several levels, the chances of picking the right one are not
much better than random. Why do you think otherwise?

With regard to the second, I suggested several lines of thinking back
in August that could lead to limiting the number of grouped_rels that
we create, but it doesn't really look like much of anything has
changed. We're still creating a partially grouped rel for every
baserel in the query, and every joinrel in the query. I'm not very
happy with "let's just turn it off by default" as the answer to that
concern. A lot of people won't enable the feature, which will mean it
doesn't have much value to our users, and those who do will still see
a lot of overhead. Maybe I'm wrong, but I bet with some good
heuristics the planning cost of this could be reduced by an order of
magnitude or more. If that were done, we could imagine eventually (or
maybe even immediately) enabling this by default; without that, we
still have the burden of maintaining this code and keeping it working,
but almost nobody will benefit.

+      <term><varname>enable_eager_aggregate</varname> (<type>boolean</type>)
+       <para>
+        Enables or disables the query planner's ability to partially push
+        aggregation past a join, and finalize it once all the relations are
+        joined. The default is <literal>off</literal>.

I'm a bit concerned about the naming here. I feel like we're adding an
increasing number of planner features with an increasing number of
disabling GUCs that are all a bit random. I kind of wonder if this
should be called enable_incremental_aggregate. Maybe that's worse,
because "eager" is a word we're not using for anything yet, so using
it here improves greppability and perhaps understandability. On the
other hand, the aggregate that is pushed down by this feature is
always partial (I believe) so we still need a finalize step later,
which means we're aggregating incrementally. There's some nice parity
with incremental sort, too, perhaps.

+/* The original length and hashtable of a RelInfoList */
+typedef struct
+{
+ int savelength;
+ struct HTAB *savehash;
+} RelInfoListInfo;

Both the comment and the name of the data type are completely meaningless.

+ /*
+ * Try at least sorting the cheapest path and also try
+ * incrementally sorting any path which is partially sorted
+ * already (no need to deal with paths which have presorted
+ * keys when incremental sort is disabled unless it's the
+ * cheapest input path).
+ */

This would be the fifth copy of this comment. It's not entirely this
patch's fault, of course, but some kind of refactoring or cleanup is
probably needed here.

+ * cheapest_parameterized_paths also always includes the fewest-row
+ * unparameterized path, if there is one, for grouped relations.  Different
+ * paths of a grouped relation can have very different row counts, and in some
+ * cases the cheapest-total unparameterized path may not be the one with the
+ * fewest row.

As I said back in October, this seems like mixing together in one
RelOptInfo paths that really belong to two different RelOptInfos.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Jan 15, 2025 at 12:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 12, 2025 at 9:04 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > Attached is an updated version of this patch that addresses Jian's
> > review comments, along with some more cosmetic tweaks.  I'm going to
> > be looking at this patch again from the point of view of committing
> > it, with the plan to commit it late this week or early next week,
> > barring any further comments or objections.
>
> I feel this is rushed. This is a pretty big patch touching a sensitive
> area of the code. I'm the only senior hacker who has reviewed it, and
> I would say that I've only reviewed it pretty lightly, and that the
> concerns I raised were fairly substantial. I don't think it's
> customary to go from that point to commit after one more patch
> revision. This really deserves to be looked at by multiple senior
> hackers familiar with the planner; or at least by Tom.

Thank you for your input.  In fact, there have been several changes
since your last review, as I mentioned in the off-list email.
However, I agree that it would be great if someone else, especially
Tom, could take a look at this patch.

> My core concerns here are still what they were in the first email I
> posted to the thread: it's unclear that the cost model will deliver
> meaningful answers for the grouped rels, and it doesn't seem like
> you've done enough to limit the overhead of the feature.
>
> With regard to the first, I reiterate that we are in general quite bad
> at having meaningful statistics for anything above an aggregate, and
> this patch greatly expands how much of a query could be above an
> aggregate. I felt back in August when I did my first review, and still
> feel now, that when faced with a query where aggregation could be done
> at any of several levels, the chances of picking the right one are not
> much better than random. Why do you think otherwise?

I understand that we're currently quite bad at estimating the number
of groups after aggregation.  In fact, it's not just aggregation
estimates — we're also bad at join estimates in some cases.  This is a
reality we have to face.  Here's what I think: we should be trying our
best to cost each node type as accurately as possible, and then build
the upper nodes based on those costs.  We should not conclude that,
because we are unable to accurately cost one node type, we should
avoid any cost-based optimizations above that node.

Actually, performing aggregation before joins is not a new concept;
consider JOIN_UNIQUE_OUTER/INNER, for example:

explain (costs off)
select * from t t1 join t t2 on t1.b = t2.b
where (t1.a, t1.b) in
    (select t3.a, t3.b from t t3, t t4 where t3.a > t4.a);
                      QUERY PLAN
------------------------------------------------------
 Hash Join
   Hash Cond: ((t2.b = t1.b) AND (t3.a = t1.a))
   ->  Hash Join
         Hash Cond: (t2.b = t3.b)
         ->  Seq Scan on t t2
         ->  Hash
               ->  HashAggregate
                     Group Key: t3.a, t3.b
                     ->  Nested Loop
                           Join Filter: (t3.a > t4.a)
                           ->  Seq Scan on t t3
                           ->  Materialize
                                 ->  Seq Scan on t t4
   ->  Hash
         ->  Seq Scan on t t1
(15 rows)

I believe the HashAggregate node in this plan faces the same problem
with inaccurate estimates.  However, I don't think it's reasonable to
say that, because we cannot accurately cost the Aggregate node, we
should disregard considering JOIN_UNIQUE_OUTER/INNER.

Back in August, I responded to this issue by "Maybe we can run some
benchmarks first and investigate the regressions discovered on a
case-by-case basis".  In October, I ran the TPC-DS benchmark at scale
10 and observed that eager aggregation was applied in 7 queries, with
no notable regressions.  In contrast, Q4 and Q11 showed performance
improvements of 3–4 times.  Please see [1].

> With regard to the second, I suggested several lines of thinking back
> in August that could lead to limiting the number of grouped_rels that
> we create, but it doesn't really look like much of anything has
> changed. We're still creating a partially grouped rel for every
> baserel in the query, and every joinrel in the query. I'm not very
> happy with "let's just turn it off by default" as the answer to that
> concern. A lot of people won't enable the feature, which will mean it
> doesn't have much value to our users, and those who do will still see
> a lot of overhead. Maybe I'm wrong, but I bet with some good
> heuristics the planning cost of this could be reduced by an order of
> magnitude or more. If that were done, we could imagine eventually (or
> maybe even immediately) enabling this by default; without that, we
> still have the burden of maintaining this code and keeping it working,
> but almost nobody will benefit.

Actually, I introduced the EAGER_AGGREGATE_RATIO mechanism in October
to limit the planning effort for eager aggregation.  For more details,
please see [2].

And I don't think it's correct to say that we create a partially
grouped rel for every baserel and every joinrel.  This patch includes
a bunch of logic to determine whether it's appropriate to create a
grouped rel for a base or join rel.  Furthermore, with the
EAGER_AGGREGATE_RATIO mechanism, even if creating a grouped rel is
possible, we will skip it if the grouped paths are considered not
useful.  All of these measures help reduce the number of grouped
paths as well as the grouped relations in many cases where eager
aggregation would not help a lot.

Based on the TPC-DS benchmark results, I don't see "a lot of overhead"
in the planning cost, at least for the 7 queries where eager
aggregation is applied.  As I said in [1], "For the planning time, I
do not see notable regressions for any of the seven queries".  In
fact, I initially thought that we might consider enabling this by
default, given the positive benchmark results, but I just couldn't
summon the courage to do it.  Perhaps we should reconsider enabling it
by default, so users can benefit from the new feature and help
identify any potential bugs.

> +      <term><varname>enable_eager_aggregate</varname> (<type>boolean</type>)
> +       <para>
> +        Enables or disables the query planner's ability to partially push
> +        aggregation past a join, and finalize it once all the relations are
> +        joined. The default is <literal>off</literal>.
>
> I'm a bit concerned about the naming here. I feel like we're adding an
> increasing number of planner features with an increasing number of
> disabling GUCs that are all a bit random. I kind of wonder if this
> should be called enable_incremental_aggregate. Maybe that's worse,
> because "eager" is a word we're not using for anything yet, so using
> it here improves greppability and perhaps understandability. On the
> other hand, the aggregate that is pushed down by this feature is
> always partial (I believe) so we still need a finalize step later,
> which means we're aggregating incrementally. There's some nice parity
> with incremental sort, too, perhaps.

As I mentioned in [3], the name "Eager Aggregation" is inherited from
the paper "Eager Aggregation and Lazy Aggregation" [4], from which
many of the ideas in this feature are derived.  Personally, I like
this name a lot, but I'm open to other names if others find it
unreasonable.

> +/* The original length and hashtable of a RelInfoList */
> +typedef struct
> +{
> + int savelength;
> + struct HTAB *savehash;
> +} RelInfoListInfo;
>
> Both the comment and the name of the data type are completely meaningless.

Thanks.  Will fix the comment and the name for this data type.

> + /*
> + * Try at least sorting the cheapest path and also try
> + * incrementally sorting any path which is partially sorted
> + * already (no need to deal with paths which have presorted
> + * keys when incremental sort is disabled unless it's the
> + * cheapest input path).
> + */
>
> This would be the fifth copy of this comment. It's not entirely this
> patch's fault, of course, but some kind of refactoring or cleanup is
> probably needed here.

Agreed.  However, I think it would be better to refactor this in a
separate patch.  This issue also exists on master, and I'd prefer to
avoid introducing such refactors in this already large patch.

> + * cheapest_parameterized_paths also always includes the fewest-row
> + * unparameterized path, if there is one, for grouped relations.  Different
> + * paths of a grouped relation can have very different row counts, and in some
> + * cases the cheapest-total unparameterized path may not be the one with the
> + * fewest row.
>
> As I said back in October, this seems like mixing together in one
> RelOptInfo paths that really belong to two different RelOptInfos.

I understand that you said about the design in October where
"PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2) get separate
RelOptInfos", because "it's less clear whether it's fair to compare
across the two categories".  I've shared my thoughts on this in [5].

Furthermore, even if we separate these grouped paths into two
different RelOptInfos, we still face the issue that "different paths
of a grouped relation can have very different row counts", and we need
a way to handle this.  One could argue that we can separate the
grouped paths where partial aggregation is placed at different
locations into different RelOptInfos, but this would lead to an
explosion in the number of RelOptInfos for grouped relations as we
climb up the join tree.  I think this is neither realistic nor
necessary.

[1] https://postgr.es/m/CAMbWs49DrR8Gkp3TUwFJV_1ShtmLzQUq3mOYD+GyF+Y3AmmrFw@mail.gmail.com
[2] https://postgr.es/m/CAMbWs48OS3Z0G5u3fhar1=H_ucmEcUaX0tRUNpcLQxHt=z4Y7w@mail.gmail.com
[3] https://postgr.es/m/CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com
[4] https://www.vldb.org/conf/1995/P345.PDF
[5] https://postgr.es/m/CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ@mail.gmail.com

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Jan 15, 2025 at 1:58 AM Richard Guo <guofenglinux@gmail.com> wrote:
> I understand that we're currently quite bad at estimating the number
> of groups after aggregation.  In fact, it's not just aggregation
> estimates — we're also bad at join estimates in some cases.  This is a
> reality we have to face.  Here's what I think: we should be trying our
> best to cost each node type as accurately as possible, and then build
> the upper nodes based on those costs.  We should not conclude that,
> because we are unable to accurately cost one node type, we should
> avoid any cost-based optimizations above that node.

Well, I agree with that last sentence, for sure. But I don't think
it's true that the situations with joins and aggregates are
comparable. We are much better able to estimate the number of rows
that will come out of a join than we are to estimate the number of
rows that come out of an aggregate. It's certainly true that in some
cases we get join estimates badly wrong, and I'd like to see us do
better there, but our estimates of the number of distinct values that
exist in a column are the least reliable part of our statistics system
by far.

Also, we look at the underlying statistics for a column variable even
after joins and aggregates and assume (not having any other
information) that the distribution after that operation is likely to
be similar to the distribution before that operation. Consider a table
A with columns x and y. Let's say x is a unique ID and y is a
dependent value with some distribution over a finite range of
possibilities (e.g. a person's age). If we join table A to some other
table B on A.x = B.x and filter out some of the rows via that join,
the distribution of values in column y is likely to be altered. If the
rows are removed at random, the original distribution will prevail,
but often it won't be random and so the distribution will change in a
way we can't predict. However, guessing pre-join distribution of A.y
is still prevails isn't crazy, and it's better than assuming we can
say nothing about the distribution.

But now let's say that after joining to B, we perform an aggregation
operation, computing the minimum value of A.y for each value of B.z. A
this point, we have no usable statistics for either output column. The
result must be unique on B.z, and the distribution of MIN(A.y) is
going to be entirely different from the distribution of B.y. Any
future joins that we perform here will have to be estimated without
any MCVs, which is going to reduce the accuracy of the estimation by a
lot. In summary, the join makes relying on our MCV information less
likely to be accurate, but the aggregate makes it impossible to rely
on our MCV information at all. In terms of the accuracy of our
results, that is a lot worse.

> I believe the HashAggregate node in this plan faces the same problem
> with inaccurate estimates.  However, I don't think it's reasonable to
> say that, because we cannot accurately cost the Aggregate node, we
> should disregard considering JOIN_UNIQUE_OUTER/INNER.

Fair point.

> Back in August, I responded to this issue by "Maybe we can run some
> benchmarks first and investigate the regressions discovered on a
> case-by-case basis".  In October, I ran the TPC-DS benchmark at scale
> 10 and observed that eager aggregation was applied in 7 queries, with
> no notable regressions.  In contrast, Q4 and Q11 showed performance
> improvements of 3–4 times.  Please see [1].

I had forgotten about that, and again, fair point, but I'm concerned
that it might not be a broad enough base of queries to test against.
(7 isn't a very large number.)

> Actually, I introduced the EAGER_AGGREGATE_RATIO mechanism in October
> to limit the planning effort for eager aggregation.  For more details,
> please see [2].

OK, I missed this, but...

> And I don't think it's correct to say that we create a partially
> grouped rel for every baserel and every joinrel.  This patch includes
> a bunch of logic to determine whether it's appropriate to create a
> grouped rel for a base or join rel.  Furthermore, with the
> EAGER_AGGREGATE_RATIO mechanism, even if creating a grouped rel is
> possible, we will skip it if the grouped paths are considered not
> useful.  All of these measures help reduce the number of grouped
> paths as well as the grouped relations in many cases where eager
> aggregation would not help a lot.

...it looks to me like EAGER_AGGREGATE_RATIO is used to set the
RelAggInfo's agg_useful field, which seems like it happens after the
RelOptInfo has already been created. I had been looking for something
that would avoid creating the RelOptInfo in the first place and I
didn't see it.

> Based on the TPC-DS benchmark results, I don't see "a lot of overhead"
> in the planning cost, at least for the 7 queries where eager
> aggregation is applied.  As I said in [1], "For the planning time, I
> do not see notable regressions for any of the seven queries".  In
> fact, I initially thought that we might consider enabling this by
> default, given the positive benchmark results, but I just couldn't
> summon the courage to do it.  Perhaps we should reconsider enabling it
> by default, so users can benefit from the new feature and help
> identify any potential bugs.

If you're going to commit this, I think it would be a good idea to
enable it by default at least for now. If there are problems, it's
better to find out about them sooner rather than later. If they are
minor they can be fixed; if they are major, we can consider whether it
is better to fix them, disable the feature by default, or revert. We
can add an open item to reconsider the default setting during beta.

> > As I said back in October, this seems like mixing together in one
> > RelOptInfo paths that really belong to two different RelOptInfos.
>
> I understand that you said about the design in October where
> "PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2) get separate
> RelOptInfos", because "it's less clear whether it's fair to compare
> across the two categories".  I've shared my thoughts on this in [5].
>
> Furthermore, even if we separate these grouped paths into two
> different RelOptInfos, we still face the issue that "different paths
> of a grouped relation can have very different row counts", and we need
> a way to handle this.  One could argue that we can separate the
> grouped paths where partial aggregation is placed at different
> locations into different RelOptInfos, but this would lead to an
> explosion in the number of RelOptInfos for grouped relations as we
> climb up the join tree.  I think this is neither realistic nor
> necessary.

It's possible you're right, but it does make me nervous. I do agree
that making the number of RelOptInfos explode would be really bad.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Jan 15, 2025 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 15, 2025 at 1:58 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > I understand that we're currently quite bad at estimating the number
> > of groups after aggregation.  In fact, it's not just aggregation
> > estimates — we're also bad at join estimates in some cases.  This is a
> > reality we have to face.  Here's what I think: we should be trying our
> > best to cost each node type as accurately as possible, and then build
> > the upper nodes based on those costs.  We should not conclude that,
> > because we are unable to accurately cost one node type, we should
> > avoid any cost-based optimizations above that node.
>
> Well, I agree with that last sentence, for sure. But I don't think
> it's true that the situations with joins and aggregates are
> comparable. We are much better able to estimate the number of rows
> that will come out of a join than we are to estimate the number of
> rows that come out of an aggregate. It's certainly true that in some
> cases we get join estimates badly wrong, and I'd like to see us do
> better there, but our estimates of the number of distinct values that
> exist in a column are the least reliable part of our statistics system
> by far.

I totally understand that the situation with joins is better than with
aggregates, which is why I said that we're also bad at join estimates
"in some cases" - especially in the cases where we fall back to use
default selectivity estimates.  A simple example:

create table t1 (a int, b int);
create table t2 (a int, b int);

insert into t1 select i, i from generate_series(1,1000)i;
insert into t2 select i, i from generate_series(1000, 1999)i;

analyze t1, t2;

explain analyze select * from t1 join t2 on t1.a > t2.a;

And here is what I got:

 Nested Loop  (cost=0.00..15032.50 rows=333333 width=16)
              (actual time=392.841..392.854 rows=0 loops=1)

If this t1/t2 join is part of a larger SELECT query, I think the cost
estimates for the upper join nodes would likely be quite inaccurate.

> > I believe the HashAggregate node in this plan faces the same problem
> > with inaccurate estimates.  However, I don't think it's reasonable to
> > say that, because we cannot accurately cost the Aggregate node, we
> > should disregard considering JOIN_UNIQUE_OUTER/INNER.
>
> Fair point.
>
> > Back in August, I responded to this issue by "Maybe we can run some
> > benchmarks first and investigate the regressions discovered on a
> > case-by-case basis".  In October, I ran the TPC-DS benchmark at scale
> > 10 and observed that eager aggregation was applied in 7 queries, with
> > no notable regressions.  In contrast, Q4 and Q11 showed performance
> > improvements of 3–4 times.  Please see [1].
>
> I had forgotten about that, and again, fair point, but I'm concerned
> that it might not be a broad enough base of queries to test against.
> (7 isn't a very large number.)

Yeah, I know 7 is not a large number, but this is the result I got
from running the TPC-DS benchmark.  For the remaining 92 queries in
the benchmark, either the logic in this patch determines that eager
aggregation is not applicable, or the path with eager aggregation is
not the optimal one.  I'd be more than happy if a benchmark query
showed significant performance regression, so it would provide an
opportunity to investigate how the cost estimates are negatively
impacting the final plan and explore ways to avoid or improve that.
If anyone can provide such a benchmark query, I'd be very grateful.

Perhaps this is another reason why we should enable this feature by
default, so we can identify such regression issues sooner rather than
later.

> > Actually, I introduced the EAGER_AGGREGATE_RATIO mechanism in October
> > to limit the planning effort for eager aggregation.  For more details,
> > please see [2].
>
> OK, I missed this, but...
>
> > And I don't think it's correct to say that we create a partially
> > grouped rel for every baserel and every joinrel.  This patch includes
> > a bunch of logic to determine whether it's appropriate to create a
> > grouped rel for a base or join rel.  Furthermore, with the
> > EAGER_AGGREGATE_RATIO mechanism, even if creating a grouped rel is
> > possible, we will skip it if the grouped paths are considered not
> > useful.  All of these measures help reduce the number of grouped
> > paths as well as the grouped relations in many cases where eager
> > aggregation would not help a lot.
>
> ...it looks to me like EAGER_AGGREGATE_RATIO is used to set the
> RelAggInfo's agg_useful field, which seems like it happens after the
> RelOptInfo has already been created. I had been looking for something
> that would avoid creating the RelOptInfo in the first place and I
> didn't see it.

Well, from the perspective of planning effort, what really matters is
whether the RelOptInfo for the grouped relation is added to the
PlannerInfo, as it is only then available for further joining in the
join search routine, not whether the RelOptInfo is built or not.
Building the RelOptInfo for a grouped relation is simply a makeNode
call followed by a flat copy; it doesn't require going through the
full process of determining its target list, or constructing its
restrict and join clauses, or calculating size estimates, etc.

Now, let's take a look at how the EAGER_AGGREGATE_RATIO mechanism is
used.  As you mentioned, EAGER_AGGREGATE_RATIO is used to set the
agg_useful field of the RelAggInfo.  For a base rel where we've
decided that aggregation can be pushed down, if agg_useful is false,
we skip building the grouped relation for it in the first place, not
to mention adding the grouped relation to the PlannerInfo.  For a join
rel where aggregation can be pushed down, if agg_useful is false, we
will create a temporary RelOptInfo for its grouped relation, but we
only add this RelOptInfo to the PlannerInfo if we can generate any
grouped paths by joining its input relations.  We could easily modify
make_grouped_join_rel() to create this temporary RelOptInfo only when
needed, but I'm not sure if that's necessary, since I don't have data
to suggest that the creation of this temporary RelOptInfo is a factor
in causing planning regressions.

> > Based on the TPC-DS benchmark results, I don't see "a lot of overhead"
> > in the planning cost, at least for the 7 queries where eager
> > aggregation is applied.  As I said in [1], "For the planning time, I
> > do not see notable regressions for any of the seven queries".  In
> > fact, I initially thought that we might consider enabling this by
> > default, given the positive benchmark results, but I just couldn't
> > summon the courage to do it.  Perhaps we should reconsider enabling it
> > by default, so users can benefit from the new feature and help
> > identify any potential bugs.
>
> If you're going to commit this, I think it would be a good idea to
> enable it by default at least for now. If there are problems, it's
> better to find out about them sooner rather than later. If they are
> minor they can be fixed; if they are major, we can consider whether it
> is better to fix them, disable the feature by default, or revert. We
> can add an open item to reconsider the default setting during beta.

Agreed.  And I like the suggestion of adding an open item about the
default setting during beta.

> > > As I said back in October, this seems like mixing together in one
> > > RelOptInfo paths that really belong to two different RelOptInfos.
> >
> > I understand that you said about the design in October where
> > "PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2) get separate
> > RelOptInfos", because "it's less clear whether it's fair to compare
> > across the two categories".  I've shared my thoughts on this in [5].
> >
> > Furthermore, even if we separate these grouped paths into two
> > different RelOptInfos, we still face the issue that "different paths
> > of a grouped relation can have very different row counts", and we need
> > a way to handle this.  One could argue that we can separate the
> > grouped paths where partial aggregation is placed at different
> > locations into different RelOptInfos, but this would lead to an
> > explosion in the number of RelOptInfos for grouped relations as we
> > climb up the join tree.  I think this is neither realistic nor
> > necessary.
>
> It's possible you're right, but it does make me nervous. I do agree
> that making the number of RelOptInfos explode would be really bad.

Based on my explanation in [1], I think it's acceptable to compare
grouped paths for the same grouped rel, regardless of where the
partial aggregation is placed.

I fully understand that I could be wrong about this, but I don't think
it would break anything in regular planning (i.e., planning without
eager aggregation).  We would never compare a grouped path with a
non-grouped path during scan/join planning.  As far as I can see, the
only consequence in that case would be that we might fail to select
the optimal grouped path and miss out on fully leveraging the benefits
of eager aggregation.

Back in November, I considered the possibility of introducing a
GroupPathInfo into the Path structure to store the location of the
partial aggregation as well as the estimated rowcount for this grouped
path, similar to how ParamPathInfo functions for parameterized paths.
However, after some exploration, I determined that this was
unnecessary.

But in any case, I don't think it's an option to separate the grouped
paths of the same grouped relation into different RelOptInfos based on
the location of the partial aggregation within the path tree.

[1] https://postgr.es/m/CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ@mail.gmail.com

Thanks
Richard



Re: Eager aggregation, take 3

От
Tom Lane
Дата:
I'm very sorry for not having had any time to look at this patch
before --- it's been on my radar screen for awhile, but $LIFE has
been rather demanding lately.

Anyway, I've now read through the mail thread and portions of the
v16 patch, and I have to concur with Robert's qualms about whether
this is ready.  A few observations:

* The README addition, and the basically identical text in the
commit message, don't even provide a reason to believe that the
transformation is correct let alone that it will result in faster
execution.  I don't understand why it's so hard to provide a solid
correctness argument.  This work was supposedly based on an academic
paper; surely that paper must have included a correctness proof?
PG might need a few refinements, like being specific about what we
expect from the equality operators.  But an EXPLAIN plan is not an
argument.

* As for the performance aspect, we're given

 Finalize HashAggregate
   Group Key: a.i
   ->  Nested Loop
         ->  Partial HashAggregate
               Group Key: b.j
               ->  Seq Scan on b
         ->  Index Only Scan using a_pkey on a
               Index Cond: (i = b.j)

As far as I can see, this will require aggregation to be performed
across every row of "b", whereas the naive way would have aggregated
across only rows having join partners in "a".  If most "b" rows lack
a join partner then this will be far slower than the naive way.
I do see that it can be better if most "b" rows have multiple join
partners, because we'll re-use partial aggregation results instead
of (effectively) recalculating them.  But the README text makes it
sound like this is an unconditional win, which is not the right
mindset.  (In fact, in this specific example where a.i is presumed
unique, how's it a win at all?)

* I'm also concerned about what happens with aggregates that can have
large partial-aggregation values, such as string_agg().  With the
existing usage of partial aggregation for parallel queries, it's
possible to be confident that there are not many partial-aggregation
values in existence at the same time.  I don't think that holds for
pushed-down aggregates: for example, I wouldn't be surprised if the
planner chooses a join plan that requires stuffing all those values
into a hash table, or "materializes" the output of the partial
aggregation step.  Do we have logic that will avoid blowing out
memory during such queries?

* I am just as worried as Robert is about the notion of different
paths for the same RelOptInfo having different rowcount estimates.
That is an extremely fundamental violation of basic planner
assumptions.  We did bend it for parameterized paths by restating
those assumptions as (from optimizer/README):

  To keep cost estimation rules relatively simple, we make an implementation
  restriction that all paths for a given relation of the same parameterization
  (i.e., the same set of outer relations supplying parameters) must have the
  same rowcount estimate.  This is justified by insisting that each such path
  apply *all* join clauses that are available with the named outer relations.

I don't see any corresponding statement here, and it's not clear
to me that the point has been thought through adequately.

Another aspect that bothers me is that a RelOptInfo is understood
to contain a bunch of paths that all yield the same data (the same
set of columns), and it seems like that might not be the case here.
Certainly partially-aggregated paths will output something different
than unaggregated ones, but mightn't different join orders mutate the
column set even further?

I think that we might be better off building a separate RelOptInfo for
each way of pushing down the aggregates, in order to preserve the
principle that all the paths in any one RelOptInfo have the same
output.  This'll mean more RelOptInfos, but not more paths, so
I doubt it adds that much performance overhead.

Richard Guo <guofenglinux@gmail.com> writes:
> Back in November, I considered the possibility of introducing a
> GroupPathInfo into the Path structure to store the location of the
> partial aggregation as well as the estimated rowcount for this grouped
> path, similar to how ParamPathInfo functions for parameterized paths.
> However, after some exploration, I determined that this was
> unnecessary.

Why did you determine that was unnecessary?  The principal function
of ParamPathInfo IMV is to ensure that we use exactly the same
rowcount estimate for all the paths that should have the same
estimate, and that problem seems to exist here as well.  If you
don't have a forcing mechanism then paths' estimates will diverge
as a result of things like different roundoff errors in different
join sequences.

Anyway, I agree with Robert that this isn't ready.  I don't feel
that I can even review it adequately without a lot better internal
documentation, specifically a clearer statement of what query shapes
the optimization applies to and what's the rationale for the
transformation being correct.  The commentary in pathnodes.h for the
new data structures is likewise so skimpy as to be near useless.

            regards, tom lane



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Jan 17, 2025 at 6:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> * The README addition, and the basically identical text in the
> commit message, don't even provide a reason to believe that the
> transformation is correct let alone that it will result in faster
> execution.  I don't understand why it's so hard to provide a solid
> correctness argument.  This work was supposedly based on an academic
> paper; surely that paper must have included a correctness proof?
> PG might need a few refinements, like being specific about what we
> expect from the equality operators.  But an EXPLAIN plan is not an
> argument.

Thank you for taking a look at this patch!

In README, I provided the justification for the correctness of this
transformation as follows:

  For the partial aggregation that is pushed down to a non-aggregated
  relation, we need to consider all expressions from this relation that
  are involved in upper join clauses and include them in the grouping
  keys, using compatible operators.  This is essential to ensure that an
  aggregated row from the partial aggregation matches the other side of
  the join if and only if each row in the partial group does.  This
  ensures that all rows within the same partial group share the same
  'destiny', which is crucial for maintaining correctness.

I believed that this explanation would make it clear why this
transformation is correct.

Yeah, this work implements one of the transformations introduced in
paper "Eager Aggregation and Lazy Aggregation".  In the paper, Section
4 presents the formalism, Section 5 proves the main theorem, and
Section 6 introduces corollaries related to this specific
transformation.  I'm just not sure how to translate these theorems and
corollaries into natural language that would be suitable to be
included in the README.  I can give it another try if you find the
above justification not clear enough, but it would be really helpful
if I could get some assistance with this.

And I'd like to clarify that the EXPLAIN plan included in the README
is only meant to illustrate how this transformation looks like, and is
not intended to serve as an argument for its correctness.

> * As for the performance aspect, we're given
>
>  Finalize HashAggregate
>    Group Key: a.i
>    ->  Nested Loop
>          ->  Partial HashAggregate
>                Group Key: b.j
>                ->  Seq Scan on b
>          ->  Index Only Scan using a_pkey on a
>                Index Cond: (i = b.j)
>
> As far as I can see, this will require aggregation to be performed
> across every row of "b", whereas the naive way would have aggregated
> across only rows having join partners in "a".

Yes, that's correct.

> If most "b" rows lack
> a join partner then this will be far slower than the naive way.

No, this is not correct.  The partial aggregation may reduce the
number of input rows to the join, and the resulting data reduction
could justify the cost of performing the partial aggregation.  As an
example, please consider:

create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);

insert into t1 select i%3, i%3, i from generate_series(1,1000000)i;
insert into t2 select i%3+3, i%3+3, i from generate_series(1,1000000)i;

analyze t1, t2;

explain analyze
select sum(t2.c) from t1 join t2 on t1.b = t2.b group by t1.a;

So for this query, most (actually all) t2 rows lack a join partner.

Running it with and without eager aggregation, I got (best of 3):

-- with eager aggregation
 Execution Time: 496.856 ms

-- without eager aggregation
 Execution Time: 1723.844 ms

> I do see that it can be better if most "b" rows have multiple join
> partners, because we'll re-use partial aggregation results instead
> of (effectively) recalculating them.

Not only because we'll re-use partial aggregation results, but also
(and perhaps more importantly) because the number of input rows to the
join could be significantly reduced.

> But the README text makes it
> sound like this is an unconditional win, which is not the right
> mindset.

I'm sorry if the README text gives that impression.  The README says:

 If the partial aggregation on table B significantly reduces the number
 of input rows, the join above will be much cheaper, leading to a more
 efficient final plan.

Perhaps I should use "could" or "might" instead of "will" to make it
less misleading.

But as you can see from the implementation, the decision is entirely
based on cost, not on rules.  There is no part of the code that ever
assumes this transformation is an unconditional win.

> (In fact, in this specific example where a.i is presumed
> unique, how's it a win at all?)

It seems to me that whether it's a win depends on whether b.j is a
column with low cardinality (i.e., relatively few unique values).  I
don't really see how a.i being unique would change that.  Please
see the example below:

create table a (i int primary key, x int);
create table b (j int, y int);

insert into a select i, i%3 from generate_series(1,10000)i;
insert into b select i%3, i from generate_series(1,10000)i;

analyze a, b;

set enable_eager_aggregate to off;

EXPLAIN (ANALYZE, COSTS OFF)
SELECT a.i, avg(b.y)
FROM a JOIN b ON a.i > b.j
GROUP BY a.i;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 HashAggregate (actual time=100257.254..100268.841 rows=10000 loops=1)
   Group Key: a.i
   Batches: 1  Memory Usage: 2193kB
   Buffers: shared hit=133
   ->  Nested Loop (actual time=2.629..40849.630 rows=99990000 loops=1)
         Buffers: shared hit=133
         ->  Seq Scan on b (actual time=0.450..10.066 rows=10000 loops=1)
               Buffers: shared hit=45
         ->  Memoize (actual time=0.002..0.752 rows=9999 loops=10000)
               Cache Key: b.j
               Cache Mode: binary
               Hits: 9997  Misses: 3  Evictions: 0  Overflows: 0
Memory Usage: 1055kB
               Buffers: shared hit=88
               ->  Index Only Scan using a_pkey on a (actual
time=0.752..8.100 rows=9999 loops=3)
                     Index Cond: (i > b.j)
                     Heap Fetches: 0
                     Buffers: shared hit=88
 Planning Time: 1.681 ms
 Execution Time: 100273.011 ms
(19 rows)

set enable_eager_aggregate to on;

EXPLAIN (ANALYZE, COSTS OFF)
SELECT a.i, avg(b.y)
FROM a JOIN b ON a.i > b.j
GROUP BY a.i;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Finalize HashAggregate (actual time=77.701..90.680 rows=10000 loops=1)
   Group Key: a.i
   Batches: 1  Memory Usage: 2193kB
   Buffers: shared hit=133
   ->  Nested Loop (actual time=27.586..52.352 rows=29997 loops=1)
         Buffers: shared hit=133
         ->  Partial HashAggregate (actual time=27.408..27.419 rows=3 loops=1)
               Group Key: b.j
               Batches: 1  Memory Usage: 24kB
               Buffers: shared hit=45
               ->  Seq Scan on b (actual time=0.173..3.767 rows=10000 loops=1)
                     Buffers: shared hit=45
         ->  Index Only Scan using a_pkey on a (actual
time=0.108..5.277 rows=9999 loops=3)
               Index Cond: (i > b.j)
               Heap Fetches: 0
               Buffers: shared hit=88
 Planning Time: 1.739 ms
 Execution Time: 93.003 ms
(18 rows)

There is a performance improvement of ~1000 times, even though a.i is
unique.

# select 100273.011/93.003;
       ?column?
-----------------------
 1078.1696396890422890
(1 row)

(I used 'a.i > b.j' instead of 'a.i = b.j' to make the performance
difference more noticeable.  I believe this is fine, as it doesn't
undermine the fact that a.i is unique.)

> * I'm also concerned about what happens with aggregates that can have
> large partial-aggregation values, such as string_agg().  With the
> existing usage of partial aggregation for parallel queries, it's
> possible to be confident that there are not many partial-aggregation
> values in existence at the same time.  I don't think that holds for
> pushed-down aggregates: for example, I wouldn't be surprised if the
> planner chooses a join plan that requires stuffing all those values
> into a hash table, or "materializes" the output of the partial
> aggregation step.  Do we have logic that will avoid blowing out
> memory during such queries?

Good point!  Thank you for bringing this up.  I hadn't considered it
before, and it seems no one else has raised this issue.  I'll look
into it.

> * I am just as worried as Robert is about the notion of different
> paths for the same RelOptInfo having different rowcount estimates.
> That is an extremely fundamental violation of basic planner
> assumptions.  We did bend it for parameterized paths by restating
> those assumptions as (from optimizer/README):
>
>   To keep cost estimation rules relatively simple, we make an implementation
>   restriction that all paths for a given relation of the same parameterization
>   (i.e., the same set of outer relations supplying parameters) must have the
>   same rowcount estimate.  This is justified by insisting that each such path
>   apply *all* join clauses that are available with the named outer relations.
>
> I don't see any corresponding statement here, and it's not clear
> to me that the point has been thought through adequately.
>
> Another aspect that bothers me is that a RelOptInfo is understood
> to contain a bunch of paths that all yield the same data (the same
> set of columns), and it seems like that might not be the case here.
> Certainly partially-aggregated paths will output something different
> than unaggregated ones, but mightn't different join orders mutate the
> column set even further?
>
> I think that we might be better off building a separate RelOptInfo for
> each way of pushing down the aggregates, in order to preserve the
> principle that all the paths in any one RelOptInfo have the same
> output.  This'll mean more RelOptInfos, but not more paths, so
> I doubt it adds that much performance overhead.

Hmm, IIUC, this means that we would separate the grouped paths of the
same grouped relation into different RelOptInfos based on the location
of the partial aggregation within the path tree.  Let's define the
"location" as the relids of the relation on top of which we place the
partial aggregation.  For grouped relation {A B C D}, if we perform
some aggregation on C, we would end up with 8 diffent grouped paths:

{A B D PartialAgg(C)}
{B D PartialAgg(A C)}
{A D PartialAgg(B C)}
{A B PartialAgg(D C)}
{D PartialAgg(A B C)}
{B PartialAgg(A D C)}
{A PartialAgg(B D C)}
{PartialAgg(A B D C)}

That means we would need to create 8 RelOptInfos for this grouped
relation.  If my math doesn't fail me, for a relation containing n
base rels, we would need to create 2^(n-1) different RelOptInfos.

When building grouped relation {A B C D E} by joining {A B C D} with
{E}, we would need to call make_grouped_join_rel() 8 times, each time
joining {E} with one of the 8 RelOptInfos mentioned above.  And at
last, considering other join orders such as joining {A B C E} with
{D}, this new grouped relation would end up with 16 new RelOptInfos.

And then we proceed with building grouped relation {A B C D E F}, and
end up with 32 new RelOptInfos, and this process continues...

It seems to me that this doesn't only result in more RelOptInfos, it
can also lead to more paths.  Consider two grouped paths, say P1 and
P2, for the same grouped relation, but with different locations of the
partial aggregation.  Suppose P1 is cheaper, at least as well ordered,
generates no more rows, requires no outer rels not required by P2, and
is no less parallel-safe.  If these two paths are kept in the same
RelOptInfo, P2 will be discarded and not considered in further
planning.  However, if P1 and P2 are separated into different
RelOptInfos, and P2 happens to have survived the add_path() tournament
for the RelOptInfo it is in, then it will be considered in subsequent
planning steps.

So in any case, this doesn't seem like a feasible approach to me.

I also have some thoughts on grouped paths and parameterized paths,
but I've run out of time for today.  I'll send a separate email.

I'm really glad you're taking a look at this patch.  Thank you!

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Thu, Jan 16, 2025 at 3:18 AM Richard Guo <guofenglinux@gmail.com> wrote:
> If this t1/t2 join is part of a larger SELECT query, I think the cost
> estimates for the upper join nodes would likely be quite inaccurate.

That's definitely true. However, the question is not whether the
planner has problems today (it definitely does) but whether it's OK to
make this change without improving our ability to estimate the effects
of aggregation operations. I understand that you (quite rightly) don't
want to get sucked into fixing unrelated planner problems, and I'm
also not sure to what extent these problems are actually fixable.
However, major projects sometimes require such work. For instance,
commit 5edc63bda68a77c4d38f0cbeae1c4271f9ef4100 was motivated by the
discovery that it was too easy to get a Parallel Bitmap Heap Scan plan
even when it wasn't best. The fact that the costing wasn't right
wasn't the fault of parallel query, but parallel query still needed to
do something about it to get good results.

> Yeah, I know 7 is not a large number, but this is the result I got
> from running the TPC-DS benchmark.  For the remaining 92 queries in
> the benchmark, either the logic in this patch determines that eager
> aggregation is not applicable, or the path with eager aggregation is
> not the optimal one.  I'd be more than happy if a benchmark query
> showed significant performance regression, so it would provide an
> opportunity to investigate how the cost estimates are negatively
> impacting the final plan and explore ways to avoid or improve that.
> If anyone can provide such a benchmark query, I'd be very grateful.

Yes, having more people test this and look for regressions would be
quite valuable.

> Well, from the perspective of planning effort, what really matters is
> whether the RelOptInfo for the grouped relation is added to the
> PlannerInfo, as it is only then available for further joining in the
> join search routine, not whether the RelOptInfo is built or not.
> Building the RelOptInfo for a grouped relation is simply a makeNode
> call followed by a flat copy; it doesn't require going through the
> full process of determining its target list, or constructing its
> restrict and join clauses, or calculating size estimates, etc.

That's probably mostly true, but the overhead of memory allocations in
planner routines is not trivial. There are previous cases of changing
things or declining to change this purely on the number of palloc
cycles involved.

> > It's possible you're right, but it does make me nervous. I do agree
> > that making the number of RelOptInfos explode would be really bad.
>
> Based on my explanation in [1], I think it's acceptable to compare
> grouped paths for the same grouped rel, regardless of where the
> partial aggregation is placed.
>
> I fully understand that I could be wrong about this, but I don't think
> it would break anything in regular planning (i.e., planning without
> eager aggregation).

I think you might be taking too narrow a view of the problem. As Tom
says, the issue is that this breaks a bunch of assumptions that hold
elsewhere. One place that shows up in the patch is in the special-case
logic you've added to set_cheapest(), but I fear that won't be the end
of it. It seems a bit surprising to me that you didn't also need to
adjust add_path(), for example. Even if you don't, there's lots of
places that rely on the assumption that all paths for a RelOptInfo are
returning the same set of rows. If it turns out that a bunch of those
places need to be adjusted to work with this, then the code could
potentially end up quite messy, and that might also have performance
consequences, even when this feature is disabled. Many of the code
paths that deal with paths in the planner are quite hot.

To say that another way, I'm not so much worried about the possibility
that the patch contains a bug. Patches contain bugs all the time and
we can just fix them. It's not wonderful, but that's how software
development goes. What I am worried about is whether the architecture
is right. If we go with one RelOptInfo when the "right answer" is
many, or for that matter if we go with many when the right answer is
one, those are things that cannot be easily and reasonably patched
post-commit, and especially not post-release.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Sat, Jan 18, 2025 at 6:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 16, 2025 at 3:18 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > If this t1/t2 join is part of a larger SELECT query, I think the cost
> > estimates for the upper join nodes would likely be quite inaccurate.
>
> That's definitely true. However, the question is not whether the
> planner has problems today (it definitely does) but whether it's OK to
> make this change without improving our ability to estimate the effects
> of aggregation operations. I understand that you (quite rightly) don't
> want to get sucked into fixing unrelated planner problems, and I'm
> also not sure to what extent these problems are actually fixable.
> However, major projects sometimes require such work. For instance,
> commit 5edc63bda68a77c4d38f0cbeae1c4271f9ef4100 was motivated by the
> discovery that it was too easy to get a Parallel Bitmap Heap Scan plan
> even when it wasn't best. The fact that the costing wasn't right
> wasn't the fault of parallel query, but parallel query still needed to
> do something about it to get good results.

Yeah, it's true that we have problems in aggregate estimates today.
And it has been the case for a long time.  In the past, we made some
improvements in this area, such as in 84f9a35e3, where we adapted a
new formula that is based on the random selection probability,
inspired by two papers from Yao and Dell'Era.  But we still have
problems with aggregate estimates.  I'm not sure when we could fix
these problems, but I doubt that it will happen in the near future.
(Sorry to be pessimistic.)

If, at last, the conclusion of this discussion is that we should not
apply this change until we fix those problems in aggregate estimates,
I'd be very sad.  This conclusion is absolutely correct, for sure, in
an ideal world, but in the real world, it feels like a death sentence
for this patch, and for any future patches that attempt to apply some
optimizations above aggregate nodes - unless, of course, the day
arrives when we finally fix those aggregate estimate problems, which
doesn't seem likely in the near future.

And if that's the case, can I then argue that the same principle
should apply to joins?  Specifically, should we refrain from applying
any optimizations above join nodes until we've fixed the join estimate
problems, particularly in cases where we fall back on default
selectivity estimates?

Please do not get me wrong.  I'm not saying that we should not fix the
problems in our current aggregate estimates.  I think, as I said
previously, that the realistic approach is to first identify some
real-world queries where this patch causes significant performance
regressions.  This would give us the opportunity to investigate these
regressions and understand how the bad cost estimates contributed to
them.  From there, we could figure out where to start fixing the cost
estimates.  And if we find that the problem is not entirely fixable,
we could then explore the possibility of introducing new heuristics to
avoid the performance regressions as much as possible.  In my opinion,
it's not very possible to make cost estimation perfect in all cases.
In a sense, cost estimation is an art of compromise.

I believe this is also the approach that commit 5edc63bda followed.
First, it was found that Bitmap Heap Scans caused performance
regressions in many TPCH queries in cases where work_mem was low.
Then, this issue was thoroughly discussed, and eventually it was
figured out that the impact of lossy pages needed to be accounted for
when estimating the cost of bitmap scans, which became 5edc63bda.

> > Well, from the perspective of planning effort, what really matters is
> > whether the RelOptInfo for the grouped relation is added to the
> > PlannerInfo, as it is only then available for further joining in the
> > join search routine, not whether the RelOptInfo is built or not.
> > Building the RelOptInfo for a grouped relation is simply a makeNode
> > call followed by a flat copy; it doesn't require going through the
> > full process of determining its target list, or constructing its
> > restrict and join clauses, or calculating size estimates, etc.
>
> That's probably mostly true, but the overhead of memory allocations in
> planner routines is not trivial. There are previous cases of changing
> things or declining to change this purely on the number of palloc
> cycles involved.

Hmm, I think you are right.  I will modify make_grouped_join_rel() to
create the RelOptInfo for a grouped join relation only if we can
generate any grouped paths by joining its input relations.

> > > It's possible you're right, but it does make me nervous. I do agree
> > > that making the number of RelOptInfos explode would be really bad.
> >
> > Based on my explanation in [1], I think it's acceptable to compare
> > grouped paths for the same grouped rel, regardless of where the
> > partial aggregation is placed.
> >
> > I fully understand that I could be wrong about this, but I don't think
> > it would break anything in regular planning (i.e., planning without
> > eager aggregation).
>
> I think you might be taking too narrow a view of the problem. As Tom
> says, the issue is that this breaks a bunch of assumptions that hold
> elsewhere. One place that shows up in the patch is in the special-case
> logic you've added to set_cheapest(), but I fear that won't be the end
> of it. It seems a bit surprising to me that you didn't also need to
> adjust add_path(), for example. Even if you don't, there's lots of
> places that rely on the assumption that all paths for a RelOptInfo are
> returning the same set of rows. If it turns out that a bunch of those
> places need to be adjusted to work with this, then the code could
> potentially end up quite messy, and that might also have performance
> consequences, even when this feature is disabled. Many of the code
> paths that deal with paths in the planner are quite hot.

Yeah, one of the basic assumptions in the planner is that all paths
for a given RelOptInfo return the same set of rows.  One exception
to this is parameterized paths.  As an example, please consider:

create table t (a int, b int);
create table t3 (a int, b int);

insert into t select i, i from generate_series(1,1000)i;
insert into t3 select i, i from generate_series(1,1000)i;

create index on t3(a, b);
analyze t, t3;

explain (costs off)
select * from t t1 join t t2 on true join t3 on t3.a > t1.a and t3.b > t2.b;

With gdb, I found the following 4 paths in the pathlist of RelOptInfo
of {t3}:

   {INDEXPATH
   :path.pathtype 341
   :parent_relids (b 4)
   :required_outer (b 1 2)
   :path.parallel_aware false
   :path.parallel_safe true
   :path.parallel_workers 0
   :path.rows 111
   :path.disabled_nodes 0
   :path.startup_cost 0.275
   :path.total_cost 4.755000000000001

   {INDEXPATH
   :path.pathtype 341
   :parent_relids (b 4)
   :required_outer (b 1)
   :path.parallel_aware false
   :path.parallel_safe true
   :path.parallel_workers 0
   :path.rows 333
   :path.disabled_nodes 0
   :path.startup_cost 0.275
   :path.total_cost 6.1425

   {INDEXPATH
   :path.pathtype 341
   :parent_relids (b 4)
   :required_outer (b 2)
   :path.parallel_aware false
   :path.parallel_safe true
   :path.parallel_workers 0
   :path.rows 333
   :path.disabled_nodes 0
   :path.startup_cost 0.275
   :path.total_cost 11.145

   {PATH
   :pathtype 338
   :parent_relids (b 4)
   :required_outer (b)
   :parallel_aware false
   :parallel_safe true
   :parallel_workers 0
   :rows 1000
   :disabled_nodes 0
   :startup_cost 0
   :total_cost 15

None of them are returning the same set of rows.  This is fine because
we have revised the assumption to that all paths for a RelOptInfo of
the same parameterization return the same set of rows.  That is to
say, it's OK that paths for the same RelOptInfo return different sets
of rows if they have different parameterizations.

Now we have the grouped paths.  I had previously considered further
revising this assumption to that all paths for a RelOptInfo of the
same parameterization and the same location of partial aggregation
return the same set of rows.  That's why, back in November, I proposed
the idea of introducing a GroupPathInfo into the Path structure to
store the location of the partial aggregation and the estimated
rowcount for each grouped path, similar to how ParamPathInfo functions
for parameterized paths.

However, I gave up on this idea in December after realizing an
important difference from the parameterized path case.  For a
parameterized path, the fewer the required outer rels, the better, as
more outer rels imply more join restrictions.  In other words, the
number of required outer rels is an important factor when comparing
two paths in add_path().  For a grouped path, however, the location of
partial aggregation does not impose such restrictions for further
planning.  As long as one grouped path is cheaper than another based
on the current merits of add_path(), we don't really care where the
partial aggregation is placed within the path tree.

I can take up the idea of GroupPathInfo again.  Before I start
implementing it (which is not trivial), I'd like to hear others'
thoughts on this approach - whether it's necessary and whether this is
the right direction to pursue.

> To say that another way, I'm not so much worried about the possibility
> that the patch contains a bug. Patches contain bugs all the time and
> we can just fix them. It's not wonderful, but that's how software
> development goes. What I am worried about is whether the architecture
> is right. If we go with one RelOptInfo when the "right answer" is
> many, or for that matter if we go with many when the right answer is
> one, those are things that cannot be easily and reasonably patched
> post-commit, and especially not post-release.

Fair point.  We should make sure the architecture of this patch is
solid before committing it.

Regarding whether we should use a single RelOptInfo or separate
RelOptInfos for the same grouped relation: If we choose to separate
the grouped paths of the same grouped relation into different
RelOptInfos based on the location of the partial aggregation within
the path tree, then, based on my calculation from the previous email,
for a relation containing n base rels, we would need to create 2^(n-1)
different RelOptInfos, not to mention that this can also lead to more
paths.  I still struggle to see how this is feasible.  Could you
please elaborate on why you believe this is a viable option?

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Sun, Jan 19, 2025 at 7:53 AM Richard Guo <guofenglinux@gmail.com> wrote:
> If, at last, the conclusion of this discussion is that we should not
> apply this change until we fix those problems in aggregate estimates,
> I'd be very sad.  This conclusion is absolutely correct, for sure, in
> an ideal world, but in the real world, it feels like a death sentence
> for this patch, and for any future patches that attempt to apply some
> optimizations above aggregate nodes - unless, of course, the day
> arrives when we finally fix those aggregate estimate problems, which
> doesn't seem likely in the near future.

Well, such conclusions should be based on evidence. So far, the
evidence you've presented suggests that the optimization works, so
there's no reason to leap to the conclusion that we shouldn't move
forward. On the other hand, the amount of evidence you've presented
does not seem to me to be all that large. And I'm not sure that you've
gone looking for adversarial cases.

> And if that's the case, can I then argue that the same principle
> should apply to joins?  Specifically, should we refrain from applying
> any optimizations above join nodes until we've fixed the join estimate
> problems, particularly in cases where we fall back on default
> selectivity estimates?

I am having a hard time figuring out how to write back to this. I
mean, I don't think that what you write here is a serious proposal,
and I think you already know that I was not proposing any such thing.
But it upsets me that you think that this hypothetical argument is
equivalent to the ones I've actually been making. Apparently, you
consider my concerns quite groundless and foolish.

> Yeah, one of the basic assumptions in the planner is that all paths
> for a given RelOptInfo return the same set of rows.  One exception
> to this is parameterized paths.

Good point. I had not considered this parallel.

> Now we have the grouped paths.  I had previously considered further
> revising this assumption to that all paths for a RelOptInfo of the
> same parameterization and the same location of partial aggregation
> return the same set of rows.  That's why, back in November, I proposed
> the idea of introducing a GroupPathInfo into the Path structure to
> store the location of the partial aggregation and the estimated
> rowcount for each grouped path, similar to how ParamPathInfo functions
> for parameterized paths.

Interesting.

> However, I gave up on this idea in December after realizing an
> important difference from the parameterized path case.  For a
> parameterized path, the fewer the required outer rels, the better, as
> more outer rels imply more join restrictions.  In other words, the
> number of required outer rels is an important factor when comparing
> two paths in add_path().  For a grouped path, however, the location of
> partial aggregation does not impose such restrictions for further
> planning.  As long as one grouped path is cheaper than another based
> on the current merits of add_path(), we don't really care where the
> partial aggregation is placed within the path tree.
>
> I can take up the idea of GroupPathInfo again.  Before I start
> implementing it (which is not trivial), I'd like to hear others'
> thoughts on this approach - whether it's necessary and whether this is
> the right direction to pursue.

Yes, I would, too. Tom, do you have any thoughts on this point? Anybody else?

An advantage of this approach could be that it would avoid any
explosion in the number of RelOptInfo structures, since presumably all
the partially aggregated paths could be attached to the same
RelOptInfo as the unaggregated paths, just with a GroupPathInfo to
mark them as partially aggregated. I have to admit that I'm not sure
it was the right idea to mix parameterized and unparameterized paths
in the same path list, and I'm even less sure that it would be a good
idea to mix in partially-aggregated paths. That's because a
parameterized path behaves like a regular path with a join
order/method restriction: as long as we only create valid joins from
parameterized paths, we'll eventually end up with unparameterized
paths without doing anything else. A partially aggregated path behaves
more like a partial path, which requires a Gather or Gather Merge node
to terminate parallelism. Likewise, a partially aggregated path will
require a FinalizeAggregate step to complete the aggregation. Maybe
that's the wrong way of thinking about it, though, since the
FinalizeAggregate node must (I think) go at the top of the join tree,
whereas a Gather can go anywhere.

I felt it best when implementing parallel query to put partial paths
into a separate list, rather than mixing them into the regular path
list. I am vaguely under the impression that Tom thinks that was a
poor decision on my part. And I can sort of see that there is a
problem brewing here. If we handled this case like that one, then we'd
go from 2 lists to 4: normal paths, paths needing a FinalizeAggregate,
paths needing a Gather(Merge), paths needing both. And if we handled
one more future thing in the same way, then the number of combinations
doubles again to 8. Clearly, that way lies madness. On the other hand,
there's another kind of madness in thinking that we can just stick a
whole bunch of paths that are different from each other in an
increasing number of ways into a single path list and suffer no
adverse consequences. The growing complexity of add_path() is one
fairly obvious one.

So I don't quite know which way to jump here. It now seems to me that
we have three similar features with three different designs.
Parameterization added non-comparable paths to the same path list;
parallel query added them to a different path list in the same
RelOptInfo; and this patch currently adds them a separate RelOptInfo.
That's quite a bit of diversity. Really, if we wanted to stick
strictly to the idea of paths associated with the same RelOptInfo
being directly comparable, then parameterization should have spawned a
separate RelOptInfo for each workable parameterization, but that
wasn't done, possibly (though I'm not sure) for the same reasons that
you don't want to do it here.

> Regarding whether we should use a single RelOptInfo or separate
> RelOptInfos for the same grouped relation: If we choose to separate
> the grouped paths of the same grouped relation into different
> RelOptInfos based on the location of the partial aggregation within
> the path tree, then, based on my calculation from the previous email,
> for a relation containing n base rels, we would need to create 2^(n-1)
> different RelOptInfos, not to mention that this can also lead to more
> paths.  I still struggle to see how this is feasible.  Could you
> please elaborate on why you believe this is a viable option?

I agree that creating an exponential number of RelOptInfos is not
going to work out well. I haven't been quite as certain as you seem to
be that it's an unavoidable reality, but maybe it is. For instance, my
intuition is that if PartialAgg(t1) JOIN t2 and PartialAgg(t1 JOIN t2)
produce very different numbers of rows, we could probably just take
the one with the smaller row count regardless of cost, because the
whole selling point of this optimization is that we reduce the number
of rows that are being fed to higher level plan nodes. I don't quite
see how it can make sense to keep a less costly path that produces
more rows on the theory that maybe it's going to work out better later
on. Why is the path cheaper, after all? It feels like the savings must
come from not reducing the row count so much, but that is a cost we're
going to have to repay at a higher plan level. Moreover, we'll be
repaying it with interest, because more rows will have filtered
through every level of plan over which we postponed partial
aggregation.

I admit it's not so clear-cut when the row counts are close. If
PartialAgg(t1 JOIN t2) JOIN t3 has a very similar to PartialAgg(t1
JOIN t3) JOIN t2, can we categorically pick whichever one has the
lower row count and forget about the other? I'm not sure. But I have
an uncomfortable feeling that if we can't, we're going to have an
explosion in the number of paths we have to generate even if we avoid
an explosion in the number of RelOptInfos we generate.

For example, consider:

SELECT ... FROM fact f, dim1, dim2, dim3, dim4
WHERE f.dim1_id = dim1.id AND f.dim2_id = dim2.id
AND f.dim3_id = dim3.id AND f.dim4_id = dim4.id
GROUP BY f.something;

Let's assume that each dimN table has PRIMARY KEY (id). Because of the
primary keys, it's only sensible to consider partial aggregation for
subsets of rels that include f; and it doesn't make sense to consider
partially aggregating after joining all 5 tables because at that point
we should just do a single-step aggregation. So, the partially
grouped-rel for {f,dim1,dim2,dim3,dim4} can contain paths generated in
15 different ways, because we can join f to any proper subset of
{dim1,dim2,dim3,dim4} before partially aggregating and then to the
remainder after partially aggregating. But that feels like we're
re-performing essentially the same join search 16 times which seems
super-expensive. I can't quite say that the work is useless or that I
have a better idea, but I guess there will be a lot of cases where all
16 join searches produce the same results, or most of them do. It
doesn't feel to me like checking through all of those possibilities is
a good expenditure of planner effort.

I took a look at the paper you linked in the original post, but
unfortunately it doesn't seem to say much about how to search the plan
space efficiently. I wonder if other systems perform a search that as
exhaustive as the one that you are proposing to perform here or
whether they apply some heuristics to limit the search space, and if
so, what those heuristics are.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> So I don't quite know which way to jump here. It now seems to me that
> we have three similar features with three different designs.
> Parameterization added non-comparable paths to the same path list;
> parallel query added them to a different path list in the same
> RelOptInfo; and this patch currently adds them a separate RelOptInfo.

Yeah, this.  I don't think that either of those first two decisions
was wrong, but it does seem annoying that this patch wants to do it
yet a third way.  Still, it may be the right thing.  Bear with me a
moment:

We dealt with parameterized paths being in the same list as
non-parameterized paths by treating the set of parameter rels as a
figure-of-merit that add_path can compare.  This works because if,
say, a nonparameterized path dominates a parameterized one on every
other figure of merit then there's no point in keeping the
parameterized one.  It is squirrely that the parameterized paths
typically don't yield the same number of rows as others for the same
RelOptInfo, but at least so far that hasn't broken anything.  I think
it's important that the parameterized paths do yield the same column
set as other paths for the rel; and the rows they do yield will be a
subset of the rows that nonparameterized paths yield.

On the other hand, it's not sensible for partial paths to compete
in an add_path tournament with non-partial ones.  If they did, neither
group could be allowed to dominate the other group, so add_path would
just be wasting its time making those path comparisons.  So I do think
it was right to put them in a separate path list.  Importantly, they
generate the same column set and some subset of the same rows that
the non-partial ones do, which I think is what justifies putting
them into the same RelOptInfo.

However, a partial-aggregation path does not generate the same data
as an unaggregated path, no matter how fuzzy you are willing to be
about the concept.  So I'm having a very hard time accepting that
it ought to be part of the same RelOptInfo, and thus I don't really
buy that annotating paths with a GroupPathInfo is the way forward.

What this line of analysis doesn't tell us though is whether paths
that did their partial aggregations at different join levels can be
considered as enough alike that they should compete on cost terms.
If they are, we need to put them into the same RelOptInfo.  So
while I want to have separate RelOptInfos for partially aggregated
paths, I'm unclear on how many of those we need or what their
identifying property is.

Also: we avoid generating parameterized partial paths, because
combining those things would be too much of a mess.  There's some
handwaving in the comments for add_partial_path to the effect that
it wouldn't be a win anyway, but I think the real reason is that
it'd be far too complicated for the potential value.  Can we make
a similar argument for partial aggregation?  I sure hope so.

> I agree that creating an exponential number of RelOptInfos is not
> going to work out well.

FWIW, I'm way more concerned about the number of Paths considered
than I am about the number of RelOptInfos.  This relates to your
question about whether we want to use some heuristics to limit
the planner's search space.

            regards, tom lane



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Mon, Jan 20, 2025 at 12:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, a partial-aggregation path does not generate the same data
> as an unaggregated path, no matter how fuzzy you are willing to be
> about the concept.  So I'm having a very hard time accepting that
> it ought to be part of the same RelOptInfo, and thus I don't really
> buy that annotating paths with a GroupPathInfo is the way forward.

Seems like a fair argument. I'm not sure it's dispositive if practical
considerations merited the opposite treatment, but that doesn't seem
to be the case.

> What this line of analysis doesn't tell us though is whether paths
> that did their partial aggregations at different join levels can be
> considered as enough alike that they should compete on cost terms.
> If they are, we need to put them into the same RelOptInfo.  So
> while I want to have separate RelOptInfos for partially aggregated
> paths, I'm unclear on how many of those we need or what their
> identifying property is.
>
> Also: we avoid generating parameterized partial paths, because
> combining those things would be too much of a mess.  There's some
> handwaving in the comments for add_partial_path to the effect that
> it wouldn't be a win anyway, but I think the real reason is that
> it'd be far too complicated for the potential value.  Can we make
> a similar argument for partial aggregation?  I sure hope so.

I think your hopes will be dashed in this instance. Suppose we have:

SELECT m.mapped_value, SUM(g.summable_quantity)
FROM mapping_table m JOIN gigantic_dataset g
WHERE m.raw_value = g.raw_value GROUP BY 1;

If the mapping_table is small, we can do something like this:

FinalizeAggregate
-> Gather
  -> PartialAggregate
    -> Hash Join

But if mapping_table is big but relatively few of the keys appear as
raw values in gigantic_dataset, being able to do the PartialAggregate
before the join would be rather nice; and you wouldn't want to give up
on parallel query in such a case.

P.S. I'm not so sure you're right about the reason why this is
supported. We can create a partial path for a joinrel by picking a
partial path on one side and a non-partial path on the other side, so
we can get NestLoops below Gather just fine using the parameterized
paths that we're generating anyway to support non-parallel cases. But
what would the plan look like if we were using a partial,
parameterized path? That path would have to be on the inner side of a
nested loo, and as far as I can see it would need to have a Gather
node on top of it and below the Nested Loop, so you're talking about
something like this:

Nested Loop
-> Seq Scan on something
-> Gather
  -> Nested Loop
    -> Index Scan on otherthing
       Index Cond: otherthing.x = something.x
    -> Whatever Scan on whatever

But putting Gather on the inner side of a nested loop like that would
mean repeatedly starting up workers and shutting them down again which
seems no fun at all. If there's some way of using a partial,
parameterized path that doesn't involve sticking a Gather on the inner
side of a Nested Loop, then the technique might have some promise and
the existing comment (which I probably wrote) is likely bunk.

> > I agree that creating an exponential number of RelOptInfos is not
> > going to work out well.
>
> FWIW, I'm way more concerned about the number of Paths considered
> than I am about the number of RelOptInfos.  This relates to your
> question about whether we want to use some heuristics to limit
> the planner's search space.

I had that instinct, too, but I'm not 100% sure whether it was a
correct instinct. If we create too many Paths, it's possible that most
of them will be thrown away before we really do anything with them, in
which case they cost CPU cycles but there's no memory accumulation.
Mixing paths that perform the partial aggregation at different levels
into the same RelOptInfo also increases the chances that you're going
to throw away a lot of stuff early. On the other hand, if we create
too many RelOptInfos, that memory can't be freed until the end of the
planning cycle. If you wouldn't have minded waiting a long time for
the planner, but you do mind running out of memory, the second one is
worse. But of course, the best option is to consider neither too many
Paths nor too many RelOptInfos.

I have heard rumors that in some other systems, they decide on the
best serial plan first and then insert parallel operators afterward.
Something like that could potentially be done here, too: only explore
eager aggregation for join orders that are sub-parts of the best
non-eagerly-aggregated join order. But I am sort of hesitant to
propose it as a development direction because we've never done
anything like that before and I don't think it would be at all easy to
implement. Nonetheless, I can't help feeling like we're kidding
ourselves a little bit, not just with this patch but in general. We
talk about "pushing down" aggregates or sorts or operations that can
be done on foreign nodes, but that implies that we start with them at
the top and then try to move them downward. In fact, we consider them
everywhere and expect the pushed-down versions to win out on cost.
While that feels sensible to some degree, it means every major new
query planning technique tends to multiply the amount of planner work
we're doing rather than adding to it. I'm fairly sure that the best
parallel plan need not be a parallelized version of the best
non-parallel plan but it often is, and the more things parallelism
supports, the more likely it is that it will be (I think). With eager
aggregation, it feels like we're multiplying the number of times that
we replan the same join tree by a number that is potentially MUCH
larger than 2, yet it seems to me that much of that extra work is
unlikely to find anything. Even if we find a way to make it work here
without too much pain, I wonder what happens when the next interesting
optimization comes along. Multiplication by a constant greater than or
equal to two isn't an operation one can do too many times, generally
speaking, without ending up with a big number.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Tue, Jan 21, 2025 at 1:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 19, 2025 at 7:53 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > If, at last, the conclusion of this discussion is that we should not
> > apply this change until we fix those problems in aggregate estimates,
> > I'd be very sad.  This conclusion is absolutely correct, for sure, in
> > an ideal world, but in the real world, it feels like a death sentence
> > for this patch, and for any future patches that attempt to apply some
> > optimizations above aggregate nodes - unless, of course, the day
> > arrives when we finally fix those aggregate estimate problems, which
> > doesn't seem likely in the near future.
>
> Well, such conclusions should be based on evidence. So far, the
> evidence you've presented suggests that the optimization works, so
> there's no reason to leap to the conclusion that we shouldn't move
> forward. On the other hand, the amount of evidence you've presented
> does not seem to me to be all that large. And I'm not sure that you've
> gone looking for adversarial cases.
>
> > And if that's the case, can I then argue that the same principle
> > should apply to joins?  Specifically, should we refrain from applying
> > any optimizations above join nodes until we've fixed the join estimate
> > problems, particularly in cases where we fall back on default
> > selectivity estimates?
>
> I am having a hard time figuring out how to write back to this. I
> mean, I don't think that what you write here is a serious proposal,
> and I think you already know that I was not proposing any such thing.
> But it upsets me that you think that this hypothetical argument is
> equivalent to the ones I've actually been making. Apparently, you
> consider my concerns quite groundless and foolish.

I'm really sorry if my previous response upset you or gave the wrong
impression.  That was never my intention, and I certainly do not
consider your concerns to be groundless or foolish.  I can see how my
message may have come across differently than I intended.  To clarify,
I wasn't suggesting that your concerns about the estimates weren't
valid.  Rather, I was trying to express that it might be more
effective to fix the cost estimates based on specific regressions.

> > Regarding whether we should use a single RelOptInfo or separate
> > RelOptInfos for the same grouped relation: If we choose to separate
> > the grouped paths of the same grouped relation into different
> > RelOptInfos based on the location of the partial aggregation within
> > the path tree, then, based on my calculation from the previous email,
> > for a relation containing n base rels, we would need to create 2^(n-1)
> > different RelOptInfos, not to mention that this can also lead to more
> > paths.  I still struggle to see how this is feasible.  Could you
> > please elaborate on why you believe this is a viable option?
>
> I agree that creating an exponential number of RelOptInfos is not
> going to work out well. I haven't been quite as certain as you seem to
> be that it's an unavoidable reality, but maybe it is. For instance, my
> intuition is that if PartialAgg(t1) JOIN t2 and PartialAgg(t1 JOIN t2)
> produce very different numbers of rows, we could probably just take
> the one with the smaller row count regardless of cost, because the
> whole selling point of this optimization is that we reduce the number
> of rows that are being fed to higher level plan nodes. I don't quite
> see how it can make sense to keep a less costly path that produces
> more rows on the theory that maybe it's going to work out better later
> on. Why is the path cheaper, after all? It feels like the savings must
> come from not reducing the row count so much, but that is a cost we're
> going to have to repay at a higher plan level. Moreover, we'll be
> repaying it with interest, because more rows will have filtered
> through every level of plan over which we postponed partial
> aggregation.

I've been thinking about this proposal, and it's quite appealing.  It
would significantly reduce both the planning effort and implementation
complexity, while still yielding reasonable planning results.

One concern I have with this proposal is that, as we climb up higher
and higher in the join tree, the assumption that a path with smaller
row count and higher cost is better than one with larger row count and
lower cost may gradually no longer hold.  It's true that a path with a
smaller row count is generally better for upper join nodes, as it
feeds fewer rows to upper join nodes.  However, as there are fewer and
fewer upper join nodes left, the efficiency gained from the smaller
row count could likely no longer justify the high cost of that path
itself.

Here's an example I found that can help illustrate what I mean.

create table t (a int, b int, c int);
insert into t select i%3, i%3, i from generate_series(1,500)i;
analyze t;
set enable_eager_aggregate to on;

And here are two plans for the same query:

-- Plan 1
explain (costs on)
select sum(t4.c) from t t1 join
  (t t2 join t t3 on t2.b != t3.b join t t4 on t3.b = t4.b)
  on t1.b = t2.b
group by t1.a;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Finalize HashAggregate  (cost=4135.19..4135.22 rows=3 width=12)
   Group Key: t1.a
   ->  Hash Join  (cost=1392.13..3301.85 rows=166668 width=12)
         Hash Cond: (t2.b = t1.b)
         ->  Nested Loop  (cost=1377.88..1409.66 rows=1000 width=12)
               Join Filter: (t2.b <> t3.b)
               ->  Partial HashAggregate  (cost=1377.88..1377.91
rows=3 width=12)
                     Group Key: t3.b
                     ->  Hash Join  (cost=14.25..961.22 rows=83334 width=8)
                           Hash Cond: (t3.b = t4.b)
                           ->  Seq Scan on t t3  (cost=0.00..8.00
rows=500 width=4)
                           ->  Hash  (cost=8.00..8.00 rows=500 width=8)
                                 ->  Seq Scan on t t4
(cost=0.00..8.00 rows=500 width=8)
               ->  Materialize  (cost=0.00..10.50 rows=500 width=4)
                     ->  Seq Scan on t t2  (cost=0.00..8.00 rows=500 width=4)
         ->  Hash  (cost=8.00..8.00 rows=500 width=8)
               ->  Seq Scan on t t1  (cost=0.00..8.00 rows=500 width=8)
(17 rows)

-- Plan 2
explain (costs on)
select sum(t4.c) from t t1 join
  (t t2 join t t3 on t2.b != t3.b join t t4 on t3.b = t4.b)
  on t1.b = t2.b
group by t1.a;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Finalize HashAggregate  (cost=455675.44..455675.47 rows=3 width=12)
   Group Key: t1.a
   ->  Hash Join  (cost=455658.07..455672.94 rows=500 width=12)
         Hash Cond: (t1.b = t2.b)
         ->  Seq Scan on t t1  (cost=0.00..8.00 rows=500 width=8)
         ->  Hash  (cost=455658.03..455658.03 rows=3 width=12)
               ->  Partial HashAggregate  (cost=455658.00..455658.03
rows=3 width=12)
                     Group Key: t2.b
                     ->  Hash Join  (cost=14.25..316768.56
rows=27777887 width=8)
                           Hash Cond: (t3.b = t4.b)
                           ->  Nested Loop  (cost=0.00..3767.25
rows=166666 width=8)
                                 Join Filter: (t2.b <> t3.b)
                                 ->  Seq Scan on t t2
(cost=0.00..8.00 rows=500 width=4)
                                 ->  Materialize  (cost=0.00..10.50
rows=500 width=4)
                                       ->  Seq Scan on t t3
(cost=0.00..8.00 rows=500 width=4)
                           ->  Hash  (cost=8.00..8.00 rows=500 width=8)
                                 ->  Seq Scan on t t4
(cost=0.00..8.00 rows=500 width=8)
(17 rows)

For the grouped relation {t2 t3 t4}, Plan 1 chose the path
"PartialAgg(t3/t4) JOIN t2", while Plan 2 chose the path
"PartialAgg(t2/t3/t4)".

The first path has larger row count (1000) and lower cost (1409.66).
The second path has smaller row count (3) and higher cost (455658.03).

Executing these two plans shows that Plan 2 is slower than Plan 1.

-- Plan 1
 Execution Time: 286.860 ms

-- Plan 2
 Execution Time: 27109.744 ms

I think we may need to take the position in the join tree into account
when applying this heuristic.  At lower levels, we should prefer paths
with smaller row counts, while at higher levels, we should prefer
paths with lower costs.  However, it's unclear to me how we should
define "lower" and "higher" - how low is 'low' and how high is 'high'.

> I admit it's not so clear-cut when the row counts are close. If
> PartialAgg(t1 JOIN t2) JOIN t3 has a very similar to PartialAgg(t1
> JOIN t3) JOIN t2, can we categorically pick whichever one has the
> lower row count and forget about the other? I'm not sure. But I have
> an uncomfortable feeling that if we can't, we're going to have an
> explosion in the number of paths we have to generate even if we avoid
> an explosion in the number of RelOptInfos we generate.
>
> For example, consider:
>
> SELECT ... FROM fact f, dim1, dim2, dim3, dim4
> WHERE f.dim1_id = dim1.id AND f.dim2_id = dim2.id
> AND f.dim3_id = dim3.id AND f.dim4_id = dim4.id
> GROUP BY f.something;
>
> Let's assume that each dimN table has PRIMARY KEY (id). Because of the
> primary keys, it's only sensible to consider partial aggregation for
> subsets of rels that include f; and it doesn't make sense to consider
> partially aggregating after joining all 5 tables because at that point
> we should just do a single-step aggregation. So, the partially
> grouped-rel for {f,dim1,dim2,dim3,dim4} can contain paths generated in
> 15 different ways, because we can join f to any proper subset of
> {dim1,dim2,dim3,dim4} before partially aggregating and then to the
> remainder after partially aggregating. But that feels like we're
> re-performing essentially the same join search 16 times which seems
> super-expensive. I can't quite say that the work is useless or that I
> have a better idea, but I guess there will be a lot of cases where all
> 16 join searches produce the same results, or most of them do. It
> doesn't feel to me like checking through all of those possibilities is
> a good expenditure of planner effort.

Yeah, you're right that the join search process for grouped paths
basically mirrors what we do for non-grouped paths, which indeed
involves a lot of planner effort.  I've been exploring potential
heuristics to limit the search space for grouped paths, but so far, I
haven't found any effective solutions.  Currently, the heuristic used
in the patch is to only consider grouped paths that dramatically
reduce the number of rows.  All others are just discarded.  The
rationale is that if a grouped path does not reduce the number of rows
enough, it is highly unlikely to result in a competitive final plan
during the upper planning stages, so it doesn't make much sense to
consider it.  The current threshold is set to 50%, meaning that if the
number of rows returned by PartialAgg(t1 JOIN t2) is not less than 50%
of the rows returned by (t1 JOIN t2), no Aggregate paths will be
generated on top of the t1/t2 join.  If we notice significant
regressions in planning time, we might consider further increasing
this threshold, say, to 80%, so that only grouped paths that reduce
the rows by more than 80% will be considered.  This heuristic also
ensures that, once a plan with eager aggregation is chosen, it is
highly likely to result in performance improvements, due to the
significant data reduction before joins.

> I took a look at the paper you linked in the original post, but
> unfortunately it doesn't seem to say much about how to search the plan
> space efficiently. I wonder if other systems perform a search that as
> exhaustive as the one that you are proposing to perform here or
> whether they apply some heuristics to limit the search space, and if
> so, what those heuristics are.

Unfortunately, I don't have much knowledge about other systems.  It
would be really helpful if anyone could share some insights on how
other systems handle this.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Tue, Jan 21, 2025 at 2:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, a partial-aggregation path does not generate the same data
> as an unaggregated path, no matter how fuzzy you are willing to be
> about the concept.  So I'm having a very hard time accepting that
> it ought to be part of the same RelOptInfo, and thus I don't really
> buy that annotating paths with a GroupPathInfo is the way forward.

Agreed.  I think one point I failed to make myself clear on is that
I've never intended to put a partial-aggregation path and an
unaggregated path into the same RelOptInfo.  One of the basic designs
of this patch is that partial-aggregation paths are placed in a
separate category of RelOptInfos, which I call "grouped relations"
(though I admit that's not the best name).  This ensures that we never
compare a partial-aggregation path with an unaggregated path during
scan/join planning, because I am certain that the two categories of
paths are not comparable.

Regarding the GroupPathInfo proposal, my intention is to add a valid
GroupPathInfo only for the partial-aggregation paths.  The goal is to
ensure that partial-aggregation paths within this category are
compared only if their partial aggregations are at the same location.

To be honest, I still doubt that this is necessary.  I have two main
reasons for this.

1.
For a partial-aggregation path, the location where we place the
partial aggregation does not impose any restrictions on further
planning.  This is different from the parameterized path case.  If two
parameterized paths are equal on very other figure of merit, we will
choose the one with fewer required outer rels, as it means fewer join
restrictions on upper planning.  However, for partial-aggregation
paths, we do not have a preference regarding the location of the
partial aggregation.  For instance, for path "A JOIN PartialAgg(B)
JOIN C" and path "PartialAgg(A JOIN B) JOIN C", if one path dominates
the other on every figure of merit, it seems to me that there's no
point in keeping the less favorable one, although they have their
partial aggregations at different join levels.

2.
A partial-aggregation path of a rel essentially yields an aggregated
form of that rel's row set.  The difference between the row sets
yielded by paths with different locations of partial aggregation is
primarily about the different degrees to which the rows are
aggregated.  These sets are fundamentally homogeneous.

In summary, in my own opinion, I think the partial-aggregation paths
of the same "grouped relation" are comparable, regardless of the
position of the partial aggregation within the path tree.  So I think
we should put them into the same RelOptInfo.

Of course, I could be very wrong about this.  I would greatly
appreciate hearing others' thoughts on this.

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Jan 21, 2025 at 3:33 AM Richard Guo <guofenglinux@gmail.com> wrote:
> I've been thinking about this proposal, and it's quite appealing.  It
> would significantly reduce both the planning effort and implementation
> complexity, while still yielding reasonable planning results.
>
> One concern I have with this proposal is that, as we climb up higher
> and higher in the join tree, the assumption that a path with smaller
> row count and higher cost is better than one with larger row count and
> lower cost may gradually no longer hold.  It's true that a path with a
> smaller row count is generally better for upper join nodes, as it
> feeds fewer rows to upper join nodes.  However, as there are fewer and
> fewer upper join nodes left, the efficiency gained from the smaller
> row count could likely no longer justify the high cost of that path
> itself.
>
> Here's an example I found that can help illustrate what I mean.

Thanks for the example. What seems to be happening here is that each
of the three joins increases the number of rows by a multiple of
either 166 or 333. Aggregating reduces the number of rows to 3. I am
not sure that we should be too concerned about this kind of case,
because I don't think it will be common to have multiple joins that
dramatically increase the row count. If you did have that, you must
want to aggregate multiple times. We don't have the code for an
IntermediateAggregate or CombineAggregate node right now, I believe,
but in this query it would likely make sense to apply such a step
after every join; then you'd never have more than three rows.

Honestly, I'm not sure how much we should worry about a case like
this. I think that if a user is writing queries that use joins to
vastly inflate the row count and then aggregate the result, perhaps
they need to think about rewriting the queries. In this instance, it
feels a bit like the user is emulating multiplication using an
iterated SUM(), which is probably never going to work out all that
well.

But I bet it's possible to construct an example using only
row-reducing joins. Let's say we start with 10k rows that aggregate to
10 rows; after performing a join, we end up with 9k rows that
aggregate to 9 rows. So if we partially aggregate first, we have to
aggregate 1000 extra rows, but if we join first, we have to join 1000
extra rows. I don't think we can say a priori which will be cheaper,
but my idea would make the path that partially aggregates after the
join win unconditionally.

> Yeah, you're right that the join search process for grouped paths
> basically mirrors what we do for non-grouped paths, which indeed
> involves a lot of planner effort.  I've been exploring potential
> heuristics to limit the search space for grouped paths, but so far, I
> haven't found any effective solutions.  Currently, the heuristic used
> in the patch is to only consider grouped paths that dramatically
> reduce the number of rows.  All others are just discarded.  The
> rationale is that if a grouped path does not reduce the number of rows
> enough, it is highly unlikely to result in a competitive final plan
> during the upper planning stages, so it doesn't make much sense to
> consider it.  The current threshold is set to 50%, meaning that if the
> number of rows returned by PartialAgg(t1 JOIN t2) is not less than 50%
> of the rows returned by (t1 JOIN t2), no Aggregate paths will be
> generated on top of the t1/t2 join.  If we notice significant
> regressions in planning time, we might consider further increasing
> this threshold, say, to 80%, so that only grouped paths that reduce
> the rows by more than 80% will be considered.  This heuristic also
> ensures that, once a plan with eager aggregation is chosen, it is
> highly likely to result in performance improvements, due to the
> significant data reduction before joins.

To be honest, I was quite surprised this was a percentage like 50% or
80% and not a multiple like 2 or 5. And I had thought the multiplier
might even be larger, like 10 or more. The thing is, 50% means we only
have to form 2-item groups in order to justify aggregating twice.
Maybe SUM() is cheap enough to justify that treatment, but a more
expensive aggregate might not be, especially things like string_agg()
or array_agg() where aggregation creates bigger objects.

Another thing to consider is that when the number of groups is small
enough that we don't need to do a Sort+GroupAggregate, it doesn't seem
so bad to perform marginally-useful partial aggregation, but sometimes
that won't be the case. For example, imagine that the user wants to
join orders to order_lines and then compute SUM(order_lines.quantity)
for each orders.customer_id. If the size of the order_lines tables is
large relative to  work_mem, we're going to need to sort it in order
to partially aggregate, which is expensive. If it turns out that the
orders table is also quite big, then maybe we'll end up performing a
merge join and the same sort order can be used for both operations,
but if not, we could've just done a hash join with orders as the build
table. In that kind of case, partial aggregation has to save quite a
lot to justify itself.

Now, maybe we shouldn't worry about that when applying this heuristic
cutoff; after all, it's the job of the cost model to understand that
sorting is expensive, and this cutoff should just be there to make
sure we don't even try the cost model in cases where it's clearly
unpromising. But I do suspect that in queries where the average group
size is 2, this will often be a marginal technique. In addition to the
problems already mentioned, it could be that the average group size is
2 but a lot of groups are actually of size 1 and then there are some
larger groups. In such cases I'm even less sure that the partial
aggregation technique will be a winner. Building many 1-element groups
sounds inefficient.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Jan 22, 2025 at 1:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks for the example. What seems to be happening here is that each
> of the three joins increases the number of rows by a multiple of
> either 166 or 333. Aggregating reduces the number of rows to 3. I am
> not sure that we should be too concerned about this kind of case,
> because I don't think it will be common to have multiple joins that
> dramatically increase the row count. If you did have that, you must
> want to aggregate multiple times. We don't have the code for an
> IntermediateAggregate or CombineAggregate node right now, I believe,
> but in this query it would likely make sense to apply such a step
> after every join; then you'd never have more than three rows.

Haha, I did once think about the concept of multi-stage aggregations
while working on this patch.  While testing this patch and trying to
figure out where placing the partial aggregation would bring the most
benefit, I noticed that a potentially effective approach could be
this: every time the row count increases to a certain point as we join
more and more tables, we perform one aggregation to deflate it, and
then wait for it to grow again before deflating it once more.

This approach would require injecting multiple intermediate
aggregation nodes into the path tree, for which we currently lack the
necessary architecture.  As a result, I didn't pursue this idea
further.  However, I'm really glad you mentioned this approach, though
it's still unclear whether it's a feasible or reasonable idea.

> Honestly, I'm not sure how much we should worry about a case like
> this. I think that if a user is writing queries that use joins to
> vastly inflate the row count and then aggregate the result, perhaps
> they need to think about rewriting the queries. In this instance, it
> feels a bit like the user is emulating multiplication using an
> iterated SUM(), which is probably never going to work out all that
> well.

I don't have much experience with end-user scenarios, so I'm not sure
if it's common to have queries where the row count increases with more
and more tables joined.

> But I bet it's possible to construct an example using only
> row-reducing joins. Let's say we start with 10k rows that aggregate to
> 10 rows; after performing a join, we end up with 9k rows that
> aggregate to 9 rows. So if we partially aggregate first, we have to
> aggregate 1000 extra rows, but if we join first, we have to join 1000
> extra rows. I don't think we can say a priori which will be cheaper,
> but my idea would make the path that partially aggregates after the
> join win unconditionally.

Yeah, this is the concern I raised upthread: the efficiency gained
from a path having a smaller row count may not always justify the high
cost of the path itself, especially as we move higher in the join
tree.

> To be honest, I was quite surprised this was a percentage like 50% or
> 80% and not a multiple like 2 or 5. And I had thought the multiplier
> might even be larger, like 10 or more. The thing is, 50% means we only
> have to form 2-item groups in order to justify aggregating twice.
> Maybe SUM() is cheap enough to justify that treatment, but a more
> expensive aggregate might not be, especially things like string_agg()
> or array_agg() where aggregation creates bigger objects.

Hmm, if I understand correctly, the "percentage" and the "multiple"
work in the same way.  Percentage 50% and multiple 2 both mean that
the average group size is 2, and percentage 90% and multiple 10 both
mean that the average group size is 10.  In general, this relationship
should hold: percentage = 1 - 1/multiple.  However, I might not have
grasped your point correctly.

> Another thing to consider is that when the number of groups is small
> enough that we don't need to do a Sort+GroupAggregate, it doesn't seem
> so bad to perform marginally-useful partial aggregation, but sometimes
> that won't be the case. For example, imagine that the user wants to
> join orders to order_lines and then compute SUM(order_lines.quantity)
> for each orders.customer_id. If the size of the order_lines tables is
> large relative to  work_mem, we're going to need to sort it in order
> to partially aggregate, which is expensive. If it turns out that the
> orders table is also quite big, then maybe we'll end up performing a
> merge join and the same sort order can be used for both operations,
> but if not, we could've just done a hash join with orders as the build
> table. In that kind of case, partial aggregation has to save quite a
> lot to justify itself.
>
> Now, maybe we shouldn't worry about that when applying this heuristic
> cutoff; after all, it's the job of the cost model to understand that
> sorting is expensive, and this cutoff should just be there to make
> sure we don't even try the cost model in cases where it's clearly
> unpromising. But I do suspect that in queries where the average group
> size is 2, this will often be a marginal technique. In addition to the
> problems already mentioned, it could be that the average group size is
> 2 but a lot of groups are actually of size 1 and then there are some
> larger groups. In such cases I'm even less sure that the partial
> aggregation technique will be a winner. Building many 1-element groups
> sounds inefficient.

Yeah, as you summarized, this heuristic is primarily used to discard
unpromising paths, ensuring they aren't considered further.  For the
paths that pass this heuristic, the cost model will then determine the
appropriate aggregation and join methods.  If we take this into
consideration when applying the heuristic, it seems to me that we
would essentially be duplicating the work that the cost model
performs, which doesn't seem necessary.

I think you are right that in cases where a lot of groups are actually
of size 1 and then there are some larger groups, the partial
aggregation may not be a win.  Perhaps we can do better in this if we
have the techniques to estimate the distribution of data across
different groups or to predict how skewed the data might be.  It seems
that we don't have such techniques at the moment.  This also reminds
me of a similar challenge when calculating the startup cost of
incremental sort.  I looked into cost_incremental_sort() and found
that we're currently using the average group size to estimate the
startup cost (please correct me if I'm wrong).

    group_tuples = input_tuples / input_groups;

I think this may also suffer from data skew across different groups.
With the mentioned techniques, I believe we could improve the cost
estimation for incremental sort as well.

If I understand correctly, your main concern is the threshold being
set to 2, rather than the heuristic itself, right?  Do you think
increasing this threshold to 10 or a larger value would help mitigate
the issue?

Thanks
Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Jan 22, 2025 at 1:48 AM Richard Guo <guofenglinux@gmail.com> wrote:
> This approach would require injecting multiple intermediate
> aggregation nodes into the path tree, for which we currently lack the
> necessary architecture.  As a result, I didn't pursue this idea
> further.  However, I'm really glad you mentioned this approach, though
> it's still unclear whether it's a feasible or reasonable idea.

I think the biggest question in my mind is really whether we can
accurately judge when such a strategy is likely to be a win. In this
instance it looks like we could have figured it out, but as we've
discussed, I fear a lot of estimates will be inaccurate. If we knew
they were going to be good, then I see no reason not to apply the
technique when it's sensible.

> I don't have much experience with end-user scenarios, so I'm not sure
> if it's common to have queries where the row count increases with more
> and more tables joined.

I don't think it's very common to see it increase as dramatically as
in your test case.

> > To be honest, I was quite surprised this was a percentage like 50% or
> > 80% and not a multiple like 2 or 5. And I had thought the multiplier
> > might even be larger, like 10 or more. The thing is, 50% means we only
> > have to form 2-item groups in order to justify aggregating twice.
> > Maybe SUM() is cheap enough to justify that treatment, but a more
> > expensive aggregate might not be, especially things like string_agg()
> > or array_agg() where aggregation creates bigger objects.
>
> Hmm, if I understand correctly, the "percentage" and the "multiple"
> work in the same way.  Percentage 50% and multiple 2 both mean that
> the average group size is 2, and percentage 90% and multiple 10 both
> mean that the average group size is 10.  In general, this relationship
> should hold: percentage = 1 - 1/multiple.  However, I might not have
> grasped your point correctly.

Yes, they're equivalent. However, a percentage to me suggests that we
think that the meaningful values might be something like 20%, 50%,
80%; whereas with a multiplier someone might be more inclined to think
of values like 10, 100, 1000. You can definitely write those values as
90%, 99%, 99.9%; however, it seems less natural to me to express it
that way when we think the value will be quite close to 1. The fact
that you chose a percentage suggested to me that you were aiming for a
less-strict threshold than I had supposed we would want.

> Yeah, as you summarized, this heuristic is primarily used to discard
> unpromising paths, ensuring they aren't considered further.  For the
> paths that pass this heuristic, the cost model will then determine the
> appropriate aggregation and join methods.  If we take this into
> consideration when applying the heuristic, it seems to me that we
> would essentially be duplicating the work that the cost model
> performs, which doesn't seem necessary.

Well, I think we do ideally want heuristics that can reject
unpromising paths earlier. The planning cost of this is really quite
high. But I'm not sure how far we can get with this particular
heuristic. True, we could raise it to a larger value, and that might
help to rule out unpromising paths earlier. But I fear you'll quickly
find examples where it also rules out promising paths early. A good
heuristic is easy to compute and highly accurate. This heuristic is
easy to compute, but the accuracy is questionable.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
I've switched back to this thread and will begin by working through
the key concerns that were previously raised.

The first concern is the lack of a proof demonstrating the correctness
of this transformation.  To address this, I plan to include a detailed
proof in the README, along the lines of the following.

====== proof start ======
To prove that the transformation is correct, we partition the tables
in the FROM clause into two groups: those that contain at least one
aggregation column, and those that do not contain any aggregation
columns.  Each group can be treated as a single relation formed by the
Cartesian product of the tables within that group.  Therefore, without
loss of generality, we can assume that the FROM clause contains
exactly two relations, R1 and R2, where R1 represents the relation
containing all aggregation columns, and R2 represents the relation
without any aggregation columns.

Let the query be of the form:

SELECT G, AGG(A)
FROM R1 JOIN R2 ON J
GROUP BY G;

where G is the set of grouping keys that may include columns from R1
and/or R2; AGG(A) is an aggregate function over columns A from R1; J
is the join condition between R1 and R2.

The transformation of eager aggregation is:

    GROUP BY G, AGG(A) on (R1 JOIN R2 ON J)
    =
    GROUP BY G, AGG(agg_A) on ((GROUP BY G1, AGG(A) AS agg_A on R1)
JOIN R2 ON J)

This equivalence holds under the following conditions:

1) AGG is decomposable, meaning that it can be computed in two stages:
a partial aggregation followed by a final aggregation;
2) The set G1 used in the pre-aggregation of R1 includes:
    * all columns from R1 that are part of the grouping keys G, and
    * all columns from R1 that appear in the join condition J.
3) The grouping operator for any column in G1 must be compatible with
the operator used for that column in the join condition J.

Since G1 includes all columns from R1 that appear in either the
grouping keys G or the join condition J, all rows within each partial
group have identical values for both the grouping keys and the
join-relevant columns from R1, assuming compatible operators are used.
As a result, the rows within a partial group are indistinguishable in
terms of their contribution to the aggregation and their behavior in
the join.  This ensures that all rows in the same partial group share
the same "destiny": they either all match or all fail to match a given
row in R2.  Because the aggregate function AGG is decomposable,
aggregating the partial results after the join yields the same final
result as aggregating after the full join, thereby preserving query
semantics.

Q.E.D.

The second concern is that a RelOptInfo representing a grouped
relation may include paths that produce different row sets due to
partial aggregation being applied at different join levels.  This
potentially violates a fundamental assumption in the planner.

Additionally, the patch currently performs an exhaustive search by
exploring partial aggregation at every possible join level, leading to
excessive planning effort, which may not be justified by the
cost-benefit ratio.

To address these concerns, I'm thinking that maybe we can adopt a
strategy where partial aggregation is only pushed to the lowest
possible level in the join tree that is deemed useful.  In other
words, if we can build a grouped path like "AGG(B) JOIN A" -- and
AGG(B) yields a significant reduction in row count -- we skip
exploring alternatives like "AGG(A JOIN B)".

This is somewhat analogous to how we handle qual clauses: we only push
a qual clause down to the lowest scan or join level that includes all
the relations it references -- following the "filter early, join late"
principle.  For example, if predicate Pb only references B, we only
consider "A JOIN sigma[Pb](B)" and skip "sigma[Pb](A JOIN B)".  (Note
that if Pb involves costly functions and the join is highly selective,
we may want to apply the predicate after the join.)

This ensures that all grouped paths for the same grouped relation
produce the same set of rows (e.g., consider "A JOIN AGG(B) JOIN C"
vs. "AGG(B) JOIN C JOIN A").  As a result, we avoid the complexity of
comparing costs between different grouped paths of the same grouped
relation, and also eliminate the need for special handling of row
estimates on join paths.  It also significantly reduces planning
effort.

While this approach may miss potentially more efficient plans where
applying partial aggregation at a higher join level would yield better
performance, it strikes a practical balance: we can still find plans
that outperform those without eager aggregation, without incurring
excessive planning overhead.  As discussed earlier, it's uncommon in
practice to encounter multiple joins that dramatically inflate row
counts.  So in most cases, pushing partial aggregation to the lowest
level where it offers a significant row count reduction tends to be
the most efficient strategy.

I think this heuristic serves as a good starting point, and we can
look into extending it with more advanced strategies as the feature
evolves.

Any thoughts?

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Jun 26, 2025 at 11:01 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Here is the patch based on the proposed ideas.  It includes the proof
> of correctness in the README and implements the strategy of pushing
> partial aggregation only to the lowest applicable join level where it
> is deemed useful.  This is done by introducing a "Relids apply_at"
> field to track that level and ensuring that partial aggregation is
> applied only at the recorded "apply_at" level.
>
> Additionally, this patch changes how grouped relations are stored.
> Since each grouped relation represents a partially aggregated version
> of a non-grouped relation, we now associate each grouped relation with
> the RelOptInfo of the corresponding non-grouped relation.  This
> eliminates the need for a dedicated list of all grouped relations and
> avoids list searches when retrieving a grouped relation.
>
> It also addresses other previously raised concerns, such as the
> potential memory blowout risks with large partial-aggregation values,
> and includes improvements to comments and the commit message.
>
> Another change is that this feature is now enabled by default.

This patch no longer applies; here's a rebased version.  Nothing
essential has changed.

Thanks
Richard

Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Jul 24, 2025 at 12:21 PM Richard Guo <guofenglinux@gmail.com> wrote:
> This patch no longer applies; here's a rebased version.  Nothing
> essential has changed.

Based on some off-list testing by Matheus (CC'ed), several TPC-DS
queries that used to apply eager aggregation no longer do, which
suggests that the v18 patch is too strict about when eager aggregation
can be used.

I looked into query 4 and query 11, and found two reasons why they no
longer apply eager aggregation with v18.

* The has_internal_aggtranstype() check.

To avoid potential memory blowout risks from large partial aggregation
values, v18 avoids applying eager aggregation if any aggregate uses an
INTERNAL transition type, as this typically indicates a large internal
data structure (as in string_agg or array_agg).  However, this also
excludes aggregates like avg(numeric) and sum(numeric), which are
actually safe to use with eager aggregation.

What we really want to exclude are aggregate functions that can
produce large transition values by accumulating or concatenating input
rows.  So I'm wondering if we could instead check the transfn_oid
directly and explicitly exclude only F_ARRAY_AGG_TRANSFN and
F_STRING_AGG_TRANSFN.  We don't need to worry about json_agg,
jsonb_agg, or xmlagg, since they don't support partial aggregation
anyway.

* The EAGER_AGG_MIN_GROUP_SIZE threshold

This threshold defines the minimum average group size required to
consider applying eager aggregation.  It was previously set to 2, but
in v18 it was increased to 20 to be cautious about planning overhead.
This change was a snap decision though, without any profiling or data
to back it.

Looking at TPC-DS queries 4 and 11, a threshold of 10 is the minimum
needed to consider eager aggregation for them.  The resulting plans
show nice performance improvements without any measurable increase in
planning time.  So, I'm inclined to lower the threshold to 10 for now.
(Wondering whether we should make this threshold a GUC, so users can
adjust it based on their needs.)


With these two changes, here are the planning and execution time for
queries 4 and 11 (scale factor 1) on my snail-paced machine, with and
without eager aggregation.

query 4:
-- without eager aggregation
 Planning Time: 6.765 ms
 Execution Time: 34941.713 ms
-- with eager aggregation
 Planning Time: 6.674 ms
 Execution Time: 13994.183 ms

query 11:
-- without eager aggregation
 Planning Time: 3.757 ms
 Execution Time: 20888.076 ms
-- with eager aggregation
 Planning Time: 3.747 ms
 Execution Time: 7449.522 ms

Any comments on these two changes?

Thanks
Richard



Re: Eager aggregation, take 3

От
"Matheus Alcantara"
Дата:
On Wed Aug 6, 2025 at 4:52 AM -03, Richard Guo wrote:
> On Thu, Jul 24, 2025 at 12:21 PM Richard Guo <guofenglinux@gmail.com> wrote:
>> This patch no longer applies; here's a rebased version.  Nothing
>> essential has changed.
>
> Based on some off-list testing by Matheus (CC'ed), several TPC-DS
> queries that used to apply eager aggregation no longer do, which
> suggests that the v18 patch is too strict about when eager aggregation
> can be used.
>
> I looked into query 4 and query 11, and found two reasons why they no
> longer apply eager aggregation with v18.
>
> * The has_internal_aggtranstype() check.
>
> To avoid potential memory blowout risks from large partial aggregation
> values, v18 avoids applying eager aggregation if any aggregate uses an
> INTERNAL transition type, as this typically indicates a large internal
> data structure (as in string_agg or array_agg).  However, this also
> excludes aggregates like avg(numeric) and sum(numeric), which are
> actually safe to use with eager aggregation.
>
> What we really want to exclude are aggregate functions that can
> produce large transition values by accumulating or concatenating input
> rows.  So I'm wondering if we could instead check the transfn_oid
> directly and explicitly exclude only F_ARRAY_AGG_TRANSFN and
> F_STRING_AGG_TRANSFN.  We don't need to worry about json_agg,
> jsonb_agg, or xmlagg, since they don't support partial aggregation
> anyway.
>
I think it makes sense to me. I just wondering if we should follow an
"allow" or "don't-allow" strategy. I mean, instead of a list aggregate
functions that are not allowed we could list functions that are actually
allowed to use eager aggregation, so in this case we ensure that for the
functions that are enabled the eager aggregation can work properly.

> * The EAGER_AGG_MIN_GROUP_SIZE threshold
>
> This threshold defines the minimum average group size required to
> consider applying eager aggregation.  It was previously set to 2, but
> in v18 it was increased to 20 to be cautious about planning overhead.
> This change was a snap decision though, without any profiling or data
> to back it.
>
> Looking at TPC-DS queries 4 and 11, a threshold of 10 is the minimum
> needed to consider eager aggregation for them.  The resulting plans
> show nice performance improvements without any measurable increase in
> planning time.  So, I'm inclined to lower the threshold to 10 for now.
> (Wondering whether we should make this threshold a GUC, so users can
> adjust it based on their needs.)
>
Having a GUC may sound like a good idea to me TBH. This threshold may
vary from workload to workload (?).

>
> With these two changes, here are the planning and execution time for
> queries 4 and 11 (scale factor 1) on my snail-paced machine, with and
> without eager aggregation.
>
> query 4:
> -- without eager aggregation
>  Planning Time: 6.765 ms
>  Execution Time: 34941.713 ms
> -- with eager aggregation
>  Planning Time: 6.674 ms
>  Execution Time: 13994.183 ms
>
> query 11:
> -- without eager aggregation
>  Planning Time: 3.757 ms
>  Execution Time: 20888.076 ms
> -- with eager aggregation
>  Planning Time: 3.747 ms
>  Execution Time: 7449.522 ms
>
> Any comments on these two changes?
>
It sounds like a good way to go for me, looking forward to the next
patch version to perform some other tests.

Thanks

--
Matheus Alcantara



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Wed, Aug 6, 2025 at 10:44 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> On Wed Aug 6, 2025 at 4:52 AM -03, Richard Guo wrote:
> > * The has_internal_aggtranstype() check.
> >
> > To avoid potential memory blowout risks from large partial aggregation
> > values, v18 avoids applying eager aggregation if any aggregate uses an
> > INTERNAL transition type, as this typically indicates a large internal
> > data structure (as in string_agg or array_agg).  However, this also
> > excludes aggregates like avg(numeric) and sum(numeric), which are
> > actually safe to use with eager aggregation.
> >
> > What we really want to exclude are aggregate functions that can
> > produce large transition values by accumulating or concatenating input
> > rows.  So I'm wondering if we could instead check the transfn_oid
> > directly and explicitly exclude only F_ARRAY_AGG_TRANSFN and
> > F_STRING_AGG_TRANSFN.  We don't need to worry about json_agg,
> > jsonb_agg, or xmlagg, since they don't support partial aggregation
> > anyway.

> I think it makes sense to me. I just wondering if we should follow an
> "allow" or "don't-allow" strategy. I mean, instead of a list aggregate
> functions that are not allowed we could list functions that are actually
> allowed to use eager aggregation, so in this case we ensure that for the
> functions that are enabled the eager aggregation can work properly.

I ended up still checking for INTERNAL transition types, but
explicitly excluded aggregates that use F_NUMERIC_AVG_ACCUM transition
function, assuming that avg(numeric) and sum(numeric) are safe in this
context.  This might still be overly strict, but I prefer to be on the
safe side for now.

> > * The EAGER_AGG_MIN_GROUP_SIZE threshold
> >
> > This threshold defines the minimum average group size required to
> > consider applying eager aggregation.  It was previously set to 2, but
> > in v18 it was increased to 20 to be cautious about planning overhead.
> > This change was a snap decision though, without any profiling or data
> > to back it.
> >
> > Looking at TPC-DS queries 4 and 11, a threshold of 10 is the minimum
> > needed to consider eager aggregation for them.  The resulting plans
> > show nice performance improvements without any measurable increase in
> > planning time.  So, I'm inclined to lower the threshold to 10 for now.
> > (Wondering whether we should make this threshold a GUC, so users can
> > adjust it based on their needs.)

> Having a GUC may sound like a good idea to me TBH. This threshold may
> vary from workload to workload (?).

I've made this threshold a GUC, with a default value of 8 (further
benchmark testing showed that a value of 10 is still too strict for
TPC-DS query 4).

> > Any comments on these two changes?

> It sounds like a good way to go for me, looking forward to the next
> patch version to perform some other tests.

OK.  Here it is.

Thanks
Richard

Вложения

Re: Eager aggregation, take 3

От
"Matheus Alcantara"
Дата:
On 08/08/25 22:32, Richard Guo wrote:
>> It sounds like a good way to go for me, looking forward to the next
>> patch version to perform some other tests.
>
> OK.  Here it is.
>
Thanks! I can confirm now that I can see the eager aggregate in action
in some of these queries that I've tested on the TPC-DS benchmark.

I few questions regarding the new version:

I've noticed that when a query has a WHERE clause filtering columns from
the same relation being aggregated using "=" operator the Partial and
Finalize aggregation nodes are not present on explain results even if
setup_eager_aggregation() returns true on all if statements and also
RelAggInfo->agg_useful is true. For example, consider this query that is
used on eager aggregation paper that use some tables from TPC-H
benchmark:

tpch=# show enable_eager_aggregate ;
 enable_eager_aggregate
------------------------
 on
(1 row)

tpch=# set max_parallel_workers_per_gather to 0;
SET

tpch=# EXPLAIN(COSTS OFF) SELECT O_CLERK,
       SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT)) AS LOSS
FROM LINEITEM
JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY
WHERE L_RETURNFLAG = 'R'
GROUP BY O_CLERK;
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate
   Group Key: orders.o_clerk
   ->  Hash Join
         Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
         ->  Seq Scan on lineitem
               Filter: (l_returnflag = 'R'::bpchar)
         ->  Hash
               ->  Seq Scan on orders
(8 rows)

Debugging this query shows that all if conditions on
setup_eager_aggregation() returns false and create_agg_clause_infos()
and create_grouping_expr_infos() are called. The RelAggInfo->agg_useful
is also being set to true so I would expect to see Finalize and Partial
agg nodes, is this correct or am I missing something here?

Removing the WHERE clause I can see the Finalize and Partial agg nodes:

tpch=# EXPLAIN(COSTS OFF) SELECT O_CLERK,
       SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT)) AS LOSS
FROM LINEITEM
JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY
GROUP BY O_CLERK;
                              QUERY PLAN
----------------------------------------------------------------------
 Finalize HashAggregate
   Group Key: orders.o_clerk
   ->  Merge Join
         Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)
         ->  Partial GroupAggregate
               Group Key: lineitem.l_orderkey
               ->  Index Scan using idx_lineitem_orderkey on lineitem
         ->  Index Scan using orders_pkey on orders
(8 rows)

This can also be reproduced with an addition of a WHERE clause on some
tests on eager_aggregate.sql:

postgres=# EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, avg(t2.c)
FROM eager_agg_t1 t1
JOIN eager_agg_t2 t2
    ON t1.b = t2.b
WHERE t2.c = 5
GROUP BY t1.a
ORDER BY t1.a;
                            QUERY PLAN
------------------------------------------------------------------
 GroupAggregate
   Output: t1.a, avg(t2.c)
   Group Key: t1.a
   ->  Sort
         Output: t1.a, t2.c
         Sort Key: t1.a
         ->  Hash Join
               Output: t1.a, t2.c
               Hash Cond: (t1.b = t2.b)
               ->  Seq Scan on public.eager_agg_t1 t1
                     Output: t1.a, t1.b, t1.c
               ->  Hash
                     Output: t2.c, t2.b
                     ->  Seq Scan on public.eager_agg_t2 t2
                           Output: t2.c, t2.b
                           Filter: (t2.c = '5'::double precision)
(16 rows)


Note that if I use ">" operator for example, this doesn't happen:
SELECT t1.a, avg(t2.c)
FROM eager_agg_t1 t1
JOIN eager_agg_t2 t2
    ON t1.b = t2.b
WHERE t2.c > 5
GROUP BY t1.a
ORDER BY t1.a;
                               QUERY PLAN
------------------------------------------------------------------------
 Finalize GroupAggregate
   Output: t1.a, avg(t2.c)
   Group Key: t1.a
   ->  Sort
         Output: t1.a, (PARTIAL avg(t2.c))
         Sort Key: t1.a
         ->  Hash Join
               Output: t1.a, (PARTIAL avg(t2.c))
               Hash Cond: (t1.b = t2.b)
               ->  Seq Scan on public.eager_agg_t1 t1
                     Output: t1.a, t1.b, t1.c
               ->  Hash
                     Output: t2.b, (PARTIAL avg(t2.c))
                     ->  Partial HashAggregate
                           Output: t2.b, PARTIAL avg(t2.c)
                           Group Key: t2.b
                           ->  Seq Scan on public.eager_agg_t2 t2
                                 Output: t2.a, t2.b, t2.c
                                 Filter: (t2.c > '5'::double precision)
(19 rows)


Is this behavior correct? If it's correct, would be possible to check
this limitation on setup_eager_aggregation() and maybe skip all the
other work?

--
Matheus Alcantara



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Aug 15, 2025 at 4:22 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> Debugging this query shows that all if conditions on
> setup_eager_aggregation() returns false and create_agg_clause_infos()
> and create_grouping_expr_infos() are called. The RelAggInfo->agg_useful
> is also being set to true so I would expect to see Finalize and Partial
> agg nodes, is this correct or am I missing something here?

Well, just because eager aggregation *can* be applied does not mean
that it *will* be; it depends on whether it produces a lower-cost
execution plan.  This transformation is cost-based, so it's not the
right mindset to assume that it will always be applied when possible.

In your case, with the filter "t2.c = 5", the row estimate for t2 is
just 1 after the filter has been applied.  The planner decides that
adding a partial aggregation on top of such a small result set doesn't
offer much benefit, which seems reasonable to me.

->  Hash  (cost=18.50..18.50 rows=1 width=12)
          (actual time=0.864..0.865 rows=1.00 loops=1)
      Buckets: 1024  Batches: 1  Memory Usage: 9kB
      ->  Seq Scan on eager_agg_t2 t2  (cost=0.00..18.50 rows=1 width=12)
                                       (actual time=0.060..0.851
rows=1.00 loops=1)
            Filter: (c = '5'::double precision)
            Rows Removed by Filter: 999


With the filter "t2.c > 5", the row estimate for t2 is 995 after
filtering.  A partial aggregation can reduce that to 10 rows, so the
planner decides that adding a partial aggregation is beneficial -- and
does so.  That also seems reasonable to me.

->  Partial HashAggregate  (cost=23.48..23.58 rows=10 width=36)
                           (actual time=2.427..2.438 rows=10.00 loops=1)
      Group Key: t2.b
      Batches: 1  Memory Usage: 32kB
      ->  Seq Scan on eager_agg_t2 t2  (cost=0.00..18.50 rows=995 width=12)
                                       (actual time=0.053..0.989
rows=995.00 loops=1)
            Filter: (c > '5'::double precision)
            Rows Removed by Filter: 5

> Is this behavior correct? If it's correct, would be possible to check
> this limitation on setup_eager_aggregation() and maybe skip all the
> other work?

Hmm, I wouldn't consider this a limitation; it's just the result of
the planner's cost-based tournament for path selection.

Thanks
Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Sat, Aug 9, 2025 at 10:32 AM Richard Guo <guofenglinux@gmail.com> wrote:
> OK.  Here it is.

This patch needs a rebase; here it is.  No changes were made.

- Richard

Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Mon, Sep 1, 2025 at 10:32 AM Richard Guo <guofenglinux@gmail.com> wrote:
> This patch needs a rebase; here it is.  No changes were made.

Here is a rebase after the GUC tables change.

- Richard

Вложения

Re: Eager aggregation, take 3

От
Robert Haas
Дата:
Sorry for the slow response.

On Fri, Jun 13, 2025 at 3:42 AM Richard Guo <guofenglinux@gmail.com> wrote:
> The transformation of eager aggregation is:
>
>     GROUP BY G, AGG(A) on (R1 JOIN R2 ON J)
>     =
>     GROUP BY G, AGG(agg_A) on ((GROUP BY G1, AGG(A) AS agg_A on R1)
> JOIN R2 ON J)
>
> This equivalence holds under the following conditions:
>
> 1) AGG is decomposable, meaning that it can be computed in two stages:
> a partial aggregation followed by a final aggregation;
> 2) The set G1 used in the pre-aggregation of R1 includes:
>     * all columns from R1 that are part of the grouping keys G, and
>     * all columns from R1 that appear in the join condition J.
> 3) The grouping operator for any column in G1 must be compatible with
> the operator used for that column in the join condition J.

This proof seems to ignore join-order constraints. I'm not sure to
what degree that influences the ultimate outcome here, but given A
LEFT JOIN (B INNER JOIN C), we cannot simply decide that A and C
comprise R1 and B comprises R2, because it is not actually possible to
do the A-C join first and treat the result as a relation to be joined
to B. That said, I do very much like the explicit enumeration of
criteria that must be met for the optimization to be valid. That makes
it a lot easier to evaluate whether the theory of the patch is
correct.

> To address these concerns, I'm thinking that maybe we can adopt a
> strategy where partial aggregation is only pushed to the lowest
> possible level in the join tree that is deemed useful.  In other
> words, if we can build a grouped path like "AGG(B) JOIN A" -- and
> AGG(B) yields a significant reduction in row count -- we skip
> exploring alternatives like "AGG(A JOIN B)".

I really like this idea. I believe we need some heuristic here and
this seems like a reasonable one. I think there could be a better one,
potentially. For instance, it would be reasonable (in my opinion) to
do some kind of evaluation of AGG(A JOIN B) vs. AGG(B) JOIN A that
does not involve performing full path generation for both cases; e.g.
one could try to decide considering only row counts, for instance.
However, I'm not saying that would work better than your proposal
here, or that it should be a requirement for this to be committed;
it's just an idea. IMHO, the requirement to have something committable
is that there is SOME heuristic limiting the search space and at the
same time the patch can still be demonstrated to give SOME benefit. I
think what you propose here meets those criteria. I also like the fact
that it's simple and easy to understand. If it does go wrong, it will
not be too difficult for someone to understand why it has gone wrong,
which is very desirable.

> I think this heuristic serves as a good starting point, and we can
> look into extending it with more advanced strategies as the feature
> evolves.

So IOW, +1 to what you say here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Aug 6, 2025 at 3:52 AM Richard Guo <guofenglinux@gmail.com> wrote:
> To avoid potential memory blowout risks from large partial aggregation
> values, v18 avoids applying eager aggregation if any aggregate uses an
> INTERNAL transition type, as this typically indicates a large internal
> data structure (as in string_agg or array_agg).  However, this also
> excludes aggregates like avg(numeric) and sum(numeric), which are
> actually safe to use with eager aggregation.
>
> What we really want to exclude are aggregate functions that can
> produce large transition values by accumulating or concatenating input
> rows.  So I'm wondering if we could instead check the transfn_oid
> directly and explicitly exclude only F_ARRAY_AGG_TRANSFN and
> F_STRING_AGG_TRANSFN.  We don't need to worry about json_agg,
> jsonb_agg, or xmlagg, since they don't support partial aggregation
> anyway.

This strategy seems fairly unfriendly towards out-of-core code. Can
you come up with something that allows the author of a SQL-callable
function to include or exclude the function by a choice that is under
their control, rather than hard-coding something in PostgreSQL itself?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Fri, Sep 5, 2025 at 3:35 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Here is a rebase after the GUC tables change.

I spent a bit of time scrolling through this today. Here are a few
observations/review comments.

It looks as though this will create a bunch of RelOptInfo objects that
don't end up getting used for anything once the apply_at test in
generate_grouped_paths() fails. It seems to me that it would be better
to altogether avoid generating the RelOptInfo in that case.

I think it would be worth considering generating the partially grouped
relations in a second pass. Right now, as you progress from the bottom
of the join tree towards the top, you created grouped rels as you go.
But you could equally well finish planning everything up to the
scan/join target first and then go back and add grouped_rels to
relations where it seems worthwhile. I don't know if this would really
make a big difference as you have things today, but I think it might
provided a better structure for the future, because you would then
have a lot more information with which to judge where to do
aggregation. For instance, you could looked at the row counts of any
number of those ungrouped-rels before deciding where to put the
partial aggregation. That seems like it could be pretty valuable.

I haven't done a detailed comparison of generate_grouped_paths() to
other parts of the code, but I have an uncomfortable feeling that it
might be rather similar to some existing code that probably already
exists in multiple, slightly-different versions. Is there any
refactoring we could do here?

Do you need a test of this feature in combination with GEQO? You have
code for it but I don't immediately see a test. I didn't check
carefully, though.

Overall I like the direction this is heading. I don't feel
well-qualified to evaluate whether all of the things that you're doing
are completely safe. The logic in is_var_in_aggref_only() and
is_var_needed_by_join() scares me a bit because I worry that the
checks are somehow non-exhaustive, but I don't know of a specific
hazard. That said, I think that modulo such issues, this has a good
chance of significantly improving performance for certain query
shapes.

One thing to check might be whether you can construct any cases where
the strategy is applied too boldly. Given the safeguards you've put in
place that seems a little a little hard to construct. The most obvious
thing that occurs to me is an aggregate where combining is more
expensive than aggregating, so that the partial aggregation gives the
appearance of saving more work than it really does, but I can't
immediately think of a problem case. Another case could be where the
row counts are off, leading to us mistakenly believing that we're
going to reduce the number of rows that need to be processed when we
really don't. Of course, such a case would arguably be a fault of the
bad row-count estimate rather than this patch, but if the patch has
that problem frequently, it might need to be addressed. Still, I have
a feeling that the testing you've already been doing might have
surfaced such cases if they were common. Have you looked into how many
queries in the regression tests, or in TPC-H/DS, expend significant
planning effort on this strategy before discarding it? That might be a
good way to get a sense of whether the patch is too aggressive, not
aggressive enough, a mix of the two, or just right.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Wed, Aug 6, 2025 at 3:52 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Looking at TPC-DS queries 4 and 11, a threshold of 10 is the minimum
> needed to consider eager aggregation for them.  The resulting plans
> show nice performance improvements without any measurable increase in
> planning time.  So, I'm inclined to lower the threshold to 10 for now.
> (Wondering whether we should make this threshold a GUC, so users can
> adjust it based on their needs.)

Like Matheus, I think a GUC is reasonable. A significant danger here
appears to be the possibility of a performance cliff, where queries
are optimized very different when the ratio is 9.99 vs. 10.01, say. It
would be nice if there were some way to mitigate that danger, but at
least a GUC avoids chaining the performance of the whole system to a
hard-coded value.

It might be worth considering whether there are heuristics other than
the group size that could help here. Possibly that's just making
things more complicated to no benefit. It seems to me, for example,
that reducing 100 rows to 10 is quite different from reducing a
million rows to 100,000. On the whole, the latter seems more likely to
work out well, but it's tricky, because the effort expended per group
can be arbitrarily high. I think we do want to let the cost model make
most of the decisions, and just use this threshold to prune ideas that
are obviously bad at an early stage. That said, it's worth thinking
about how this interacts with the just-considered-one-eager-agg
strategy. Does this threshold apply before or after that rule?

For instance, consider AGG(FACT_TABLE JOIN DIMENSION_TABLE), like a
count of orders grouped by customer name. Aggregating on the dimension
table (in this case, the list of customers) is probably useless, but
aggregating on the join column of the fact table has a good chance of
being useful. If we consider only one of those strategies, we want it
to be the right one. This threshold could be the thing that helps us
to get it right.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Sep 5, 2025 at 10:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 13, 2025 at 3:42 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > The transformation of eager aggregation is:
> >
> >     GROUP BY G, AGG(A) on (R1 JOIN R2 ON J)
> >     =
> >     GROUP BY G, AGG(agg_A) on ((GROUP BY G1, AGG(A) AS agg_A on R1)
> > JOIN R2 ON J)
> >
> > This equivalence holds under the following conditions:
> >
> > 1) AGG is decomposable, meaning that it can be computed in two stages:
> > a partial aggregation followed by a final aggregation;
> > 2) The set G1 used in the pre-aggregation of R1 includes:
> >     * all columns from R1 that are part of the grouping keys G, and
> >     * all columns from R1 that appear in the join condition J.
> > 3) The grouping operator for any column in G1 must be compatible with
> > the operator used for that column in the join condition J.

> This proof seems to ignore join-order constraints. I'm not sure to
> what degree that influences the ultimate outcome here, but given A
> LEFT JOIN (B INNER JOIN C), we cannot simply decide that A and C
> comprise R1 and B comprises R2, because it is not actually possible to
> do the A-C join first and treat the result as a relation to be joined
> to B. That said, I do very much like the explicit enumeration of
> criteria that must be met for the optimization to be valid. That makes
> it a lot easier to evaluate whether the theory of the patch is
> correct.

Thanks for pointing this out.  I should have clarified that the proof
is intended for the inner join case.  My plan was to first establish
the correctness for inner joins, and then extend the proof to cover
outer joins, but I failed to make that clear.

In the case where there are any outer joins, the situation becomes
more complex due to join order constraints and the semantics of
null-extension in outer joins.  If the relations that contain at least
one aggregation column cannot be treated as a single relation because
of the join order constraints, partial aggregation paths will not be
generated, and thus the transformation is not applicable.

Otherwise, to preserve correctness, we need to add an additional
condition: R1 must not be on the nullable side of any outer join.
This ensures that partial aggregation over R1 does not suppress any
null-extended rows that would be introduced by outer joins.

I'll update the proof in README to cover the outer join case.

> > To address these concerns, I'm thinking that maybe we can adopt a
> > strategy where partial aggregation is only pushed to the lowest
> > possible level in the join tree that is deemed useful.  In other
> > words, if we can build a grouped path like "AGG(B) JOIN A" -- and
> > AGG(B) yields a significant reduction in row count -- we skip
> > exploring alternatives like "AGG(A JOIN B)".

> I really like this idea. I believe we need some heuristic here and
> this seems like a reasonable one. I think there could be a better one,
> potentially. For instance, it would be reasonable (in my opinion) to
> do some kind of evaluation of AGG(A JOIN B) vs. AGG(B) JOIN A that
> does not involve performing full path generation for both cases; e.g.
> one could try to decide considering only row counts, for instance.
> However, I'm not saying that would work better than your proposal
> here, or that it should be a requirement for this to be committed;
> it's just an idea. IMHO, the requirement to have something committable
> is that there is SOME heuristic limiting the search space and at the
> same time the patch can still be demonstrated to give SOME benefit. I
> think what you propose here meets those criteria. I also like the fact
> that it's simple and easy to understand. If it does go wrong, it will
> not be too difficult for someone to understand why it has gone wrong,
> which is very desirable.

> > I think this heuristic serves as a good starting point, and we can
> > look into extending it with more advanced strategies as the feature
> > evolves.

> So IOW, +1 to what you say here.

Thanks for liking this idea.  Another way this heuristic makes life
easier is that it ensures all grouped paths for the same grouped
relation produce the same set of rows.  This means we don't need all
the hacks for comparing costs between grouped paths, nor do we have to
resolve disputes about how many RelOptInfos to create for a single
grouped relation.  I'd prefer to keep this property for now and
explore more complex heuristics in the future.

- Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Sep 5, 2025 at 10:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 6, 2025 at 3:52 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > What we really want to exclude are aggregate functions that can
> > produce large transition values by accumulating or concatenating input
> > rows.  So I'm wondering if we could instead check the transfn_oid
> > directly and explicitly exclude only F_ARRAY_AGG_TRANSFN and
> > F_STRING_AGG_TRANSFN.  We don't need to worry about json_agg,
> > jsonb_agg, or xmlagg, since they don't support partial aggregation
> > anyway.

> This strategy seems fairly unfriendly towards out-of-core code. Can
> you come up with something that allows the author of a SQL-callable
> function to include or exclude the function by a choice that is under
> their control, rather than hard-coding something in PostgreSQL itself?

Yeah, ideally we should tell whether an aggregate's transition state
may grow unbounded just by looking at system catalogs.  Unfortunately,
after trying for a while, it seems to me that the current catalog
doesn't provide enough information.

I once considered adding a flag (e.g., aggtransbounded) to catalog
pg_aggregate to indicate whether the transition state size is bounded.
This flag could be specified by users when creating aggregate
functions, and then leveraged by features such as eager aggregation.

However, adding new information to system catalogs involves a lot of
discussions and changes, including updates to DDL commands, dump and
restore processes, and upgrade procedures.  Therefore, to keep the
focus of this patch on the eager aggregation feature itself, I prefer
to treat this enhancement as future work.

- Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Sep 5, 2025 at 11:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I spent a bit of time scrolling through this today. Here are a few
> observations/review comments.

Thanks for all the comments.

> It looks as though this will create a bunch of RelOptInfo objects that
> don't end up getting used for anything once the apply_at test in
> generate_grouped_paths() fails. It seems to me that it would be better
> to altogether avoid generating the RelOptInfo in that case.

Hmm, that's not the case.  make_grouped_join_rel() guarantees that for
a given relation, if its grouped paths are not considered useful, and
no grouped paths can be built by joining grouped input relations, then
its grouped relation will not be created.  IOW, we only create a
grouped RelOptInfo if we've determined that we can generate useful
grouped paths for it.

In the case you mentioned, where the apply_at test in
generate_grouped_paths() fails, it must mean that grouped paths can be
built by joining its outer and inner relations.  Also, note that calls
to generate_grouped_paths() are always followed by calls to
set_cheapest().  If we failed to generate any grouped paths for a
grouped relation, the set_cheapest() call should already have reported
an error.

> I think it would be worth considering generating the partially grouped
> relations in a second pass. Right now, as you progress from the bottom
> of the join tree towards the top, you created grouped rels as you go.
> But you could equally well finish planning everything up to the
> scan/join target first and then go back and add grouped_rels to
> relations where it seems worthwhile.

Hmm, I don't think so.  I think the presence of eager aggregation
could change the best join order.  For example, without eager
aggregation, the optimizer might find that (A JOIN B) JOIN C the best
join order.  But with eager aggregation on B, the optimizer could
prefer A JOIN (AGG(B) JOIN C).  I'm not sure how we could find the
best join order with eager aggregation applied without building the
join tree from the bottom up.

> I haven't done a detailed comparison of generate_grouped_paths() to
> other parts of the code, but I have an uncomfortable feeling that it
> might be rather similar to some existing code that probably already
> exists in multiple, slightly-different versions. Is there any
> refactoring we could do here?

Yeah, we currently have several functions that do similar, but not
exactly the same, things.  Maybe some refactoring is possible -- maybe
not -- I haven't looked into it closely yet.  However, I'd prefer to
address that in a separate patch if possible, since this issue also
exists on master, and I want to avoid introducing such changes in this
already large patch.

> Do you need a test of this feature in combination with GEQO? You have
> code for it but I don't immediately see a test. I didn't check
> carefully, though.

Good point.  I do have manually tested GEQO by setting geqo_threshold
to 2 and running the regression tests to check for any planning
errors, crashes, or incorrect results.  However, I'm not sure where
test cases for GEQO should be added.  I searched the regression tests
and found only one explicit GEQO test, added back in 2009 (commit
a43b190e3).  It's not quite clear to me what the current policy is for
adding GEQO test cases.

Anyway, I will add some test cases in eager_aggregate.sql with
geqo_threshold set to 2.

> Overall I like the direction this is heading. I don't feel
> well-qualified to evaluate whether all of the things that you're doing
> are completely safe. The logic in is_var_in_aggref_only() and
> is_var_needed_by_join() scares me a bit because I worry that the
> checks are somehow non-exhaustive, but I don't know of a specific
> hazard. That said, I think that modulo such issues, this has a good
> chance of significantly improving performance for certain query
> shapes.
>
> One thing to check might be whether you can construct any cases where
> the strategy is applied too boldly. Given the safeguards you've put in
> place that seems a little a little hard to construct. The most obvious
> thing that occurs to me is an aggregate where combining is more
> expensive than aggregating, so that the partial aggregation gives the
> appearance of saving more work than it really does, but I can't
> immediately think of a problem case. Another case could be where the
> row counts are off, leading to us mistakenly believing that we're
> going to reduce the number of rows that need to be processed when we
> really don't. Of course, such a case would arguably be a fault of the
> bad row-count estimate rather than this patch, but if the patch has
> that problem frequently, it might need to be addressed. Still, I have
> a feeling that the testing you've already been doing might have
> surfaced such cases if they were common. Have you looked into how many
> queries in the regression tests, or in TPC-H/DS, expend significant
> planning effort on this strategy before discarding it? That might be a
> good way to get a sense of whether the patch is too aggressive, not
> aggressive enough, a mix of the two, or just right.

I previously looked into the TPC-DS queries where eager aggregation
was applied and didn't observe any regressions in planning time or
execution time.  I can run TPC-DS again to check the planning time for
the remaining queries.

- Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Fri, Sep 5, 2025 at 11:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Like Matheus, I think a GUC is reasonable. A significant danger here
> appears to be the possibility of a performance cliff, where queries
> are optimized very different when the ratio is 9.99 vs. 10.01, say. It
> would be nice if there were some way to mitigate that danger, but at
> least a GUC avoids chaining the performance of the whole system to a
> hard-coded value.

Yeah, I think the performance cliff issue does exist.  It might be
mitigated by carefully selecting the threshold value to ensure that
small differences in the average group size near the boundary don't
cause big performance swings with and without eager aggregation, but
this doesn't seem like an easy task.

How is this issue avoided in other thresholds?  For example, with
min_parallel_table_scan_size, is there a performance cliff when the
table size is 7.99MB vs. 8.01MB, where a parallel scan is considered
in the latter case but not the former?

> It might be worth considering whether there are heuristics other than
> the group size that could help here. Possibly that's just making
> things more complicated to no benefit. It seems to me, for example,
> that reducing 100 rows to 10 is quite different from reducing a
> million rows to 100,000. On the whole, the latter seems more likely to
> work out well, but it's tricky, because the effort expended per group
> can be arbitrarily high. I think we do want to let the cost model make
> most of the decisions, and just use this threshold to prune ideas that
> are obviously bad at an early stage. That said, it's worth thinking
> about how this interacts with the just-considered-one-eager-agg
> strategy. Does this threshold apply before or after that rule?

If I understand correctly, this means that we need to explore each
join level to find out the most optimal position for applying partial
aggregation.  For example, suppose Agg(B) reduces 100 rows to 10, and
Agg(A JOIN B) reduces a million rows to 100,000, it might be better to
apply partial aggregation at the (A JOIN B) level rather than just
over B.  However, that's not always the case: the Agg(B) option can
reduce the number of input rows to the join earlier, potentially
outperforming the Agg(A JOIN B) approach.  Therefore, we need to
consider both options and compare their costs.

This is actually what the patch used to do before I introduced the
always-push-to-lowest heuristic.

> For instance, consider AGG(FACT_TABLE JOIN DIMENSION_TABLE), like a
> count of orders grouped by customer name. Aggregating on the dimension
> table (in this case, the list of customers) is probably useless, but
> aggregating on the join column of the fact table has a good chance of
> being useful. If we consider only one of those strategies, we want it
> to be the right one. This threshold could be the thing that helps us
> to get it right.

Now I see what you meant.  However, in the current implementation, we
only push partial aggregation down to relations that contain all the
aggregation columns.  So, in the case you mentioned, if the
aggregation columns come from the dimension table, unfortunately, we
don't have the option to partially aggregate the fact table.

The paper does discuss several other transformations, such as "Eager
Count", "Double Eager", and "Eager Split", that can perform partial
aggregation on relations that don't contain aggregation columns, or
even on both sides of the join.  However, those are beyond the scope
of this patch.

- Richard



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Sep 9, 2025 at 5:20 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Yeah, ideally we should tell whether an aggregate's transition state
> may grow unbounded just by looking at system catalogs.  Unfortunately,
> after trying for a while, it seems to me that the current catalog
> doesn't provide enough information.
>
> I once considered adding a flag (e.g., aggtransbounded) to catalog
> pg_aggregate to indicate whether the transition state size is bounded.
> This flag could be specified by users when creating aggregate
> functions, and then leveraged by features such as eager aggregation.
>
> However, adding new information to system catalogs involves a lot of
> discussions and changes, including updates to DDL commands, dump and
> restore processes, and upgrade procedures.  Therefore, to keep the
> focus of this patch on the eager aggregation feature itself, I prefer
> to treat this enhancement as future work.

I don't really like that. I think there's a lot of danger of that
future work never getting done, and thus leaving us stuck more-or-less
permanently with a system that's not really extensible. Data type and
function extensibility is one of the strongest areas of PostgreSQL,
and we should try hard to avoid situations where we regress it. I'm
not sure whether the aggtransbounded flag is exactly the right thing
here, but I don't think adding a new catalog column is an unreasonable
amount of work for a feature of this type.

Having said that, I wonder whether there's some way that we could use
the aggtransspace property for this. For instance, for stanullfrac, we
use values >0 to mean absolute quantities and values <0 to mean
proportions. The current definition of aggtranspace assigns no meaning
to values <0, and the current coding seems to assume that sizes are
fixed regardless of how many inputs are supplied. Maybe we could
define aggtransspace<0 to mean that the number of bytes used per input
value is the additive inverse of the value, or something like that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Tue, Sep 9, 2025 at 6:30 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > I think it would be worth considering generating the partially grouped
> > relations in a second pass. Right now, as you progress from the bottom
> > of the join tree towards the top, you created grouped rels as you go.
> > But you could equally well finish planning everything up to the
> > scan/join target first and then go back and add grouped_rels to
> > relations where it seems worthwhile.
>
> Hmm, I don't think so.  I think the presence of eager aggregation
> could change the best join order.  For example, without eager
> aggregation, the optimizer might find that (A JOIN B) JOIN C the best
> join order.  But with eager aggregation on B, the optimizer could
> prefer A JOIN (AGG(B) JOIN C).  I'm not sure how we could find the
> best join order with eager aggregation applied without building the
> join tree from the bottom up.

Oh, that is a problem, yes. :-(

> > I haven't done a detailed comparison of generate_grouped_paths() to
> > other parts of the code, but I have an uncomfortable feeling that it
> > might be rather similar to some existing code that probably already
> > exists in multiple, slightly-different versions. Is there any
> > refactoring we could do here?
>
> Yeah, we currently have several functions that do similar, but not
> exactly the same, things.  Maybe some refactoring is possible -- maybe
> not -- I haven't looked into it closely yet.  However, I'd prefer to
> address that in a separate patch if possible, since this issue also
> exists on master, and I want to avoid introducing such changes in this
> already large patch.

Well, it's not just a matter of "this already exists" -- it gets
harder and harder to unify things the more near-copies you add.

> Good point.  I do have manually tested GEQO by setting geqo_threshold
> to 2 and running the regression tests to check for any planning
> errors, crashes, or incorrect results.  However, I'm not sure where
> test cases for GEQO should be added.  I searched the regression tests
> and found only one explicit GEQO test, added back in 2009 (commit
> a43b190e3).  It's not quite clear to me what the current policy is for
> adding GEQO test cases.
>
> Anyway, I will add some test cases in eager_aggregate.sql with
> geqo_threshold set to 2.

Sounds good. I think GEQO is mostly-unmaintained these days, but if
we're updating the code, I think it is good to add tests. Being that
the code is so old, it probably lacks adequate test coverage.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Tue, Sep 9, 2025 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Having said that, I wonder whether there's some way that we could use
> the aggtransspace property for this. For instance, for stanullfrac, we
> use values >0 to mean absolute quantities and values <0 to mean
> proportions. The current definition of aggtranspace assigns no meaning
> to values <0, and the current coding seems to assume that sizes are
> fixed regardless of how many inputs are supplied. Maybe we could
> define aggtransspace<0 to mean that the number of bytes used per input
> value is the additive inverse of the value, or something like that.

I really like this idea.  Currently, aggtransspace represents an
estimate of the transition state size provided by the aggregate
definition.  If it's set to zero, a default estimate based on the
state data type is used.  Negative values currently have no defined
meaning.  I think it makes perfect sense to reuse this field so that
a negative value indicates that the transition state data can grow
unboundedly in size.

Attached 0002 implements this idea.  It requires fewer code changes
than I expected.  This is mainly because that our current code uses
aggtransspace in such a way that if it's a positive value, that value
is used as it's provided by the aggregate definition; otherwise, some
heuristics are applied to estimate the size.  For the aggregates that
accumulate input rows (e.g., array_agg, string_agg), I don't currently
have a better heuristic for estimating their size, so I've chosen to
keep the current logic.  This won't regress anything in estimating
transition state data size.

- Richard

Вложения

Re: Eager aggregation, take 3

От
Robert Haas
Дата:
On Fri, Sep 12, 2025 at 5:34 AM Richard Guo <guofenglinux@gmail.com> wrote:
> I really like this idea.  Currently, aggtransspace represents an
> estimate of the transition state size provided by the aggregate
> definition.  If it's set to zero, a default estimate based on the
> state data type is used.  Negative values currently have no defined
> meaning.  I think it makes perfect sense to reuse this field so that
> a negative value indicates that the transition state data can grow
> unboundedly in size.
>
> Attached 0002 implements this idea.  It requires fewer code changes
> than I expected.  This is mainly because that our current code uses
> aggtransspace in such a way that if it's a positive value, that value
> is used as it's provided by the aggregate definition; otherwise, some
> heuristics are applied to estimate the size.  For the aggregates that
> accumulate input rows (e.g., array_agg, string_agg), I don't currently
> have a better heuristic for estimating their size, so I've chosen to
> keep the current logic.  This won't regress anything in estimating
> transition state data size.

This might be OK, but it's not what I was suggesting: I was suggesting
trying to do a calculation like space_used = -aggtransspace *
rowcount, not just using a <0 value as a sentinel.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Sat, Sep 13, 2025 at 3:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 12, 2025 at 5:34 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > I really like this idea.  Currently, aggtransspace represents an
> > estimate of the transition state size provided by the aggregate
> > definition.  If it's set to zero, a default estimate based on the
> > state data type is used.  Negative values currently have no defined
> > meaning.  I think it makes perfect sense to reuse this field so that
> > a negative value indicates that the transition state data can grow
> > unboundedly in size.
> >
> > Attached 0002 implements this idea.  It requires fewer code changes
> > than I expected.  This is mainly because that our current code uses
> > aggtransspace in such a way that if it's a positive value, that value
> > is used as it's provided by the aggregate definition; otherwise, some
> > heuristics are applied to estimate the size.  For the aggregates that
> > accumulate input rows (e.g., array_agg, string_agg), I don't currently
> > have a better heuristic for estimating their size, so I've chosen to
> > keep the current logic.  This won't regress anything in estimating
> > transition state data size.

> This might be OK, but it's not what I was suggesting: I was suggesting
> trying to do a calculation like space_used = -aggtransspace *
> rowcount, not just using a <0 value as a sentinel.

I've considered your suggestion, but I'm not sure I'll adopt it in the
end.  Here's why:

1) At the point where we check whether any aggregates might pose a
risk of excessive memory usage during partial aggregation, row count
information is not yet available.  You could argue that we could
reorganize the logic to perform this check after we've had the row
count, but that seems quite tricky.  If I understand correctly, the
"rowcount" in this context actually means the number of rows within
one partial group.  That would require us to first decide on the
grouping expressions for the partial aggregation, then compute the
group row counts, then estimate space usage, and only then decide
whether memory usage is excessive and fall back.  This would come
quite late in planning and adds nontrivial overhead, compared to the
current approach which checks at the very beginning.

2) Even if we were able to estimate space usage based on the number of
rows per partial group and determined that memory usage seems
acceptable, we still couldn't guarantee that the transition state data
won't grow excessively after further joins.  Joins can multiply
partial aggregates, potentially causing a blowup in memory usage even
if the initial estimate seemed safe.

3) I don't think "-aggtransspace * rowcount" reflects the true memory
footprint for aggregates that accumulate input rows.  For example,
what if we have an aggregate like string_agg(somecolumn, 'a very long
delimiter')?

4) AFAICS, the main downside of the current approach compared to yours
is that it avoids pushing down aggregates like string_agg() that
accumulate input rows, whereas your suggestion might allow pushing
them down in some cases where we *think* it wouldn't blow up memory.
You might argue that the current implementation is over-conservative.
But I prefer to start safe.

That said, I appreciate you proposing the idea of reusing
aggtransspace, although I ended up using it in a different way than
you suggested.

- Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
I've run TPC-DS again to compare planning times with and without eager
aggregation.  Out of 99 queries, only one query (query 64) shows a
noticeable increase in planning time.  This query performs inner joins
across 38 tables.  This is a very large search space.  (I'm talking
about the standard join search method, not the GEQO.)

If my math doesn't fail me, the maximum number of different join
orders when joining n tables is: Catalan(n − 1) x n!.  For n = 38,
this number is astronomically large.  In practice, query 64 joins 19
tables twice (due to a CTE), which still results in about 3.4E28
different join orders.

Of course, in practice, with the help of join_collapse_limit and other
heuristics, the effective search space is reduced a lot, but even
then, it remains very large.  Given this, I'm not too surprised that
query 64 shows an increase in planning time when eager aggregation is
applied -- exploring the best join order in such a space is inherently
expensive.

That said, I've identified a few performance hotspots that can be
optimized to help reduce planning time:

1) the exprs_known_equal() call in get_expression_sortgroupref(),
which is used to check if a given expression is known equal to a
grouping expression due to ECs.  We can optimize this by storing the
EC of each grouping expression, and then get_expression_sortgroupref()
would only need to search the relevant EC, rather than scanning all of
them.

2) the estimate_num_groups() call in create_rel_agg_info().  We can
optimize this by avoiding unnecessary calls to estimate_num_groups()
where possible.

Attached is an updated version of the patch with these optimizations
applied.  With this patch, the planning times for query 64, with and
without eager aggregation, are:

-- with eager aggregation
 Planning Time: 9432.042 ms
-- without eager aggregation
 Planning Time: 7196.999 ms

I think the increase in planning time is acceptable given the large
search space involved, though I may be biased.

- Richard

Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Sep 25, 2025 at 1:23 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Attached is an updated version of the patch with these optimizations
> applied.

FWIW, I plan to do another self-review of this patch soon, with the
goal of assessing whether it's ready to be pushed.  If anyone has any
concerns about any part of the patch or would like to review it, I
would greatly appreciate hearing from you.

- Richard



Re: Eager aggregation, take 3

От
Matheus Alcantara
Дата:
[ getting back to testing this patch ...]

On my last email you replied:
>> Debugging this query shows that all if conditions on
>> setup_eager_aggregation() returns false and create_agg_clause_infos()
>> and create_grouping_expr_infos() are called. The RelAggInfo->agg_useful
>> is also being set to true so I would expect to see Finalize and Partial
>> agg nodes, is this correct or am I missing something here?
>
> Well, just because eager aggregation *can* be applied does not mean
> that it *will* be; it depends on whether it produces a lower-cost
> execution plan.  This transformation is cost-based, so it's not the
> right mindset to assume that it will always be applied when possible.
>
Sorry for the noise here. I didn't consider the costs.

On Sun Sep 28, 2025 at 11:09 PM -03, Richard Guo wrote:
> On Thu, Sep 25, 2025 at 1:23 PM Richard Guo <guofenglinux@gmail.com> wrote:
>> Attached is an updated version of the patch with these optimizations
>> applied.
>
> FWIW, I plan to do another self-review of this patch soon, with the
> goal of assessing whether it's ready to be pushed.  If anyone has any
> concerns about any part of the patch or would like to review it, I
> would greatly appreciate hearing from you.
>
I spent some time testing patch v23 using the TPC-DS benchmark and am
seeing worse execution times when using eager aggregation.
The most interesting cases are:

Query    |  planning time |  execution time |
query 31 |   -2.03%       │    -99.56%      │
query 71 |  -15.51%       │    -68.88%      │
query 20 |  -10.77%       │    -32.40%      │
query 26 |  -28.01%       │    -32.35%      │
query 85 |  -10.57%       │    -31.91%      │
query 77 |  -30.07%       │    -31.38%      │
query 69 |  -32.79%       │    -29.21%      │
query 32 |  -68.48%       │    -27.89%      │
query 57 |   -7.99%       │    -27.32%      │
query 91 |  -24.81%       │    -26.20%      │
query 23 |  -11.72%       │    -18.24%      │

The query 31 seems bad, I don't know if I'm doing something completely
wrong but I've just setup a TPC-DS database and then executed the query
on master and with the v23 patch and I got these results:

Master:
    Planning Time: 3.191 ms
    Execution Time: 16950.619 ms

Patch:
    Planning Time: 3.257 ms
    Execution Time: 3848355.646 ms

Note that I've executed ANALYZE before running the queries on both
scenarios (master and patched).

I'm attaching an EXPLAIN(ANALYZE) output for the query 31 from master
and with the patch applied.

Please let me know if there is any other test that I can run to
benchmark this patch.

--
Matheus Alcantara

Вложения

Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Oct 2, 2025 at 8:55 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> The query 31 seems bad, I don't know if I'm doing something completely
> wrong but I've just setup a TPC-DS database and then executed the query
> on master and with the v23 patch and I got these results:
>
> Master:
>     Planning Time: 3.191 ms
>     Execution Time: 16950.619 ms
>
> Patch:
>     Planning Time: 3.257 ms
>     Execution Time: 3848355.646 ms

Thanks for reporting this.  It does seem odd.  I checked the TPC-DS
benchmarking on v13 and found that the execution time for query 31,
with and without eager aggregation, is as follows:

       EAGER-AGG-OFF           EAGER-AGG-ON
q31     10463.536 ms            10244.175 ms

There appears to be a regression between v13 and v23.  Looking into
it...

- Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Oct 2, 2025 at 10:13 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Thu, Oct 2, 2025 at 8:55 AM Matheus Alcantara
> <matheusssilv97@gmail.com> wrote:
> > The query 31 seems bad, I don't know if I'm doing something completely
> > wrong but I've just setup a TPC-DS database and then executed the query
> > on master and with the v23 patch and I got these results:
> >
> > Master:
> >     Planning Time: 3.191 ms
> >     Execution Time: 16950.619 ms
> >
> > Patch:
> >     Planning Time: 3.257 ms
> >     Execution Time: 3848355.646 ms

> Thanks for reporting this.  It does seem odd.  I checked the TPC-DS
> benchmarking on v13 and found that the execution time for query 31,
> with and without eager aggregation, is as follows:
>
>        EAGER-AGG-OFF           EAGER-AGG-ON
> q31     10463.536 ms            10244.175 ms
>
> There appears to be a regression between v13 and v23.  Looking into
> it...

I noticed something interesting while comparing the two EXPLAIN
(ANALYZE) outputs: the patched version uses parallel plans, whereas
the master does not.  To rule that out as a factor, I ran "SET
max_parallel_workers_per_gather TO 0;" and re-ran query 31 on both
master and the patched version.  This time, I got a positive result.

-- on master
 Planning Time: 5.281 ms
 Execution Time: 7222.665 ms

-- on patched
 Planning Time: 4.855 ms
 Execution Time: 5977.287 ms

It seems eager aggregation doesn't cope well with parallel plans for
this query.  Looking into it.

- Richard



Re: Eager aggregation, take 3

От
Richard Guo
Дата:
On Thu, Oct 2, 2025 at 10:39 AM Richard Guo <guofenglinux@gmail.com> wrote:
> It seems eager aggregation doesn't cope well with parallel plans for
> this query.  Looking into it.

It turns out that this is not related to parallel plans but rather to
poor size estimates.

Looking at query 31, it involves joining 6 base relations, all of
which are CTE references (i.e., RTE_CTE relations) to two different
CTEs.  Each CTE involves aggregations and GROUP BY clauses.
Unfortunately, our size estimates for CTE relations are quite poor,
especially when the CTE uses GROUP BY.  In these cases, we don't have
any ANALYZE statistics available (cf. examine_simple_variable).  As a
result, when computing the selectivity of the CTE relation's qual
clauses, we have to fall back on default values.  For example, for
quals like "CTE.var = const", which are used a lot in query 31, the
selectivity is computed as "1.0 / DEFAULT_NUM_DISTINCT(200)", with the
assumption that there are DEFAULT_NUM_DISTINCT distinct values in the
relation, and that these values are equally common (cf. var_eq_const).

The consequence is that the size estimates are significantly different
from the actual values.  For example, from the EXPLAIN(ANALYZE) output
provided by Matheus:

->  CTE Scan on ws ws3  (cost=0.00..1797.35 rows=2 width=110)
                 (actual time=0.001..74.725 rows=1261.00 loops=1)
      Filter: ((d_year = 1999) AND (d_qoy = 3))

Interestingly, with eager aggregation applied, the row count estimates
for the two CTE plans actually become closer to the actual values.

-- without eager aggregation
CTE ws
  ->  HashAggregate  (cost=96009.03..114825.35 rows=718952 width=54)
                (actual time=977.215..1014.889 rows=23320.00 loops=1)

-- with eager aggregation
CTE ws
  ->  Finalize GroupAggregate  (cost=52144.19..62314.79 rows=71894 width=54)
                          (actual time=275.121..340.107 rows=23312.00 loops=1)

However, due to the highly underestimated selectivity for the qual
clauses, the row count estimates for CTE Scan nodes become worse.
This is because:

-- without eager aggregation
718952 * (1.0/200) * (1.0/200) ~= 18

-- with eager aggregation
71894 * (1.0/200) * (1.0/200) ~= 2

... while the actual row count is 1261.00 as shown above.

That is to say, on master, the CTE plan rows are overestimated while
the selectivity estimates are severely underestimated.  With eager
aggregation, the CTE plan rows become closer to the actual values, but
the selectivity estimates remain equally underestimated.  As a result,
the row count estimates for the CTE Scan nodes worsen with eager
aggregation.  This causes the join order in the final plan to change
when eager aggregation is applied, leading to longer execution times
in this case.


Another point to note is that, due to severely underestimated
selectivity estimates (0.000025, sometimes 0.000000125), the size
estimates for the CTE relations are very small, causing the planner to
tend to choose nestloops.  I tried manually disabling nestloop, and
here are what I got for query 31.

-- on master, set enable_nestloop to on;
 Planning Time: 4.613 ms
 Execution Time: 7142.090 ms

-- on master, set enable_nestloop to off;
 Planning Time: 4.315 ms
 Execution Time: 2262.330 ms

-- on patched, set enable_nestloop to off;
 Planning Time: 4.321 ms
 Execution Time: 1214.376 ms

That is, on master, simply disabling nestloop makes query 31 run more
than 3 times faster.  Enabling eager aggregation on top of that
improves performance further, making it run 1.86 times faster relative
to the nested-loop-disabled baseline.

I manually disabled nested loops for other TPC-DS queries on master
and discovered some additional interesting findings.

For query 4, on master:

-- set enable_nestloop to on
 Planning Time: 3.054 ms
 Execution Time: 3231356.258 ms

-- set enable_nestloop to off
 Planning Time: 4.291 ms
 Execution Time: 12751.170 ms

That is, on master, simply disabling nestloop makes query 4 run more
than 253 times faster.

For query 11, on master:

-- set enable_nestloop to on
 Planning Time: 1.435 ms
 Execution Time: 1824860.937 ms

-- set enable_nestloop to off
 Planning Time: 2.479 ms
 Execution Time: 7984.360 ms

Disabling nestloop makes query 11 run more than 228 times faster.

I believe you can find more such queries in TPC-DS if you keep
looking.  Given this, I don't think it makes much sense to debug a
performance regression on TPC-DS with nestloop enabled.

Matheus, I wonder if you could help run TPC-DS again with this patch,
this time with nested loops disabled for all queries.

- Richard



Re: Eager aggregation, take 3

От
Matheus Alcantara
Дата:
On Thu Oct 2, 2025 at 5:49 AM -03, Richard Guo wrote:
> On Thu, Oct 2, 2025 at 10:39 AM Richard Guo <guofenglinux@gmail.com> wrote:
>> It seems eager aggregation doesn't cope well with parallel plans for
>> this query.  Looking into it.
>
> It turns out that this is not related to parallel plans but rather to
> poor size estimates.
>
> [ ... ]

> Matheus, I wonder if you could help run TPC-DS again with this patch,
> this time with nested loops disabled for all queries.
>
Thanks for all the details. I've disabled the nested loops and executed
the benchmark again and the results look much better! I see a 55%
improvement on query_31 on my machine now (MacOS M3 Max).

The only query that I see a considerable regression is query 23 which I
get a 23% worst execution time. I'm attaching the EXPLAIN(ANALYZE)
output from master and from the patched version if it's interesting.

I'm also attaching a csv with the planning time and execution time from
master and the patched version for all queries. It contains the % of
difference between the executions. Negative numbers means that the
patched version using eager aggregation is faster. (I loaded this csv on
a postgres table and played with some queries to analyze the results).

I'm just wondering if there is anything that can be done on the planner
to prevent this type of situation?

--
Matheus Alcantara

Вложения