Обсуждение: DML and column cound in aggregated subqueries

Поиск
Список
Период
Сортировка

DML and column cound in aggregated subqueries

От
Andres Freund
Дата:
Hi,

while hacking up nodeAgg.c to use a dedicated slot/desc for the
hashtable, an assert I placed brought up a weird case:

regression[29535][1]=# EXPLAIN VERBOSE update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
┌───────────────────────────────────────────────────────────────────────────────────────┐
│                                      QUERY PLAN                                       │
├───────────────────────────────────────────────────────────────────────────────────────┤
│ Update on public.bar  (cost=42.75..109.25 rows=1130 width=20)                         │
│   ->  Hash Join  (cost=42.75..109.25 rows=1130 width=20)                              │
│         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid                            │
│         Hash Cond: (bar.f1 = foo.f1)                                                  │
│         ->  Seq Scan on public.bar  (cost=0.00..32.60 rows=2260 width=14)             │
│               Output: bar.f1, bar.f2, bar.ctid                                        │
│         ->  Hash  (cost=40.25..40.25 rows=200 width=10)                               │
│               Output: foo.ctid, foo.f1                                                │
│               ->  HashAggregate  (cost=38.25..40.25 rows=200 width=10)                │
│                     Output: foo.ctid, foo.f1                                          │
│                     Group Key: foo.f1                                                 │
│                     ->  Seq Scan on public.foo  (cost=0.00..32.60 rows=2260 width=10) │
│                           Output: foo.ctid, foo.f1                                    │
└───────────────────────────────────────────────────────────────────────────────────────┘
(13 rows)

this doesn't look right. The ctid shouldn't be in the aggregate output -
after all it's pretty much meaningless here.

Note that in this case find_hash_columns(aggstate) will return two
columns, but node->numCols will be 1 (and thus node->grpOperators only
have one element).

So it seems we're doing something wrong here.

Casting a wider net: find_hash_columns() and it's subroutines seem like
pretty much dead code, including outdated comments (look for "SQL99
semantics"). The main explanation says: /* * Create a list of the tuple
columns that actually need to be stored in * hashtable entries.  The
incoming tuples from the child plan node will * contain grouping
columns, other columns referenced in our targetlist and * qual, columns
used to compute the aggregate functions, and perhaps just * junk columns
we don't use at all.  Only columns of the first two types * need to be
stored in the hashtable, and getting rid of the others can * make the
table entries significantly smaller.  To avoid messing up Var *
numbering, we keep the same tuple descriptor for hashtable entries as
the * incoming tuples have, but set unwanted columns to NULL in the
tuples that * go into the table.  but that seems bogus - when are we
referencing "other columns referenced in our targetlist and qual" from
representative tuple that aren't also grouped upon?  This method would
select pretty arbitrary value for those, since we don't hash them, so
it'll just be the tuple first occurred?

Greetings,

Andres Freund



Re: DML and column cound in aggregated subqueries

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> this doesn't look right. The ctid shouldn't be in the aggregate output -
> after all it's pretty much meaningless here.

I suspect it's being added to support EvalPlanQual row re-fetches.

> Casting a wider net: find_hash_columns() and it's subroutines seem like
> pretty much dead code, including outdated comments (look for "SQL99
> semantics").

Hm, maybe.  In principle the planner could do that instead, but it doesn't
look like it actually does at the moment; I'm not seeing any distinction
in tlist-building for AGG_HASHED vs other cases in create_agg_plan().
As an example:

regression=# explain verbose select avg(ten), hundred from tenk1 group by 2;
                        QUERY PLAN                                                                       
--------------------------------------------------------------------------------
-----------------------------------------------------------------------HashAggregate  (cost=508.00..509.25 rows=100
width=36) Output: avg(ten), hundred  Group Key: tenk1.hundred  ->  Seq Scan on public.tenk1  (cost=0.00..458.00
rows=10000width=8)        Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, tw 
othousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
(5 rows)

If you wanted to forgo find_hash_columns() then it would be important
for the seqscan to output a minimal tlist rather than try to optimize
away its projection step.

I'm not that excited about making such a change in terms of performance:
you'd essentially be skipping a handmade projection step inside nodeAgg
at the cost of probably adding one at the input node, as in this example.
But it might be worth doing anyway just on the grounds that this ought to
be the planner's domain not a custom hack in the executor.
        regards, tom lane



Re: DML and column cound in aggregated subqueries

От
Andres Freund
Дата:
On 2016-10-31 09:35:57 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > this doesn't look right. The ctid shouldn't be in the aggregate output -
> > after all it's pretty much meaningless here.
> 
> I suspect it's being added to support EvalPlanQual row re-fetches.

Hm, that doesn't seem particularly meaningful though? I wonder if it's
not actually more an accident than anything. Looks like
preprocess_rowmarks() adds them pretty unconditionally to everything for DML.


> > Casting a wider net: find_hash_columns() and it's subroutines seem like
> > pretty much dead code, including outdated comments (look for "SQL99
> > semantics").
> 
> Hm, maybe.  In principle the planner could do that instead, but it doesn't
> look like it actually does at the moment; I'm not seeing any distinction
> in tlist-building for AGG_HASHED vs other cases in create_agg_plan().

Isn't the important part what's inside Agg->numCols/grpColIdx/grpOperators?
Those should Because those are the ones that are minimal, and it looks
we do the right thing there already (and if not, we'd be in trouble
anyways afaics).

We already only store columns that find_hash_columns() figures out to be
required in the hashtable, but afaics we can instead just iterate over
grpColIdx[] and just store the ones in there.


The reason I'm looking into this in the first place is that the slot
access inside execGrouping turns out to be pretty expensive, especially
because all the tuples have nulls, so we can't even skip columns and
such. I wrote code to create a new tupledesc which only containing the
columns referenced by grpColIdx, and asserted that it's the same set
that find_hash_columns() - but that's not always the case, but afaics
spuriously.

Greetings,

Andres Freund