Обсуждение: Use BumpContext contexts for TupleHashTables' tablecxt

Поиск
Список
Период
Сортировка

Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
[ starting a new thread to keep this separate from the estimation
issue ]

I looked at the callers of BuildTupleHashTable, and realized that
(a) every one of them can use a BumpContext for the "tablecxt",
and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
cc721c459.  So we have precedent for the idea working.  Here's
a fleshed-out patch to fix the remaining callers.

            regards, tom lane

From 15ef2e50085ac83728cb9189e482964ff02e5aae Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 26 Oct 2025 16:46:57 -0400
Subject: [PATCH v1] Use BumpContext contexts for TupleHashTables' tablecxt.

execGrouping.c itself does nothing with this context except to
allocate new hash entries in it, and the callers do nothing with it
except to reset the whole context.  So this is an ideal use-case for
a BumpContext, and the hash tables are frequently big enough for the
savings to be significant.

Commit cc721c459 already taught nodeAgg.c this idea, but neglected
the other callers of BuildTupleHashTable.

(There is a side issue of the extent to which this change invalidates
the planner's estimates of hashtable memory consumption.  However,
those estimates are already pretty bad, so improving them seems
like it can be a separate project.)

Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
 src/backend/executor/execGrouping.c       |  1 +
 src/backend/executor/nodeRecursiveunion.c |  9 +++++----
 src/backend/executor/nodeSetOp.c          | 10 ++++++----
 src/backend/executor/nodeSubplan.c        |  6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 75087204f0c..75023182325 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -147,6 +147,7 @@ execTuplesHashPrepare(int numCols,
  *    additionalsize: size of data that may be stored along with the hash entry
  *    metacxt: memory context for long-lived allocation, but not per-entry data
  *    tablecxt: memory context in which to store table entries
+ *        (it's usually desirable for tablecxt to be a BumpContext)
  *    tempcxt: short-lived context for evaluation hash and comparison functions
  *    use_variable_hash_iv: if true, adjust hash IV per-parallel-worker
  *
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..6d920955fc5 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -209,7 +209,8 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
      * If hashing, we need a per-tuple memory context for comparisons, and a
      * longer-lived context to store the hash table.  The table can't just be
      * kept in the per-query context because we want to be able to throw it
-     * away when rescanning.
+     * away when rescanning.  We can use a BumpContext to save storage,
+     * because we will have no need to delete individual table entries.
      */
     if (node->numCols > 0)
     {
@@ -218,9 +219,9 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
                                   "RecursiveUnion",
                                   ALLOCSET_DEFAULT_SIZES);
         rustate->tableContext =
-            AllocSetContextCreate(CurrentMemoryContext,
-                                  "RecursiveUnion hash table",
-                                  ALLOCSET_DEFAULT_SIZES);
+            BumpContextCreate(CurrentMemoryContext,
+                              "RecursiveUnion hash table",
+                              ALLOCSET_DEFAULT_SIZES);
     }

     /*
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 4068481a523..ce42ea6a549 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -589,13 +589,15 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * If hashing, we also need a longer-lived context to store the hash
      * table.  The table can't just be kept in the per-query context because
-     * we want to be able to throw it away in ExecReScanSetOp.
+     * we want to be able to throw it away in ExecReScanSetOp.  We can use a
+     * BumpContext to save storage, because we will have no need to delete
+     * individual table entries.
      */
     if (node->strategy == SETOP_HASHED)
         setopstate->tableContext =
-            AllocSetContextCreate(CurrentMemoryContext,
-                                  "SetOp hash table",
-                                  ALLOCSET_DEFAULT_SIZES);
+            BumpContextCreate(CurrentMemoryContext,
+                              "SetOp hash table",
+                              ALLOCSET_DEFAULT_SIZES);

     /*
      * initialize child nodes
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 53fb56f7388..8930e0398a7 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -891,9 +891,9 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)

         /* We need a memory context to hold the hash table(s) */
         sstate->hashtablecxt =
-            AllocSetContextCreate(CurrentMemoryContext,
-                                  "Subplan HashTable Context",
-                                  ALLOCSET_DEFAULT_SIZES);
+            BumpContextCreate(CurrentMemoryContext,
+                              "Subplan HashTable Context",
+                              ALLOCSET_DEFAULT_SIZES);
         /* and a short-lived exprcontext for function evaluation */
         sstate->innerecontext = CreateExprContext(estate);

--
2.43.7


Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
David Rowley
Дата:
On Mon, 27 Oct 2025 at 09:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> [ starting a new thread to keep this separate from the estimation
> issue ]
>
> I looked at the callers of BuildTupleHashTable, and realized that
> (a) every one of them can use a BumpContext for the "tablecxt",
> and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
> cc721c459.  So we have precedent for the idea working.  Here's
> a fleshed-out patch to fix the remaining callers.

Yeah, this should give a decent performance improvement for larger workloads.

I can't help wonder if we can improve the memory context parameter
names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
remind myself that it's not for the bucket array, just the stuff we
have the buckets point to. Would "hashedtuplecxt" be better?

If we did that, then the context names could use some of the same
treatment, "RecursiveUnion hashed tuples" and "SetOp hashed tuples" or
something.

The field name for TupleHashTableData.tablecxt and the comment (/*
memory context containing table */) also leads me into thinking it's
for the bucket array.

Maybe I'm unique in having that problem...

David



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> I can't help wonder if we can improve the memory context parameter
> names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
> remind myself that it's not for the bucket array, just the stuff we
> have the buckets point to. Would "hashedtuplecxt" be better?

I agree these names are not great.  I think they might be leftovers
from a time when there actually was a complete hash-table structure
in that context.

Related to this, while I was chasing Jeff's complaint I realized that
the none-too-small simplehash table for this is getting made in the
query's ExecutorState.  That's pretty awful from the standpoint of
being able to blame memory consumption on the hash node.  I'm not
sure though if we want to go so far as to make another context just
for the simplehash table.  We could keep it in that same "tablectx"
at the price of destroying and rebuilding the simplehash table, not
just resetting it, at each node rescan.  But that's not ideal either.

            regards, tom lane



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
David Rowley
Дата:
On Mon, 27 Oct 2025 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Related to this, while I was chasing Jeff's complaint I realized that
> the none-too-small simplehash table for this is getting made in the
> query's ExecutorState.  That's pretty awful from the standpoint of
> being able to blame memory consumption on the hash node.  I'm not
> sure though if we want to go so far as to make another context just
> for the simplehash table.  We could keep it in that same "tablectx"
> at the price of destroying and rebuilding the simplehash table, not
> just resetting it, at each node rescan.  But that's not ideal either.

I don't think you could do that and have your patch as SH_GROW() needs
to pfree the old bucket array after rehashing, which bump won't like.

David



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> On Mon, 27 Oct 2025 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... We could keep it in that same "tablectx"

> I don't think you could do that and have your patch as SH_GROW() needs
> to pfree the old bucket array after rehashing, which bump won't like.

Ah, good point.  Nevermind that idea then ...

            regards, tom lane



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
David Rowley
Дата:
On Mon, 27 Oct 2025 at 10:46, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Mon, 27 Oct 2025 at 09:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > [ starting a new thread to keep this separate from the estimation
> > issue ]
> >
> > I looked at the callers of BuildTupleHashTable, and realized that
> > (a) every one of them can use a BumpContext for the "tablecxt",
> > and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
> > cc721c459.  So we have precedent for the idea working.  Here's
> > a fleshed-out patch to fix the remaining callers.
>
> Yeah, this should give a decent performance improvement for larger workloads.

I just did a quick test of this with the best-case I could imagine,
where all rows are filtered, thus reducing the additional overhead of
going into other nodes. Patched came out about 9% faster than master
(without MEMORY_CONTEXT_CHECKING).

Test 1)
Setup 1 million rows:
create table t1 as select generate_Series(1,1_000_000)a;
vacuum freeze analyze t1;
set work_mem = '1GB';
set jit=0;
\timing on

Master:
select * from t1 except select * from t1;
Time: 343.660 ms
Time: 341.049 ms
Time: 352.762 ms

Patched:
select * from t1 except select * from t1;
Time: 312.576 ms
Time: 317.629 ms
Time: 323.980 ms (+8.73%)

Test 2)
Setup 10 million rows:
create table t2 as select generate_Series(1,10_000_000)a;
vacuum freeze analyze t2;
set work_mem = '1GB';
set jit=0;
\timing on

Master:
select * from t2 except select * from t2;
Time: 4737.736 ms
Time: 4660.170 ms
Time: 4749.984 ms

Patched:
select * from t2 except select * from t2;
Time: 4307.161 ms
Time: 4339.134 ms
Time: 4279.902 ms (+9.45%)

David



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Chao Li
Дата:

> On Oct 27, 2025, at 04:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>
> I looked at the callers of BuildTupleHashTable, and realized that
> (a) every one of them can use a BumpContext for the "tablecxt",
> and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
> cc721c459.  So we have precedent for the idea working.  Here's
> a fleshed-out patch to fix the remaining callers.
>
> regards, tom lane
>
> From 15ef2e50085ac83728cb9189e482964ff02e5aae Mon Sep 17 00:00:00 2001
> From: Tom Lane <tgl@sss.pgh.pa.us>
> Date: Sun, 26 Oct 2025 16:46:57 -0400
> Subject: [PATCH v1] Use BumpContext contexts for TupleHashTables' tablecxt.
>
> execGrouping.c itself does nothing with this context except to
> allocate new hash entries in it, and the callers do nothing with it
> except to reset the whole context.  So this is an ideal use-case for
> a BumpContext, and the hash tables are frequently big enough for the
> savings to be significant.
>

I just did a test. My test procedure is like:

1. Set the following options in postgrs.conf:
```
work_mem = '512MB’
max_parallel_workers = 0
shared_buffers = '1GB’
log_statement_stats = on
```

2. Prepare for some data
```
CREATE TABLE t1 AS
  SELECT g AS id, md5(g::text) AS txt
  FROM generate_series(1, 5e6) g;

CREATE TABLE t2 AS
  SELECT g AS id, md5(g::text) AS txt
  FROM generate_series(1, 5e6) g
  WHERE g % 2 = 0;

VACUUM FREEZE t1;
VACUUM FREEZE t2;
```

3. Run
```
\timing on
EXPLAIN (ANALYZE, BUFFERS)
  SELECT * FROM t1
  INTERSECT
  SELECT * FROM t2;
```

Here is the test result:

Without the patch:

- stats:
```
2025-10-27 11:04:43.145 CST [88599] LOG:  QUERY STATISTICS
2025-10-27 11:04:43.145 CST [88599] DETAIL:  ! system usage stats:
    !    1.609178 s user, 0.051652 s system, 1.661308 s elapsed
    !    [1.612991 s user, 0.056877 s system total]
    !    601392 kB max resident size
    !    0/0 [0/0] filesystem blocks in/out
    !    1/37016 [1/37621] page faults/reclaims, 0 [0] swaps
    !    0 [0] signals rcvd, 0/0 [2/1] messages rcvd/sent
    !    0/18 [0/44] voluntary/involuntary context switches
```

- SQL execution:
```
evantest=# EXPLAIN (ANALYZE, BUFFERS)
                                                SELECT * FROM t1
                                                                                                INTERSECT

                       SELECT * FROM t2; 
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 HashSetOp Intersect  (cost=174993.73..181243.73 rows=2500000 width=39) (actual time=1500.129..1565.293 rows=2500000.00
loops=1)
   Buffers: shared read=62501
   ->  Seq Scan on t2  (cost=0.00..45834.00 rows=2500000 width=39) (actual time=0.323..68.214 rows=2500000.00 loops=1)
         Buffers: shared read=20834
   ->  Seq Scan on t1  (cost=0.00..91662.15 rows=4999515 width=39) (actual time=0.077..113.602 rows=5000000.00 loops=1)
         Buffers: shared read=41667
 Planning:
   Buffers: shared hit=75 read=25
 Planning Time: 1.800 ms
 Execution Time: 1655.722 ms
(10 rows)

Time: 1661.561 ms (00:01.662)
```

With the patch:

- stats:
```
2025-10-27 11:02:14.656 CST [87387] LOG:  QUERY STATISTICS
2025-10-27 11:02:14.656 CST [87387] DETAIL:  ! system usage stats:
    !    1.625830 s user, 0.046199 s system, 1.672351 s elapsed
    !    [1.630466 s user, 0.052191 s system total]
    !    487264 kB max resident size
    !    0/0 [0/0] filesystem blocks in/out
    !    1/29891 [1/30489] page faults/reclaims, 0 [0] swaps
    !    0 [0] signals rcvd, 0/0 [2/1] messages rcvd/sent
    !    0/11 [0/36] voluntary/involuntary context switches
```

- SQL execution:
```
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 HashSetOp Intersect  (cost=174993.73..181243.73 rows=2500000 width=39) (actual time=1471.164..1602.767 rows=2500000.00
loops=1)
   Buffers: shared read=62501
   ->  Seq Scan on t2  (cost=0.00..45834.00 rows=2500000 width=39) (actual time=0.296..67.955 rows=2500000.00 loops=1)
         Buffers: shared read=20834
   ->  Seq Scan on t1  (cost=0.00..91662.15 rows=4999515 width=39) (actual time=0.088..114.052 rows=5000000.00 loops=1)
         Buffers: shared read=41667
 Planning:
   Buffers: shared hit=75 read=25
 Planning Time: 1.875 ms
 Execution Time: 1666.881 ms
(10 rows)

Time: 1672.574 ms (00:01.673)
```

Looks like:

* No obvious execution time improvement
* Max resident size reduced from 600MB to 487MB, ~19% reduction
* Page reclaims dropped from 37k -> 30k, ~19% reduction

I ran the test several rounds, and the results are consistent: roughly 19% memory usage reduction.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
>> On Mon, 27 Oct 2025 at 09:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I looked at the callers of BuildTupleHashTable, and realized that
>>> (a) every one of them can use a BumpContext for the "tablecxt",
>>> and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
>>> cc721c459.  So we have precedent for the idea working.  Here's
>>> a fleshed-out patch to fix the remaining callers.

> I just did a quick test of this with the best-case I could imagine,
> where all rows are filtered, thus reducing the additional overhead of
> going into other nodes. Patched came out about 9% faster than master
> (without MEMORY_CONTEXT_CHECKING).

Hmm, I wasn't really expecting any direct time saving; the point
was about cutting memory consumption.  So Chao Li's nearby results
are in line with mine.

            regards, tom lane



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
David Rowley
Дата:
On Mon, 27 Oct 2025 at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, I wasn't really expecting any direct time saving; the point
> was about cutting memory consumption.  So Chao Li's nearby results
> are in line with mine.

It's for the same reason that Hash Join starts to run more slowly once
the hash table is larger than L3. Because the memory access pattern
when probing the hash table can't be predicted by the CPU, larger
tables will start having to shuffle cachelines in from RAM more often.
The same happens with smaller tables when having to go from L2 out to
L3 (and even L1d out to L2). If you graphed various different table
sizes, you'd see the performance dropping off per hash lookup as the
memory usage crosses cache size boundaries.

What you've done by using bump is made it so that more tuples will fit
in the same amount of memory, therefore increasing the chances that
useful cachelines are found.

If you happened to always probe the hash table in hash key order, then
this probably wouldn't happen (or at least to a lesser extent) as the
hardware prefetcher would see the forward pattern and prefetch the
memory.

David



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
I wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
>> I can't help wonder if we can improve the memory context parameter
>> names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
>> remind myself that it's not for the bucket array, just the stuff we
>> have the buckets point to. Would "hashedtuplecxt" be better?

> I agree these names are not great.  I think they might be leftovers
> from a time when there actually was a complete hash-table structure
> in that context.

Looking closer, it seems like a lot of the confusion here arose when
TupleHashTables were rebased onto simplehash.h.  That patch seems to
have paid about zero attention to updating comments that it'd
falsified or at least made misleading.  Here's a v2 that tries to
clean up some of the mess.

I'm also proposing to move the MemoryContextReset of that context
into ResetTupleHashTable instead of documenting that we expect the
caller to do that.  The old way just seems like a bizarre artifact.

> Related to this, while I was chasing Jeff's complaint I realized that
> the none-too-small simplehash table for this is getting made in the
> query's ExecutorState.  That's pretty awful from the standpoint of
> being able to blame memory consumption on the hash node.

I didn't do anything about this.  I briefly considered having
BuildTupleHashTable construct a per-hashtable context, but backed
off after seeing that nodeAgg.c goes to some effort to use the same
metacxt and tuplescxt for several hashtables.  I'm not sure if that
had measured performance benefits behind it, but it well might have.
I now think that if we care, the right thing is to make the remaining
callers construct extra memory contexts for their hash tables.  That
could be left for another day.

            regards, tom lane

From bf6b2ee1c941811efc4bb836627e6f2dcc8ae2e8 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 29 Oct 2025 15:31:59 -0400
Subject: [PATCH v2] Use BumpContext contexts in TupleHashTables, and do some
 code cleanup.

For all extant uses of TupleHashTables, execGrouping.c itself does
nothing with the "tablecxt" except to allocate new hash entries in it,
and the callers do nothing with it except to reset the whole context.
So this is an ideal use-case for a BumpContext, and the hash tables
are frequently big enough for the savings to be significant.

(Commit cc721c459 already taught nodeAgg.c this idea, but neglected
the other callers of BuildTupleHashTable.)

While at it, let's clean up some ill-advised leftovers from rebasing
TupleHashTables on simplehash.h:

* Many comments and variable names were based on the idea that the
tablecxt holds the whole TupleHashTable, whereas now it only holds the
hashed tuples (plus any caller-defined "additional storage").  Rename
to names like tuplescxt and tuplesContext, and adjust the comments.
Also adjust the memory context names to be like "<Foo> hashed tuples".

* Make ResetTupleHashTable() reset the tuplescxt rather than relying
on the caller to do so; that was a recipe for leaks.  This is less
efficient in the case where nodeAgg.c uses the same tuplescxt for
several different hashtables, but only microscopically so because
mcxt.c will short-circuit the extra resets via its isReset flag.
I judge the extra safety and intellectual cleanliness well worth
those few cycles.

* Remove the long-obsolete "allow_jit" check added by ac88807f9;
instead, just Assert that metacxt and tuplescxt are different.
We need that anyway for this definition of ResetTupleHashTable() to
be safe.

A loose end not addressed in this version is that the "entrysize"
calculation in BuildTupleHashTable is ridiculous on its face:
"sizeof(TupleHashEntryData) + additionalsize" corresponds neither
to the size of the simplehash entries nor to the total space needed
per tuple.  We lack info about the expected tuple width, so computing
the latter would require an API change.  I kind of wonder why
BuildTupleHashTable is second-guessing its caller's nbuckets choice
at all; wasn't that computed earlier with more information?

There is also a side issue of the extent to which this change
invalidates the planner's estimates of hashtable memory consumption.
However, those estimates are already pretty bad, so improving them
seems like it can be a separate project.

Reported-by: Jeff Janes <jeff.janes@gmail.com>
Reported-by: David Rowley <dgrowleyml@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Discussion: https://postgr.es/m/2268409.1761512111@sss.pgh.pa.us
---
 src/backend/executor/execGrouping.c       | 57 ++++++++++++++---------
 src/backend/executor/nodeAgg.c            | 31 ++++++------
 src/backend/executor/nodeRecursiveunion.c | 23 ++++-----
 src/backend/executor/nodeSetOp.c          | 22 ++++-----
 src/backend/executor/nodeSubplan.c        | 17 ++++---
 src/include/executor/executor.h           |  2 +-
 src/include/nodes/execnodes.h             | 20 +++++---
 7 files changed, 90 insertions(+), 82 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 75087204f0c..621d382405c 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -145,8 +145,8 @@ execTuplesHashPrepare(int numCols,
  *    collations: collations to use in comparisons
  *    nbuckets: initial estimate of hashtable size
  *    additionalsize: size of data that may be stored along with the hash entry
- *    metacxt: memory context for long-lived allocation, but not per-entry data
- *    tablecxt: memory context in which to store table entries
+ *    metacxt: memory context for long-lived data and the simplehash table
+ *    tuplescxt: memory context in which to store the hashed tuples themselves
  *    tempcxt: short-lived context for evaluation hash and comparison functions
  *    use_variable_hash_iv: if true, adjust hash IV per-parallel-worker
  *
@@ -157,6 +157,13 @@ execTuplesHashPrepare(int numCols,
  * Note that the keyColIdx, hashfunctions, and collations arrays must be
  * allocated in storage that will live as long as the hashtable does.
  *
+ * The metacxt and tuplescxt are separate because it's usually desirable for
+ * tuplescxt to be a BumpContext to avoid memory wastage, while metacxt must
+ * support pfree in case the simplehash table needs to be enlarged.  (We could
+ * simplify the API of TupleHashTables by managing the tuplescxt internally.
+ * But that would be disadvantageous to nodeAgg.c and nodeSubplan.c, which use
+ * a single tuplescxt for multiple TupleHashTables that are reset together.)
+ *
  * LookupTupleHashEntry, FindTupleHashEntry, and related functions may leak
  * memory in the tempcxt.  It is caller's responsibility to reset that context
  * reasonably often, typically once per tuple.  (We do it that way, rather
@@ -175,7 +182,7 @@ BuildTupleHashTable(PlanState *parent,
                     long nbuckets,
                     Size additionalsize,
                     MemoryContext metacxt,
-                    MemoryContext tablecxt,
+                    MemoryContext tuplescxt,
                     MemoryContext tempcxt,
                     bool use_variable_hash_iv)
 {
@@ -183,14 +190,24 @@ BuildTupleHashTable(PlanState *parent,
     Size        entrysize;
     Size        hash_mem_limit;
     MemoryContext oldcontext;
-    bool        allow_jit;
     uint32        hash_iv = 0;

     Assert(nbuckets > 0);
+
+    /* tuplescxt must be separate, else ResetTupleHashTable breaks things */
+    Assert(metacxt != tuplescxt);
+
+    /* ensure additionalsize is maxalign'ed */
     additionalsize = MAXALIGN(additionalsize);
-    entrysize = sizeof(TupleHashEntryData) + additionalsize;

-    /* Limit initial table size request to not more than hash_mem */
+    /*
+     * Limit initial table size request to not more than hash_mem.
+     *
+     * XXX this calculation seems pretty misguided, as it counts only overhead
+     * and not the tuples themselves.  But we have no knowledge of the
+     * expected tuple width here.
+     */
+    entrysize = sizeof(TupleHashEntryData) + additionalsize;
     hash_mem_limit = get_hash_memory_limit() / entrysize;
     if (nbuckets > hash_mem_limit)
         nbuckets = hash_mem_limit;
@@ -202,7 +219,7 @@ BuildTupleHashTable(PlanState *parent,
     hashtable->numCols = numCols;
     hashtable->keyColIdx = keyColIdx;
     hashtable->tab_collations = collations;
-    hashtable->tablecxt = tablecxt;
+    hashtable->tuplescxt = tuplescxt;
     hashtable->tempcxt = tempcxt;
     hashtable->additionalsize = additionalsize;
     hashtable->tableslot = NULL;    /* will be made on first lookup */
@@ -230,16 +247,6 @@ BuildTupleHashTable(PlanState *parent,
     hashtable->tableslot = MakeSingleTupleTableSlot(CreateTupleDescCopy(inputDesc),
                                                     &TTSOpsMinimalTuple);

-    /*
-     * If the caller fails to make the metacxt different from the tablecxt,
-     * allowing JIT would lead to the generated functions to a) live longer
-     * than the query or b) be re-generated each time the table is being
-     * reset. Therefore prevent JIT from being used in that case, by not
-     * providing a parent node (which prevents accessing the JitContext in the
-     * EState).
-     */
-    allow_jit = (metacxt != tablecxt);
-
     /* build hash ExprState for all columns */
     hashtable->tab_hash_expr = ExecBuildHash32FromAttrs(inputDesc,
                                                         inputOps,
@@ -247,7 +254,7 @@ BuildTupleHashTable(PlanState *parent,
                                                         collations,
                                                         numCols,
                                                         keyColIdx,
-                                                        allow_jit ? parent : NULL,
+                                                        parent,
                                                         hash_iv);

     /* build comparator for all columns */
@@ -256,7 +263,7 @@ BuildTupleHashTable(PlanState *parent,
                                                     &TTSOpsMinimalTuple,
                                                     numCols,
                                                     keyColIdx, eqfuncoids, collations,
-                                                    allow_jit ? parent : NULL);
+                                                    parent);

     /*
      * While not pretty, it's ok to not shut down this context, but instead
@@ -273,13 +280,17 @@ BuildTupleHashTable(PlanState *parent,

 /*
  * Reset contents of the hashtable to be empty, preserving all the non-content
- * state. Note that the tablecxt passed to BuildTupleHashTable() should
- * also be reset, otherwise there will be leaks.
+ * state.
+ *
+ * Note: in usages where several TupleHashTables share a tuplescxt, we will be
+ * redundantly resetting the tuplescxt.  But because of mcxt.c's isReset flag,
+ * that's cheap enough that we need not avoid it.
  */
 void
 ResetTupleHashTable(TupleHashTable hashtable)
 {
     tuplehash_reset(hashtable->hashtab);
+    MemoryContextReset(hashtable->tuplescxt);
 }

 /*
@@ -489,10 +500,10 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
             /* created new entry */
             *isnew = true;

-            MemoryContextSwitchTo(hashtable->tablecxt);
+            MemoryContextSwitchTo(hashtable->tuplescxt);

             /*
-             * Copy the first tuple into the table context, and request
+             * Copy the first tuple into the tuples context, and request
              * additionalsize extra bytes before the allocation.
              *
              * The caller can get a pointer to the additional data with
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 64643c3943a..5b247f4f966 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1457,7 +1457,7 @@ find_cols_walker(Node *node, FindColsContext *context)
  * We have a separate hashtable and associated perhash data structure for each
  * grouping set for which we're doing hashing.
  *
- * The contents of the hash tables always live in the hashcontext's per-tuple
+ * The contents of the hash tables live in the aggstate's hash_tuplescxt
  * memory context (there is only one of these for all tables together, since
  * they are all reset at the same time).
  */
@@ -1509,7 +1509,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 {
     AggStatePerHash perhash = &aggstate->perhash[setno];
     MemoryContext metacxt = aggstate->hash_metacxt;
-    MemoryContext tablecxt = aggstate->hash_tablecxt;
+    MemoryContext tuplescxt = aggstate->hash_tuplescxt;
     MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
     Size        additionalsize;

@@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
                                              nbuckets,
                                              additionalsize,
                                              metacxt,
-                                             tablecxt,
+                                             tuplescxt,
                                              tmpcxt,
                                              DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
 }
@@ -1868,7 +1868,7 @@ hash_agg_check_limits(AggState *aggstate)
     uint64        ngroups = aggstate->hash_ngroups_current;
     Size        meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt,
                                                      true);
-    Size        entry_mem = MemoryContextMemAllocated(aggstate->hash_tablecxt,
+    Size        entry_mem = MemoryContextMemAllocated(aggstate->hash_tuplescxt,
                                                       true);
     Size        tval_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory,
                                                      true);
@@ -1959,7 +1959,7 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
     meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt, true);

     /* memory for hash entries */
-    entry_mem = MemoryContextMemAllocated(aggstate->hash_tablecxt, true);
+    entry_mem = MemoryContextMemAllocated(aggstate->hash_tuplescxt, true);

     /* memory for byref transition states */
     hashkey_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory, true);
@@ -2042,11 +2042,11 @@ hash_create_memory(AggState *aggstate)
     /* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
     maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);

-    aggstate->hash_tablecxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
-                                                "HashAgg table context",
-                                                ALLOCSET_DEFAULT_MINSIZE,
-                                                ALLOCSET_DEFAULT_INITSIZE,
-                                                maxBlockSize);
+    aggstate->hash_tuplescxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
+                                                 "HashAgg hashed tuples",
+                                                 ALLOCSET_DEFAULT_MINSIZE,
+                                                 ALLOCSET_DEFAULT_INITSIZE,
+                                                 maxBlockSize);

 }

@@ -2707,7 +2707,6 @@ agg_refill_hash_table(AggState *aggstate)

     /* free memory and reset hash tables */
     ReScanExprContext(aggstate->hashcontext);
-    MemoryContextReset(aggstate->hash_tablecxt);
     for (int setno = 0; setno < aggstate->num_hashes; setno++)
         ResetTupleHashTable(aggstate->perhash[setno].hashtable);

@@ -4433,13 +4432,12 @@ ExecEndAgg(AggState *node)
         MemoryContextDelete(node->hash_metacxt);
         node->hash_metacxt = NULL;
     }
-    if (node->hash_tablecxt != NULL)
+    if (node->hash_tuplescxt != NULL)
     {
-        MemoryContextDelete(node->hash_tablecxt);
-        node->hash_tablecxt = NULL;
+        MemoryContextDelete(node->hash_tuplescxt);
+        node->hash_tuplescxt = NULL;
     }

-
     for (transno = 0; transno < node->numtrans; transno++)
     {
         AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -4555,8 +4553,7 @@ ExecReScanAgg(AggState *node)
         node->hash_ngroups_current = 0;

         ReScanExprContext(node->hashcontext);
-        MemoryContextReset(node->hash_tablecxt);
-        /* Rebuild an empty hash table */
+        /* Rebuild empty hash table(s) */
         build_hash_tables(node);
         node->table_filled = false;
         /* iterator will be reset when the table is filled */
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..00269d254e0 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -53,7 +53,7 @@ build_hash_table(RecursiveUnionState *rustate)
                                              node->numGroups,
                                              0,
                                              rustate->ps.state->es_query_cxt,
-                                             rustate->tableContext,
+                                             rustate->tuplesContext,
                                              rustate->tempContext,
                                              false);
 }
@@ -197,7 +197,7 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
     rustate->hashfunctions = NULL;
     rustate->hashtable = NULL;
     rustate->tempContext = NULL;
-    rustate->tableContext = NULL;
+    rustate->tuplesContext = NULL;

     /* initialize processing state */
     rustate->recursing = false;
@@ -209,7 +209,8 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
      * If hashing, we need a per-tuple memory context for comparisons, and a
      * longer-lived context to store the hash table.  The table can't just be
      * kept in the per-query context because we want to be able to throw it
-     * away when rescanning.
+     * away when rescanning.  We can use a BumpContext to save storage,
+     * because we will have no need to delete individual table entries.
      */
     if (node->numCols > 0)
     {
@@ -217,10 +218,10 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
             AllocSetContextCreate(CurrentMemoryContext,
                                   "RecursiveUnion",
                                   ALLOCSET_DEFAULT_SIZES);
-        rustate->tableContext =
-            AllocSetContextCreate(CurrentMemoryContext,
-                                  "RecursiveUnion hash table",
-                                  ALLOCSET_DEFAULT_SIZES);
+        rustate->tuplesContext =
+            BumpContextCreate(CurrentMemoryContext,
+                              "RecursiveUnion hashed tuples",
+                              ALLOCSET_DEFAULT_SIZES);
     }

     /*
@@ -291,8 +292,8 @@ ExecEndRecursiveUnion(RecursiveUnionState *node)
     /* free subsidiary stuff including hashtable */
     if (node->tempContext)
         MemoryContextDelete(node->tempContext);
-    if (node->tableContext)
-        MemoryContextDelete(node->tableContext);
+    if (node->tuplesContext)
+        MemoryContextDelete(node->tuplesContext);

     /*
      * close down subplans
@@ -328,10 +329,6 @@ ExecReScanRecursiveUnion(RecursiveUnionState *node)
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);

-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
-
     /* Empty hashtable if needed */
     if (plan->numCols > 0)
         ResetTupleHashTable(node->hashtable);
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 4068481a523..97f975b1c09 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -106,7 +106,7 @@ build_hash_table(SetOpState *setopstate)
                                                 node->numGroups,
                                                 sizeof(SetOpStatePerGroupData),
                                                 setopstate->ps.state->es_query_cxt,
-                                                setopstate->tableContext,
+                                                setopstate->tuplesContext,
                                                 econtext->ecxt_per_tuple_memory,
                                                 false);
 }
@@ -589,13 +589,15 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * If hashing, we also need a longer-lived context to store the hash
      * table.  The table can't just be kept in the per-query context because
-     * we want to be able to throw it away in ExecReScanSetOp.
+     * we want to be able to throw it away in ExecReScanSetOp.  We can use a
+     * BumpContext to save storage, because we will have no need to delete
+     * individual table entries.
      */
     if (node->strategy == SETOP_HASHED)
-        setopstate->tableContext =
-            AllocSetContextCreate(CurrentMemoryContext,
-                                  "SetOp hash table",
-                                  ALLOCSET_DEFAULT_SIZES);
+        setopstate->tuplesContext =
+            BumpContextCreate(CurrentMemoryContext,
+                              "SetOp hashed tuples",
+                              ALLOCSET_DEFAULT_SIZES);

     /*
      * initialize child nodes
@@ -681,8 +683,8 @@ void
 ExecEndSetOp(SetOpState *node)
 {
     /* free subsidiary stuff including hashtable */
-    if (node->tableContext)
-        MemoryContextDelete(node->tableContext);
+    if (node->tuplesContext)
+        MemoryContextDelete(node->tuplesContext);

     ExecEndNode(outerPlanState(node));
     ExecEndNode(innerPlanState(node));
@@ -721,10 +723,6 @@ ExecReScanSetOp(SetOpState *node)
             return;
         }

-        /* Release any hashtable storage */
-        if (node->tableContext)
-            MemoryContextReset(node->tableContext);
-
         /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 53fb56f7388..9f6e45bcb0b 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -506,7 +506,6 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
      * saves a needless fetch inner op step for the hashing ExprState created
      * in BuildTupleHashTable().
      */
-    MemoryContextReset(node->hashtablecxt);
     node->havehashrows = false;
     node->havenullrows = false;

@@ -528,7 +527,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
                                               nbuckets,
                                               0,
                                               node->planstate->state->es_query_cxt,
-                                              node->hashtablecxt,
+                                              node->tuplesContext,
                                               innerecontext->ecxt_per_tuple_memory,
                                               false);

@@ -557,7 +556,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
                                                   nbuckets,
                                                   0,
                                                   node->planstate->state->es_query_cxt,
-                                                  node->hashtablecxt,
+                                                  node->tuplesContext,
                                                   innerecontext->ecxt_per_tuple_memory,
                                                   false);
     }
@@ -838,7 +837,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
     sstate->projRight = NULL;
     sstate->hashtable = NULL;
     sstate->hashnulls = NULL;
-    sstate->hashtablecxt = NULL;
+    sstate->tuplesContext = NULL;
     sstate->innerecontext = NULL;
     sstate->keyColIdx = NULL;
     sstate->tab_eq_funcoids = NULL;
@@ -889,11 +888,11 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
                    *righttlist;
         ListCell   *l;

-        /* We need a memory context to hold the hash table(s) */
-        sstate->hashtablecxt =
-            AllocSetContextCreate(CurrentMemoryContext,
-                                  "Subplan HashTable Context",
-                                  ALLOCSET_DEFAULT_SIZES);
+        /* We need a memory context to hold the hash table(s)' tuples */
+        sstate->tuplesContext =
+            BumpContextCreate(CurrentMemoryContext,
+                              "SubPlan hashed tuples",
+                              ALLOCSET_DEFAULT_SIZES);
         /* and a short-lived exprcontext for function evaluation */
         sstate->innerecontext = CreateExprContext(estate);

diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 0ba86c2ad72..a6046b35966 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -141,7 +141,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
                                           long nbuckets,
                                           Size additionalsize,
                                           MemoryContext metacxt,
-                                          MemoryContext tablecxt,
+                                          MemoryContext tuplescxt,
                                           MemoryContext tempcxt,
                                           bool use_variable_hash_iv);
 extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..18ae8f0d4bb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -844,9 +844,15 @@ typedef struct ExecAuxRowMark
 typedef struct TupleHashEntryData *TupleHashEntry;
 typedef struct TupleHashTableData *TupleHashTable;

+/*
+ * TupleHashEntryData is a slot in the tuplehash_hash table.  If it's
+ * populated, it contains a pointer to a MinimalTuple that can also have
+ * associated "additional data".  That's stored in the TupleHashTable's
+ * tuplescxt.
+ */
 typedef struct TupleHashEntryData
 {
-    MinimalTuple firstTuple;    /* copy of first tuple in this group */
+    MinimalTuple firstTuple;    /* -> copy of first tuple in this group */
     uint32        status;            /* hash status */
     uint32        hash;            /* hash value (cached) */
 } TupleHashEntryData;
@@ -861,13 +867,13 @@ typedef struct TupleHashEntryData

 typedef struct TupleHashTableData
 {
-    tuplehash_hash *hashtab;    /* underlying hash table */
+    tuplehash_hash *hashtab;    /* underlying simplehash hash table */
     int            numCols;        /* number of columns in lookup key */
     AttrNumber *keyColIdx;        /* attr numbers of key columns */
     ExprState  *tab_hash_expr;    /* ExprState for hashing table datatype(s) */
     ExprState  *tab_eq_func;    /* comparator for table datatype(s) */
     Oid           *tab_collations; /* collations for hash and comparison */
-    MemoryContext tablecxt;        /* memory context containing table */
+    MemoryContext tuplescxt;    /* memory context storing hashed tuples */
     MemoryContext tempcxt;        /* context for function evaluations */
     Size        additionalsize; /* size of additional data */
     TupleTableSlot *tableslot;    /* slot for referencing table entries */
@@ -1017,7 +1023,7 @@ typedef struct SubPlanState
     TupleHashTable hashnulls;    /* hash table for rows with null(s) */
     bool        havehashrows;    /* true if hashtable is not empty */
     bool        havenullrows;    /* true if hashnulls is not empty */
-    MemoryContext hashtablecxt; /* memory context containing hash tables */
+    MemoryContext tuplesContext;    /* context containing hash tables' tuples */
     ExprContext *innerecontext; /* econtext for computing inner tuples */
     int            numCols;        /* number of columns being hashed */
     /* each of the remaining fields is an array of length numCols: */
@@ -1566,7 +1572,7 @@ typedef struct RecursiveUnionState
     FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     MemoryContext tempContext;    /* short-term context for comparisons */
     TupleHashTable hashtable;    /* hash table for tuples already seen */
-    MemoryContext tableContext; /* memory context containing hash table */
+    MemoryContext tuplesContext;    /* context containing hash table's tuples */
 } RecursiveUnionState;

 /* ----------------
@@ -2567,7 +2573,7 @@ typedef struct AggState
     bool        table_filled;    /* hash table filled yet? */
     int            num_hashes;
     MemoryContext hash_metacxt; /* memory for hash table bucket array */
-    MemoryContext hash_tablecxt;    /* memory for hash table entries */
+    MemoryContext hash_tuplescxt;    /* memory for hash table tuples */
     struct LogicalTapeSet *hash_tapeset;    /* tape set for hash spill tapes */
     struct HashAggSpill *hash_spills;    /* HashAggSpill for each grouping set,
                                          * exists only during first pass */
@@ -2866,7 +2872,7 @@ typedef struct SetOpState
     Oid           *eqfuncoids;        /* per-grouping-field equality fns */
     FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
-    MemoryContext tableContext; /* memory context containing hash table */
+    MemoryContext tuplesContext;    /* context containing hash table's tuples */
     bool        table_filled;    /* hash table filled yet? */
     TupleHashIterator hashiter; /* for iterating through hash table */
 } SetOpState;
--
2.43.7


Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
David Rowley
Дата:
On Thu, 30 Oct 2025 at 08:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > David Rowley <dgrowleyml@gmail.com> writes:
> >> I can't help wonder if we can improve the memory context parameter
> >> names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
> >> remind myself that it's not for the bucket array, just the stuff we
> >> have the buckets point to. Would "hashedtuplecxt" be better?
>
> > I agree these names are not great.  I think they might be leftovers
> > from a time when there actually was a complete hash-table structure
> > in that context.
>
> Looking closer, it seems like a lot of the confusion here arose when
> TupleHashTables were rebased onto simplehash.h.  That patch seems to
> have paid about zero attention to updating comments that it'd
> falsified or at least made misleading.  Here's a v2 that tries to
> clean up some of the mess.

Thanks for doing that cleanup. A good improvement. Just a couple of
notes from the review for your consideration:

1) The comment in ExecEndSetOp() says: "/* free subsidiary stuff
including hashtable */". How about adjusting that to "hashtable data"?
Is it worth mentioning there that we leave it up to the destruction of
the hash table itself to when the ExecutorState context is reset?
(Apparently execGrouping.c does not expose any way to do
tuplehash_destroy() anyway)

2) I'm not sure if it makes it neater or not, but did you consider
moving the creation of the setopstate->tuplesContext context into
build_hash_table()? The motivation there is that it keeps the
hash-related initialisations together.

3) Not for this patch, but wanted to note the observation: I see we
always create the hash table during plan startup in nodeSetOp.c. I
suspect this could cause the same issue for Setops as I fixed in
Memoize in 57f59396b, which fixed a slow EXPLAIN in [1]. That was
mostly down to a bug that overestimated the number of buckets, but
there's legitimate reasons to want lots of buckets too... a big table
and lots of work_mem.

create table t (a int);
insert into t values(1);
alter table t alter column a set (n_distinct = -1); -- all values distinct
analyze t;
-- hack fool the planner into thinking it's a big table.
update pg_class set reltuples = 1e9 where oid = 't'::regclass;
set work_mem = '4GB';
\timing on
explain (summary on) select a from t except select a from t;

 HashSetOp Except  (cost=25000002.00..27500002.00 rows=1000000000 width=4)
   Disabled: true
   ->  Seq Scan on t  (cost=0.00..10000001.00 rows=1000000000 width=4)
   ->  Seq Scan on t t_1  (cost=0.00..10000001.00 rows=1000000000 width=4)
 Planning Time: 0.091 ms <-- not a long time
(5 rows)

Time: 1658.866 ms (00:01.659) <-- a long time.

> I'm also proposing to move the MemoryContextReset of that context
> into ResetTupleHashTable instead of documenting that we expect the
> caller to do that.  The old way just seems like a bizarre artifact.
>
> > Related to this, while I was chasing Jeff's complaint I realized that
> > the none-too-small simplehash table for this is getting made in the
> > query's ExecutorState.  That's pretty awful from the standpoint of
> > being able to blame memory consumption on the hash node.
>
> I didn't do anything about this.  I briefly considered having
> BuildTupleHashTable construct a per-hashtable context, but backed
> off after seeing that nodeAgg.c goes to some effort to use the same
> metacxt and tuplescxt for several hashtables.  I'm not sure if that
> had measured performance benefits behind it, but it well might have.
> I now think that if we care, the right thing is to make the remaining
> callers construct extra memory contexts for their hash tables.  That
> could be left for another day.

Reminds me of 590b045c3, but I think maybe not as bad as the bucket
array for the hash table is more likely to be a large allocation,
which aset.c will put on a dedicated AllocBlock rather than sharing an
AllocBlock with other chunks. ResetTupleHashTable() also just results
in a SH_RESET(), which doesn't do any pfreeing work (just memset()).
i.e. shouldn't cause fragmentation due to many rescans. Before
590b045c3, we used to pfree() all tuples at tuplestore_end(), which
happens in a Materialize node's rescan. That resulted in bad
fragmentation of the ExecutorState context. Nothing like that is
happening here, so I'm not that worried about it, though I agree that
it isn't ideal.

David

[1] https://postgr.es/m/CAFj8pRAMp%3DQsMi6sPQJ4W3hczoFJRvyXHJV3AZAZaMyTVM312Q%40mail.gmail.com



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Jeff Davis
Дата:
On Sun, 2025-10-26 at 18:11 -0400, Tom Lane wrote:
> Related to this, while I was chasing Jeff's complaint I realized that
> the none-too-small simplehash table for this is getting made in the
> query's ExecutorState.  That's pretty awful from the standpoint of
> being able to blame memory consumption on the hash node.  I'm not
> sure though if we want to go so far as to make another context just
> for the simplehash table.  We could keep it in that same "tablectx"
> at the price of destroying and rebuilding the simplehash table, not
> just resetting it, at each node rescan.  But that's not ideal either.

I had investigated the idea of destroying/rebuilding the simplehash
table regardless because, in some cases, it can crowd out space for the
transition states, and we end up with a mostly-empty bucket array that
causes recursive spilling.

I'm not sure if that's a practical problem -- all the heuristics are
designed to avoid that situation -- but I can create the situation with
injection points.

Also, in the case of spilled grouping sets, once one group is
completely done with all spilled data it would be good to destroy that
hash table. By adjusting the order we process the spilled data, we
could minimize the creation/destruction of the bucket array.

However, rebuilding it has a cost, so we should only do that when there
is really a problem. I came to the conclusion it would require
significant refactoring to make that work well, and I didn't get around
to it yet.

Regards,
    Jeff Davis




Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 30 Oct 2025 at 08:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a v2 that tries to clean up some of the mess.

> Thanks for doing that cleanup. A good improvement.

Thanks for reviewing!

> Just a couple of
> notes from the review for your consideration:

> 1) The comment in ExecEndSetOp() says: "/* free subsidiary stuff
> including hashtable */". How about adjusting that to "hashtable data"?

Agreed, also in ExecEndRecursiveUnion.

> Is it worth mentioning there that we leave it up to the destruction of
> the hash table itself to when the ExecutorState context is reset?
> (Apparently execGrouping.c does not expose any way to do
> tuplehash_destroy() anyway)

Yeah.  I think this is better documented in execGrouping.c.  I propose
adding this:

@@ -169,6 +169,13 @@ execTuplesHashPrepare(int numCols,
  * reasonably often, typically once per tuple.  (We do it that way, rather
  * than managing an extra context within the hashtable, because in many cases
  * the caller can specify a tempcxt that it needs to reset per-tuple anyway.)
+ *
+ * We don't currently provide DestroyTupleHashTable functionality; the hash
+ * table will be cleaned up at destruction of the metacxt.  (Some callers
+ * bother to delete the tuplescxt explicitly, though it'd be sufficient to
+ * ensure it's a child of the metacxt.)  There's not much point in working
+ * harder than this so long as the expression-evaluation infrastructure
+ * behaves similarly.
  */
 TupleHashTable
 BuildTupleHashTable(PlanState *parent,


> 2) I'm not sure if it makes it neater or not, but did you consider
> moving the creation of the setopstate->tuplesContext context into
> build_hash_table()? The motivation there is that it keeps the
> hash-related initialisations together.

Hmm, could do that, but all of these callers currently follow the
pattern of creating these contexts in the plan nodes' Init functions
and destroying them in the End functions.  I don't see a big advantage
in changing that.

> 3) Not for this patch, but wanted to note the observation: I see we
> always create the hash table during plan startup in nodeSetOp.c. I
> suspect this could cause the same issue for Setops as I fixed in
> Memoize in 57f59396b, which fixed a slow EXPLAIN in [1].

Fair point.  I'd be inclined to just skip the hashtable creation
if (eflags & EXEC_FLAG_EXPLAIN_ONLY), like nodeAgg.c did.

I wonder though if skipping the initialization of the hashtable
execution expressions has any visible impact on what EXPLAIN can
print.  I have a vague recollection that there were some places
where we backed off similar optimizations because of concerns
like that.

>> I now think that if we care, the right thing is to make the remaining
>> callers construct extra memory contexts for their hash tables.  That
>> could be left for another day.

> Reminds me of 590b045c3, but I think maybe not as bad as the bucket
> array for the hash table is more likely to be a large allocation,
> which aset.c will put on a dedicated AllocBlock rather than sharing an
> AllocBlock with other chunks. ResetTupleHashTable() also just results
> in a SH_RESET(), which doesn't do any pfreeing work (just memset()).
> i.e. shouldn't cause fragmentation due to many rescans.

My concern here was just whether you could tell by MemoryContextStats
inspection how much to blame on the hash table.

            regards, tom lane



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
David Rowley
Дата:
On Thu, 30 Oct 2025 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Is it worth mentioning there that we leave it up to the destruction of
> > the hash table itself to when the ExecutorState context is reset?
> > (Apparently execGrouping.c does not expose any way to do
> > tuplehash_destroy() anyway)
>
> Yeah.  I think this is better documented in execGrouping.c.  I propose
> adding this:
>
> @@ -169,6 +169,13 @@ execTuplesHashPrepare(int numCols,
>   * reasonably often, typically once per tuple.  (We do it that way, rather
>   * than managing an extra context within the hashtable, because in many cases
>   * the caller can specify a tempcxt that it needs to reset per-tuple anyway.)
> + *
> + * We don't currently provide DestroyTupleHashTable functionality; the hash
> + * table will be cleaned up at destruction of the metacxt.  (Some callers
> + * bother to delete the tuplescxt explicitly, though it'd be sufficient to
> + * ensure it's a child of the metacxt.)  There's not much point in working
> + * harder than this so long as the expression-evaluation infrastructure
> + * behaves similarly.
>   */
>  TupleHashTable
>  BuildTupleHashTable(PlanState *parent,

Looks good.

> > 2) I'm not sure if it makes it neater or not, but did you consider
> > moving the creation of the setopstate->tuplesContext context into
> > build_hash_table()? The motivation there is that it keeps the
> > hash-related initialisations together.
>
> Hmm, could do that, but all of these callers currently follow the
> pattern of creating these contexts in the plan nodes' Init functions
> and destroying them in the End functions.  I don't see a big advantage
> in changing that.

All good. I'm fine either way.

> > 3) Not for this patch, but wanted to note the observation: I see we
> > always create the hash table during plan startup in nodeSetOp.c. I
> > suspect this could cause the same issue for Setops as I fixed in
> > Memoize in 57f59396b, which fixed a slow EXPLAIN in [1].
>
> Fair point.  I'd be inclined to just skip the hashtable creation
> if (eflags & EXEC_FLAG_EXPLAIN_ONLY), like nodeAgg.c did.
>
> I wonder though if skipping the initialization of the hashtable
> execution expressions has any visible impact on what EXPLAIN can
> print.  I have a vague recollection that there were some places
> where we backed off similar optimizations because of concerns
> like that.

I recall some discussion around JIT and EXPLAIN (ANALYZE OFF). I seem
to have a branch here that disables it completely, but I recall Andres
or you rejecting the idea. I can't seem to find the thread. I agree
would be quite strange if the JIT functions count was to differ
between EXPLAIN and EXPLAIN ANALYZE, but if we have to contort our
code too much to make that work, then I do question the usefulness of
showing the JIT details in EXPLAIN (ANALYZE OFF). Anyway, maybe off
topic for this thread.

I'm fine with the patch, assuming it now includes the new
BuildTupleHashTable() comment and adjusted
ExecEndRecursiveUnion/ExecEndSetOp comments.

David



Re: Use BumpContext contexts for TupleHashTables' tablecxt

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> I'm fine with the patch, assuming it now includes the new
> BuildTupleHashTable() comment and adjusted
> ExecEndRecursiveUnion/ExecEndSetOp comments.

OK, pushed after a tiny bit more comment-smithing.  Thanks
for reviewing!

            regards, tom lane