Обсуждение: Should HashSetOp go away
Many years ago I ran into some problems doing maintenance tasks checking for short identifiers which existed in one table but not another. It would choose HashSetOp, which was memory inefficient, and it was also unaware of how memory inefficient it was, leading it to blow well past work_mem, by many fold. So if you set work_mem to a large value in order to get maintenance operations over with quickly while the system is basically single-user, it could cause crashes. It certainly isn't the only part of PostgreSQL which is memory inefficient, but it did seem particularly egregious.
I noticed some changes in this code v18, so wanted to revisit the issue. Under commit 27627929528e, it looks like it got 25% more memory efficient, but it thinks it got 40% more efficient, so the memory use got better but the estimation actually got worse.
Using the data:
create table jj as select lpad(x::text,15,'0') from generate_series(1,10000000) f(x);
And the dummy query:
select lpad from jj except select lpad from jj;
It goes from needing a work_mem of at least 270MB to choose HashSetOp where it actually uses 1.3GB (as determined by 'max resident size' from log_executor_stats, which is not perfect but should be pretty close--I intentionally used a small shared_buffers so that it didn't contribute much to the memory usage) in v17 to needing work_mem of at least 160MB while actually using 1.0GB in 18devel commit 276279. Under 18.0, it is slightly but not meaningfully different from commit 276279.
I was thinking of ways to improve the memory usage (or at least its estimation) but decided maybe it would be better if HashSetOp went away entirely. As far as I can tell HashSetOp has nothing to recommend it other than the fact that it already exists. If we instead used an elaboration on Hash Anti Join, then it would automatically get spilling to disk, parallel operations, better estimation, and the benefits of whatever micro optimizations people lavish on the highly used HashJoin machinery but not the obscure, little-used HashSetOp.
It would need to elaborate the HashAntiJoin so that it can deduplicate one input (in the case of EXCEPT) or count the other input (in the case of EXCEPT ALL).
Is there some reason this is not feasible?
Yes, I could (and did) rewrite my query to force it to use the AntiJoin, but why should people need to do that when the planner can do it instead?
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes:
> I noticed some changes in this code v18, so wanted to revisit the issue.
> Under commit 27627929528e, it looks like it got 25% more memory efficient,
> but it thinks it got 40% more efficient, so the memory use got better but
> the estimation actually got worse.
Hmm, so why not fix that estimation?
> I was thinking of ways to improve the memory usage (or at least its
> estimation) but decided maybe it would be better if HashSetOp went away
> entirely. As far as I can tell HashSetOp has nothing to recommend it other
> than the fact that it already exists. If we instead used an elaboration on
> Hash Anti Join, then it would automatically get spilling to disk, parallel
> operations, better estimation, and the benefits of whatever micro
> optimizations people lavish on the highly used HashJoin machinery but not
> the obscure, little-used HashSetOp.
This seems like a pretty bad solution. It would imply exporting the
complexities of duplicate-counting for EXCEPT ALL and INTERSECT ALL
modes into the hash-join logic. We don't need that extra complexity
there (it's more than enough of a mess already), and we don't need
whatever performance hit ordinary hash joins would take.
Also, I doubt the problem is confined to nodeSetOp. I think this is
fundamentally a complaint about BuildTupleHashTable and friends being
unable to spill to disk. Since we also use that logic for hashed
aggregates, RecursiveUnion, and hashed SubPlans, getting rid of
nodeSetOp isn't going to move the needle very far.
regards, tom lane
I wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> I noticed some changes in this code v18, so wanted to revisit the issue.
>> Under commit 27627929528e, it looks like it got 25% more memory efficient,
>> but it thinks it got 40% more efficient, so the memory use got better but
>> the estimation actually got worse.
> Hmm, so why not fix that estimation?
So I poked at this a little bit, and found a few factors affecting it:
* Tuple hash tables are typically given license to use twice work_mem;
see hash_mem_multiplier. Not sure if you were accounting for that
in your test.
* create_setop_path's required-space estimate of entrysize * numGroups
was rather lame before commit 5dfc19814, and it's even more so
afterwards. It's basically only accounting for the tuples themselves,
and not either the hashtable overhead or the SetOpStatePerGroupData
counter space. With wide tuples that might disappear into the noise,
but with only 16-ish data bytes per tuple it's all about the overhead.
On my machine this example uses 80 bytes per tuple in the "SetOp hash
table" context and another 16 or more in the simplehash hashtable.
So about triple what the planner thought.
* We can buy some of this back nearly for free, by switching the
SetOp hash table context to be a BumpContext not an AllocSetContext.
That doesn't move the needle too much in this example with
MEMORY_CONTEXT_CHECKING on, but with it off, the SetOp hash table
consumption drops to 48 bytes/tuple. (Also, for other tuple widths,
AllocSetContext's round-up-to-power-of-2 behavior could hurt a lot
more than it does here.)
* To do better, we probably need to take this computation out of the
planner and have execGrouping.c expose a function to estimate the
TupleHashTable size for N entries and such-and-such average data
width. That in turn will require simplehash.h to expose a function
for estimating the size of its tables, because AFAICS its callers are
not supposed to know such details.
Attached is a quick-hack patch to use a BumpContext for this purpose
in nodeSetOp.c. We should probably look at whether other users of
BuildTupleHashTable can do similarly. I haven't thought hard about
what better-factorized space estimation would look like.
regards, tom lane
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 4068481a523..ce42ea6a549 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -589,13 +589,15 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
/*
* If hashing, we also need a longer-lived context to store the hash
* table. The table can't just be kept in the per-query context because
- * we want to be able to throw it away in ExecReScanSetOp.
+ * we want to be able to throw it away in ExecReScanSetOp. We can use a
+ * BumpContext to save storage, because we will have no need to delete
+ * individual table entries.
*/
if (node->strategy == SETOP_HASHED)
setopstate->tableContext =
- AllocSetContextCreate(CurrentMemoryContext,
- "SetOp hash table",
- ALLOCSET_DEFAULT_SIZES);
+ BumpContextCreate(CurrentMemoryContext,
+ "SetOp hash table",
+ ALLOCSET_DEFAULT_SIZES);
/*
* initialize child nodes
On Mon, 27 Oct 2025 at 07:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Janes <jeff.janes@gmail.com> writes: > > I was thinking of ways to improve the memory usage (or at least its > > estimation) but decided maybe it would be better if HashSetOp went away > > entirely. As far as I can tell HashSetOp has nothing to recommend it other > > than the fact that it already exists. If we instead used an elaboration on > > Hash Anti Join, then it would automatically get spilling to disk, parallel > > operations, better estimation, and the benefits of whatever micro > > optimizations people lavish on the highly used HashJoin machinery but not > > the obscure, little-used HashSetOp. > > This seems like a pretty bad solution. It would imply exporting the > complexities of duplicate-counting for EXCEPT ALL and INTERSECT ALL > modes into the hash-join logic. We don't need that extra complexity > there (it's more than enough of a mess already), and we don't need > whatever performance hit ordinary hash joins would take. If Hash Joins did support IS NOT DISTINCT FROM clauses, then at least the non-ALL cases could be done with Hash Semi Join and Hash Anti Join for INTERSECT and EXCEPT, respectively, followed by a HashAgg. I doubt it would be any faster for the general case, but at least it would allow those setop queries to run when the inputs don't fit in memory. It's not ideal though, as when the planner underestimates, Hashed Setops could still blow up. David
David Rowley <dgrowleyml@gmail.com> writes:
> If Hash Joins did support IS NOT DISTINCT FROM clauses, then at least
> the non-ALL cases could be done with Hash Semi Join and Hash Anti Join
> for INTERSECT and EXCEPT, respectively, followed by a HashAgg. I doubt
> it would be any faster for the general case, but at least it would
> allow those setop queries to run when the inputs don't fit in memory.
> It's not ideal though, as when the planner underestimates, Hashed
> Setops could still blow up.
Yeah. As I hinted before, I think a better answer would be to teach
TupleHashTables to be able to spill to disk at need. No idea how
much work that would be, but it would fix all users of that code
not just one of them.
regards, tom lane
I wrote:
> * create_setop_path's required-space estimate of entrysize * numGroups
> was rather lame before commit 5dfc19814, and it's even more so
> afterwards. It's basically only accounting for the tuples themselves,
> and not either the hashtable overhead or the SetOpStatePerGroupData
> counter space. With wide tuples that might disappear into the noise,
> but with only 16-ish data bytes per tuple it's all about the overhead.
> On my machine this example uses 80 bytes per tuple in the "SetOp hash
> table" context and another 16 or more in the simplehash hashtable.
> So about triple what the planner thought.
> * To do better, we probably need to take this computation out of the
> planner and have execGrouping.c expose a function to estimate the
> TupleHashTable size for N entries and such-and-such average data
> width. That in turn will require simplehash.h to expose a function
> for estimating the size of its tables, because AFAICS its callers are
> not supposed to know such details.
Here's a pair of patches to try to do better. The first one
is concerned with getting more realistic size estimates for
TupleHashTables in the planner. The second is some mop-up
that's been pending for a long time in the same area, namely
getting rid of "long int" field types in Plan nodes.
With 0001, the planner's estimate of the amount of space needed
for your example query seems to be pretty dead-on, at least in
non-debug builds.
regards, tom lane
From 83b4b2ce3732996ad941d3a2f376df0cf9e5669b Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 30 Oct 2025 18:37:06 -0400
Subject: [PATCH v1 1/2] Improve planner's estimates of tuple hash table sizes.
For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.
To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)
I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables. That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.
The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation. On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
src/backend/executor/execGrouping.c | 48 +++++++++++++++++++++++
src/backend/executor/nodeSetOp.c | 9 +++++
src/backend/executor/nodeSubplan.c | 53 +++++++++++++++++++++++++-
src/backend/optimizer/plan/subselect.c | 45 +++++++++++-----------
src/backend/optimizer/util/pathnode.c | 12 +++---
src/include/executor/executor.h | 3 ++
src/include/executor/nodeSetOp.h | 2 +
src/include/executor/nodeSubplan.h | 4 ++
src/include/lib/simplehash.h | 47 ++++++++++++++++++++++-
9 files changed, 192 insertions(+), 31 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index b4bdaa3c305..d51b422cee9 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "common/hashfn.h"
#include "executor/executor.h"
@@ -302,6 +303,53 @@ ResetTupleHashTable(TupleHashTable hashtable)
MemoryContextReset(hashtable->tuplescxt);
}
+/*
+ * Estimate the amount of space needed for a TupleHashTable with nentries
+ * entries, if the tuples have average data width tupleWidth and the caller
+ * requires additionalsize extra space per entry.
+ *
+ * Return SIZE_MAX if it'd overflow size_t.
+ *
+ * nentries is "double" because this is meant for use by the planner,
+ * which typically works with double rowcount estimates. So we'd need to
+ * clamp to integer somewhere and that might as well be here. We do expect
+ * the value not to be NaN or negative, though.
+ */
+Size
+EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize)
+{
+ Size sh_space;
+ double tuples_space;
+
+ /* First estimate the space needed for the simplehash table */
+ sh_space = tuplehash_estimate_space(nentries);
+
+ /* Give up if that's already too big */
+ if (sh_space >= SIZE_MAX)
+ return sh_space;
+
+ /*
+ * Compute space needed for hashed tuples with additional data. nentries
+ * must be somewhat sane, so it should be safe to compute this product.
+ * (Note that this is only accurate if MEMORY_CONTEXT_CHECKING is off,
+ * else bump.c will add a MemoryChunk header to each tuple. However, it
+ * seems undesirable for debug builds to make different planning choices
+ * than production builds, so we assume the production behavior always.)
+ */
+ tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ MAXALIGN(additionalsize));
+
+ /* Check for size_t overflow */
+ if (sh_space + tuples_space >= SIZE_MAX)
+ return SIZE_MAX;
+
+ /* We don't bother estimating size of the miscellaneous overhead data */
+ return (Size) (sh_space + tuples_space);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7b223a7ca3a..5aabed18a09 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -111,6 +111,15 @@ build_hash_table(SetOpState *setopstate)
false);
}
+/* Planner support routine to estimate space needed for hash table */
+Size
+EstimateSetOpHashTableSpace(double nentries, Size tupleWidth)
+{
+ return EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ sizeof(SetOpStatePerGroupData));
+}
+
/*
* We've completed processing a tuple group. Decide how many copies (if any)
* of its representative row to emit, and store the count into numOutput.
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9f6e45bcb0b..96d900a7783 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -525,7 +525,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_hash_funcs,
node->tab_collations,
nbuckets,
- 0,
+ 0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
innerecontext->ecxt_per_tuple_memory,
@@ -554,7 +554,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_hash_funcs,
node->tab_collations,
nbuckets,
- 0,
+ 0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
innerecontext->ecxt_per_tuple_memory,
@@ -636,6 +636,55 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
MemoryContextSwitchTo(oldcontext);
}
+/* Planner support routine to estimate space needed for hash table(s) */
+Size
+EstimateSubplanHashTableSpace(double nentries,
+ Size tupleWidth,
+ bool unknownEqFalse)
+{
+ Size tab1space,
+ tab2space;
+
+ /* Estimate size of main hashtable */
+ tab1space = EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ 0 /* no additional data */ );
+
+ /* Give up if that's already too big */
+ if (tab1space >= SIZE_MAX)
+ return tab1space;
+
+ /* Done if we don't need a hashnulls table */
+ if (unknownEqFalse)
+ return tab1space;
+
+ /*
+ * Adjust the rowcount estimate in the same way as above, except that we
+ * don't bother with the special case for a single hash column. (We could
+ * handle that, but it'd be notationally painful for our caller to provide
+ * the column count, and this table adds not that much to the total
+ * estimate anyway.)
+ */
+ nentries /= 16;
+ if (nentries < 1)
+ nentries = 1;
+
+ /*
+ * It might be sane to also reduce the tupleWidth, but on the other hand
+ * we are not accounting for the space taken by the tuples' null bitmaps.
+ * Leave it alone for now.
+ */
+ tab2space = EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ 0 /* no additional data */ );
+
+ /* Guard against overflow */
+ if (tab2space >= SIZE_MAX - tab1space)
+ return SIZE_MAX;
+
+ return tab1space + tab2space;
+}
+
/*
* execTuplesUnequal
* Return true if two tuples are definitely unequal in the indicated
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 14192a13236..ff63d20f8d5 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -20,6 +20,7 @@
#include "catalog/pg_operator.h"
#include "catalog/pg_type.h"
#include "executor/executor.h"
+#include "executor/nodeSubplan.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
@@ -79,8 +80,8 @@ static Node *convert_testexpr(PlannerInfo *root,
List *subst_nodes);
static Node *convert_testexpr_mutator(Node *node,
convert_testexpr_context *context);
-static bool subplan_is_hashable(Plan *plan);
-static bool subpath_is_hashable(Path *path);
+static bool subplan_is_hashable(Plan *plan, bool unknownEqFalse);
+static bool subpath_is_hashable(Path *path, bool unknownEqFalse);
static bool testexpr_is_hashable(Node *testexpr, List *param_ids);
static bool test_opexpr_is_hashable(OpExpr *testexpr, List *param_ids);
static bool hash_ok_operator(OpExpr *expr);
@@ -283,7 +284,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
best_path = final_rel->cheapest_total_path;
/* Now we can check if it'll fit in hash_mem */
- if (subpath_is_hashable(best_path))
+ if (subpath_is_hashable(best_path, true))
{
SubPlan *hashplan;
AlternativeSubPlan *asplan;
@@ -524,7 +525,7 @@ build_subplan(PlannerInfo *root, Plan *plan, Path *path,
*/
if (subLinkType == ANY_SUBLINK &&
splan->parParam == NIL &&
- subplan_is_hashable(plan) &&
+ subplan_is_hashable(plan, unknownEqFalse) &&
testexpr_is_hashable(splan->testexpr, splan->paramIds))
splan->useHashTable = true;
@@ -711,19 +712,19 @@ convert_testexpr_mutator(Node *node,
* is suitable for hashing. We only look at the subquery itself.
*/
static bool
-subplan_is_hashable(Plan *plan)
+subplan_is_hashable(Plan *plan, bool unknownEqFalse)
{
- double subquery_size;
+ Size hashtablesize;
/*
- * The estimated size of the subquery result must fit in hash_mem. (Note:
- * we use heap tuple overhead here even though the tuples will actually be
- * stored as MinimalTuples; this provides some fudge factor for hashtable
- * overhead.)
+ * The estimated size of the hashtable holding the subquery result must
+ * fit in hash_mem. (Note: reject on equality, to ensure that an estimate
+ * of SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- subquery_size = plan->plan_rows *
- (MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));
- if (subquery_size > get_hash_memory_limit())
+ hashtablesize = EstimateSubplanHashTableSpace(plan->plan_rows,
+ plan->plan_width,
+ unknownEqFalse);
+ if (hashtablesize >= get_hash_memory_limit())
return false;
return true;
@@ -735,19 +736,19 @@ subplan_is_hashable(Plan *plan)
* Identical to subplan_is_hashable, but work from a Path for the subplan.
*/
static bool
-subpath_is_hashable(Path *path)
+subpath_is_hashable(Path *path, bool unknownEqFalse)
{
- double subquery_size;
+ Size hashtablesize;
/*
- * The estimated size of the subquery result must fit in hash_mem. (Note:
- * we use heap tuple overhead here even though the tuples will actually be
- * stored as MinimalTuples; this provides some fudge factor for hashtable
- * overhead.)
+ * The estimated size of the hashtable holding the subquery result must
+ * fit in hash_mem. (Note: reject on equality, to ensure that an estimate
+ * of SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- subquery_size = path->rows *
- (MAXALIGN(path->pathtarget->width) + MAXALIGN(SizeofHeapTupleHeader));
- if (subquery_size > get_hash_memory_limit())
+ hashtablesize = EstimateSubplanHashTableSpace(path->rows,
+ path->pathtarget->width,
+ unknownEqFalse);
+ if (hashtablesize >= get_hash_memory_limit())
return false;
return true;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 44ac5312edd..e4fd6950fad 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -17,6 +17,7 @@
#include <math.h>
#include "access/htup_details.h"
+#include "executor/nodeSetOp.h"
#include "foreign/fdwapi.h"
#include "miscadmin.h"
#include "nodes/extensible.h"
@@ -3461,7 +3462,7 @@ create_setop_path(PlannerInfo *root,
}
else
{
- Size hashentrysize;
+ Size hashtablesize;
/*
* In hashed mode, we must read all the input before we can emit
@@ -3490,11 +3491,12 @@ create_setop_path(PlannerInfo *root,
/*
* Also disable if it doesn't look like the hashtable will fit into
- * hash_mem.
+ * hash_mem. (Note: reject on equality, to ensure that an estimate of
+ * SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- hashentrysize = MAXALIGN(leftpath->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- if (hashentrysize * numGroups > get_hash_memory_limit())
+ hashtablesize = EstimateSetOpHashTableSpace(numGroups,
+ leftpath->pathtarget->width);
+ if (hashtablesize >= get_hash_memory_limit())
pathnode->path.disabled_nodes++;
}
pathnode->path.rows = outputRows;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8e7a5453064..086f52cff3d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -157,6 +157,9 @@ extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
ExprState *eqcomp,
ExprState *hashexpr);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern Size EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize);
#ifndef FRONTEND
/*
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index 024c6ba1fce..302936df8be 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -20,4 +20,6 @@ extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
extern void ExecEndSetOp(SetOpState *node);
extern void ExecReScanSetOp(SetOpState *node);
+extern Size EstimateSetOpHashTableSpace(double nentries, Size tupleWidth);
+
#endif /* NODESETOP_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index a1cafbcc694..301c29d1f24 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -20,6 +20,10 @@ extern SubPlanState *ExecInitSubPlan(SubPlan *subplan, PlanState *parent);
extern Datum ExecSubPlan(SubPlanState *node, ExprContext *econtext, bool *isNull);
+extern Size EstimateSubplanHashTableSpace(double nentries,
+ Size tupleWidth,
+ bool unknownEqFalse);
+
extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 9622131ede6..ae20b74df4d 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -125,6 +125,7 @@
#define SH_ITERATE SH_MAKE_NAME(iterate)
#define SH_ALLOCATE SH_MAKE_NAME(allocate)
#define SH_FREE SH_MAKE_NAME(free)
+#define SH_ESTIMATE_SPACE SH_MAKE_NAME(estimate_space)
#define SH_STAT SH_MAKE_NAME(stat)
/* internal helper functions (no externally visible prototypes) */
@@ -242,7 +243,10 @@ SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at);
/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
SH_SCOPE SH_ELEMENT_TYPE *SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
-/* void <prefix>_stat(<prefix>_hash *tb */
+/* size_t <prefix>_estimate_space(double nentries) */
+SH_SCOPE size_t SH_ESTIMATE_SPACE(double nentries);
+
+/* void <prefix>_stat(<prefix>_hash *tb) */
SH_SCOPE void SH_STAT(SH_TYPE * tb);
#endif /* SH_DECLARE */
@@ -305,7 +309,7 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
/*
* Compute allocation size for hashtable. Result can be passed to
- * SH_UPDATE_PARAMETERS.
+ * SH_UPDATE_PARAMETERS. (Keep SH_ESTIMATE_SPACE in sync with this!)
*/
static inline uint64
SH_COMPUTE_SIZE(uint64 newsize)
@@ -1068,6 +1072,44 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
return NULL;
}
+/*
+ * Estimate the amount of space needed for a hashtable with nentries entries.
+ * Return SIZE_MAX if that's too many entries.
+ *
+ * nentries is "double" because this is meant for use by the planner,
+ * which typically works with double rowcount estimates. So we'd need to
+ * clamp to integer somewhere and that might as well be here. We do expect
+ * the value not to be NaN or negative, though.
+ */
+SH_SCOPE size_t
+SH_ESTIMATE_SPACE(double nentries)
+{
+ uint64 size;
+ uint64 space;
+
+ /* fail immediately if we'd overrun SH_MAX_SIZE entries */
+ if (nentries >= SH_MAX_SIZE * SH_FILLFACTOR)
+ return SIZE_MAX;
+
+ /* should be safe to convert to uint64 */
+ size = (uint64) nentries;
+
+ /* supporting zero sized hashes would complicate matters */
+ size = Max(size, 2);
+
+ /* round up size to the next power of 2, that's how bucketing works */
+ size = pg_nextpower2_64(size);
+
+ /* calculate space needed for ->data */
+ space = ((uint64) sizeof(SH_ELEMENT_TYPE)) * size;
+
+ /* verify that allocation of ->data is possible on this platform */
+ if (space >= SIZE_MAX / 2)
+ return SIZE_MAX;
+
+ return (size_t) space + sizeof(SH_TYPE);
+}
+
/*
* Report some statistics about the state of the hashtable. For
* debugging/profiling purposes only.
@@ -1195,6 +1237,7 @@ SH_STAT(SH_TYPE * tb)
#undef SH_ITERATE
#undef SH_ALLOCATE
#undef SH_FREE
+#undef SH_ESTIMATE_SPACE
#undef SH_STAT
/* internal function names */
--
2.43.7
From dfeb506e7bc5838d43a59f0f79078a5c60b41011 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 31 Oct 2025 13:56:06 -0400
Subject: [PATCH v1 2/2] Change "long" numGroups fields to be Cardinality
(i.e., double).
We've been nibbling away at removing uses of "long" for a long time,
since its width is platform-dependent. Here's one more: change the
remaining "long" fields in Plan nodes to Cardinality, since the three
surviving examples all represent group-count estimates. The upstream
planner code was converted to Cardinality some time ago; for example
the corresponding fields in Path nodes are type Cardinality, as are
the arguments of the make_foo_path functions. Downstream in the
executor, it turns out that these all feed to the table-size argument
of BuildTupleHashTable. Change that to "double" as well, and fix it
so that it safely clamps out-of-range values to the uint32 limit of
simplehash.h, as was not being done before.
Essentially, this is removing all the artificial datatype-dependent
limitations on these values from upstream processing, and applying
just one clamp at the moment where we're forced to do so by the
datatype choices of simplehash.h.
Also, remove BuildTupleHashTable's misguided attempt to enforce
work_mem/hash_mem_limit. It doesn't have enough information
(particularly not the expected tuple width) to do that accurately,
and it has no real business second-guessing the caller's choice.
For all these plan types, it's really the planner's responsibility
to not choose a hashed implementation if the hashtable is expected
to exceed hash_mem_limit. The previous patch improved the
accuracy of those estimates, and even if BuildTupleHashTable had
more information it should arrive at the same conclusions.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
src/backend/executor/execGrouping.c | 36 ++++++++++++-----------
src/backend/executor/nodeAgg.c | 24 +++++++--------
src/backend/executor/nodeRecursiveunion.c | 1 -
src/backend/executor/nodeSetOp.c | 1 -
src/backend/executor/nodeSubplan.c | 19 +++++-------
src/backend/optimizer/path/costsize.c | 26 ----------------
src/backend/optimizer/plan/createplan.c | 26 +++++-----------
src/include/executor/executor.h | 2 +-
src/include/nodes/plannodes.h | 6 ++--
src/include/optimizer/optimizer.h | 1 -
src/include/optimizer/planmain.h | 2 +-
11 files changed, 50 insertions(+), 94 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index d51b422cee9..b21821df1db 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -14,6 +14,8 @@
*/
#include "postgres.h"
+#include <math.h>
+
#include "access/htup_details.h"
#include "access/parallel.h"
#include "common/hashfn.h"
@@ -144,7 +146,7 @@ execTuplesHashPrepare(int numCols,
* eqfuncoids: OIDs of equality comparison functions to use
* hashfunctions: FmgrInfos of datatype-specific hashing functions to use
* collations: collations to use in comparisons
- * nbuckets: initial estimate of hashtable size
+ * nelements: initial estimate of hashtable size
* additionalsize: size of data that may be stored along with the hash entry
* metacxt: memory context for long-lived data and the simplehash table
* tuplescxt: memory context in which to store the hashed tuples themselves
@@ -187,7 +189,7 @@ BuildTupleHashTable(PlanState *parent,
const Oid *eqfuncoids,
FmgrInfo *hashfunctions,
Oid *collations,
- long nbuckets,
+ double nelements,
Size additionalsize,
MemoryContext metacxt,
MemoryContext tuplescxt,
@@ -195,12 +197,24 @@ BuildTupleHashTable(PlanState *parent,
bool use_variable_hash_iv)
{
TupleHashTable hashtable;
- Size entrysize;
- Size hash_mem_limit;
+ uint32 nbuckets;
MemoryContext oldcontext;
uint32 hash_iv = 0;
- Assert(nbuckets > 0);
+ /*
+ * tuplehash_create requires a uint32 element count, so we had better
+ * clamp the given nelements to fit in that. As long as we have to do
+ * that, we might as well protect against completely insane input like
+ * zero or NaN. But it is not our job here to enforce issues like staying
+ * within hash_mem: the caller should have done that, and we don't have
+ * enough info to second-guess.
+ */
+ if (isnan(nelements) || nelements <= 0)
+ nbuckets = 1;
+ else if (nelements >= PG_UINT32_MAX)
+ nbuckets = PG_UINT32_MAX;
+ else
+ nbuckets = (uint32) nelements;
/* tuplescxt must be separate, else ResetTupleHashTable breaks things */
Assert(metacxt != tuplescxt);
@@ -208,18 +222,6 @@ BuildTupleHashTable(PlanState *parent,
/* ensure additionalsize is maxalign'ed */
additionalsize = MAXALIGN(additionalsize);
- /*
- * Limit initial table size request to not more than hash_mem.
- *
- * XXX this calculation seems pretty misguided, as it counts only overhead
- * and not the tuples themselves. But we have no knowledge of the
- * expected tuple width here.
- */
- entrysize = sizeof(TupleHashEntryData) + additionalsize;
- hash_mem_limit = get_hash_memory_limit() / entrysize;
- if (nbuckets > hash_mem_limit)
- nbuckets = hash_mem_limit;
-
oldcontext = MemoryContextSwitchTo(metacxt);
hashtable = (TupleHashTable) palloc(sizeof(TupleHashTableData));
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 759ffeed2e6..058171d5133 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -402,12 +402,12 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
Bitmapset **unaggregated);
static bool find_cols_walker(Node *node, FindColsContext *context);
static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, double nbuckets);
static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
bool nullcheck);
static void hash_create_memory(AggState *aggstate);
-static long hash_choose_num_buckets(double hashentrysize,
- long ngroups, Size memory);
+static double hash_choose_num_buckets(double hashentrysize,
+ double ngroups, Size memory);
static int hash_choose_num_partitions(double input_groups,
double hashentrysize,
int used_bits,
@@ -1469,7 +1469,7 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
- long nbuckets;
+ double nbuckets;
Size memory;
if (perhash->hashtable != NULL)
@@ -1478,8 +1478,6 @@ build_hash_tables(AggState *aggstate)
continue;
}
- Assert(perhash->aggnode->numGroups > 0);
-
memory = aggstate->hash_mem_limit / aggstate->num_hashes;
/* choose reasonable number of buckets per hashtable */
@@ -1505,7 +1503,7 @@ build_hash_tables(AggState *aggstate)
* Build a single hashtable for this grouping set.
*/
static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, double nbuckets)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
MemoryContext metacxt = aggstate->hash_metacxt;
@@ -2053,11 +2051,11 @@ hash_create_memory(AggState *aggstate)
/*
* Choose a reasonable number of buckets for the initial hash table size.
*/
-static long
-hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
+static double
+hash_choose_num_buckets(double hashentrysize, double ngroups, Size memory)
{
- long max_nbuckets;
- long nbuckets = ngroups;
+ double max_nbuckets;
+ double nbuckets = ngroups;
max_nbuckets = memory / hashentrysize;
@@ -2065,7 +2063,7 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
* Underestimating is better than overestimating. Too many buckets crowd
* out space for group keys and transition state values.
*/
- max_nbuckets >>= 1;
+ max_nbuckets /= 2;
if (nbuckets > max_nbuckets)
nbuckets = max_nbuckets;
@@ -3686,7 +3684,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
if (use_hashing)
{
Plan *outerplan = outerPlan(node);
- uint64 totalGroups = 0;
+ double totalGroups = 0;
aggstate->hash_spill_rslot = ExecInitExtraTupleSlot(estate, scanDesc,
&TTSOpsMinimalTuple);
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index ebb7919b49b..cd0ad51dcd2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -35,7 +35,6 @@ build_hash_table(RecursiveUnionState *rustate)
TupleDesc desc = ExecGetResultType(outerPlanState(rustate));
Assert(node->numCols > 0);
- Assert(node->numGroups > 0);
/*
* If both child plans deliver the same fixed tuple slot type, we can tell
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5aabed18a09..9e0f9274fb1 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -88,7 +88,6 @@ build_hash_table(SetOpState *setopstate)
TupleDesc desc = ExecGetResultType(outerPlanState(setopstate));
Assert(node->strategy == SETOP_HASHED);
- Assert(node->numGroups > 0);
/*
* If both child plans deliver the same fixed tuple slot type, we can tell
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 96d900a7783..aa9fa482797 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -34,7 +34,6 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
-#include "optimizer/optimizer.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -481,7 +480,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
int ncols = node->numCols;
ExprContext *innerecontext = node->innerecontext;
MemoryContext oldcontext;
- long nbuckets;
+ double nentries;
TupleTableSlot *slot;
Assert(subplan->subLinkType == ANY_SUBLINK);
@@ -509,9 +508,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->havehashrows = false;
node->havenullrows = false;
- nbuckets = clamp_cardinality_to_long(planstate->plan->plan_rows);
- if (nbuckets < 1)
- nbuckets = 1;
+ nentries = planstate->plan->plan_rows;
if (node->hashtable)
ResetTupleHashTable(node->hashtable);
@@ -524,7 +521,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_eq_funcoids,
node->tab_hash_funcs,
node->tab_collations,
- nbuckets,
+ nentries,
0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
@@ -534,12 +531,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
if (!subplan->unknownEqFalse)
{
if (ncols == 1)
- nbuckets = 1; /* there can only be one entry */
+ nentries = 1; /* there can only be one entry */
else
{
- nbuckets /= 16;
- if (nbuckets < 1)
- nbuckets = 1;
+ nentries /= 16;
+ if (nentries < 1)
+ nentries = 1;
}
if (node->hashnulls)
@@ -553,7 +550,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_eq_funcoids,
node->tab_hash_funcs,
node->tab_collations,
- nbuckets,
+ nentries,
0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a006..8335cf5b5c5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -257,32 +257,6 @@ clamp_width_est(int64 tuple_width)
return (int32) tuple_width;
}
-/*
- * clamp_cardinality_to_long
- * Cast a Cardinality value to a sane long value.
- */
-long
-clamp_cardinality_to_long(Cardinality x)
-{
- /*
- * Just for paranoia's sake, ensure we do something sane with negative or
- * NaN values.
- */
- if (isnan(x))
- return LONG_MAX;
- if (x <= 0)
- return 0;
-
- /*
- * If "long" is 64 bits, then LONG_MAX cannot be represented exactly as a
- * double. Casting it to double and back may well result in overflow due
- * to rounding, so avoid doing that. We trust that any double value that
- * compares strictly less than "(double) LONG_MAX" will cast to a
- * representable "long" value.
- */
- return (x < (double) LONG_MAX) ? (long) x : LONG_MAX;
-}
-
/*
* cost_seqscan
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 63fe6637155..8af091ba647 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -226,7 +226,7 @@ static RecursiveUnion *make_recursive_union(List *tlist,
Plan *righttree,
int wtParam,
List *distinctList,
- long numGroups);
+ Cardinality numGroups);
static BitmapAnd *make_bitmap_and(List *bitmapplans);
static BitmapOr *make_bitmap_or(List *bitmapplans);
static NestLoop *make_nestloop(List *tlist,
@@ -301,7 +301,7 @@ static Gather *make_gather(List *qptlist, List *qpqual,
int nworkers, int rescan_param, bool single_copy, Plan *subplan);
static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
List *tlist, Plan *lefttree, Plan *righttree,
- List *groupList, long numGroups);
+ List *groupList, Cardinality numGroups);
static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
static Result *make_gating_result(List *tlist, Node *resconstantqual,
Plan *subplan);
@@ -2564,7 +2564,6 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
List *tlist = build_path_tlist(root, &best_path->path);
Plan *leftplan;
Plan *rightplan;
- long numGroups;
/*
* SetOp doesn't project, so tlist requirements pass through; moreover we
@@ -2575,16 +2574,13 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
rightplan = create_plan_recurse(root, best_path->rightpath,
flags | CP_LABEL_TLIST);
- /* Convert numGroups to long int --- but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(best_path->numGroups);
-
plan = make_setop(best_path->cmd,
best_path->strategy,
tlist,
leftplan,
rightplan,
best_path->groupList,
- numGroups);
+ best_path->numGroups);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2604,7 +2600,6 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
Plan *leftplan;
Plan *rightplan;
List *tlist;
- long numGroups;
/* Need both children to produce same tlist, so force it */
leftplan = create_plan_recurse(root, best_path->leftpath, CP_EXACT_TLIST);
@@ -2612,15 +2607,12 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
tlist = build_path_tlist(root, &best_path->path);
- /* Convert numGroups to long int --- but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(best_path->numGroups);
-
plan = make_recursive_union(tlist,
leftplan,
rightplan,
best_path->wtParam,
best_path->distinctList,
- numGroups);
+ best_path->numGroups);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5845,7 +5837,7 @@ make_recursive_union(List *tlist,
Plan *righttree,
int wtParam,
List *distinctList,
- long numGroups)
+ Cardinality numGroups)
{
RecursiveUnion *node = makeNode(RecursiveUnion);
Plan *plan = &node->plan;
@@ -6582,15 +6574,11 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain, double dNumGroups,
+ List *groupingSets, List *chain, Cardinality numGroups,
Size transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
- long numGroups;
-
- /* Reduce to long, but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(dNumGroups);
node->aggstrategy = aggstrategy;
node->aggsplit = aggsplit;
@@ -6822,7 +6810,7 @@ make_gather(List *qptlist,
static SetOp *
make_setop(SetOpCmd cmd, SetOpStrategy strategy,
List *tlist, Plan *lefttree, Plan *righttree,
- List *groupList, long numGroups)
+ List *groupList, Cardinality numGroups)
{
SetOp *node = makeNode(SetOp);
Plan *plan = &node->plan;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 086f52cff3d..fa2b657fb2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -138,7 +138,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
const Oid *eqfuncoids,
FmgrInfo *hashfunctions,
Oid *collations,
- long nbuckets,
+ double nelements,
Size additionalsize,
MemoryContext metacxt,
MemoryContext tuplescxt,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7cdd2b51c94..c4393a94321 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -475,7 +475,7 @@ typedef struct RecursiveUnion
Oid *dupCollations pg_node_attr(array_size(numCols));
/* estimated number of groups in input */
- long numGroups;
+ Cardinality numGroups;
} RecursiveUnion;
/* ----------------
@@ -1206,7 +1206,7 @@ typedef struct Agg
Oid *grpCollations pg_node_attr(array_size(numCols));
/* estimated number of groups in input */
- long numGroups;
+ Cardinality numGroups;
/* for pass-by-ref transition data */
uint64 transitionSpace;
@@ -1446,7 +1446,7 @@ typedef struct SetOp
bool *cmpNullsFirst pg_node_attr(array_size(numCols));
/* estimated number of groups in left input */
- long numGroups;
+ Cardinality numGroups;
} SetOp;
/* ----------------
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index a34113903c0..d0aa8ab0c1c 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -82,7 +82,6 @@ extern PGDLLIMPORT int effective_cache_size;
extern double clamp_row_est(double nrows);
extern int32 clamp_width_est(int64 tuple_width);
-extern long clamp_cardinality_to_long(Cardinality x);
/* in path/indxpath.c: */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 09b48b26f8f..00addf15992 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain, double dNumGroups,
+ List *groupingSets, List *chain, Cardinality numGroups,
Size transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
LimitOption limitOption, int uniqNumCols,
--
2.43.7
I wrote:
> Here's a pair of patches to try to do better. The first one
> is concerned with getting more realistic size estimates for
> TupleHashTables in the planner. The second is some mop-up
> that's been pending for a long time in the same area, namely
> getting rid of "long int" field types in Plan nodes.
Meh ... cfbot found a compiler warning that I'd not seen locally.
v2 attached silences that, and I twiddled a couple of comments.
regards, tom lane
From 51766859aa3f694bc6c26e8569f6ebff61aeac8b Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 31 Oct 2025 15:14:26 -0400
Subject: [PATCH v2 1/2] Improve planner's estimates of tuple hash table sizes.
For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.
To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)
I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables. That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.
The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation. On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
src/backend/executor/execGrouping.c | 59 ++++++++++++++++++++++++++
src/backend/executor/nodeSetOp.c | 9 ++++
src/backend/executor/nodeSubplan.c | 53 ++++++++++++++++++++++-
src/backend/optimizer/plan/subselect.c | 45 ++++++++++----------
src/backend/optimizer/util/pathnode.c | 12 +++---
src/include/executor/executor.h | 3 ++
src/include/executor/nodeSetOp.h | 2 +
src/include/executor/nodeSubplan.h | 4 ++
src/include/lib/simplehash.h | 47 +++++++++++++++++++-
9 files changed, 203 insertions(+), 31 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index b4bdaa3c305..e1a3a813dd9 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "common/hashfn.h"
#include "executor/executor.h"
@@ -302,6 +303,64 @@ ResetTupleHashTable(TupleHashTable hashtable)
MemoryContextReset(hashtable->tuplescxt);
}
+/*
+ * Estimate the amount of space needed for a TupleHashTable with nentries
+ * entries, if the tuples have average data width tupleWidth and the caller
+ * requires additionalsize extra space per entry.
+ *
+ * Return SIZE_MAX if it'd overflow size_t.
+ *
+ * nentries is "double" because this is meant for use by the planner,
+ * which typically works with double rowcount estimates. So we'd need to
+ * clamp to integer somewhere and that might as well be here. We do expect
+ * the value not to be NaN or negative, else the result will be garbage.
+ */
+Size
+EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize)
+{
+ Size sh_space;
+ double tuples_space;
+
+ /* First estimate the space needed for the simplehash table */
+ sh_space = tuplehash_estimate_space(nentries);
+
+ /* Give up if that's already too big */
+ if (sh_space >= SIZE_MAX)
+ return sh_space;
+
+ /*
+ * Compute space needed for hashed tuples with additional data. nentries
+ * must be somewhat sane, so it should be safe to compute this product.
+ *
+ * We assume that the hashed tuples will be kept in a BumpContext so that
+ * there is not additional per-tuple overhead.
+ *
+ * (Note that this is only accurate if MEMORY_CONTEXT_CHECKING is off,
+ * else bump.c will add a MemoryChunk header to each tuple. However, it
+ * seems undesirable for debug builds to make different planning choices
+ * than production builds, so we assume the production behavior always.)
+ */
+ tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ MAXALIGN(additionalsize));
+
+ /*
+ * Check for size_t overflow. This coding is trickier than it may appear,
+ * because on 64-bit machines SIZE_MAX cannot be represented exactly as a
+ * double. We must cast it explicitly to suppress compiler warnings about
+ * an inexact conversion, and we must trust that any double value that
+ * compares strictly less than "(double) SIZE_MAX" will cast to a
+ * representable size_t value.
+ */
+ if (sh_space + tuples_space >= (double) SIZE_MAX)
+ return SIZE_MAX;
+
+ /* We don't bother estimating size of the miscellaneous overhead data */
+ return (Size) (sh_space + tuples_space);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7b223a7ca3a..5aabed18a09 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -111,6 +111,15 @@ build_hash_table(SetOpState *setopstate)
false);
}
+/* Planner support routine to estimate space needed for hash table */
+Size
+EstimateSetOpHashTableSpace(double nentries, Size tupleWidth)
+{
+ return EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ sizeof(SetOpStatePerGroupData));
+}
+
/*
* We've completed processing a tuple group. Decide how many copies (if any)
* of its representative row to emit, and store the count into numOutput.
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9f6e45bcb0b..96d900a7783 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -525,7 +525,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_hash_funcs,
node->tab_collations,
nbuckets,
- 0,
+ 0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
innerecontext->ecxt_per_tuple_memory,
@@ -554,7 +554,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_hash_funcs,
node->tab_collations,
nbuckets,
- 0,
+ 0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
innerecontext->ecxt_per_tuple_memory,
@@ -636,6 +636,55 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
MemoryContextSwitchTo(oldcontext);
}
+/* Planner support routine to estimate space needed for hash table(s) */
+Size
+EstimateSubplanHashTableSpace(double nentries,
+ Size tupleWidth,
+ bool unknownEqFalse)
+{
+ Size tab1space,
+ tab2space;
+
+ /* Estimate size of main hashtable */
+ tab1space = EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ 0 /* no additional data */ );
+
+ /* Give up if that's already too big */
+ if (tab1space >= SIZE_MAX)
+ return tab1space;
+
+ /* Done if we don't need a hashnulls table */
+ if (unknownEqFalse)
+ return tab1space;
+
+ /*
+ * Adjust the rowcount estimate in the same way as above, except that we
+ * don't bother with the special case for a single hash column. (We could
+ * handle that, but it'd be notationally painful for our caller to provide
+ * the column count, and this table adds not that much to the total
+ * estimate anyway.)
+ */
+ nentries /= 16;
+ if (nentries < 1)
+ nentries = 1;
+
+ /*
+ * It might be sane to also reduce the tupleWidth, but on the other hand
+ * we are not accounting for the space taken by the tuples' null bitmaps.
+ * Leave it alone for now.
+ */
+ tab2space = EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ 0 /* no additional data */ );
+
+ /* Guard against overflow */
+ if (tab2space >= SIZE_MAX - tab1space)
+ return SIZE_MAX;
+
+ return tab1space + tab2space;
+}
+
/*
* execTuplesUnequal
* Return true if two tuples are definitely unequal in the indicated
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 14192a13236..ff63d20f8d5 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -20,6 +20,7 @@
#include "catalog/pg_operator.h"
#include "catalog/pg_type.h"
#include "executor/executor.h"
+#include "executor/nodeSubplan.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
@@ -79,8 +80,8 @@ static Node *convert_testexpr(PlannerInfo *root,
List *subst_nodes);
static Node *convert_testexpr_mutator(Node *node,
convert_testexpr_context *context);
-static bool subplan_is_hashable(Plan *plan);
-static bool subpath_is_hashable(Path *path);
+static bool subplan_is_hashable(Plan *plan, bool unknownEqFalse);
+static bool subpath_is_hashable(Path *path, bool unknownEqFalse);
static bool testexpr_is_hashable(Node *testexpr, List *param_ids);
static bool test_opexpr_is_hashable(OpExpr *testexpr, List *param_ids);
static bool hash_ok_operator(OpExpr *expr);
@@ -283,7 +284,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
best_path = final_rel->cheapest_total_path;
/* Now we can check if it'll fit in hash_mem */
- if (subpath_is_hashable(best_path))
+ if (subpath_is_hashable(best_path, true))
{
SubPlan *hashplan;
AlternativeSubPlan *asplan;
@@ -524,7 +525,7 @@ build_subplan(PlannerInfo *root, Plan *plan, Path *path,
*/
if (subLinkType == ANY_SUBLINK &&
splan->parParam == NIL &&
- subplan_is_hashable(plan) &&
+ subplan_is_hashable(plan, unknownEqFalse) &&
testexpr_is_hashable(splan->testexpr, splan->paramIds))
splan->useHashTable = true;
@@ -711,19 +712,19 @@ convert_testexpr_mutator(Node *node,
* is suitable for hashing. We only look at the subquery itself.
*/
static bool
-subplan_is_hashable(Plan *plan)
+subplan_is_hashable(Plan *plan, bool unknownEqFalse)
{
- double subquery_size;
+ Size hashtablesize;
/*
- * The estimated size of the subquery result must fit in hash_mem. (Note:
- * we use heap tuple overhead here even though the tuples will actually be
- * stored as MinimalTuples; this provides some fudge factor for hashtable
- * overhead.)
+ * The estimated size of the hashtable holding the subquery result must
+ * fit in hash_mem. (Note: reject on equality, to ensure that an estimate
+ * of SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- subquery_size = plan->plan_rows *
- (MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));
- if (subquery_size > get_hash_memory_limit())
+ hashtablesize = EstimateSubplanHashTableSpace(plan->plan_rows,
+ plan->plan_width,
+ unknownEqFalse);
+ if (hashtablesize >= get_hash_memory_limit())
return false;
return true;
@@ -735,19 +736,19 @@ subplan_is_hashable(Plan *plan)
* Identical to subplan_is_hashable, but work from a Path for the subplan.
*/
static bool
-subpath_is_hashable(Path *path)
+subpath_is_hashable(Path *path, bool unknownEqFalse)
{
- double subquery_size;
+ Size hashtablesize;
/*
- * The estimated size of the subquery result must fit in hash_mem. (Note:
- * we use heap tuple overhead here even though the tuples will actually be
- * stored as MinimalTuples; this provides some fudge factor for hashtable
- * overhead.)
+ * The estimated size of the hashtable holding the subquery result must
+ * fit in hash_mem. (Note: reject on equality, to ensure that an estimate
+ * of SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- subquery_size = path->rows *
- (MAXALIGN(path->pathtarget->width) + MAXALIGN(SizeofHeapTupleHeader));
- if (subquery_size > get_hash_memory_limit())
+ hashtablesize = EstimateSubplanHashTableSpace(path->rows,
+ path->pathtarget->width,
+ unknownEqFalse);
+ if (hashtablesize >= get_hash_memory_limit())
return false;
return true;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 44ac5312edd..e4fd6950fad 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -17,6 +17,7 @@
#include <math.h>
#include "access/htup_details.h"
+#include "executor/nodeSetOp.h"
#include "foreign/fdwapi.h"
#include "miscadmin.h"
#include "nodes/extensible.h"
@@ -3461,7 +3462,7 @@ create_setop_path(PlannerInfo *root,
}
else
{
- Size hashentrysize;
+ Size hashtablesize;
/*
* In hashed mode, we must read all the input before we can emit
@@ -3490,11 +3491,12 @@ create_setop_path(PlannerInfo *root,
/*
* Also disable if it doesn't look like the hashtable will fit into
- * hash_mem.
+ * hash_mem. (Note: reject on equality, to ensure that an estimate of
+ * SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- hashentrysize = MAXALIGN(leftpath->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- if (hashentrysize * numGroups > get_hash_memory_limit())
+ hashtablesize = EstimateSetOpHashTableSpace(numGroups,
+ leftpath->pathtarget->width);
+ if (hashtablesize >= get_hash_memory_limit())
pathnode->path.disabled_nodes++;
}
pathnode->path.rows = outputRows;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8e7a5453064..086f52cff3d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -157,6 +157,9 @@ extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
ExprState *eqcomp,
ExprState *hashexpr);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern Size EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize);
#ifndef FRONTEND
/*
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index 024c6ba1fce..302936df8be 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -20,4 +20,6 @@ extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
extern void ExecEndSetOp(SetOpState *node);
extern void ExecReScanSetOp(SetOpState *node);
+extern Size EstimateSetOpHashTableSpace(double nentries, Size tupleWidth);
+
#endif /* NODESETOP_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index a1cafbcc694..301c29d1f24 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -20,6 +20,10 @@ extern SubPlanState *ExecInitSubPlan(SubPlan *subplan, PlanState *parent);
extern Datum ExecSubPlan(SubPlanState *node, ExprContext *econtext, bool *isNull);
+extern Size EstimateSubplanHashTableSpace(double nentries,
+ Size tupleWidth,
+ bool unknownEqFalse);
+
extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 9622131ede6..82d66135ddf 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -125,6 +125,7 @@
#define SH_ITERATE SH_MAKE_NAME(iterate)
#define SH_ALLOCATE SH_MAKE_NAME(allocate)
#define SH_FREE SH_MAKE_NAME(free)
+#define SH_ESTIMATE_SPACE SH_MAKE_NAME(estimate_space)
#define SH_STAT SH_MAKE_NAME(stat)
/* internal helper functions (no externally visible prototypes) */
@@ -242,7 +243,10 @@ SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at);
/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
SH_SCOPE SH_ELEMENT_TYPE *SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
-/* void <prefix>_stat(<prefix>_hash *tb */
+/* size_t <prefix>_estimate_space(double nentries) */
+SH_SCOPE size_t SH_ESTIMATE_SPACE(double nentries);
+
+/* void <prefix>_stat(<prefix>_hash *tb) */
SH_SCOPE void SH_STAT(SH_TYPE * tb);
#endif /* SH_DECLARE */
@@ -305,7 +309,7 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
/*
* Compute allocation size for hashtable. Result can be passed to
- * SH_UPDATE_PARAMETERS.
+ * SH_UPDATE_PARAMETERS. (Keep SH_ESTIMATE_SPACE in sync with this!)
*/
static inline uint64
SH_COMPUTE_SIZE(uint64 newsize)
@@ -1068,6 +1072,44 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
return NULL;
}
+/*
+ * Estimate the amount of space needed for a hashtable with nentries entries.
+ * Return SIZE_MAX if that's too many entries.
+ *
+ * nentries is "double" because this is meant for use by the planner,
+ * which typically works with double rowcount estimates. So we'd need to
+ * clamp to integer somewhere and that might as well be here. We do expect
+ * the value not to be NaN or negative, else the result will be garbage.
+ */
+SH_SCOPE size_t
+SH_ESTIMATE_SPACE(double nentries)
+{
+ uint64 size;
+ uint64 space;
+
+ /* fail immediately if we'd overrun SH_MAX_SIZE entries */
+ if (nentries >= SH_MAX_SIZE * SH_FILLFACTOR)
+ return SIZE_MAX;
+
+ /* should be safe to convert to uint64 */
+ size = (uint64) nentries;
+
+ /* supporting zero sized hashes would complicate matters */
+ size = Max(size, 2);
+
+ /* round up size to the next power of 2, that's how bucketing works */
+ size = pg_nextpower2_64(size);
+
+ /* calculate space needed for ->data */
+ space = ((uint64) sizeof(SH_ELEMENT_TYPE)) * size;
+
+ /* verify that allocation of ->data is possible on this platform */
+ if (space >= SIZE_MAX / 2)
+ return SIZE_MAX;
+
+ return (size_t) space + sizeof(SH_TYPE);
+}
+
/*
* Report some statistics about the state of the hashtable. For
* debugging/profiling purposes only.
@@ -1195,6 +1237,7 @@ SH_STAT(SH_TYPE * tb)
#undef SH_ITERATE
#undef SH_ALLOCATE
#undef SH_FREE
+#undef SH_ESTIMATE_SPACE
#undef SH_STAT
/* internal function names */
--
2.43.7
From cc7a0b06d13fd47d6c78a74291b470e9c07e48af Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 31 Oct 2025 15:18:20 -0400
Subject: [PATCH v2 2/2] Change "long" numGroups fields to be Cardinality
(i.e., double).
We've been nibbling away at removing uses of "long" for a long time,
since its width is platform-dependent. Here's one more: change the
remaining "long" fields in Plan nodes to Cardinality, since the three
surviving examples all represent group-count estimates. The upstream
planner code was converted to Cardinality some time ago; for example
the corresponding fields in Path nodes are type Cardinality, as are
the arguments of the make_foo_path functions. Downstream in the
executor, it turns out that these all feed to the table-size argument
of BuildTupleHashTable. Change that to "double" as well, and fix it
so that it safely clamps out-of-range values to the uint32 limit of
simplehash.h, as was not being done before.
Essentially, this is removing all the artificial datatype-dependent
limitations on these values from upstream processing, and applying
just one clamp at the moment where we're forced to do so by the
datatype choices of simplehash.h.
Also, remove BuildTupleHashTable's misguided attempt to enforce
work_mem/hash_mem_limit. It doesn't have enough information
(particularly not the expected tuple width) to do that accurately,
and it has no real business second-guessing the caller's choice.
For all these plan types, it's really the planner's responsibility
to not choose a hashed implementation if the hashtable is expected
to exceed hash_mem_limit. The previous patch improved the
accuracy of those estimates, and even if BuildTupleHashTable had
more information it should arrive at the same conclusions.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
src/backend/executor/execGrouping.c | 36 ++++++++++++-----------
src/backend/executor/nodeAgg.c | 24 +++++++--------
src/backend/executor/nodeRecursiveunion.c | 1 -
src/backend/executor/nodeSetOp.c | 1 -
src/backend/executor/nodeSubplan.c | 19 +++++-------
src/backend/optimizer/path/costsize.c | 26 ----------------
src/backend/optimizer/plan/createplan.c | 26 +++++-----------
src/include/executor/executor.h | 2 +-
src/include/nodes/plannodes.h | 6 ++--
src/include/optimizer/optimizer.h | 1 -
src/include/optimizer/planmain.h | 2 +-
11 files changed, 50 insertions(+), 94 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index e1a3a813dd9..8b64a625ca5 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -14,6 +14,8 @@
*/
#include "postgres.h"
+#include <math.h>
+
#include "access/htup_details.h"
#include "access/parallel.h"
#include "common/hashfn.h"
@@ -144,7 +146,7 @@ execTuplesHashPrepare(int numCols,
* eqfuncoids: OIDs of equality comparison functions to use
* hashfunctions: FmgrInfos of datatype-specific hashing functions to use
* collations: collations to use in comparisons
- * nbuckets: initial estimate of hashtable size
+ * nelements: initial estimate of hashtable size
* additionalsize: size of data that may be stored along with the hash entry
* metacxt: memory context for long-lived data and the simplehash table
* tuplescxt: memory context in which to store the hashed tuples themselves
@@ -187,7 +189,7 @@ BuildTupleHashTable(PlanState *parent,
const Oid *eqfuncoids,
FmgrInfo *hashfunctions,
Oid *collations,
- long nbuckets,
+ double nelements,
Size additionalsize,
MemoryContext metacxt,
MemoryContext tuplescxt,
@@ -195,12 +197,24 @@ BuildTupleHashTable(PlanState *parent,
bool use_variable_hash_iv)
{
TupleHashTable hashtable;
- Size entrysize;
- Size hash_mem_limit;
+ uint32 nbuckets;
MemoryContext oldcontext;
uint32 hash_iv = 0;
- Assert(nbuckets > 0);
+ /*
+ * tuplehash_create requires a uint32 element count, so we had better
+ * clamp the given nelements to fit in that. As long as we have to do
+ * that, we might as well protect against completely insane input like
+ * zero or NaN. But it is not our job here to enforce issues like staying
+ * within hash_mem: the caller should have done that, and we don't have
+ * enough info to second-guess.
+ */
+ if (isnan(nelements) || nelements <= 0)
+ nbuckets = 1;
+ else if (nelements >= PG_UINT32_MAX)
+ nbuckets = PG_UINT32_MAX;
+ else
+ nbuckets = (uint32) nelements;
/* tuplescxt must be separate, else ResetTupleHashTable breaks things */
Assert(metacxt != tuplescxt);
@@ -208,18 +222,6 @@ BuildTupleHashTable(PlanState *parent,
/* ensure additionalsize is maxalign'ed */
additionalsize = MAXALIGN(additionalsize);
- /*
- * Limit initial table size request to not more than hash_mem.
- *
- * XXX this calculation seems pretty misguided, as it counts only overhead
- * and not the tuples themselves. But we have no knowledge of the
- * expected tuple width here.
- */
- entrysize = sizeof(TupleHashEntryData) + additionalsize;
- hash_mem_limit = get_hash_memory_limit() / entrysize;
- if (nbuckets > hash_mem_limit)
- nbuckets = hash_mem_limit;
-
oldcontext = MemoryContextSwitchTo(metacxt);
hashtable = (TupleHashTable) palloc(sizeof(TupleHashTableData));
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 759ffeed2e6..058171d5133 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -402,12 +402,12 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
Bitmapset **unaggregated);
static bool find_cols_walker(Node *node, FindColsContext *context);
static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, double nbuckets);
static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
bool nullcheck);
static void hash_create_memory(AggState *aggstate);
-static long hash_choose_num_buckets(double hashentrysize,
- long ngroups, Size memory);
+static double hash_choose_num_buckets(double hashentrysize,
+ double ngroups, Size memory);
static int hash_choose_num_partitions(double input_groups,
double hashentrysize,
int used_bits,
@@ -1469,7 +1469,7 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
- long nbuckets;
+ double nbuckets;
Size memory;
if (perhash->hashtable != NULL)
@@ -1478,8 +1478,6 @@ build_hash_tables(AggState *aggstate)
continue;
}
- Assert(perhash->aggnode->numGroups > 0);
-
memory = aggstate->hash_mem_limit / aggstate->num_hashes;
/* choose reasonable number of buckets per hashtable */
@@ -1505,7 +1503,7 @@ build_hash_tables(AggState *aggstate)
* Build a single hashtable for this grouping set.
*/
static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, double nbuckets)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
MemoryContext metacxt = aggstate->hash_metacxt;
@@ -2053,11 +2051,11 @@ hash_create_memory(AggState *aggstate)
/*
* Choose a reasonable number of buckets for the initial hash table size.
*/
-static long
-hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
+static double
+hash_choose_num_buckets(double hashentrysize, double ngroups, Size memory)
{
- long max_nbuckets;
- long nbuckets = ngroups;
+ double max_nbuckets;
+ double nbuckets = ngroups;
max_nbuckets = memory / hashentrysize;
@@ -2065,7 +2063,7 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
* Underestimating is better than overestimating. Too many buckets crowd
* out space for group keys and transition state values.
*/
- max_nbuckets >>= 1;
+ max_nbuckets /= 2;
if (nbuckets > max_nbuckets)
nbuckets = max_nbuckets;
@@ -3686,7 +3684,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
if (use_hashing)
{
Plan *outerplan = outerPlan(node);
- uint64 totalGroups = 0;
+ double totalGroups = 0;
aggstate->hash_spill_rslot = ExecInitExtraTupleSlot(estate, scanDesc,
&TTSOpsMinimalTuple);
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index ebb7919b49b..cd0ad51dcd2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -35,7 +35,6 @@ build_hash_table(RecursiveUnionState *rustate)
TupleDesc desc = ExecGetResultType(outerPlanState(rustate));
Assert(node->numCols > 0);
- Assert(node->numGroups > 0);
/*
* If both child plans deliver the same fixed tuple slot type, we can tell
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5aabed18a09..9e0f9274fb1 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -88,7 +88,6 @@ build_hash_table(SetOpState *setopstate)
TupleDesc desc = ExecGetResultType(outerPlanState(setopstate));
Assert(node->strategy == SETOP_HASHED);
- Assert(node->numGroups > 0);
/*
* If both child plans deliver the same fixed tuple slot type, we can tell
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 96d900a7783..aa9fa482797 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -34,7 +34,6 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
-#include "optimizer/optimizer.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -481,7 +480,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
int ncols = node->numCols;
ExprContext *innerecontext = node->innerecontext;
MemoryContext oldcontext;
- long nbuckets;
+ double nentries;
TupleTableSlot *slot;
Assert(subplan->subLinkType == ANY_SUBLINK);
@@ -509,9 +508,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->havehashrows = false;
node->havenullrows = false;
- nbuckets = clamp_cardinality_to_long(planstate->plan->plan_rows);
- if (nbuckets < 1)
- nbuckets = 1;
+ nentries = planstate->plan->plan_rows;
if (node->hashtable)
ResetTupleHashTable(node->hashtable);
@@ -524,7 +521,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_eq_funcoids,
node->tab_hash_funcs,
node->tab_collations,
- nbuckets,
+ nentries,
0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
@@ -534,12 +531,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
if (!subplan->unknownEqFalse)
{
if (ncols == 1)
- nbuckets = 1; /* there can only be one entry */
+ nentries = 1; /* there can only be one entry */
else
{
- nbuckets /= 16;
- if (nbuckets < 1)
- nbuckets = 1;
+ nentries /= 16;
+ if (nentries < 1)
+ nentries = 1;
}
if (node->hashnulls)
@@ -553,7 +550,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_eq_funcoids,
node->tab_hash_funcs,
node->tab_collations,
- nbuckets,
+ nentries,
0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a006..8335cf5b5c5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -257,32 +257,6 @@ clamp_width_est(int64 tuple_width)
return (int32) tuple_width;
}
-/*
- * clamp_cardinality_to_long
- * Cast a Cardinality value to a sane long value.
- */
-long
-clamp_cardinality_to_long(Cardinality x)
-{
- /*
- * Just for paranoia's sake, ensure we do something sane with negative or
- * NaN values.
- */
- if (isnan(x))
- return LONG_MAX;
- if (x <= 0)
- return 0;
-
- /*
- * If "long" is 64 bits, then LONG_MAX cannot be represented exactly as a
- * double. Casting it to double and back may well result in overflow due
- * to rounding, so avoid doing that. We trust that any double value that
- * compares strictly less than "(double) LONG_MAX" will cast to a
- * representable "long" value.
- */
- return (x < (double) LONG_MAX) ? (long) x : LONG_MAX;
-}
-
/*
* cost_seqscan
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 63fe6637155..8af091ba647 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -226,7 +226,7 @@ static RecursiveUnion *make_recursive_union(List *tlist,
Plan *righttree,
int wtParam,
List *distinctList,
- long numGroups);
+ Cardinality numGroups);
static BitmapAnd *make_bitmap_and(List *bitmapplans);
static BitmapOr *make_bitmap_or(List *bitmapplans);
static NestLoop *make_nestloop(List *tlist,
@@ -301,7 +301,7 @@ static Gather *make_gather(List *qptlist, List *qpqual,
int nworkers, int rescan_param, bool single_copy, Plan *subplan);
static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
List *tlist, Plan *lefttree, Plan *righttree,
- List *groupList, long numGroups);
+ List *groupList, Cardinality numGroups);
static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
static Result *make_gating_result(List *tlist, Node *resconstantqual,
Plan *subplan);
@@ -2564,7 +2564,6 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
List *tlist = build_path_tlist(root, &best_path->path);
Plan *leftplan;
Plan *rightplan;
- long numGroups;
/*
* SetOp doesn't project, so tlist requirements pass through; moreover we
@@ -2575,16 +2574,13 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
rightplan = create_plan_recurse(root, best_path->rightpath,
flags | CP_LABEL_TLIST);
- /* Convert numGroups to long int --- but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(best_path->numGroups);
-
plan = make_setop(best_path->cmd,
best_path->strategy,
tlist,
leftplan,
rightplan,
best_path->groupList,
- numGroups);
+ best_path->numGroups);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2604,7 +2600,6 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
Plan *leftplan;
Plan *rightplan;
List *tlist;
- long numGroups;
/* Need both children to produce same tlist, so force it */
leftplan = create_plan_recurse(root, best_path->leftpath, CP_EXACT_TLIST);
@@ -2612,15 +2607,12 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
tlist = build_path_tlist(root, &best_path->path);
- /* Convert numGroups to long int --- but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(best_path->numGroups);
-
plan = make_recursive_union(tlist,
leftplan,
rightplan,
best_path->wtParam,
best_path->distinctList,
- numGroups);
+ best_path->numGroups);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5845,7 +5837,7 @@ make_recursive_union(List *tlist,
Plan *righttree,
int wtParam,
List *distinctList,
- long numGroups)
+ Cardinality numGroups)
{
RecursiveUnion *node = makeNode(RecursiveUnion);
Plan *plan = &node->plan;
@@ -6582,15 +6574,11 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain, double dNumGroups,
+ List *groupingSets, List *chain, Cardinality numGroups,
Size transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
- long numGroups;
-
- /* Reduce to long, but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(dNumGroups);
node->aggstrategy = aggstrategy;
node->aggsplit = aggsplit;
@@ -6822,7 +6810,7 @@ make_gather(List *qptlist,
static SetOp *
make_setop(SetOpCmd cmd, SetOpStrategy strategy,
List *tlist, Plan *lefttree, Plan *righttree,
- List *groupList, long numGroups)
+ List *groupList, Cardinality numGroups)
{
SetOp *node = makeNode(SetOp);
Plan *plan = &node->plan;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 086f52cff3d..fa2b657fb2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -138,7 +138,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
const Oid *eqfuncoids,
FmgrInfo *hashfunctions,
Oid *collations,
- long nbuckets,
+ double nelements,
Size additionalsize,
MemoryContext metacxt,
MemoryContext tuplescxt,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7cdd2b51c94..c4393a94321 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -475,7 +475,7 @@ typedef struct RecursiveUnion
Oid *dupCollations pg_node_attr(array_size(numCols));
/* estimated number of groups in input */
- long numGroups;
+ Cardinality numGroups;
} RecursiveUnion;
/* ----------------
@@ -1206,7 +1206,7 @@ typedef struct Agg
Oid *grpCollations pg_node_attr(array_size(numCols));
/* estimated number of groups in input */
- long numGroups;
+ Cardinality numGroups;
/* for pass-by-ref transition data */
uint64 transitionSpace;
@@ -1446,7 +1446,7 @@ typedef struct SetOp
bool *cmpNullsFirst pg_node_attr(array_size(numCols));
/* estimated number of groups in left input */
- long numGroups;
+ Cardinality numGroups;
} SetOp;
/* ----------------
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index a34113903c0..d0aa8ab0c1c 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -82,7 +82,6 @@ extern PGDLLIMPORT int effective_cache_size;
extern double clamp_row_est(double nrows);
extern int32 clamp_width_est(int64 tuple_width);
-extern long clamp_cardinality_to_long(Cardinality x);
/* in path/indxpath.c: */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 09b48b26f8f..00addf15992 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain, double dNumGroups,
+ List *groupingSets, List *chain, Cardinality numGroups,
Size transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
LimitOption limitOption, int uniqNumCols,
--
2.43.7
On Sat, 1 Nov 2025 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > Here's a pair of patches to try to do better. The first one
> > is concerned with getting more realistic size estimates for
> > TupleHashTables in the planner. The second is some mop-up
> > that's been pending for a long time in the same area, namely
> > getting rid of "long int" field types in Plan nodes.
I had a look at the v2 patches. Mostly quibbles, but #4 seems like an oversight.
0001:
1) For the following:
+ tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ MAXALIGN(additionalsize));
If I'm not mistaken, technically that should be
MAXALIGN(SizeofMinimalTupleHeader + tupleWidth) +
MAXALIGN(additionalsize), but in reality it should come out the same
since SizeofMinimalTupleHeader is 16. If that were to change then I
believe the extra MAXALIGN would start overestimating the memory.
2) Would it be better to reference the function name
"buildSubPlanHash" instead of "above" in:
+ * Adjust the rowcount estimate in the same way as above, except that we
I think "above" is ok when it's the same function, but when it's
talking about another function, it's a recipe for becoming outdated.
It'd be better using the function name so we can grep for that when do
refactor work, else we end up with commits like e3a0304eb...
3) Quite a collection of naming styles here.
+Size
+EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize)
+{
+ Size sh_space;
+ double tuples_space;
4) I think this is missing a "/ SH_FILLFACTOR"
+ /* should be safe to convert to uint64 */
+ size = (uint64) nentries;
i.e do what SH_CREATE does.
0002:
5) Is it switching "Max(nbuckets, 1);" to "nbuckets" in
hash_choose_num_buckets(). Looks like BuildTupleHashTable() will do
that part for us.
David
David Rowley <dgrowleyml@gmail.com> writes:
> I had a look at the v2 patches. Mostly quibbles, but #4 seems like an oversight.
Thanks for reviewing!
> 1) For the following:
> + tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
> + MAXALIGN(tupleWidth) +
> + MAXALIGN(additionalsize));
> If I'm not mistaken, technically that should be
> MAXALIGN(SizeofMinimalTupleHeader + tupleWidth) +
> MAXALIGN(additionalsize),
No, I think it's correct as written: the data payload of a tuple
must always start on a MAXALIGN boundary. As you say, it doesn't
matter as long as SizeofMinimalTupleHeader is 16, but I think this
way is formally correct. (It would matter more if we were trying
to account for tuples' null bitmaps ...)
> 2) Would it be better to reference the function name
> "buildSubPlanHash" instead of "above" in:
> + * Adjust the rowcount estimate in the same way as above, except that we
Done.
> 3) Quite a collection of naming styles here.
Yeah :-( ... we work in an old and none-too-consistent code base.
Do you have any specific suggestions about which of these functions
might fit its surroundings better with a different caps style?
> 4) I think this is missing a "/ SH_FILLFACTOR"
> + /* should be safe to convert to uint64 */
> + size = (uint64) nentries;
> i.e do what SH_CREATE does.
Oh! I think I got confused because some of that logic is in
SH_CREATE and some in SH_COMPUTE_SIZE :-(. Good catch.
> 5) Is it switching "Max(nbuckets, 1);" to "nbuckets" in
> hash_choose_num_buckets(). Looks like BuildTupleHashTable() will do
> that part for us.
Yeah, we could do that. I was trying to not touch more of nodeAgg
than I had to, but it seems sensible to not duplicate something
that BuildTupleHashTable will do.
v3 attached.
regards, tom lane
From f708938d51015664acfac90763cc469e644811db Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 2 Nov 2025 11:13:59 -0500
Subject: [PATCH v3 1/2] Improve planner's estimates of tuple hash table sizes.
For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.
To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)
I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables. That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.
The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation. On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
src/backend/executor/execGrouping.c | 59 ++++++++++++++++++++++++++
src/backend/executor/nodeSetOp.c | 9 ++++
src/backend/executor/nodeSubplan.c | 53 ++++++++++++++++++++++-
src/backend/optimizer/plan/subselect.c | 45 ++++++++++----------
src/backend/optimizer/util/pathnode.c | 12 +++---
src/include/executor/executor.h | 3 ++
src/include/executor/nodeSetOp.h | 2 +
src/include/executor/nodeSubplan.h | 4 ++
src/include/lib/simplehash.h | 50 +++++++++++++++++++++-
9 files changed, 206 insertions(+), 31 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index b4bdaa3c305..e1a3a813dd9 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "common/hashfn.h"
#include "executor/executor.h"
@@ -302,6 +303,64 @@ ResetTupleHashTable(TupleHashTable hashtable)
MemoryContextReset(hashtable->tuplescxt);
}
+/*
+ * Estimate the amount of space needed for a TupleHashTable with nentries
+ * entries, if the tuples have average data width tupleWidth and the caller
+ * requires additionalsize extra space per entry.
+ *
+ * Return SIZE_MAX if it'd overflow size_t.
+ *
+ * nentries is "double" because this is meant for use by the planner,
+ * which typically works with double rowcount estimates. So we'd need to
+ * clamp to integer somewhere and that might as well be here. We do expect
+ * the value not to be NaN or negative, else the result will be garbage.
+ */
+Size
+EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize)
+{
+ Size sh_space;
+ double tuples_space;
+
+ /* First estimate the space needed for the simplehash table */
+ sh_space = tuplehash_estimate_space(nentries);
+
+ /* Give up if that's already too big */
+ if (sh_space >= SIZE_MAX)
+ return sh_space;
+
+ /*
+ * Compute space needed for hashed tuples with additional data. nentries
+ * must be somewhat sane, so it should be safe to compute this product.
+ *
+ * We assume that the hashed tuples will be kept in a BumpContext so that
+ * there is not additional per-tuple overhead.
+ *
+ * (Note that this is only accurate if MEMORY_CONTEXT_CHECKING is off,
+ * else bump.c will add a MemoryChunk header to each tuple. However, it
+ * seems undesirable for debug builds to make different planning choices
+ * than production builds, so we assume the production behavior always.)
+ */
+ tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
+ MAXALIGN(tupleWidth) +
+ MAXALIGN(additionalsize));
+
+ /*
+ * Check for size_t overflow. This coding is trickier than it may appear,
+ * because on 64-bit machines SIZE_MAX cannot be represented exactly as a
+ * double. We must cast it explicitly to suppress compiler warnings about
+ * an inexact conversion, and we must trust that any double value that
+ * compares strictly less than "(double) SIZE_MAX" will cast to a
+ * representable size_t value.
+ */
+ if (sh_space + tuples_space >= (double) SIZE_MAX)
+ return SIZE_MAX;
+
+ /* We don't bother estimating size of the miscellaneous overhead data */
+ return (Size) (sh_space + tuples_space);
+}
+
/*
* Find or create a hashtable entry for the tuple group containing the
* given tuple. The tuple must be the same type as the hashtable entries.
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7b223a7ca3a..5aabed18a09 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -111,6 +111,15 @@ build_hash_table(SetOpState *setopstate)
false);
}
+/* Planner support routine to estimate space needed for hash table */
+Size
+EstimateSetOpHashTableSpace(double nentries, Size tupleWidth)
+{
+ return EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ sizeof(SetOpStatePerGroupData));
+}
+
/*
* We've completed processing a tuple group. Decide how many copies (if any)
* of its representative row to emit, and store the count into numOutput.
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9f6e45bcb0b..1cd0988bb49 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -525,7 +525,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_hash_funcs,
node->tab_collations,
nbuckets,
- 0,
+ 0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
innerecontext->ecxt_per_tuple_memory,
@@ -554,7 +554,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_hash_funcs,
node->tab_collations,
nbuckets,
- 0,
+ 0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
innerecontext->ecxt_per_tuple_memory,
@@ -636,6 +636,55 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
MemoryContextSwitchTo(oldcontext);
}
+/* Planner support routine to estimate space needed for hash table(s) */
+Size
+EstimateSubplanHashTableSpace(double nentries,
+ Size tupleWidth,
+ bool unknownEqFalse)
+{
+ Size tab1space,
+ tab2space;
+
+ /* Estimate size of main hashtable */
+ tab1space = EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ 0 /* no additional data */ );
+
+ /* Give up if that's already too big */
+ if (tab1space >= SIZE_MAX)
+ return tab1space;
+
+ /* Done if we don't need a hashnulls table */
+ if (unknownEqFalse)
+ return tab1space;
+
+ /*
+ * Adjust the rowcount estimate in the same way that buildSubPlanHash
+ * will, except that we don't bother with the special case for a single
+ * hash column. (We skip that detail because it'd be notationally painful
+ * for our caller to provide the column count, and this table has
+ * relatively little impact on the total estimate anyway.)
+ */
+ nentries /= 16;
+ if (nentries < 1)
+ nentries = 1;
+
+ /*
+ * It might be sane to also reduce the tupleWidth, but on the other hand
+ * we are not accounting for the space taken by the tuples' null bitmaps.
+ * Leave it alone for now.
+ */
+ tab2space = EstimateTupleHashTableSpace(nentries,
+ tupleWidth,
+ 0 /* no additional data */ );
+
+ /* Guard against overflow */
+ if (tab2space >= SIZE_MAX - tab1space)
+ return SIZE_MAX;
+
+ return tab1space + tab2space;
+}
+
/*
* execTuplesUnequal
* Return true if two tuples are definitely unequal in the indicated
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 14192a13236..ff63d20f8d5 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -20,6 +20,7 @@
#include "catalog/pg_operator.h"
#include "catalog/pg_type.h"
#include "executor/executor.h"
+#include "executor/nodeSubplan.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
@@ -79,8 +80,8 @@ static Node *convert_testexpr(PlannerInfo *root,
List *subst_nodes);
static Node *convert_testexpr_mutator(Node *node,
convert_testexpr_context *context);
-static bool subplan_is_hashable(Plan *plan);
-static bool subpath_is_hashable(Path *path);
+static bool subplan_is_hashable(Plan *plan, bool unknownEqFalse);
+static bool subpath_is_hashable(Path *path, bool unknownEqFalse);
static bool testexpr_is_hashable(Node *testexpr, List *param_ids);
static bool test_opexpr_is_hashable(OpExpr *testexpr, List *param_ids);
static bool hash_ok_operator(OpExpr *expr);
@@ -283,7 +284,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
best_path = final_rel->cheapest_total_path;
/* Now we can check if it'll fit in hash_mem */
- if (subpath_is_hashable(best_path))
+ if (subpath_is_hashable(best_path, true))
{
SubPlan *hashplan;
AlternativeSubPlan *asplan;
@@ -524,7 +525,7 @@ build_subplan(PlannerInfo *root, Plan *plan, Path *path,
*/
if (subLinkType == ANY_SUBLINK &&
splan->parParam == NIL &&
- subplan_is_hashable(plan) &&
+ subplan_is_hashable(plan, unknownEqFalse) &&
testexpr_is_hashable(splan->testexpr, splan->paramIds))
splan->useHashTable = true;
@@ -711,19 +712,19 @@ convert_testexpr_mutator(Node *node,
* is suitable for hashing. We only look at the subquery itself.
*/
static bool
-subplan_is_hashable(Plan *plan)
+subplan_is_hashable(Plan *plan, bool unknownEqFalse)
{
- double subquery_size;
+ Size hashtablesize;
/*
- * The estimated size of the subquery result must fit in hash_mem. (Note:
- * we use heap tuple overhead here even though the tuples will actually be
- * stored as MinimalTuples; this provides some fudge factor for hashtable
- * overhead.)
+ * The estimated size of the hashtable holding the subquery result must
+ * fit in hash_mem. (Note: reject on equality, to ensure that an estimate
+ * of SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- subquery_size = plan->plan_rows *
- (MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));
- if (subquery_size > get_hash_memory_limit())
+ hashtablesize = EstimateSubplanHashTableSpace(plan->plan_rows,
+ plan->plan_width,
+ unknownEqFalse);
+ if (hashtablesize >= get_hash_memory_limit())
return false;
return true;
@@ -735,19 +736,19 @@ subplan_is_hashable(Plan *plan)
* Identical to subplan_is_hashable, but work from a Path for the subplan.
*/
static bool
-subpath_is_hashable(Path *path)
+subpath_is_hashable(Path *path, bool unknownEqFalse)
{
- double subquery_size;
+ Size hashtablesize;
/*
- * The estimated size of the subquery result must fit in hash_mem. (Note:
- * we use heap tuple overhead here even though the tuples will actually be
- * stored as MinimalTuples; this provides some fudge factor for hashtable
- * overhead.)
+ * The estimated size of the hashtable holding the subquery result must
+ * fit in hash_mem. (Note: reject on equality, to ensure that an estimate
+ * of SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- subquery_size = path->rows *
- (MAXALIGN(path->pathtarget->width) + MAXALIGN(SizeofHeapTupleHeader));
- if (subquery_size > get_hash_memory_limit())
+ hashtablesize = EstimateSubplanHashTableSpace(path->rows,
+ path->pathtarget->width,
+ unknownEqFalse);
+ if (hashtablesize >= get_hash_memory_limit())
return false;
return true;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 44ac5312edd..e4fd6950fad 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -17,6 +17,7 @@
#include <math.h>
#include "access/htup_details.h"
+#include "executor/nodeSetOp.h"
#include "foreign/fdwapi.h"
#include "miscadmin.h"
#include "nodes/extensible.h"
@@ -3461,7 +3462,7 @@ create_setop_path(PlannerInfo *root,
}
else
{
- Size hashentrysize;
+ Size hashtablesize;
/*
* In hashed mode, we must read all the input before we can emit
@@ -3490,11 +3491,12 @@ create_setop_path(PlannerInfo *root,
/*
* Also disable if it doesn't look like the hashtable will fit into
- * hash_mem.
+ * hash_mem. (Note: reject on equality, to ensure that an estimate of
+ * SIZE_MAX disables hashing regardless of the hash_mem limit.)
*/
- hashentrysize = MAXALIGN(leftpath->pathtarget->width) +
- MAXALIGN(SizeofMinimalTupleHeader);
- if (hashentrysize * numGroups > get_hash_memory_limit())
+ hashtablesize = EstimateSetOpHashTableSpace(numGroups,
+ leftpath->pathtarget->width);
+ if (hashtablesize >= get_hash_memory_limit())
pathnode->path.disabled_nodes++;
}
pathnode->path.rows = outputRows;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8e7a5453064..086f52cff3d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -157,6 +157,9 @@ extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
ExprState *eqcomp,
ExprState *hashexpr);
extern void ResetTupleHashTable(TupleHashTable hashtable);
+extern Size EstimateTupleHashTableSpace(double nentries,
+ Size tupleWidth,
+ Size additionalsize);
#ifndef FRONTEND
/*
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index 024c6ba1fce..302936df8be 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -20,4 +20,6 @@ extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
extern void ExecEndSetOp(SetOpState *node);
extern void ExecReScanSetOp(SetOpState *node);
+extern Size EstimateSetOpHashTableSpace(double nentries, Size tupleWidth);
+
#endif /* NODESETOP_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index a1cafbcc694..301c29d1f24 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -20,6 +20,10 @@ extern SubPlanState *ExecInitSubPlan(SubPlan *subplan, PlanState *parent);
extern Datum ExecSubPlan(SubPlanState *node, ExprContext *econtext, bool *isNull);
+extern Size EstimateSubplanHashTableSpace(double nentries,
+ Size tupleWidth,
+ bool unknownEqFalse);
+
extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 9622131ede6..031a377da84 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -125,6 +125,7 @@
#define SH_ITERATE SH_MAKE_NAME(iterate)
#define SH_ALLOCATE SH_MAKE_NAME(allocate)
#define SH_FREE SH_MAKE_NAME(free)
+#define SH_ESTIMATE_SPACE SH_MAKE_NAME(estimate_space)
#define SH_STAT SH_MAKE_NAME(stat)
/* internal helper functions (no externally visible prototypes) */
@@ -242,7 +243,10 @@ SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at);
/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
SH_SCOPE SH_ELEMENT_TYPE *SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
-/* void <prefix>_stat(<prefix>_hash *tb */
+/* size_t <prefix>_estimate_space(double nentries) */
+SH_SCOPE size_t SH_ESTIMATE_SPACE(double nentries);
+
+/* void <prefix>_stat(<prefix>_hash *tb) */
SH_SCOPE void SH_STAT(SH_TYPE * tb);
#endif /* SH_DECLARE */
@@ -305,7 +309,7 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
/*
* Compute allocation size for hashtable. Result can be passed to
- * SH_UPDATE_PARAMETERS.
+ * SH_UPDATE_PARAMETERS. (Keep SH_ESTIMATE_SPACE in sync with this!)
*/
static inline uint64
SH_COMPUTE_SIZE(uint64 newsize)
@@ -1068,6 +1072,47 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
return NULL;
}
+/*
+ * Estimate the amount of space needed for a hashtable with nentries entries.
+ * Return SIZE_MAX if that's too many entries.
+ *
+ * nentries is "double" because this is meant for use by the planner,
+ * which typically works with double rowcount estimates. So we'd need to
+ * clamp to integer somewhere and that might as well be here. We do expect
+ * the value not to be NaN or negative, else the result will be garbage.
+ */
+SH_SCOPE size_t
+SH_ESTIMATE_SPACE(double nentries)
+{
+ uint64 size;
+ uint64 space;
+
+ /* scale request by SH_FILLFACTOR, as SH_CREATE does */
+ nentries = nentries / SH_FILLFACTOR;
+
+ /* fail if we'd overrun SH_MAX_SIZE entries */
+ if (nentries >= SH_MAX_SIZE)
+ return SIZE_MAX;
+
+ /* should be safe to convert to uint64 */
+ size = (uint64) nentries;
+
+ /* supporting zero sized hashes would complicate matters */
+ size = Max(size, 2);
+
+ /* round up size to the next power of 2, that's how bucketing works */
+ size = pg_nextpower2_64(size);
+
+ /* calculate space needed for ->data */
+ space = ((uint64) sizeof(SH_ELEMENT_TYPE)) * size;
+
+ /* verify that allocation of ->data is possible on this platform */
+ if (space >= SIZE_MAX / 2)
+ return SIZE_MAX;
+
+ return (size_t) space + sizeof(SH_TYPE);
+}
+
/*
* Report some statistics about the state of the hashtable. For
* debugging/profiling purposes only.
@@ -1195,6 +1240,7 @@ SH_STAT(SH_TYPE * tb)
#undef SH_ITERATE
#undef SH_ALLOCATE
#undef SH_FREE
+#undef SH_ESTIMATE_SPACE
#undef SH_STAT
/* internal function names */
--
2.43.7
From a935da643f0fe5da46ca97aa2a0ee8acc6b35b24 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 2 Nov 2025 11:21:21 -0500
Subject: [PATCH v3 2/2] Change "long" numGroups fields to be Cardinality
(i.e., double).
We've been nibbling away at removing uses of "long" for a long time,
since its width is platform-dependent. Here's one more: change the
remaining "long" fields in Plan nodes to Cardinality, since the three
surviving examples all represent group-count estimates. The upstream
planner code was converted to Cardinality some time ago; for example
the corresponding fields in Path nodes are type Cardinality, as are
the arguments of the make_foo_path functions. Downstream in the
executor, it turns out that these all feed to the table-size argument
of BuildTupleHashTable. Change that to "double" as well, and fix it
so that it safely clamps out-of-range values to the uint32 limit of
simplehash.h, as was not being done before.
Essentially, this is removing all the artificial datatype-dependent
limitations on these values from upstream processing, and applying
just one clamp at the moment where we're forced to do so by the
datatype choices of simplehash.h.
Also, remove BuildTupleHashTable's misguided attempt to enforce
work_mem/hash_mem_limit. It doesn't have enough information
(particularly not the expected tuple width) to do that accurately,
and it has no real business second-guessing the caller's choice.
For all these plan types, it's really the planner's responsibility
to not choose a hashed implementation if the hashtable is expected
to exceed hash_mem_limit. The previous patch improved the
accuracy of those estimates, and even if BuildTupleHashTable had
more information it should arrive at the same conclusions.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
---
src/backend/executor/execGrouping.c | 36 ++++++++++++-----------
src/backend/executor/nodeAgg.c | 30 ++++++++++---------
src/backend/executor/nodeRecursiveunion.c | 1 -
src/backend/executor/nodeSetOp.c | 1 -
src/backend/executor/nodeSubplan.c | 19 +++++-------
src/backend/optimizer/path/costsize.c | 26 ----------------
src/backend/optimizer/plan/createplan.c | 26 +++++-----------
src/include/executor/executor.h | 2 +-
src/include/nodes/plannodes.h | 6 ++--
src/include/optimizer/optimizer.h | 1 -
src/include/optimizer/planmain.h | 2 +-
11 files changed, 55 insertions(+), 95 deletions(-)
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index e1a3a813dd9..8b64a625ca5 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -14,6 +14,8 @@
*/
#include "postgres.h"
+#include <math.h>
+
#include "access/htup_details.h"
#include "access/parallel.h"
#include "common/hashfn.h"
@@ -144,7 +146,7 @@ execTuplesHashPrepare(int numCols,
* eqfuncoids: OIDs of equality comparison functions to use
* hashfunctions: FmgrInfos of datatype-specific hashing functions to use
* collations: collations to use in comparisons
- * nbuckets: initial estimate of hashtable size
+ * nelements: initial estimate of hashtable size
* additionalsize: size of data that may be stored along with the hash entry
* metacxt: memory context for long-lived data and the simplehash table
* tuplescxt: memory context in which to store the hashed tuples themselves
@@ -187,7 +189,7 @@ BuildTupleHashTable(PlanState *parent,
const Oid *eqfuncoids,
FmgrInfo *hashfunctions,
Oid *collations,
- long nbuckets,
+ double nelements,
Size additionalsize,
MemoryContext metacxt,
MemoryContext tuplescxt,
@@ -195,12 +197,24 @@ BuildTupleHashTable(PlanState *parent,
bool use_variable_hash_iv)
{
TupleHashTable hashtable;
- Size entrysize;
- Size hash_mem_limit;
+ uint32 nbuckets;
MemoryContext oldcontext;
uint32 hash_iv = 0;
- Assert(nbuckets > 0);
+ /*
+ * tuplehash_create requires a uint32 element count, so we had better
+ * clamp the given nelements to fit in that. As long as we have to do
+ * that, we might as well protect against completely insane input like
+ * zero or NaN. But it is not our job here to enforce issues like staying
+ * within hash_mem: the caller should have done that, and we don't have
+ * enough info to second-guess.
+ */
+ if (isnan(nelements) || nelements <= 0)
+ nbuckets = 1;
+ else if (nelements >= PG_UINT32_MAX)
+ nbuckets = PG_UINT32_MAX;
+ else
+ nbuckets = (uint32) nelements;
/* tuplescxt must be separate, else ResetTupleHashTable breaks things */
Assert(metacxt != tuplescxt);
@@ -208,18 +222,6 @@ BuildTupleHashTable(PlanState *parent,
/* ensure additionalsize is maxalign'ed */
additionalsize = MAXALIGN(additionalsize);
- /*
- * Limit initial table size request to not more than hash_mem.
- *
- * XXX this calculation seems pretty misguided, as it counts only overhead
- * and not the tuples themselves. But we have no knowledge of the
- * expected tuple width here.
- */
- entrysize = sizeof(TupleHashEntryData) + additionalsize;
- hash_mem_limit = get_hash_memory_limit() / entrysize;
- if (nbuckets > hash_mem_limit)
- nbuckets = hash_mem_limit;
-
oldcontext = MemoryContextSwitchTo(metacxt);
hashtable = (TupleHashTable) palloc(sizeof(TupleHashTableData));
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 759ffeed2e6..0b02fd32107 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -402,12 +402,12 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
Bitmapset **unaggregated);
static bool find_cols_walker(Node *node, FindColsContext *context);
static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, double nbuckets);
static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
bool nullcheck);
static void hash_create_memory(AggState *aggstate);
-static long hash_choose_num_buckets(double hashentrysize,
- long ngroups, Size memory);
+static double hash_choose_num_buckets(double hashentrysize,
+ double ngroups, Size memory);
static int hash_choose_num_partitions(double input_groups,
double hashentrysize,
int used_bits,
@@ -1469,7 +1469,7 @@ build_hash_tables(AggState *aggstate)
for (setno = 0; setno < aggstate->num_hashes; ++setno)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
- long nbuckets;
+ double nbuckets;
Size memory;
if (perhash->hashtable != NULL)
@@ -1478,8 +1478,6 @@ build_hash_tables(AggState *aggstate)
continue;
}
- Assert(perhash->aggnode->numGroups > 0);
-
memory = aggstate->hash_mem_limit / aggstate->num_hashes;
/* choose reasonable number of buckets per hashtable */
@@ -1505,7 +1503,7 @@ build_hash_tables(AggState *aggstate)
* Build a single hashtable for this grouping set.
*/
static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, double nbuckets)
{
AggStatePerHash perhash = &aggstate->perhash[setno];
MemoryContext metacxt = aggstate->hash_metacxt;
@@ -2053,11 +2051,11 @@ hash_create_memory(AggState *aggstate)
/*
* Choose a reasonable number of buckets for the initial hash table size.
*/
-static long
-hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
+static double
+hash_choose_num_buckets(double hashentrysize, double ngroups, Size memory)
{
- long max_nbuckets;
- long nbuckets = ngroups;
+ double max_nbuckets;
+ double nbuckets = ngroups;
max_nbuckets = memory / hashentrysize;
@@ -2065,12 +2063,16 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
* Underestimating is better than overestimating. Too many buckets crowd
* out space for group keys and transition state values.
*/
- max_nbuckets >>= 1;
+ max_nbuckets /= 2;
if (nbuckets > max_nbuckets)
nbuckets = max_nbuckets;
- return Max(nbuckets, 1);
+ /*
+ * BuildTupleHashTable will clamp any obviously-insane result, so we don't
+ * need to be too careful here.
+ */
+ return nbuckets;
}
/*
@@ -3686,7 +3688,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
if (use_hashing)
{
Plan *outerplan = outerPlan(node);
- uint64 totalGroups = 0;
+ double totalGroups = 0;
aggstate->hash_spill_rslot = ExecInitExtraTupleSlot(estate, scanDesc,
&TTSOpsMinimalTuple);
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index ebb7919b49b..cd0ad51dcd2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -35,7 +35,6 @@ build_hash_table(RecursiveUnionState *rustate)
TupleDesc desc = ExecGetResultType(outerPlanState(rustate));
Assert(node->numCols > 0);
- Assert(node->numGroups > 0);
/*
* If both child plans deliver the same fixed tuple slot type, we can tell
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5aabed18a09..9e0f9274fb1 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -88,7 +88,6 @@ build_hash_table(SetOpState *setopstate)
TupleDesc desc = ExecGetResultType(outerPlanState(setopstate));
Assert(node->strategy == SETOP_HASHED);
- Assert(node->numGroups > 0);
/*
* If both child plans deliver the same fixed tuple slot type, we can tell
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 1cd0988bb49..c8b7bd9eb66 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -34,7 +34,6 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
-#include "optimizer/optimizer.h"
#include "utils/array.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -481,7 +480,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
int ncols = node->numCols;
ExprContext *innerecontext = node->innerecontext;
MemoryContext oldcontext;
- long nbuckets;
+ double nentries;
TupleTableSlot *slot;
Assert(subplan->subLinkType == ANY_SUBLINK);
@@ -509,9 +508,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->havehashrows = false;
node->havenullrows = false;
- nbuckets = clamp_cardinality_to_long(planstate->plan->plan_rows);
- if (nbuckets < 1)
- nbuckets = 1;
+ nentries = planstate->plan->plan_rows;
if (node->hashtable)
ResetTupleHashTable(node->hashtable);
@@ -524,7 +521,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_eq_funcoids,
node->tab_hash_funcs,
node->tab_collations,
- nbuckets,
+ nentries,
0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
@@ -534,12 +531,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
if (!subplan->unknownEqFalse)
{
if (ncols == 1)
- nbuckets = 1; /* there can only be one entry */
+ nentries = 1; /* there can only be one entry */
else
{
- nbuckets /= 16;
- if (nbuckets < 1)
- nbuckets = 1;
+ nentries /= 16;
+ if (nentries < 1)
+ nentries = 1;
}
if (node->hashnulls)
@@ -553,7 +550,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
node->tab_eq_funcoids,
node->tab_hash_funcs,
node->tab_collations,
- nbuckets,
+ nentries,
0, /* no additional data */
node->planstate->state->es_query_cxt,
node->tuplesContext,
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a006..8335cf5b5c5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -257,32 +257,6 @@ clamp_width_est(int64 tuple_width)
return (int32) tuple_width;
}
-/*
- * clamp_cardinality_to_long
- * Cast a Cardinality value to a sane long value.
- */
-long
-clamp_cardinality_to_long(Cardinality x)
-{
- /*
- * Just for paranoia's sake, ensure we do something sane with negative or
- * NaN values.
- */
- if (isnan(x))
- return LONG_MAX;
- if (x <= 0)
- return 0;
-
- /*
- * If "long" is 64 bits, then LONG_MAX cannot be represented exactly as a
- * double. Casting it to double and back may well result in overflow due
- * to rounding, so avoid doing that. We trust that any double value that
- * compares strictly less than "(double) LONG_MAX" will cast to a
- * representable "long" value.
- */
- return (x < (double) LONG_MAX) ? (long) x : LONG_MAX;
-}
-
/*
* cost_seqscan
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 63fe6637155..8af091ba647 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -226,7 +226,7 @@ static RecursiveUnion *make_recursive_union(List *tlist,
Plan *righttree,
int wtParam,
List *distinctList,
- long numGroups);
+ Cardinality numGroups);
static BitmapAnd *make_bitmap_and(List *bitmapplans);
static BitmapOr *make_bitmap_or(List *bitmapplans);
static NestLoop *make_nestloop(List *tlist,
@@ -301,7 +301,7 @@ static Gather *make_gather(List *qptlist, List *qpqual,
int nworkers, int rescan_param, bool single_copy, Plan *subplan);
static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
List *tlist, Plan *lefttree, Plan *righttree,
- List *groupList, long numGroups);
+ List *groupList, Cardinality numGroups);
static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
static Result *make_gating_result(List *tlist, Node *resconstantqual,
Plan *subplan);
@@ -2564,7 +2564,6 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
List *tlist = build_path_tlist(root, &best_path->path);
Plan *leftplan;
Plan *rightplan;
- long numGroups;
/*
* SetOp doesn't project, so tlist requirements pass through; moreover we
@@ -2575,16 +2574,13 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
rightplan = create_plan_recurse(root, best_path->rightpath,
flags | CP_LABEL_TLIST);
- /* Convert numGroups to long int --- but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(best_path->numGroups);
-
plan = make_setop(best_path->cmd,
best_path->strategy,
tlist,
leftplan,
rightplan,
best_path->groupList,
- numGroups);
+ best_path->numGroups);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2604,7 +2600,6 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
Plan *leftplan;
Plan *rightplan;
List *tlist;
- long numGroups;
/* Need both children to produce same tlist, so force it */
leftplan = create_plan_recurse(root, best_path->leftpath, CP_EXACT_TLIST);
@@ -2612,15 +2607,12 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
tlist = build_path_tlist(root, &best_path->path);
- /* Convert numGroups to long int --- but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(best_path->numGroups);
-
plan = make_recursive_union(tlist,
leftplan,
rightplan,
best_path->wtParam,
best_path->distinctList,
- numGroups);
+ best_path->numGroups);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5845,7 +5837,7 @@ make_recursive_union(List *tlist,
Plan *righttree,
int wtParam,
List *distinctList,
- long numGroups)
+ Cardinality numGroups)
{
RecursiveUnion *node = makeNode(RecursiveUnion);
Plan *plan = &node->plan;
@@ -6582,15 +6574,11 @@ Agg *
make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain, double dNumGroups,
+ List *groupingSets, List *chain, Cardinality numGroups,
Size transitionSpace, Plan *lefttree)
{
Agg *node = makeNode(Agg);
Plan *plan = &node->plan;
- long numGroups;
-
- /* Reduce to long, but 'ware overflow! */
- numGroups = clamp_cardinality_to_long(dNumGroups);
node->aggstrategy = aggstrategy;
node->aggsplit = aggsplit;
@@ -6822,7 +6810,7 @@ make_gather(List *qptlist,
static SetOp *
make_setop(SetOpCmd cmd, SetOpStrategy strategy,
List *tlist, Plan *lefttree, Plan *righttree,
- List *groupList, long numGroups)
+ List *groupList, Cardinality numGroups)
{
SetOp *node = makeNode(SetOp);
Plan *plan = &node->plan;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 086f52cff3d..fa2b657fb2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -138,7 +138,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
const Oid *eqfuncoids,
FmgrInfo *hashfunctions,
Oid *collations,
- long nbuckets,
+ double nelements,
Size additionalsize,
MemoryContext metacxt,
MemoryContext tuplescxt,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7cdd2b51c94..c4393a94321 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -475,7 +475,7 @@ typedef struct RecursiveUnion
Oid *dupCollations pg_node_attr(array_size(numCols));
/* estimated number of groups in input */
- long numGroups;
+ Cardinality numGroups;
} RecursiveUnion;
/* ----------------
@@ -1206,7 +1206,7 @@ typedef struct Agg
Oid *grpCollations pg_node_attr(array_size(numCols));
/* estimated number of groups in input */
- long numGroups;
+ Cardinality numGroups;
/* for pass-by-ref transition data */
uint64 transitionSpace;
@@ -1446,7 +1446,7 @@ typedef struct SetOp
bool *cmpNullsFirst pg_node_attr(array_size(numCols));
/* estimated number of groups in left input */
- long numGroups;
+ Cardinality numGroups;
} SetOp;
/* ----------------
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index a34113903c0..d0aa8ab0c1c 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -82,7 +82,6 @@ extern PGDLLIMPORT int effective_cache_size;
extern double clamp_row_est(double nrows);
extern int32 clamp_width_est(int64 tuple_width);
-extern long clamp_cardinality_to_long(Cardinality x);
/* in path/indxpath.c: */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 09b48b26f8f..00addf15992 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
extern Agg *make_agg(List *tlist, List *qual,
AggStrategy aggstrategy, AggSplit aggsplit,
int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
- List *groupingSets, List *chain, double dNumGroups,
+ List *groupingSets, List *chain, Cardinality numGroups,
Size transitionSpace, Plan *lefttree);
extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
LimitOption limitOption, int uniqNumCols,
--
2.43.7
On Mon, 3 Nov 2025 at 05:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > v3 attached. I reviewed the changes. Looks good to me. David
David Rowley <dgrowleyml@gmail.com> writes:
> On Mon, 3 Nov 2025 at 05:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> v3 attached.
> I reviewed the changes. Looks good to me.
Thanks again! I'll get that pushed shortly.
regards, tom lane