Обсуждение: Improve hash join's handling of tuples with null join keys

Поиск
Список
Период
Сортировка

Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
The attached patch is a response to the discussion at [1], where
it emerged that lots of rows with null join keys can send a hash
join into too-many-batches hell, if they are on the outer side
of the join so that they must be null-extended not just discarded.
This isn't really surprising given that such rows will certainly
end up in the same hash bucket, and no amount of splitting can
reduce the size of that bucket.  (I'm a bit surprised that the
growEnabled heuristic didn't kick in, but it seems it didn't,
at least not up to several million batches.)

Thinking about that, it occurred to me to wonder why we are putting
null-keyed tuples into the hash table at all.  They cannot match
anything, so all we really have to do with them is emit one
null-extended copy.  Awhile later I had the attached, which shoves
such rows into a tuplestore that's separate from the hash table
proper, ensuring that they can't bollix our algorithms for when to
grow the hash table.  (For tuples coming from the right input, we
need to use a tuplestore in case we're asked to rescan the existing
hashtable.  For tuples coming from the left input, we could
theoretically emit 'em and forget 'em immediately, but that'd require
some major code restructuring so I decided to just use a tuplestore
there too.)

This passes check-world, and I've extended a couple of existing test
cases to ensure that the new code paths are exercised.  I've not done
any real performance testing, though.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/18909-e5e1b702c9441b8a%40postgresql.org

From cd1cf545583a328e756a07818669c1667fb51bb8 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 5 May 2025 16:02:46 -0400
Subject: [PATCH v1] Improve hash join's handling of tuples with null join
 keys.

In a plain join, we can just summarily discard an input tuple
with null join key(s), since it cannot match anything from
the other side of the join (assuming a strict join operator).
However, if the tuple comes from the outer side of an outer join
then we have to emit it with null-extension of the other side.

Up to now, hash joins did that by inserting the tuple into the hash
table as though it were a normal tuple.  This is unnecessarily
inefficient though, since the required processing is far simpler than
for a potentially-matchable tuple.  Worse, if there are a lot of such
tuples they will bloat the hash bucket they go into, possibly causing
useless repeated attempts to split that bucket or increase the number
of batches.  We have a report of a large join vainly creating many
thousands of batches when faced with such input.

This patch improves the situation by keeping such tuples out of the
hash table altogether, instead pushing them into a separate tuplestore
from which we return them later.  (One might consider trying to return
them immediately; but that would require substantial refactoring, and
it doesn't work anyway for the case where we rescan an unmodified hash
table.)  This works even in parallel hash joins, because whichever
worker reads a null-keyed tuple can just return it; there's no need
for consultation with other workers.  Thus the tuplestores are local
storage even in a parallel join.
---
 src/backend/executor/execExpr.c         |  20 +-
 src/backend/executor/nodeHash.c         |  66 +++++--
 src/backend/executor/nodeHashjoin.c     | 247 +++++++++++++++++++++---
 src/backend/utils/sort/tuplestore.c     |  32 +++
 src/include/executor/executor.h         |   2 +-
 src/include/executor/hashjoin.h         |   9 +
 src/include/executor/nodeHash.h         |   1 +
 src/include/nodes/execnodes.h           |   8 +
 src/include/utils/tuplestore.h          |   3 +
 src/test/regress/expected/join.out      |   7 +-
 src/test/regress/expected/join_hash.out |  15 +-
 src/test/regress/sql/join.sql           |   4 +-
 src/test/regress/sql/join_hash.sql      |   1 +
 13 files changed, 356 insertions(+), 59 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..2cd5eb07985 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -4282,25 +4282,25 @@ ExecBuildHash32FromAttrs(TupleDesc desc, const TupleTableSlotOps *ops,
  * 'hash_exprs'.  When multiple expressions are present, the hash values
  * returned by each hash function are combined to produce a single hash value.
  *
+ * If any hash_expr yields NULL and the corresponding hash function is strict,
+ * the created ExprState will return NULL.
+ *
  * desc: tuple descriptor for the to-be-hashed expressions
  * ops: TupleTableSlotOps for the TupleDesc
  * hashfunc_oids: Oid for each hash function to call, one for each 'hash_expr'
- * collations: collation to use when calling the hash function.
- * hash_expr: list of expressions to hash the value of
- * opstrict: array corresponding to the 'hashfunc_oids' to store op_strict()
+ * collations: collation to use when calling the hash function
+ * hash_exprs: list of expressions to hash the value of
+ * opstrict: strictness flag for each hash function
  * parent: PlanState node that the 'hash_exprs' will be evaluated at
  * init_value: Normally 0, but can be set to other values to seed the hash
  * with some other value.  Using non-zero is slightly less efficient but can
  * be useful.
- * keep_nulls: if true, evaluation of the returned ExprState will abort early
- * returning NULL if the given hash function is strict and the Datum to hash
- * is null.  When set to false, any NULL input Datums are skipped.
  */
 ExprState *
 ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
                     const Oid *hashfunc_oids, const List *collations,
                     const List *hash_exprs, const bool *opstrict,
-                    PlanState *parent, uint32 init_value, bool keep_nulls)
+                    PlanState *parent, uint32 init_value)
 {
     ExprState  *state = makeNode(ExprState);
     ExprEvalStep scratch = {0};
@@ -4377,8 +4377,8 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
         fmgr_info(funcid, finfo);

         /*
-         * Build the steps to evaluate the hash function's argument have it so
-         * the value of that is stored in the 0th argument of the hash func.
+         * Build the steps to evaluate the hash function's argument, placing
+         * the value in the 0th argument of the hash func.
          */
         ExecInitExprRec(expr,
                         state,
@@ -4413,7 +4413,7 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
         scratch.d.hashdatum.fcinfo_data = fcinfo;
         scratch.d.hashdatum.fn_addr = finfo->fn_addr;

-        scratch.opcode = opstrict[i] && !keep_nulls ? strict_opcode : opcode;
+        scratch.opcode = opstrict[i] ? strict_opcode : opcode;
         scratch.d.hashdatum.jumpdone = -1;

         ExprEvalPushStep(state, &scratch);
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..003814a4d31 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -154,8 +154,11 @@ MultiExecPrivateHash(HashState *node)
     econtext = node->ps.ps_ExprContext;

     /*
-     * Get all tuples from the node below the Hash node and insert into the
-     * hash table (or temp files).
+     * Get all tuples from the node below the Hash node and insert the
+     * potentially-matchable ones into the hash table (or temp files).  Tuples
+     * that can't possibly match because they have null join keys are dumped
+     * into a separate tuplestore, or just summarily discarded if we don't
+     * need to emit them with null-extension.
      */
     for (;;)
     {
@@ -175,6 +178,7 @@ MultiExecPrivateHash(HashState *node)

         if (!isnull)
         {
+            /* normal case with a non-null join key */
             uint32        hashvalue = DatumGetUInt32(hashdatum);
             int            bucketNumber;

@@ -193,6 +197,14 @@ MultiExecPrivateHash(HashState *node)
             }
             hashtable->totalTuples += 1;
         }
+        else if (node->keep_null_tuples)
+        {
+            /* null join key, but we must save tuple to be emitted later */
+            if (node->null_tuple_store == NULL)
+                node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
+            tuplestore_puttupleslot(node->null_tuple_store, slot);
+        }
+        /* else we can discard the tuple immediately */
     }

     /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
@@ -223,7 +235,6 @@ MultiExecParallelHash(HashState *node)
     HashJoinTable hashtable;
     TupleTableSlot *slot;
     ExprContext *econtext;
-    uint32        hashvalue;
     Barrier    *build_barrier;
     int            i;

@@ -283,6 +294,7 @@ MultiExecParallelHash(HashState *node)
             for (;;)
             {
                 bool        isnull;
+                uint32        hashvalue;

                 slot = ExecProcNode(outerNode);
                 if (TupIsNull(slot))
@@ -296,8 +308,19 @@ MultiExecParallelHash(HashState *node)
                                                                      &isnull));

                 if (!isnull)
+                {
+                    /* normal case with a non-null join key */
                     ExecParallelHashTableInsert(hashtable, slot, hashvalue);
-                hashtable->partialTuples++;
+                    hashtable->partialTuples++;
+                }
+                else if (node->keep_null_tuples)
+                {
+                    /* null join key, but save tuple to be emitted later */
+                    if (node->null_tuple_store == NULL)
+                        node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
+                    tuplestore_puttupleslot(node->null_tuple_store, slot);
+                }
+                /* else we can discard the tuple immediately */
             }

             /*
@@ -405,14 +428,10 @@ ExecInitHash(Hash *node, EState *estate, int eflags)

     Assert(node->plan.qual == NIL);

-    /*
-     * Delay initialization of hash_expr until ExecInitHashJoin().  We cannot
-     * build the ExprState here as we don't yet know the join type we're going
-     * to be hashing values for and we need to know that before calling
-     * ExecBuildHash32Expr as the keep_nulls parameter depends on the join
-     * type.
-     */
+    /* these fields will be filled by ExecInitHashJoin() */
     hashstate->hash_expr = NULL;
+    hashstate->null_tuple_store = NULL;
+    hashstate->keep_null_tuples = false;

     return hashstate;
 }
@@ -2748,6 +2767,31 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
     }
 }

+/*
+ * Build a tuplestore suitable for holding null-keyed input tuples.
+ * (This function doesn't care whether it's for outer or inner tuples.)
+ *
+ * Note that in a parallel hash join, each worker has its own tuplestore(s)
+ * for these.  There's no need to interact with other workers to decide
+ * what to do with them.  So they're always in private storage.
+ */
+Tuplestorestate *
+ExecHashBuildNullTupleStore(HashJoinTable hashtable)
+{
+    Tuplestorestate *tstore;
+    MemoryContext oldcxt;
+
+    /*
+     * We keep the tuplestore in the hashCxt to ensure it won't go away too
+     * soon.  Size it at work_mem/16 so that it doesn't bloat the node's space
+     * consumption too much.
+     */
+    oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
+    tstore = tuplestore_begin_heap(false, false, work_mem / 16);
+    MemoryContextSwitchTo(oldcxt);
+    return tstore;
+}
+
 /*
  * Reserve space in the DSM segment for instrumentation data.
  */
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5661ad76830..6a42041c927 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -182,7 +182,9 @@
 #define HJ_SCAN_BUCKET            3
 #define HJ_FILL_OUTER_TUPLE        4
 #define HJ_FILL_INNER_TUPLES    5
-#define HJ_NEED_NEW_BATCH        6
+#define HJ_FILL_OUTER_NULL_TUPLES    6
+#define HJ_FILL_INNER_NULL_TUPLES    7
+#define HJ_NEED_NEW_BATCH        8

 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)    ((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -346,9 +348,16 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                 /*
                  * If the inner relation is completely empty, and we're not
                  * doing a left outer join, we can quit without scanning the
-                 * outer relation.
+                 * outer relation.  (If the inner relation contains only
+                 * null-keyed tuples that we need to emit, we'll fall through
+                 * and do the outer-relation scan.  In principle we could go
+                 * emit those tuples then quit, but it would complicate the
+                 * state machine logic.  The case seems rare enough to not be
+                 * worth optimizing.)
                  */
-                if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
+                if (hashtable->totalTuples == 0 &&
+                    hashNode->null_tuple_store == NULL &&
+                    !HJ_FILL_OUTER(node))
                 {
                     if (parallel)
                     {
@@ -440,12 +449,17 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                         if (parallel)
                         {
                             /*
-                             * Only one process is currently allow to handle
+                             * Only one process is currently allowed to handle
                              * each batch's unmatched tuples, in a parallel
-                             * join.
+                             * join.  However, each process must deal with any
+                             * null-keyed tuples it found.
                              */
                             if (ExecParallelPrepHashTableForUnmatched(node))
                                 node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+                            else if (node->hj_NullOuterTupleStore)
+                                node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                            else if (hashNode->null_tuple_store)
+                                node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
                             else
                                 node->hj_JoinState = HJ_NEED_NEW_BATCH;
                         }
@@ -456,7 +470,14 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                         }
                     }
                     else
-                        node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                    {
+                        /* might have outer null-keyed tuples to fill */
+                        Assert(hashNode->null_tuple_store == NULL);
+                        if (node->hj_NullOuterTupleStore)
+                            node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                        else
+                            node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                    }
                     continue;
                 }

@@ -632,8 +653,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                 if (!(parallel ? ExecParallelScanHashTableForUnmatched(node, econtext)
                       : ExecScanHashTableForUnmatched(node, econtext)))
                 {
-                    /* no more unmatched tuples */
-                    node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                    /* no more unmatched tuples, but maybe there are nulls */
+                    if (node->hj_NullOuterTupleStore)
+                        node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                    else if (hashNode->null_tuple_store)
+                        node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
+                    else
+                        node->hj_JoinState = HJ_NEED_NEW_BATCH;
                     continue;
                 }

@@ -649,6 +675,93 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                     InstrCountFiltered2(node, 1);
                 break;

+            case HJ_FILL_OUTER_NULL_TUPLES:
+
+                /*
+                 * We have finished a batch, but we are doing left/full join,
+                 * so any null-keyed outer tuples have to be emitted before we
+                 * continue to the next batch.
+                 *
+                 * (We could delay this till the end of the join, but there
+                 * seems little percentage in that.)
+                 *
+                 * We have to use tuplestore_gettupleslot_force because
+                 * hj_OuterTupleSlot may not be able to store a MinimalTuple.
+                 */
+                while (tuplestore_gettupleslot_force(node->hj_NullOuterTupleStore,
+                                                     true, false,
+                                                     node->hj_OuterTupleSlot))
+                {
+                    /*
+                     * Generate a fake join tuple with nulls for the inner
+                     * tuple, and return it if it passes the non-join quals.
+                     */
+                    econtext->ecxt_outertuple = node->hj_OuterTupleSlot;
+                    econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+
+                    if (otherqual == NULL || ExecQual(otherqual, econtext))
+                        return ExecProject(node->js.ps.ps_ProjInfo);
+                    else
+                        InstrCountFiltered2(node, 1);
+
+                    ResetExprContext(econtext);
+
+                    /* allow this loop to be cancellable */
+                    CHECK_FOR_INTERRUPTS();
+                }
+
+                /* We don't need the tuplestore any more, so discard it. */
+                tuplestore_end(node->hj_NullOuterTupleStore);
+                node->hj_NullOuterTupleStore = NULL;
+
+                /* Fill inner tuples too if it's a full join, else advance. */
+                if (hashNode->null_tuple_store)
+                    node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
+                else
+                    node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                break;
+
+            case HJ_FILL_INNER_NULL_TUPLES:
+
+                /*
+                 * We have finished a batch, but we are doing
+                 * right/right-anti/full join, so any null-keyed inner tuples
+                 * have to be emitted before we continue to the next batch.
+                 *
+                 * (We could delay this till the end of the join, but there
+                 * seems little percentage in that.)
+                 */
+                while (tuplestore_gettupleslot(hashNode->null_tuple_store,
+                                               true, false,
+                                               node->hj_HashTupleSlot))
+                {
+                    /*
+                     * Generate a fake join tuple with nulls for the outer
+                     * tuple, and return it if it passes the non-join quals.
+                     */
+                    econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+                    econtext->ecxt_innertuple = node->hj_HashTupleSlot;
+
+                    if (otherqual == NULL || ExecQual(otherqual, econtext))
+                        return ExecProject(node->js.ps.ps_ProjInfo);
+                    else
+                        InstrCountFiltered2(node, 1);
+
+                    ResetExprContext(econtext);
+
+                    /* allow this loop to be cancellable */
+                    CHECK_FOR_INTERRUPTS();
+                }
+
+                /*
+                 * Ideally we'd discard the tuplestore now, but we can't
+                 * because we might need it for rescans.
+                 */
+
+                /* Now we can advance to the next batch. */
+                node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                break;
+
             case HJ_NEED_NEW_BATCH:

                 /*
@@ -831,10 +944,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)

         /*
          * Build ExprStates to obtain hash values for either side of the join.
-         * This must be done here as ExecBuildHash32Expr needs to know how to
-         * handle NULL inputs and the required handling of that depends on the
-         * jointype.  We don't know the join type in ExecInitHash() and we
-         * must build the ExprStates before ExecHashTableCreate() so we
+         * Note: must build the ExprStates before ExecHashTableCreate() so we
          * properly attribute any SubPlans that exist in the hash expressions
          * to the correct PlanState.
          */
@@ -846,7 +956,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)

         /*
          * Determine the hash function for each side of the join for the given
-         * hash operator.
+         * join operator, and detect whether the join operator is strict.
          */
         foreach(lc, node->hashoperators)
         {
@@ -864,11 +974,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)

         /*
          * Build an ExprState to generate the hash value for the expressions
-         * on the outer of the join.  This ExprState must finish generating
-         * the hash value when HJ_FILL_OUTER() is true.  Otherwise,
-         * ExecBuildHash32Expr will set up the ExprState to abort early if it
-         * finds a NULL.  In these cases, we don't need to store these tuples
-         * in the hash table as the jointype does not require it.
+         * on the outer side of the join.
          */
         hjstate->hj_OuterHash =
             ExecBuildHash32Expr(hjstate->js.ps.ps_ResultTupleDesc,
@@ -878,8 +984,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
                                 node->hashkeys,
                                 hash_strict,
                                 &hjstate->js.ps,
-                                0,
-                                HJ_FILL_OUTER(hjstate));
+                                0);

         /* As above, but for the inner side of the join */
         hashstate->hash_expr =
@@ -890,8 +995,11 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
                                 hash->hashkeys,
                                 hash_strict,
                                 &hashstate->ps,
-                                0,
-                                HJ_FILL_INNER(hjstate));
+                                0);
+
+        /* Remember whether we need to save tuples with null join keys */
+        hjstate->hj_KeepNullTuples = HJ_FILL_OUTER(hjstate);
+        hashstate->keep_null_tuples = HJ_FILL_INNER(hjstate);

         /*
          * Set up the skew table hash function while we have a record of the
@@ -924,6 +1032,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
      * initialize hash-specific info
      */
     hjstate->hj_HashTable = NULL;
+    hjstate->hj_NullOuterTupleStore = NULL;
     hjstate->hj_FirstOuterTupleSlot = NULL;

     hjstate->hj_CurHashValue = 0;
@@ -947,6 +1056,23 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 void
 ExecEndHashJoin(HashJoinState *node)
 {
+    HashState  *hashNode = castNode(HashState, innerPlanState(node));
+
+    /*
+     * Free tuple stores if we made them (must do this before
+     * ExecHashTableDestroy deletes hashCxt)
+     */
+    if (node->hj_NullOuterTupleStore)
+    {
+        tuplestore_end(node->hj_NullOuterTupleStore);
+        node->hj_NullOuterTupleStore = NULL;
+    }
+    if (hashNode->null_tuple_store)
+    {
+        tuplestore_end(hashNode->null_tuple_store);
+        hashNode->null_tuple_store = NULL;
+    }
+
     /*
      * Free hash table
      */
@@ -1015,11 +1141,19 @@ ExecHashJoinOuterGetTuple(PlanState *outerNode,

             if (!isnull)
             {
+                /* normal case with a non-null join key */
                 /* remember outer relation is not empty for possible rescan */
                 hjstate->hj_OuterNotEmpty = true;

                 return slot;
             }
+            else if (hjstate->hj_KeepNullTuples)
+            {
+                /* null join key, but we must save tuple to be emitted later */
+                if (hjstate->hj_NullOuterTupleStore == NULL)
+                    hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+                tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+            }

             /*
              * That tuple couldn't match because of a NULL, so discard it and
@@ -1087,7 +1221,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
                                                                   &isnull));

             if (!isnull)
+            {
+                /* normal case with a non-null join key */
                 return slot;
+            }
+            else if (hjstate->hj_KeepNullTuples)
+            {
+                /* null join key, but we must save tuple to be emitted later */
+                if (hjstate->hj_NullOuterTupleStore == NULL)
+                    hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+                tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+            }

             /*
              * That tuple couldn't match because of a NULL, so discard it and
@@ -1496,6 +1640,17 @@ ExecReScanHashJoin(HashJoinState *node)
     PlanState  *outerPlan = outerPlanState(node);
     PlanState  *innerPlan = innerPlanState(node);

+    /*
+     * We're always going to rescan the outer rel, so drop the associated
+     * null-keys tuplestore; we'll rebuild it during the rescan.  (Must do
+     * this before ExecHashTableDestroy deletes hashCxt.)
+     */
+    if (node->hj_NullOuterTupleStore)
+    {
+        tuplestore_end(node->hj_NullOuterTupleStore);
+        node->hj_NullOuterTupleStore = NULL;
+    }
+
     /*
      * In a multi-batch join, we currently have to do rescans the hard way,
      * primarily because batch temp files may have already been released. But
@@ -1505,6 +1660,10 @@ ExecReScanHashJoin(HashJoinState *node)
      */
     if (node->hj_HashTable != NULL)
     {
+        HashState  *hashNode = castNode(HashState, innerPlan);
+
+        Assert(hashNode->hashtable == node->hj_HashTable);
+
         if (node->hj_HashTable->nbatch == 1 &&
             innerPlan->chgParam == NULL)
         {
@@ -1529,15 +1688,20 @@ ExecReScanHashJoin(HashJoinState *node)
              */
             node->hj_OuterNotEmpty = false;

+            /*
+             * Also, rewind inner null-key tuplestore so that we can return
+             * those tuples again.
+             */
+            if (hashNode->null_tuple_store)
+                tuplestore_rescan(hashNode->null_tuple_store);
+
             /* ExecHashJoin can skip the BUILD_HASHTABLE step */
             node->hj_JoinState = HJ_NEED_NEW_OUTER;
         }
         else
         {
             /* must destroy and rebuild hash table */
-            HashState  *hashNode = castNode(HashState, innerPlan);

-            Assert(hashNode->hashtable == node->hj_HashTable);
             /* accumulate stats from old hash table, if wanted */
             /* (this should match ExecShutdownHash) */
             if (hashNode->ps.instrument && !hashNode->hinstrument)
@@ -1546,6 +1710,14 @@ ExecReScanHashJoin(HashJoinState *node)
             if (hashNode->hinstrument)
                 ExecHashAccumInstrumentation(hashNode->hinstrument,
                                              hashNode->hashtable);
+
+            /* free inner null-key tuplestore before ExecHashTableDestroy */
+            if (hashNode->null_tuple_store)
+            {
+                tuplestore_end(hashNode->null_tuple_store);
+                hashNode->null_tuple_store = NULL;
+            }
+
             /* for safety, be sure to clear child plan node's pointer too */
             hashNode->hashtable = NULL;

@@ -1601,7 +1773,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
     ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
     HashJoinTable hashtable = hjstate->hj_HashTable;
     TupleTableSlot *slot;
-    uint32        hashvalue;
     int            i;

     Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1610,6 +1781,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
     for (;;)
     {
         bool        isnull;
+        uint32        hashvalue;

         slot = ExecProcNode(outerState);
         if (TupIsNull(slot))
@@ -1624,6 +1796,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)

         if (!isnull)
         {
+            /* normal case with a non-null join key */
             int            batchno;
             int            bucketno;
             bool        shouldFree;
@@ -1637,6 +1810,15 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
             if (shouldFree)
                 heap_free_minimal_tuple(mintup);
         }
+        else if (hjstate->hj_KeepNullTuples)
+        {
+            /* null join key, but we must save tuple to be emitted later */
+            if (hjstate->hj_NullOuterTupleStore == NULL)
+                hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+            tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+        }
+        /* else we can just discard the tuple immediately */
+
         CHECK_FOR_INTERRUPTS();
     }

@@ -1715,6 +1897,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 {
     int            plan_node_id = state->js.ps.plan->plan_node_id;
     ParallelHashJoinState *pstate;
+    HashState  *hashNode;

     /* Nothing to do if we failed to create a DSM segment. */
     if (pcxt->seg == NULL)
@@ -1744,6 +1927,20 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
     /* Clear any shared batch files. */
     SharedFileSetDeleteAll(&pstate->fileset);

+    /* We'd better clear our local null-key tuplestores, too. */
+    if (state->hj_NullOuterTupleStore)
+    {
+        tuplestore_end(state->hj_NullOuterTupleStore);
+        state->hj_NullOuterTupleStore = NULL;
+    }
+    hashNode = (HashState *) innerPlanState(state);
+    if (hashNode->null_tuple_store)
+    {
+        tuplestore_end(hashNode->null_tuple_store);
+        hashNode->null_tuple_store = NULL;
+    }
+
+
     /* Reset build_barrier to PHJ_BUILD_ELECT so we can go around again. */
     BarrierInit(&pstate->build_barrier, 0);
 }
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..0e19ecc2f8d 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1152,6 +1152,38 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
     }
 }

+/*
+ * tuplestore_gettupleslot_force - exported function to fetch a tuple
+ *
+ * This is identical to tuplestore_gettupleslot except the given slot can be
+ * any kind of slot; it need not be one that will accept a MinimalTuple.
+ */
+bool
+tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
+                              bool copy, TupleTableSlot *slot)
+{
+    MinimalTuple tuple;
+    bool        should_free;
+
+    tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+    if (tuple)
+    {
+        if (copy && !should_free)
+        {
+            tuple = heap_copy_minimal_tuple(tuple, 0);
+            should_free = true;
+        }
+        ExecForceStoreMinimalTuple(tuple, slot, should_free);
+        return true;
+    }
+    else
+    {
+        ExecClearTuple(slot);
+        return false;
+    }
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index ae99407db89..24010a7fdd6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -366,7 +366,7 @@ extern ExprState *ExecBuildHash32Expr(TupleDesc desc,
                                       const List *collations,
                                       const List *hash_exprs,
                                       const bool *opstrict, PlanState *parent,
-                                      uint32 init_value, bool keep_nulls);
+                                      uint32 init_value);
 extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
                                          const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
                                          int numCols,
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..5f59b61f671 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -68,6 +68,15 @@
  * inner batch file.  Subsequently, while reading either inner or outer batch
  * files, we might find tuples that no longer belong to the current batch;
  * if so, we just dump them out to the correct batch file.
+ *
+ * If an input tuple has a null join key, then it cannot match anything from
+ * the other side of the join.  Normally we can just discard such a tuple
+ * immediately, but if it comes from the outer side of an outer join then we
+ * must emit it with null-extension of the other side.  For various reasons
+ * it's not convenient to do that immediately on seeing the tuple, so we dump
+ * the tuple into a tuplestore and emit it later.  (In the unlikely but
+ * supported case of a non-strict join operator, we treat null keys as normal
+ * data.)
  * ----------------------------------------------------------------
  */

diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..55b89febd1a 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -64,6 +64,7 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
                                     int *numbatches,
                                     int *num_skew_mcvs);
 extern int    ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
+extern Tuplestorestate *ExecHashBuildNullTupleStore(HashJoinTable hashtable);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..d31c6c2e59e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2239,8 +2239,11 @@ typedef struct MergeJoinState
  *        hj_NullOuterTupleSlot    prepared null tuple for right/right-anti/full
  *                                outer joins
  *        hj_NullInnerTupleSlot    prepared null tuple for left/full outer joins
+ *        hj_NullOuterTupleStore    tuplestore holding outer tuples that have
+ *                                null join keys (but must be emitted anyway)
  *        hj_FirstOuterTupleSlot    first tuple retrieved from outer plan
  *        hj_JoinState            current state of ExecHashJoin state machine
+ *        hj_KeepNullTuples        true to keep outer tuples with null join keys
  *        hj_MatchedOuter            true if found a join match for current outer
  *        hj_OuterNotEmpty        true if outer relation known not empty
  * ----------------
@@ -2264,8 +2267,10 @@ typedef struct HashJoinState
     TupleTableSlot *hj_HashTupleSlot;
     TupleTableSlot *hj_NullOuterTupleSlot;
     TupleTableSlot *hj_NullInnerTupleSlot;
+    Tuplestorestate *hj_NullOuterTupleStore;
     TupleTableSlot *hj_FirstOuterTupleSlot;
     int            hj_JoinState;
+    bool        hj_KeepNullTuples;
     bool        hj_MatchedOuter;
     bool        hj_OuterNotEmpty;
 } HashJoinState;
@@ -2815,6 +2820,9 @@ typedef struct HashState
     FmgrInfo   *skew_hashfunction;    /* lookup data for skew hash function */
     Oid            skew_collation; /* collation to call skew_hashfunction with */

+    Tuplestorestate *null_tuple_store;    /* where to put null-keyed tuples */
+    bool        keep_null_tuples;    /* do we need to save such tuples? */
+
     /*
      * In a parallelized hash join, the leader retains a pointer to the
      * shared-memory stats area in its shared_info field, and then copies the
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..b9e152c1701 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -73,6 +73,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
                                     bool copy, TupleTableSlot *slot);

+extern bool tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
+                                          bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);

 extern bool tuplestore_skiptuples(Tuplestorestate *state,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index f35a0b18c37..334d38b1052 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4590,7 +4590,7 @@ order by fault;
 explain (costs off)
 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;
                          QUERY PLAN
 -------------------------------------------------------------
@@ -4606,13 +4606,14 @@ left join unnest(v1ys) as u1(u1y) on u1y = v2y;

 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;
  v1x |  v1ys   | v2x | v2y | u1y
 -----+---------+-----+-----+-----
    1 | {10,20} |   1 |  10 |  10
    2 | {20,30} |   2 |  20 |  20
-(2 rows)
+   2 | {20,30} |   2 |     |
+(3 rows)

 --
 -- test handling of potential equivalence clauses above outer joins
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 4fc34a0e72a..3df9f653d35 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -53,6 +53,7 @@ $$;
 -- estimated size.
 create table simple as
   select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+insert into simple values (null, null);
 alter table simple set (parallel_workers = 2);
 analyze simple;
 -- Make a relation whose size we will under-estimate.  We want stats
@@ -308,7 +309,7 @@ $$);
 select count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -786,7 +787,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -809,7 +810,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -834,7 +835,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -857,7 +858,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count
 -------
- 40000
+ 40002
 (1 row)

 rollback to settings;
@@ -880,7 +881,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count
 -------
- 40000
+ 40002
 (1 row)

 rollback to settings;
@@ -905,7 +906,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count
 -------
- 40000
+ 40002
 (1 row)

 rollback to settings;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index cc5128add4d..c4946d39e77 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1554,12 +1554,12 @@ order by fault;
 explain (costs off)
 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;

 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;

 --
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 6b0688ab0a6..11e3a164c76 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -57,6 +57,7 @@ $$;
 -- estimated size.
 create table simple as
   select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+insert into simple values (null, null);
 alter table simple set (parallel_workers = 2);
 analyze simple;

--
2.43.5


Re: Improve hash join's handling of tuples with null join keys

От
Tomas Vondra
Дата:
On 5/6/25 01:11, Tom Lane wrote:
> The attached patch is a response to the discussion at [1], where
> it emerged that lots of rows with null join keys can send a hash
> join into too-many-batches hell, if they are on the outer side
> of the join so that they must be null-extended not just discarded.
> This isn't really surprising given that such rows will certainly
> end up in the same hash bucket, and no amount of splitting can
> reduce the size of that bucket.  (I'm a bit surprised that the
> growEnabled heuristic didn't kick in, but it seems it didn't,
> at least not up to several million batches.)
> 

I don't think that's too surprising - growEnabled depends on all tuples
getting into the same batch during a split. But even if there are many
duplicate values, real-world data sets often have a couple more tuples
that just happen to fall into that bucket too. And then during the split
some get into one batch and some get into another.

My personal experience is that the growEnabled heuristics is overly
sensitive, and probably does not trigger very often. It can also get set
to "true" to early, but that's (much) harder to hit.

I have suggested to make growEnabled less strict in [2], i.e. to
calculate the threshold as percentage of the batch, and not disable
growth permanently. But it was orthogonal to what that thread did.

But more importantly, wasn't the issue discussed in [1] about parallel
hash joins? I got quite confused while reading the thread ... I'm asking
because growEnabled is checked only in ExecHashIncreaseNumBatches, not
in ExecParallelHashIncreaseNumBatches. So AFAICS the parallel hash joins
don't use growEnabled at all, no?

[2]
https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f%40vondra.me


> Thinking about that, it occurred to me to wonder why we are putting
> null-keyed tuples into the hash table at all.  They cannot match
> anything, so all we really have to do with them is emit one
> null-extended copy.  Awhile later I had the attached, which shoves
> such rows into a tuplestore that's separate from the hash table
> proper, ensuring that they can't bollix our algorithms for when to
> grow the hash table.  (For tuples coming from the right input, we
> need to use a tuplestore in case we're asked to rescan the existing
> hashtable.  For tuples coming from the left input, we could
> theoretically emit 'em and forget 'em immediately, but that'd require
> some major code restructuring so I decided to just use a tuplestore
> there too.)
> 
> This passes check-world, and I've extended a couple of existing test
> cases to ensure that the new code paths are exercised.  I've not done
> any real performance testing, though.
> 

Are you planning to? If not, I can try to collect some numbers, but I
can't promise that before pgconf.dev.

I'd be surprised if this was a regression, the hash table lookups are
not exactly free. And even if it was a minor regression, it'd affect
only cases with many NULL keys, but it improves robustness.

BTW do you consider this to be a bugfix for PG18? Or would it have to
wait for PG19 at this point?


regards

-- 
Tomas Vondra




Re: Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
Tomas Vondra <tomas@vondra.me> writes:
> My personal experience is that the growEnabled heuristics is overly
> sensitive, and probably does not trigger very often.

Yeah, it would be good to make it not quite all-or-nothing.

> But more importantly, wasn't the issue discussed in [1] about parallel
> hash joins?

I'm not clear on that either; it seemed that the OP was able to
trigger it in some non-parallel cases too.  But we don't have a
reproducer so I can't say for sure.  Building a reproducer would
be a useful exercise for testing this.  There might well be some
parallel-specific misbehavior that would be worth ameliorating
independently of this work, in case of a lot of non-null duplicate
keys.

>> This passes check-world, and I've extended a couple of existing test
>> cases to ensure that the new code paths are exercised.  I've not done
>> any real performance testing, though.

> Are you planning to? If not, I can try to collect some numbers, but I
> can't promise that before pgconf.dev.

If you have time after the conference, please feel free.

> BTW do you consider this to be a bugfix for PG18? Or would it have to
> wait for PG19 at this point?

This has been like this forever I suspect --- certainly for as long
as we've had PHJ, and probably longer.  So I'm seeing it as new work
for v19, not something we'd attempt to back-patch.

            regards, tom lane



Re: Improve hash join's handling of tuples with null join keys

От
Thomas Munro
Дата:
On Tue, May 6, 2025 at 12:12 PM Tomas Vondra <tomas@vondra.me> wrote:
> On 5/6/25 01:11, Tom Lane wrote:
> > The attached patch is a response to the discussion at [1], where
> > it emerged that lots of rows with null join keys can send a hash
> > join into too-many-batches hell, if they are on the outer side
> > of the join so that they must be null-extended not just discarded.
> > This isn't really surprising given that such rows will certainly
> > end up in the same hash bucket, and no amount of splitting can
> > reduce the size of that bucket.  (I'm a bit surprised that the
> > growEnabled heuristic didn't kick in, but it seems it didn't,
> > at least not up to several million batches.)

Good idea.  I haven't reviewed it properly, but one observation is
that trapping the null-keys tuples in per-worker tuple stores creates
unfairness.  That could be fixed by using a SharedTuplestore instead,
but unfortunately SharedTuplestore always spills to disk at the
moment, so maybe I should think about how to give it some memory for
small sets like regular Tuplestore.  Will look more closely after
Montreal.

> I don't think that's too surprising - growEnabled depends on all tuples
> getting into the same batch during a split. But even if there are many
> duplicate values, real-world data sets often have a couple more tuples
> that just happen to fall into that bucket too. And then during the split
> some get into one batch and some get into another.

Yeah.

> My personal experience is that the growEnabled heuristics is overly
> sensitive, and probably does not trigger very often. It can also get set
> to "true" to early, but that's (much) harder to hit.
>
> I have suggested to make growEnabled less strict in [2], i.e. to
> calculate the threshold as percentage of the batch, and not disable
> growth permanently. But it was orthogonal to what that thread did.

+1, I also wrote a thread with a draft patch like that at some point,
which I'll also try to dig up after Montreal.

> But more importantly, wasn't the issue discussed in [1] about parallel
> hash joins? I got quite confused while reading the thread ... I'm asking
> because growEnabled is checked only in ExecHashIncreaseNumBatches, not
> in ExecParallelHashIncreaseNumBatches. So AFAICS the parallel hash joins
> don't use growEnabled at all, no?

There is an equivalent mechanism, but it's slightly more complicated
as it requires consensus (see code near PHJ_GROW_BATCHES_DECIDE).
It's possible that it's not working quite as well as it should.  It's
definitely less deterministic in some edge cases since tuples are
packed into chunks differently so the memory used can vary slightly
run-to-run, but the tuple count should be stable.  I've made a note to
review that logic again too.

Note also that v16 is the first that could put NULLs in a shared
memory hash table (11c2d6fdf5 enabled Parallel Hash Right|Full Join),
while non-P HJ has had that for a long time, but it also couldn't be
used in a parallel query, so I guess it's possible that this stuff is
coming up now because it wasn't often picked for problems that would
generate interesting numbers of NULLs likely to exceed limits given
available plans in older releases.  See also related bug fix in
98c7c7152, spotted soon after this plan type escaped into the field.

While thinking about that, I wanted to note that we have more things
to improve in PHRJ: (1) Parallelism of unmatched scan: a short but not
entirely satisfying patch was already shared on the PHRJ thread but
not committed with the main feature.  I already had some inklings of
how to do much better which I recently described in a bit more detail
on the PBHS thread in vapourware form, where parallel fairness came up
again.  "La perfection est le mortel ennemi du bien" or whatever it is
they say in the language of Montreal, but really the easy patch for
unmatched scan parallelism wasn't actually bon enough, because it was
non-deterministic how many processes could participate due to deadlock
avoidance arcana, creating run-to-run variation that I'd expect Tomáš
to find empirically and reject in one of his benchmarking expeditions
:-).  (2) Bogus asymmetries in estimations/planning: I wrote some
analysis of why we don't use PHRJ as much as we could/should near
Richard Guo's work on anti/semi joins which went in around the same
time.  My idea there is to try to debogify the parallel degree logic
more generally, it's just that PHRJ brought key aspects of it into
relief for me, ie bogosity of the rule-based "driving table" concept.
I'll try to write these projects up on the wiki, instead of in random
threads :-)

In other words if you just use local Tuplestores as you showed it
would actually be an improvement in fairness over the status quo due
to (1) not being solved yet... but it will be solved, hence mentioning
it in this context.



Re: Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, May 6, 2025 at 12:12 PM Tomas Vondra <tomas@vondra.me> wrote:
>> On 5/6/25 01:11, Tom Lane wrote:
>>> The attached patch is a response to the discussion at [1], where
>>> it emerged that lots of rows with null join keys can send a hash
>>> join into too-many-batches hell, if they are on the outer side
>>> of the join so that they must be null-extended not just discarded.

> Good idea.  I haven't reviewed it properly, but one observation is
> that trapping the null-keys tuples in per-worker tuple stores creates
> unfairness.  That could be fixed by using a SharedTuplestore instead,
> but unfortunately SharedTuplestore always spills to disk at the
> moment, so maybe I should think about how to give it some memory for
> small sets like regular Tuplestore.  Will look more closely after
> Montreal.

Hmm ... I'm unpersuaded that "fairness" is an argument for adding
overhead to the processing of these tuples.  It's very hard to see
how shoving them into a shared tuplestore can beat not shoving them
into a shared tuplestore.  But if you want to poke at that idea,
feel free.

In the meantime, I noticed that my patch was intermittently failing
in CI, and was able to reproduce that locally.  It turns out I'd
missed the point that we might accumulate some null-keyed tuples
into local tuplestores during a parallel HJ_BUILD_HASHTABLE step.
Ordinarily that doesn't matter because we'll dump them anyway at
conclusion of the first batch.  But with the right timing, we might
collect some tuples and yet, by the time we're ready to process a
batch, there are none left to do.  Then the state machine fell out
without ever dumping those tuples.  (For some reason this is way
easier to reproduce under FreeBSD than Linux --- scheduler quirk
I guess.)

v2 attached fixes that, and improves some comments.

            regards, tom lane

From 84db9c67b9fc1b33d12d8249407eeaa578eb33c0 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 2 Jun 2025 12:30:15 -0400
Subject: [PATCH v2] Improve hash join's handling of tuples with null join
 keys.

In a plain join, we can just summarily discard an input tuple
with null join key(s), since it cannot match anything from
the other side of the join (assuming a strict join operator).
However, if the tuple comes from the outer side of an outer join
then we have to emit it with null-extension of the other side.

Up to now, hash joins did that by inserting the tuple into the hash
table as though it were a normal tuple.  This is unnecessarily
inefficient though, since the required processing is far simpler than
for a potentially-matchable tuple.  Worse, if there are a lot of such
tuples they will bloat the hash bucket they go into, possibly causing
useless repeated attempts to split that bucket or increase the number
of batches.  We have a report of a large join vainly creating many
thousands of batches when faced with such input.

This patch improves the situation by keeping such tuples out of the
hash table altogether, instead pushing them into a separate tuplestore
from which we return them later.  (One might consider trying to return
them immediately; but that would require substantial refactoring, and
it doesn't work anyway for the case where we rescan an unmodified hash
table.)  This works even in parallel hash joins, because whichever
worker reads a null-keyed tuple can just return it; there's no need
for consultation with other workers.  Thus the tuplestores are local
storage even in a parallel join.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3061845.1746486714@sss.pgh.pa.us
---
 src/backend/executor/execExpr.c         |  22 +-
 src/backend/executor/nodeHash.c         |  66 +++++-
 src/backend/executor/nodeHashjoin.c     | 282 ++++++++++++++++++++----
 src/backend/utils/sort/tuplestore.c     |  32 +++
 src/include/executor/executor.h         |   2 +-
 src/include/executor/hashjoin.h         |   9 +
 src/include/executor/nodeHash.h         |   1 +
 src/include/nodes/execnodes.h           |   8 +
 src/include/utils/tuplestore.h          |   3 +
 src/test/regress/expected/join.out      |   7 +-
 src/test/regress/expected/join_hash.out |  15 +-
 src/test/regress/sql/join.sql           |   4 +-
 src/test/regress/sql/join_hash.sql      |   1 +
 13 files changed, 381 insertions(+), 71 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..07868371aea 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -4282,25 +4282,27 @@ ExecBuildHash32FromAttrs(TupleDesc desc, const TupleTableSlotOps *ops,
  * 'hash_exprs'.  When multiple expressions are present, the hash values
  * returned by each hash function are combined to produce a single hash value.
  *
+ * If any hash_expr yields NULL and the corresponding hash operator is strict,
+ * the created ExprState will return NULL.  (If the operator is not strict,
+ * we treat NULL values as having a hash value of zero.  The hash functions
+ * themselves are always treated as strict.)
+ *
  * desc: tuple descriptor for the to-be-hashed expressions
  * ops: TupleTableSlotOps for the TupleDesc
  * hashfunc_oids: Oid for each hash function to call, one for each 'hash_expr'
- * collations: collation to use when calling the hash function.
- * hash_expr: list of expressions to hash the value of
- * opstrict: array corresponding to the 'hashfunc_oids' to store op_strict()
+ * collations: collation to use when calling the hash function
+ * hash_exprs: list of expressions to hash the value of
+ * opstrict: strictness flag for each hash function's comparison operator
  * parent: PlanState node that the 'hash_exprs' will be evaluated at
  * init_value: Normally 0, but can be set to other values to seed the hash
  * with some other value.  Using non-zero is slightly less efficient but can
  * be useful.
- * keep_nulls: if true, evaluation of the returned ExprState will abort early
- * returning NULL if the given hash function is strict and the Datum to hash
- * is null.  When set to false, any NULL input Datums are skipped.
  */
 ExprState *
 ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
                     const Oid *hashfunc_oids, const List *collations,
                     const List *hash_exprs, const bool *opstrict,
-                    PlanState *parent, uint32 init_value, bool keep_nulls)
+                    PlanState *parent, uint32 init_value)
 {
     ExprState  *state = makeNode(ExprState);
     ExprEvalStep scratch = {0};
@@ -4377,8 +4379,8 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
         fmgr_info(funcid, finfo);

         /*
-         * Build the steps to evaluate the hash function's argument have it so
-         * the value of that is stored in the 0th argument of the hash func.
+         * Build the steps to evaluate the hash function's argument, placing
+         * the value in the 0th argument of the hash func.
          */
         ExecInitExprRec(expr,
                         state,
@@ -4413,7 +4415,7 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
         scratch.d.hashdatum.fcinfo_data = fcinfo;
         scratch.d.hashdatum.fn_addr = finfo->fn_addr;

-        scratch.opcode = opstrict[i] && !keep_nulls ? strict_opcode : opcode;
+        scratch.opcode = opstrict[i] ? strict_opcode : opcode;
         scratch.d.hashdatum.jumpdone = -1;

         ExprEvalPushStep(state, &scratch);
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..003814a4d31 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -154,8 +154,11 @@ MultiExecPrivateHash(HashState *node)
     econtext = node->ps.ps_ExprContext;

     /*
-     * Get all tuples from the node below the Hash node and insert into the
-     * hash table (or temp files).
+     * Get all tuples from the node below the Hash node and insert the
+     * potentially-matchable ones into the hash table (or temp files).  Tuples
+     * that can't possibly match because they have null join keys are dumped
+     * into a separate tuplestore, or just summarily discarded if we don't
+     * need to emit them with null-extension.
      */
     for (;;)
     {
@@ -175,6 +178,7 @@ MultiExecPrivateHash(HashState *node)

         if (!isnull)
         {
+            /* normal case with a non-null join key */
             uint32        hashvalue = DatumGetUInt32(hashdatum);
             int            bucketNumber;

@@ -193,6 +197,14 @@ MultiExecPrivateHash(HashState *node)
             }
             hashtable->totalTuples += 1;
         }
+        else if (node->keep_null_tuples)
+        {
+            /* null join key, but we must save tuple to be emitted later */
+            if (node->null_tuple_store == NULL)
+                node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
+            tuplestore_puttupleslot(node->null_tuple_store, slot);
+        }
+        /* else we can discard the tuple immediately */
     }

     /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
@@ -223,7 +235,6 @@ MultiExecParallelHash(HashState *node)
     HashJoinTable hashtable;
     TupleTableSlot *slot;
     ExprContext *econtext;
-    uint32        hashvalue;
     Barrier    *build_barrier;
     int            i;

@@ -283,6 +294,7 @@ MultiExecParallelHash(HashState *node)
             for (;;)
             {
                 bool        isnull;
+                uint32        hashvalue;

                 slot = ExecProcNode(outerNode);
                 if (TupIsNull(slot))
@@ -296,8 +308,19 @@ MultiExecParallelHash(HashState *node)
                                                                      &isnull));

                 if (!isnull)
+                {
+                    /* normal case with a non-null join key */
                     ExecParallelHashTableInsert(hashtable, slot, hashvalue);
-                hashtable->partialTuples++;
+                    hashtable->partialTuples++;
+                }
+                else if (node->keep_null_tuples)
+                {
+                    /* null join key, but save tuple to be emitted later */
+                    if (node->null_tuple_store == NULL)
+                        node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
+                    tuplestore_puttupleslot(node->null_tuple_store, slot);
+                }
+                /* else we can discard the tuple immediately */
             }

             /*
@@ -405,14 +428,10 @@ ExecInitHash(Hash *node, EState *estate, int eflags)

     Assert(node->plan.qual == NIL);

-    /*
-     * Delay initialization of hash_expr until ExecInitHashJoin().  We cannot
-     * build the ExprState here as we don't yet know the join type we're going
-     * to be hashing values for and we need to know that before calling
-     * ExecBuildHash32Expr as the keep_nulls parameter depends on the join
-     * type.
-     */
+    /* these fields will be filled by ExecInitHashJoin() */
     hashstate->hash_expr = NULL;
+    hashstate->null_tuple_store = NULL;
+    hashstate->keep_null_tuples = false;

     return hashstate;
 }
@@ -2748,6 +2767,31 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
     }
 }

+/*
+ * Build a tuplestore suitable for holding null-keyed input tuples.
+ * (This function doesn't care whether it's for outer or inner tuples.)
+ *
+ * Note that in a parallel hash join, each worker has its own tuplestore(s)
+ * for these.  There's no need to interact with other workers to decide
+ * what to do with them.  So they're always in private storage.
+ */
+Tuplestorestate *
+ExecHashBuildNullTupleStore(HashJoinTable hashtable)
+{
+    Tuplestorestate *tstore;
+    MemoryContext oldcxt;
+
+    /*
+     * We keep the tuplestore in the hashCxt to ensure it won't go away too
+     * soon.  Size it at work_mem/16 so that it doesn't bloat the node's space
+     * consumption too much.
+     */
+    oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
+    tstore = tuplestore_begin_heap(false, false, work_mem / 16);
+    MemoryContextSwitchTo(oldcxt);
+    return tstore;
+}
+
 /*
  * Reserve space in the DSM segment for instrumentation data.
  */
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5661ad76830..3878214f8a2 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -182,7 +182,9 @@
 #define HJ_SCAN_BUCKET            3
 #define HJ_FILL_OUTER_TUPLE        4
 #define HJ_FILL_INNER_TUPLES    5
-#define HJ_NEED_NEW_BATCH        6
+#define HJ_FILL_OUTER_NULL_TUPLES    6
+#define HJ_FILL_INNER_NULL_TUPLES    7
+#define HJ_NEED_NEW_BATCH        8

 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)    ((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -346,9 +348,16 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                 /*
                  * If the inner relation is completely empty, and we're not
                  * doing a left outer join, we can quit without scanning the
-                 * outer relation.
+                 * outer relation.  (If the inner relation contains only
+                 * null-keyed tuples that we need to emit, we'll fall through
+                 * and do the outer-relation scan.  In principle we could go
+                 * emit those tuples then quit, but it would complicate the
+                 * state machine logic.  The case seems rare enough to not be
+                 * worth optimizing.)
                  */
-                if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
+                if (hashtable->totalTuples == 0 &&
+                    hashNode->null_tuple_store == NULL &&
+                    !HJ_FILL_OUTER(node))
                 {
                     if (parallel)
                     {
@@ -395,21 +404,24 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                             ExecParallelHashJoinPartitionOuter(node);
                         BarrierArriveAndWait(build_barrier,
                                              WAIT_EVENT_HASH_BUILD_HASH_OUTER);
-                    }
-                    else if (BarrierPhase(build_barrier) == PHJ_BUILD_FREE)
-                    {
-                        /*
-                         * If we attached so late that the job is finished and
-                         * the batch state has been freed, we can return
-                         * immediately.
-                         */
-                        return NULL;
+                        Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUN);
                     }

-                    /* Each backend should now select a batch to work on. */
-                    Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUN);
+                    /*
+                     * Each backend should now select a batch to work on.
+                     * However, if we've already collected some null-keyed
+                     * tuples, dump them first.  (That is critical when we
+                     * arrive late enough that no more batches are available;
+                     * otherwise we'd fail to dump those tuples at all.)
+                     */
                     hashtable->curbatch = -1;
-                    node->hj_JoinState = HJ_NEED_NEW_BATCH;
+
+                    if (node->hj_NullOuterTupleStore)
+                        node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                    else if (hashNode->null_tuple_store)
+                        node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
+                    else
+                        node->hj_JoinState = HJ_NEED_NEW_BATCH;

                     continue;
                 }
@@ -440,12 +452,17 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                         if (parallel)
                         {
                             /*
-                             * Only one process is currently allow to handle
+                             * Only one process is currently allowed to handle
                              * each batch's unmatched tuples, in a parallel
-                             * join.
+                             * join.  However, each process must deal with any
+                             * null-keyed tuples it found.
                              */
                             if (ExecParallelPrepHashTableForUnmatched(node))
                                 node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+                            else if (node->hj_NullOuterTupleStore)
+                                node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                            else if (hashNode->null_tuple_store)
+                                node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
                             else
                                 node->hj_JoinState = HJ_NEED_NEW_BATCH;
                         }
@@ -456,7 +473,14 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                         }
                     }
                     else
-                        node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                    {
+                        /* might have outer null-keyed tuples to fill */
+                        Assert(hashNode->null_tuple_store == NULL);
+                        if (node->hj_NullOuterTupleStore)
+                            node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                        else
+                            node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                    }
                     continue;
                 }

@@ -632,8 +656,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                 if (!(parallel ? ExecParallelScanHashTableForUnmatched(node, econtext)
                       : ExecScanHashTableForUnmatched(node, econtext)))
                 {
-                    /* no more unmatched tuples */
-                    node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                    /* no more unmatched tuples, but maybe there are nulls */
+                    if (node->hj_NullOuterTupleStore)
+                        node->hj_JoinState = HJ_FILL_OUTER_NULL_TUPLES;
+                    else if (hashNode->null_tuple_store)
+                        node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
+                    else
+                        node->hj_JoinState = HJ_NEED_NEW_BATCH;
                     continue;
                 }

@@ -649,6 +678,93 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                     InstrCountFiltered2(node, 1);
                 break;

+            case HJ_FILL_OUTER_NULL_TUPLES:
+
+                /*
+                 * We have finished a batch, but we are doing left/full join,
+                 * so any null-keyed outer tuples have to be emitted before we
+                 * continue to the next batch.
+                 *
+                 * (We could delay this till the end of the join, but there
+                 * seems little percentage in that.)
+                 *
+                 * We have to use tuplestore_gettupleslot_force because
+                 * hj_OuterTupleSlot may not be able to store a MinimalTuple.
+                 */
+                while (tuplestore_gettupleslot_force(node->hj_NullOuterTupleStore,
+                                                     true, false,
+                                                     node->hj_OuterTupleSlot))
+                {
+                    /*
+                     * Generate a fake join tuple with nulls for the inner
+                     * tuple, and return it if it passes the non-join quals.
+                     */
+                    econtext->ecxt_outertuple = node->hj_OuterTupleSlot;
+                    econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+
+                    if (otherqual == NULL || ExecQual(otherqual, econtext))
+                        return ExecProject(node->js.ps.ps_ProjInfo);
+                    else
+                        InstrCountFiltered2(node, 1);
+
+                    ResetExprContext(econtext);
+
+                    /* allow this loop to be cancellable */
+                    CHECK_FOR_INTERRUPTS();
+                }
+
+                /* We don't need the tuplestore any more, so discard it. */
+                tuplestore_end(node->hj_NullOuterTupleStore);
+                node->hj_NullOuterTupleStore = NULL;
+
+                /* Fill inner tuples too if it's a full join, else advance. */
+                if (hashNode->null_tuple_store)
+                    node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES;
+                else
+                    node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                break;
+
+            case HJ_FILL_INNER_NULL_TUPLES:
+
+                /*
+                 * We have finished a batch, but we are doing
+                 * right/right-anti/full join, so any null-keyed inner tuples
+                 * have to be emitted before we continue to the next batch.
+                 *
+                 * (We could delay this till the end of the join, but there
+                 * seems little percentage in that.)
+                 */
+                while (tuplestore_gettupleslot(hashNode->null_tuple_store,
+                                               true, false,
+                                               node->hj_HashTupleSlot))
+                {
+                    /*
+                     * Generate a fake join tuple with nulls for the outer
+                     * tuple, and return it if it passes the non-join quals.
+                     */
+                    econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+                    econtext->ecxt_innertuple = node->hj_HashTupleSlot;
+
+                    if (otherqual == NULL || ExecQual(otherqual, econtext))
+                        return ExecProject(node->js.ps.ps_ProjInfo);
+                    else
+                        InstrCountFiltered2(node, 1);
+
+                    ResetExprContext(econtext);
+
+                    /* allow this loop to be cancellable */
+                    CHECK_FOR_INTERRUPTS();
+                }
+
+                /*
+                 * Ideally we'd discard the tuplestore now, but we can't
+                 * because we might need it for rescans.
+                 */
+
+                /* Now we can advance to the next batch. */
+                node->hj_JoinState = HJ_NEED_NEW_BATCH;
+                break;
+
             case HJ_NEED_NEW_BATCH:

                 /*
@@ -831,10 +947,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)

         /*
          * Build ExprStates to obtain hash values for either side of the join.
-         * This must be done here as ExecBuildHash32Expr needs to know how to
-         * handle NULL inputs and the required handling of that depends on the
-         * jointype.  We don't know the join type in ExecInitHash() and we
-         * must build the ExprStates before ExecHashTableCreate() so we
+         * Note: must build the ExprStates before ExecHashTableCreate() so we
          * properly attribute any SubPlans that exist in the hash expressions
          * to the correct PlanState.
          */
@@ -846,7 +959,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)

         /*
          * Determine the hash function for each side of the join for the given
-         * hash operator.
+         * join operator, and detect whether the join operator is strict.
          */
         foreach(lc, node->hashoperators)
         {
@@ -864,11 +977,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)

         /*
          * Build an ExprState to generate the hash value for the expressions
-         * on the outer of the join.  This ExprState must finish generating
-         * the hash value when HJ_FILL_OUTER() is true.  Otherwise,
-         * ExecBuildHash32Expr will set up the ExprState to abort early if it
-         * finds a NULL.  In these cases, we don't need to store these tuples
-         * in the hash table as the jointype does not require it.
+         * on the outer side of the join.
          */
         hjstate->hj_OuterHash =
             ExecBuildHash32Expr(hjstate->js.ps.ps_ResultTupleDesc,
@@ -878,8 +987,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
                                 node->hashkeys,
                                 hash_strict,
                                 &hjstate->js.ps,
-                                0,
-                                HJ_FILL_OUTER(hjstate));
+                                0);

         /* As above, but for the inner side of the join */
         hashstate->hash_expr =
@@ -890,8 +998,11 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
                                 hash->hashkeys,
                                 hash_strict,
                                 &hashstate->ps,
-                                0,
-                                HJ_FILL_INNER(hjstate));
+                                0);
+
+        /* Remember whether we need to save tuples with null join keys */
+        hjstate->hj_KeepNullTuples = HJ_FILL_OUTER(hjstate);
+        hashstate->keep_null_tuples = HJ_FILL_INNER(hjstate);

         /*
          * Set up the skew table hash function while we have a record of the
@@ -924,6 +1035,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
      * initialize hash-specific info
      */
     hjstate->hj_HashTable = NULL;
+    hjstate->hj_NullOuterTupleStore = NULL;
     hjstate->hj_FirstOuterTupleSlot = NULL;

     hjstate->hj_CurHashValue = 0;
@@ -947,6 +1059,23 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 void
 ExecEndHashJoin(HashJoinState *node)
 {
+    HashState  *hashNode = castNode(HashState, innerPlanState(node));
+
+    /*
+     * Free tuple stores if we made them (must do this before
+     * ExecHashTableDestroy deletes hashCxt)
+     */
+    if (node->hj_NullOuterTupleStore)
+    {
+        tuplestore_end(node->hj_NullOuterTupleStore);
+        node->hj_NullOuterTupleStore = NULL;
+    }
+    if (hashNode->null_tuple_store)
+    {
+        tuplestore_end(hashNode->null_tuple_store);
+        hashNode->null_tuple_store = NULL;
+    }
+
     /*
      * Free hash table
      */
@@ -1015,11 +1144,19 @@ ExecHashJoinOuterGetTuple(PlanState *outerNode,

             if (!isnull)
             {
+                /* normal case with a non-null join key */
                 /* remember outer relation is not empty for possible rescan */
                 hjstate->hj_OuterNotEmpty = true;

                 return slot;
             }
+            else if (hjstate->hj_KeepNullTuples)
+            {
+                /* null join key, but we must save tuple to be emitted later */
+                if (hjstate->hj_NullOuterTupleStore == NULL)
+                    hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+                tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+            }

             /*
              * That tuple couldn't match because of a NULL, so discard it and
@@ -1087,7 +1224,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
                                                                   &isnull));

             if (!isnull)
+            {
+                /* normal case with a non-null join key */
                 return slot;
+            }
+            else if (hjstate->hj_KeepNullTuples)
+            {
+                /* null join key, but we must save tuple to be emitted later */
+                if (hjstate->hj_NullOuterTupleStore == NULL)
+                    hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+                tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+            }

             /*
              * That tuple couldn't match because of a NULL, so discard it and
@@ -1274,6 +1421,14 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
     int            start_batchno;
     int            batchno;

+    /*
+     * If we are a very slow worker, MultiExecParallelHash could have observed
+     * build_barrier phase PHJ_BUILD_FREE and not bothered to set up batch
+     * accessors.  In that case we must be done.
+     */
+    if (hashtable->batches == NULL)
+        return false;
+
     /*
      * If we were already attached to a batch, remember not to bother checking
      * it again, and detach from it (possibly freeing the hash table if we are
@@ -1496,6 +1651,17 @@ ExecReScanHashJoin(HashJoinState *node)
     PlanState  *outerPlan = outerPlanState(node);
     PlanState  *innerPlan = innerPlanState(node);

+    /*
+     * We're always going to rescan the outer rel, so drop the associated
+     * null-keys tuplestore; we'll rebuild it during the rescan.  (Must do
+     * this before ExecHashTableDestroy deletes hashCxt.)
+     */
+    if (node->hj_NullOuterTupleStore)
+    {
+        tuplestore_end(node->hj_NullOuterTupleStore);
+        node->hj_NullOuterTupleStore = NULL;
+    }
+
     /*
      * In a multi-batch join, we currently have to do rescans the hard way,
      * primarily because batch temp files may have already been released. But
@@ -1505,6 +1671,10 @@ ExecReScanHashJoin(HashJoinState *node)
      */
     if (node->hj_HashTable != NULL)
     {
+        HashState  *hashNode = castNode(HashState, innerPlan);
+
+        Assert(hashNode->hashtable == node->hj_HashTable);
+
         if (node->hj_HashTable->nbatch == 1 &&
             innerPlan->chgParam == NULL)
         {
@@ -1529,15 +1699,20 @@ ExecReScanHashJoin(HashJoinState *node)
              */
             node->hj_OuterNotEmpty = false;

+            /*
+             * Also, rewind inner null-key tuplestore so that we can return
+             * those tuples again.
+             */
+            if (hashNode->null_tuple_store)
+                tuplestore_rescan(hashNode->null_tuple_store);
+
             /* ExecHashJoin can skip the BUILD_HASHTABLE step */
             node->hj_JoinState = HJ_NEED_NEW_OUTER;
         }
         else
         {
             /* must destroy and rebuild hash table */
-            HashState  *hashNode = castNode(HashState, innerPlan);

-            Assert(hashNode->hashtable == node->hj_HashTable);
             /* accumulate stats from old hash table, if wanted */
             /* (this should match ExecShutdownHash) */
             if (hashNode->ps.instrument && !hashNode->hinstrument)
@@ -1546,6 +1721,14 @@ ExecReScanHashJoin(HashJoinState *node)
             if (hashNode->hinstrument)
                 ExecHashAccumInstrumentation(hashNode->hinstrument,
                                              hashNode->hashtable);
+
+            /* free inner null-key tuplestore before ExecHashTableDestroy */
+            if (hashNode->null_tuple_store)
+            {
+                tuplestore_end(hashNode->null_tuple_store);
+                hashNode->null_tuple_store = NULL;
+            }
+
             /* for safety, be sure to clear child plan node's pointer too */
             hashNode->hashtable = NULL;

@@ -1601,7 +1784,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
     ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
     HashJoinTable hashtable = hjstate->hj_HashTable;
     TupleTableSlot *slot;
-    uint32        hashvalue;
     int            i;

     Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1610,6 +1792,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
     for (;;)
     {
         bool        isnull;
+        uint32        hashvalue;

         slot = ExecProcNode(outerState);
         if (TupIsNull(slot))
@@ -1624,6 +1807,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)

         if (!isnull)
         {
+            /* normal case with a non-null join key */
             int            batchno;
             int            bucketno;
             bool        shouldFree;
@@ -1637,6 +1821,15 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
             if (shouldFree)
                 heap_free_minimal_tuple(mintup);
         }
+        else if (hjstate->hj_KeepNullTuples)
+        {
+            /* null join key, but we must save tuple to be emitted later */
+            if (hjstate->hj_NullOuterTupleStore == NULL)
+                hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+            tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+        }
+        /* else we can just discard the tuple immediately */
+
         CHECK_FOR_INTERRUPTS();
     }

@@ -1715,6 +1908,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 {
     int            plan_node_id = state->js.ps.plan->plan_node_id;
     ParallelHashJoinState *pstate;
+    HashState  *hashNode;

     /* Nothing to do if we failed to create a DSM segment. */
     if (pcxt->seg == NULL)
@@ -1744,6 +1938,20 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
     /* Clear any shared batch files. */
     SharedFileSetDeleteAll(&pstate->fileset);

+    /* We'd better clear our local null-key tuplestores, too. */
+    if (state->hj_NullOuterTupleStore)
+    {
+        tuplestore_end(state->hj_NullOuterTupleStore);
+        state->hj_NullOuterTupleStore = NULL;
+    }
+    hashNode = (HashState *) innerPlanState(state);
+    if (hashNode->null_tuple_store)
+    {
+        tuplestore_end(hashNode->null_tuple_store);
+        hashNode->null_tuple_store = NULL;
+    }
+
+
     /* Reset build_barrier to PHJ_BUILD_ELECT so we can go around again. */
     BarrierInit(&pstate->build_barrier, 0);
 }
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..0e19ecc2f8d 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1152,6 +1152,38 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
     }
 }

+/*
+ * tuplestore_gettupleslot_force - exported function to fetch a tuple
+ *
+ * This is identical to tuplestore_gettupleslot except the given slot can be
+ * any kind of slot; it need not be one that will accept a MinimalTuple.
+ */
+bool
+tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
+                              bool copy, TupleTableSlot *slot)
+{
+    MinimalTuple tuple;
+    bool        should_free;
+
+    tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+    if (tuple)
+    {
+        if (copy && !should_free)
+        {
+            tuple = heap_copy_minimal_tuple(tuple, 0);
+            should_free = true;
+        }
+        ExecForceStoreMinimalTuple(tuple, slot, should_free);
+        return true;
+    }
+    else
+    {
+        ExecClearTuple(slot);
+        return false;
+    }
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 104b059544d..2a926c0dc35 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -338,7 +338,7 @@ extern ExprState *ExecBuildHash32Expr(TupleDesc desc,
                                       const List *collations,
                                       const List *hash_exprs,
                                       const bool *opstrict, PlanState *parent,
-                                      uint32 init_value, bool keep_nulls);
+                                      uint32 init_value);
 extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
                                          const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
                                          int numCols,
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..5f59b61f671 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -68,6 +68,15 @@
  * inner batch file.  Subsequently, while reading either inner or outer batch
  * files, we might find tuples that no longer belong to the current batch;
  * if so, we just dump them out to the correct batch file.
+ *
+ * If an input tuple has a null join key, then it cannot match anything from
+ * the other side of the join.  Normally we can just discard such a tuple
+ * immediately, but if it comes from the outer side of an outer join then we
+ * must emit it with null-extension of the other side.  For various reasons
+ * it's not convenient to do that immediately on seeing the tuple, so we dump
+ * the tuple into a tuplestore and emit it later.  (In the unlikely but
+ * supported case of a non-strict join operator, we treat null keys as normal
+ * data.)
  * ----------------------------------------------------------------
  */

diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..55b89febd1a 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -64,6 +64,7 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
                                     int *numbatches,
                                     int *num_skew_mcvs);
 extern int    ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
+extern Tuplestorestate *ExecHashBuildNullTupleStore(HashJoinTable hashtable);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2492282213f..97cd872842b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2236,8 +2236,11 @@ typedef struct MergeJoinState
  *        hj_NullOuterTupleSlot    prepared null tuple for right/right-anti/full
  *                                outer joins
  *        hj_NullInnerTupleSlot    prepared null tuple for left/full outer joins
+ *        hj_NullOuterTupleStore    tuplestore holding outer tuples that have
+ *                                null join keys (but must be emitted anyway)
  *        hj_FirstOuterTupleSlot    first tuple retrieved from outer plan
  *        hj_JoinState            current state of ExecHashJoin state machine
+ *        hj_KeepNullTuples        true to keep outer tuples with null join keys
  *        hj_MatchedOuter            true if found a join match for current outer
  *        hj_OuterNotEmpty        true if outer relation known not empty
  * ----------------
@@ -2261,8 +2264,10 @@ typedef struct HashJoinState
     TupleTableSlot *hj_HashTupleSlot;
     TupleTableSlot *hj_NullOuterTupleSlot;
     TupleTableSlot *hj_NullInnerTupleSlot;
+    Tuplestorestate *hj_NullOuterTupleStore;
     TupleTableSlot *hj_FirstOuterTupleSlot;
     int            hj_JoinState;
+    bool        hj_KeepNullTuples;
     bool        hj_MatchedOuter;
     bool        hj_OuterNotEmpty;
 } HashJoinState;
@@ -2812,6 +2817,9 @@ typedef struct HashState
     FmgrInfo   *skew_hashfunction;    /* lookup data for skew hash function */
     Oid            skew_collation; /* collation to call skew_hashfunction with */

+    Tuplestorestate *null_tuple_store;    /* where to put null-keyed tuples */
+    bool        keep_null_tuples;    /* do we need to save such tuples? */
+
     /*
      * In a parallelized hash join, the leader retains a pointer to the
      * shared-memory stats area in its shared_info field, and then copies the
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..b9e152c1701 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -73,6 +73,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
                                     bool copy, TupleTableSlot *slot);

+extern bool tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
+                                          bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);

 extern bool tuplestore_skiptuples(Tuplestorestate *state,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index f35a0b18c37..334d38b1052 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4590,7 +4590,7 @@ order by fault;
 explain (costs off)
 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;
                          QUERY PLAN
 -------------------------------------------------------------
@@ -4606,13 +4606,14 @@ left join unnest(v1ys) as u1(u1y) on u1y = v2y;

 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;
  v1x |  v1ys   | v2x | v2y | u1y
 -----+---------+-----+-----+-----
    1 | {10,20} |   1 |  10 |  10
    2 | {20,30} |   2 |  20 |  20
-(2 rows)
+   2 | {20,30} |   2 |     |
+(3 rows)

 --
 -- test handling of potential equivalence clauses above outer joins
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 4fc34a0e72a..3df9f653d35 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -53,6 +53,7 @@ $$;
 -- estimated size.
 create table simple as
   select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+insert into simple values (null, null);
 alter table simple set (parallel_workers = 2);
 analyze simple;
 -- Make a relation whose size we will under-estimate.  We want stats
@@ -308,7 +309,7 @@ $$);
 select count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -786,7 +787,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -809,7 +810,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -834,7 +835,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count
 -------
- 20000
+ 20002
 (1 row)

 rollback to settings;
@@ -857,7 +858,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count
 -------
- 40000
+ 40002
 (1 row)

 rollback to settings;
@@ -880,7 +881,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count
 -------
- 40000
+ 40002
 (1 row)

 rollback to settings;
@@ -905,7 +906,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count
 -------
- 40000
+ 40002
 (1 row)

 rollback to settings;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index cc5128add4d..c4946d39e77 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1554,12 +1554,12 @@ order by fault;
 explain (costs off)
 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;

 select * from
 (values (1, array[10,20]), (2, array[20,30])) as v1(v1x,v1ys)
-left join (values (1, 10), (2, 20)) as v2(v2x,v2y) on v2x = v1x
+left join (values (1, 10), (2, 20), (2, null)) as v2(v2x,v2y) on v2x = v1x
 left join unnest(v1ys) as u1(u1y) on u1y = v2y;

 --
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 6b0688ab0a6..11e3a164c76 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -57,6 +57,7 @@ $$;
 -- estimated size.
 create table simple as
   select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+insert into simple values (null, null);
 alter table simple set (parallel_workers = 2);
 analyze simple;

--
2.43.5


Re: Improve hash join's handling of tuples with null join keys

От
Chao Li
Дата:
I downloaded the patch and tested all join types: inner, left, right, full, semi and anti. Basically my tests all
passed.However, I didn't test any case of parallel query.
 

I have two nit comments:

1. In hashjoin.h, line 76-78, the added comment says "(In the unlikely but supported case of a non-strict join
operator,we treat null keys as normal data.)". But I don't see where non-strict join is handled. So, how this patch
impactnon-strict joins?
 

2. Two new join states are added: HJ_FILL_OUTER_NULL_TUPLES, HJ_FILL_INNER_NULL_TUPLES. There are existing join states:
HJ_FILL_OUTER_TUPLEand HJ_FILL_INNER_TUPLES. They all use "FILL". But I think that "FILL" in HJ_FILL_OUTER_NULL_TUPLES
meansdifferent from in "HJ_FILL_OUTER_TUPLE". Because HJ_FILL_OUTER_TUPLE means that when returning an outer tuple, it
needsto fill in null-extension of inner tables; while HJ_FILL_OUTER_NULL_TUPLES means to return outer tuples that have
nulljoin keys. I would suggest something like "APPEND" for the new states: HJ_APPEND_OUTER_NULL_TUPLES and
HJ_APPEND_INNER_NULL_TUPLES.

The new status of this patch is: Waiting on Author

Re: Improve hash join's handling of tuples with null join keys

От
Chao Li
Дата:


On Aug 13, 2025, at 17:16, Chao Li <li.evan.chao@gmail.com> wrote:

I downloaded the patch and tested all join types: inner, left, right, full, semi and anti. Basically my tests all passed. However, I didn't test any case of parallel query.

I have two nit comments:

1. In hashjoin.h, line 76-78, the added comment says "(In the unlikely but supported case of a non-strict join operator, we treat null keys as normal data.)". But I don't see where non-strict join is handled. So, how this patch impact non-strict joins?


I take back this comment, and I get a new comment related.

@@ -1015,11 +1144,19 @@ ExecHashJoinOuterGetTuple(PlanState *outerNode,

                        if (!isnull)
                        {
+                               /* normal case with a non-null join key */
                                /* remember outer relation is not empty for possible rescan */
                                hjstate->hj_OuterNotEmpty = true;

                                return slot;
                        }
+                       else if (hjstate->hj_KeepNullTuples)
+                       {
+                               /* null join key, but we must save tuple to be emitted later */
+                               if (hjstate->hj_NullOuterTupleStore == NULL)
+                                       hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+                               tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+                       }

When an outer tuple contains null join key, without this patch, “isnull” flag is false, so the tuple will still be returned, and for an outer join, the tuple slot will be returned to parent node immediately.

With this patch, “isnull” now becomes true because of the change of strict op. Then the outer null join key tuple must be stored in a tuplestore. When an outer table contains a lot of null join key tuples, then the tuplestore could bump to very large, in that case, it would be hard to say this patch really benefits.

I am think that, can we only do the first half of this patch? Only putting inner table’s null join key tuple into a tuplestore. So that inner hash table’s performance gets improved, and outer table’s logic keeps the same, then overall this patch makes a pure improvement without the potential memory burden from outerNullTupleStore.

—————————

I also got an idea for improving the hash logic.

/*
* If the outer relation is completely empty, and it's not
* right/right-anti/full join, we can quit without building
* the hash table.  However, for an inner join it is only a
* win to check this when the outer relation's startup cost is
* less than the projected cost of building the hash table.
* Otherwise it's best to build the hash table first and see
* if the inner relation is empty.  (When it's a left join, we
* should always make this check, since we aren't going to be
* able to skip the join on the strength of an empty inner
* relation anyway.)
*/
if (HJ_FILL_INNER(node))
{
/* no chance to not build the hash table */
node->hj_FirstOuterTupleSlot = NULL;
}

Based on this patch, if we are doing a left join, and outer table is empty, then all tuples from the inner table should be returned. In that case, we can skip building a hash table, instead, we can put all inner table tuples into hashtable.innerNullTupleStore. Building a tuplestore should be cheaper than building a hash table, so this way makes a little bit more performance improvement.

Regards,

Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/



Re: Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
Chao Li <li.evan.chao@gmail.com> writes:
> With this patch, “isnull” now becomes true because of the change of strict op. Then the outer null join key tuple
mustbe stored in a tuplestore. When an outer table contains a lot of null join key tuples, then the tuplestore could
bumpto very large, in that case, it would be hard to say this patch really benefits. 

What's your point?  If we don't divert those tuples into the
tuplestore, then they will end up in the main hash table instead,
and the consequences of bloat there are far worse.

> Based on this patch, if we are doing a left join, and outer table is empty, then all tuples from the inner table
shouldbe returned. In that case, we can skip building a hash table, instead, we can put all inner table tuples into
hashtable.innerNullTupleStore.Building a tuplestore should be cheaper than building a hash table, so this way makes a
littlebit more performance improvement. 

I think that would make the logic completely unintelligible.  Also,
a totally-empty input relation is not a common situation.  We try to
optimize such cases when it's simple to do so, but we shouldn't let
that drive the fundamental design.

            regards, tom lane



Re: Improve hash join's handling of tuples with null join keys

От
Chao Li
Дата:


On Aug 16, 2025, at 00:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:
With this patch, “isnull” now becomes true because of the change of strict op. Then the outer null join key tuple must be stored in a tuplestore. When an outer table contains a lot of null join key tuples, then the tuplestore could bump to very large, in that case, it would be hard to say this patch really benefits.

What's your point?  If we don't divert those tuples into the
tuplestore, then they will end up in the main hash table instead,
and the consequences of bloat there are far worse.

I might not state clearly. For this comments, I meant the outer table. For example:

SELECT a.*, b.* from a RIGHT JOIN b on a.id = b.a_id;

Let’s say table a is used to build hash, table b is the outer table.

And say, table b has 1000 tuples whose a_id are NULL.

Before this patch, when fetching such a tuple (a_id is null) from table b, the tuple will be returned to parent node immediately. 

With this tuple, all of such tuples will be put into hj_NullOuterTupleStore, and only be returned after all non-null tuples are processed.

My comment was trying to say that if there are a lot of null join key tuples in outer table, then hj_NullOuterTupleStore might use a lot of memory or swap data to disk, which might lead to performance burden. So, I was thinking we could keep the original logic for outer table, and return null join key tuples immediately.



Based on this patch, if we are doing a left join, and outer table is empty, then all tuples from the inner table should be returned. In that case, we can skip building a hash table, instead, we can put all inner table tuples into hashtable.innerNullTupleStore. Building a tuplestore should be cheaper than building a hash table, so this way makes a little bit more performance improvement.

I think that would make the logic completely unintelligible.  Also,
a totally-empty input relation is not a common situation.  We try to
optimize such cases when it's simple to do so, but we shouldn't let
that drive the fundamental design.


I absolutely agree we should not touch the fundamental design for the tiny optimization, that’s why I mentioned “based on this patch”.

With this patch, you have introduced a change in MultiExecPrivateHash():

else if (node->keep_null_tuples)
{
/* null join key, but we must save tuple to be emitted later */
if (node->null_tuple_store == NULL)
node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
tuplestore_puttupleslot(node->null_tuple_store, slot);
}

We can simply added a new flag to HashTable, say named skip_building_hash. Upon right join (join to the hash side), and outer table is empty, set the flag to true, then in the MultiExecPrivateHash(), if skip_building_hash is true, directly put all tuples into node->null_tuple_store without building a hash table.

Then in ExecHashJoinImpl(), after "(void) MultiExecProcNode()" is called, if hashtable->skip_building_hash is true, directly set node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES.

So, the tiny optimization is totally based on this patch, it depends on the HashTable.null_tuple_store (if you take this comment, then maybe rename this variable) and the new state HJ_FILL_INNER_NULL_TUPLES.

Best regards,
==
Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/



Re: Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
Chao Li <li.evan.chao@gmail.com> writes:
> My comment was trying to say that if there are a lot of null join key tuples in outer table, then
hj_NullOuterTupleStoremight use a lot of memory or swap data to disk, which might lead to performance burden. So, I was
thinkingwe could keep the original logic for outer table, and return null join key tuples immediately. 

I don't think that works for the parallel-hash-join case, at least not
for the multi-batch code path.  That path insists on putting every
potentially-outputtable tuple into some batch's shared tuplestore, cf
ExecParallelHashJoinPartitionOuter.  We can make that function put
the tuple into a different tuplestore instead, but I think it's quite
unreasonable to think of returning the tuple immediately from there.
It certainly wouldn't be "keeping the original logic".

Yeah, we could make multi-batch PHJ do this differently from the other
cases, but I don't want to go there: too much complication and risk of
bugs for what is a purely hypothetical performance issue.  Besides
which, if the join is large enough to be worth worrying over, it's
most likely taking that code path anyhow.


> We can simply added a new flag to HashTable, say named skip_building_hash. Upon right join (join to the hash side),
andouter table is empty, set the flag to true, then in the MultiExecPrivateHash(), if skip_building_hash is true,
directlyput all tuples into node->null_tuple_store without building a hash table. 
> Then in ExecHashJoinImpl(), after "(void) MultiExecProcNode()" is called, if hashtable->skip_building_hash is true,
directlyset node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES. 

I'm not excited about this idea either.  It's completely abusing the
data structure, because the "null_tuple_store" is now being used for
tuples that (probably) don't have null join keys.  The fact that you
could cram it in with not very many lines of code does not mean that
the result will be understandable or maintainable --- and certainly,
hash join is on the hairy edge of being too complicated already.

            regards, tom lane



Re: Improve hash join's handling of tuples with null join keys

От
Chao Li
Дата:


On Aug 19, 2025, at 05:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:


Yeah, we could make multi-batch PHJ do this differently from the other
cases, but I don't want to go there: too much complication and risk of
bugs for what is a purely hypothetical performance issue.  Besides
which, if the join is large enough to be worth worrying over, it's
most likely taking that code path anyhow.


We can simply added a new flag to HashTable, say named skip_building_hash. Upon right join (join to the hash side), and outer table is empty, set the flag to true, then in the MultiExecPrivateHash(), if skip_building_hash is true, directly put all tuples into node->null_tuple_store without building a hash table.
Then in ExecHashJoinImpl(), after "(void) MultiExecProcNode()" is called, if hashtable->skip_building_hash is true, directly set node->hj_JoinState = HJ_FILL_INNER_NULL_TUPLES.

I'm not excited about this idea either.  It's completely abusing the
data structure, because the "null_tuple_store" is now being used for
tuples that (probably) don't have null join keys.  The fact that you
could cram it in with not very many lines of code does not mean that
the result will be understandable or maintainable --- and certainly,
hash join is on the hairy edge of being too complicated already.

regards, tom lane

Thanks for the explanation. Then these two comments are resolved.

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
Bug #19030 [1] seems to be a fresh report of the problem this patch
aims to solve.  While answering that, I realized that the v2 patch
causes null-keyed inner rows to not be included in EXPLAIN ANALYZE's
report of the number of rows output by the Hash node.  Now on the
one hand, what it's reporting is an accurate reflection of the
number of rows in the hash table, which perhaps is useful.  On the
other hand, it's almost surely going to confuse users, and it's
different from the number we produced before.  Should we try to
preserve the old behavior here?  (I've not looked at what code
changes would be needed for that.)

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/19030-944dd78d7ef94c0f%40postgresql.org



Re: Improve hash join's handling of tuples with null join keys

От
Marc-Olaf Jaschke
Дата:
Tom Lane wrote:

> Bug #19030 [1] seems to be a fresh report of the problem this patch
> aims to solve.

> [1] https://www.postgresql.org/message-id/flat/19030-944dd78d7ef94c0f%40postgresql.org
>

I can confirm that the patch fixes the issue (Bug #19030). The memory usage remains within the expected range of
work_mem.
This also applies to parallel hash joins.
The query also runs significantly faster.
I also tested cases with multiple left joins.
I have only observed this problem when there are many null values in the join column.

regards
Marc-Olaf Jaschke


Re: Improve hash join's handling of tuples with null join keys

От
Tom Lane
Дата:
Marc-Olaf Jaschke <moj@dshare.de> writes:
> I can confirm that the patch fixes the issue (Bug #19030). The memory usage remains within the expected range of
work_mem.
> This also applies to parallel hash joins.
> The query also runs significantly faster.
> I also tested cases with multiple left joins.
> I have only observed this problem when there are many null values in the join column.

Thanks for testing!

            regards, tom lane