Обсуждение: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
The following bug has been logged on the website: Bug reference: 19040 Logged by: haiyang li Email address: mohen.lhy@alibaba-inc.com PostgreSQL version: 18beta3 Operating system: centos7 5.10.84 x86_64 Description: Hello, all! I found a query which consumes a lot of memory and triggers OOM killer. Memory leak occurs in hashed subplan node. I was able to create reproducible test case on machine with default config and postgresql 18beta3: CREATE TABLE test1( a numeric, b int); INSERT INTO test1 SELECT i, i FROM generate_series(1, 30000000) i; -- Make the running time longer EXPLAIN ANALYZE SELECT * FROM test1 WHERE a NOT IN( SELECT i FROM generate_series(1, 10000) i ); plan: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on test1 (cost=125.00..612432.24 rows=15000108 width=10) (actual time=135.191..25832.808 rows=29990000 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 10000 SubPlan 1 -> Function Scan on generate_series i (cost=0.00..100.00 rows=10000 width=4) (actual time=36.999..38.296 rows=10000 loops=1) Planning Time: 0.280 ms JIT: Functions: 15 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 1.155 ms, Inlining 25.929 ms, Optimization 60.700 ms, Emission 23.018 ms, Total 110.802 ms Execution Time: 28217.026 ms (11 rows) I observed that the process's RES (resident memory) was increasing rapidly during SQL execution by using 'top -p <pid>' command. Furthermore, during SQL execution, I ran 'select pg_log_backend_memory_contexts(<pid>)' to print memory context statistics. The context with abnormally high memory usage was "Subplan HashTable Temp Context." The key part of the log is as follows: ... LOG: level: 5; Subplan HashTable Temp Context: 514834432 total in 62849 blocks; 973712 free (60695 chunks); 513860720 used LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808 free (5 chunks); 400480 used ... Grand total: 518275344 bytes in 63116 blocks; 2025560 free (60976 chunks); 516249784 used ... If I change the SQL from "a NOT IN" to "b NOT IN" and do the same action, I can not observe abnormally high memory usage. Likewise, the key part of the log is as follows: ... LOG: level: 5; Subplan HashTable Temp Context: 1024 total in 1 blocks; 784 free (0 chunks); 240 used LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808 free (5 chunks); 400480 used ... Grand total: 3441936 bytes in 268 blocks; 1050520 free (281 chunks); 2391416 used ... While analyzing the source code, I found that the hashed subplan node fails to reset the 'hashtempcxt' context after probing the hash table for each slot. When variable-length datatypes (e.g., numeric) are processed, this can trigger calls to 'detoast_attr', which allocate memory in hashtempcxt. Without a reset, this memory is not reclaimed until the context itself is destroyed, resulting in a memory leak when processing large numbers of slots. A patch implementing this fix will be included in the follow-up email. -- Thanks, Haiyang Li
------------------------------------------------------------------发件人:PG Bug reporting form <noreply@postgresql.org>发送时间:2025年9月3日(周三) 01:27收件人:"pgsql-bugs"<pgsql-bugs@lists.postgresql.org>抄 送:"李海洋(陌痕)"<mohen.lhy@alibaba-inc.com>主 题:BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt resetThe following bug has been logged on the website:
Bug reference: 19040
Logged by: haiyang li
Email address: mohen.lhy@alibaba-inc.com
PostgreSQL version: 18beta3
Operating system: centos7 5.10.84 x86_64
Description:
Hello, all!
I found a query which consumes a lot of memory and triggers OOM killer.
Memory leak occurs in hashed subplan node.
I was able to create reproducible test case on machine with default config
and postgresql 18beta3:
CREATE TABLE test1(
a numeric,
b int);
INSERT INTO
test1
SELECT
i,
i
FROM
generate_series(1, 30000000) i; -- Make the running time longer
EXPLAIN ANALYZE SELECT
*
FROM
test1
WHERE
a NOT IN(
SELECT
i
FROM
generate_series(1, 10000) i
);
plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test1 (cost=125.00..612432.24 rows=15000108 width=10) (actual
time=135.191..25832.808 rows=29990000 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10000
SubPlan 1
-> Function Scan on generate_series i (cost=0.00..100.00 rows=10000
width=4) (actual time=36.999..38.296 rows=10000 loops=1)
Planning Time: 0.280 ms
JIT:
Functions: 15
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 1.155 ms, Inlining 25.929 ms, Optimization 60.700 ms,
Emission 23.018 ms, Total 110.802 ms
Execution Time: 28217.026 ms
(11 rows)
I observed that the process's RES (resident memory) was increasing rapidly
during SQL execution by using 'top -p <pid>' command.
Furthermore, during SQL execution, I ran 'select
pg_log_backend_memory_contexts(<pid>)'
to print memory context statistics. The context with abnormally high memory
usage was
"Subplan HashTable Temp Context." The key part of the log is as follows:
...
LOG: level: 5; Subplan HashTable Temp Context: 514834432 total in 62849
blocks; 973712 free (60695 chunks); 513860720 used
LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808
free (5 chunks); 400480 used
...
Grand total: 518275344 bytes in 63116 blocks; 2025560 free (60976 chunks);
516249784 used
...
If I change the SQL from "a NOT IN" to "b NOT IN" and do the same action, I
can not
observe abnormally high memory usage. Likewise, the key part of the log is
as follows:
...
LOG: level: 5; Subplan HashTable Temp Context: 1024 total in 1 blocks; 784
free (0 chunks); 240 used
LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808
free (5 chunks); 400480 used
...
Grand total: 3441936 bytes in 268 blocks; 1050520 free (281 chunks); 2391416
used
...
While analyzing the source code, I found that the hashed subplan node fails
to reset
the 'hashtempcxt' context after probing the hash table for each slot.
When variable-length datatypes (e.g., numeric) are processed, this can
trigger calls
to 'detoast_attr', which allocate memory in hashtempcxt. Without a reset,
this memory
is not reclaimed until the context itself is destroyed, resulting in a
memory leak
when processing large numbers of slots.
A patch implementing this fix will be included in the follow-up email.
--
Thanks,
Haiyang Li
Вложения
Re:BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
----- Original Message -----
From: "PG Bug reporting form" <noreply@postgresql.org>
To: pgsql-bugs@lists.postgresql.org
Cc: mohen.lhy@alibaba-inc.com
Sent: Tue, 02 Sep 2025 15:58:49 +0000
Subject: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
The following bug has been logged on the website:
Bug reference: 19040
Logged by: haiyang li
Email address: mohen.lhy@alibaba-inc.com
PostgreSQL version: 18beta3
Operating system: centos7 5.10.84 x86_64
Description:
Hello, all!
I found a query which consumes a lot of memory and triggers OOM killer.
Memory leak occurs in hashed subplan node.
I was able to create reproducible test case on machine with default config
and postgresql 18beta3:
CREATE TABLE test1(
a numeric,
b int);
INSERT INTO
test1
SELECT
i,
i
FROM
generate_series(1, 30000000) i; -- Make the running time longer
EXPLAIN ANALYZE SELECT
*
FROM
test1
WHERE
a NOT IN(
SELECT
i
FROM
generate_series(1, 10000) i
);
plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test1 (cost=125.00..612432.24 rows=15000108 width=10) (actual
time=135.191..25832.808 rows=29990000 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10000
SubPlan 1
-> Function Scan on generate_series i (cost=0.00..100.00 rows=10000
width=4) (actual time=36.999..38.296 rows=10000 loops=1)
Planning Time: 0.280 ms
JIT:
Functions: 15
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 1.155 ms, Inlining 25.929 ms, Optimization 60.700 ms,
Emission 23.018 ms, Total 110.802 ms
Execution Time: 28217.026 ms
(11 rows)
I observed that the process's RES (resident memory) was increasing rapidly
during SQL execution by using 'top -p <pid>' command.
Furthermore, during SQL execution, I ran 'select
pg_log_backend_memory_contexts(<pid>)'
to print memory context statistics. The context with abnormally high memory
usage was
"Subplan HashTable Temp Context." The key part of the log is as follows:
...
LOG: level: 5; Subplan HashTable Temp Context: 514834432 total in 62849
blocks; 973712 free (60695 chunks); 513860720 used
LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808
free (5 chunks); 400480 used
...
Grand total: 518275344 bytes in 63116 blocks; 2025560 free (60976 chunks);
516249784 used
...
If I change the SQL from "a NOT IN" to "b NOT IN" and do the same action, I
can not
observe abnormally high memory usage. Likewise, the key part of the log is
as follows:
...
LOG: level: 5; Subplan HashTable Temp Context: 1024 total in 1 blocks; 784
free (0 chunks); 240 used
LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808
free (5 chunks); 400480 used
...
Grand total: 3441936 bytes in 268 blocks; 1050520 free (281 chunks); 2391416
used
...
While analyzing the source code, I found that the hashed subplan node fails
to reset
the 'hashtempcxt' context after probing the hash table for each slot.
When variable-length datatypes (e.g., numeric) are processed, this can
trigger calls
to 'detoast_attr', which allocate memory in hashtempcxt. Without a reset,
this memory
is not reclaimed until the context itself is destroyed, resulting in a
memory leak
when processing large numbers of slots.
A patch implementing this fix will be included in the follow-up email.
--
Thanks,
Haiyang Li
Вложения
Re: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
On Sep 3, 2025, at 09:43, ocean_li_996 <ocean_li_996@163.com> wrote:I've attached 'v01_fix_memory_leak_in_hashed_subplan_node.patch' to address this.
----- Original Message -----
From: "PG Bug reporting form" <noreply@postgresql.org>
To: pgsql-bugs@lists.postgresql.org
Cc: mohen.lhy@alibaba-inc.com
Sent: Tue, 02 Sep 2025 15:58:49 +0000
Subject: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
The following bug has been logged on the website:
Bug reference: 19040
Logged by: haiyang li
Email address: mohen.lhy@alibaba-inc.com
PostgreSQL version: 18beta3
Operating system: centos7 5.10.84 x86_64
Description:
Hello, all!
I found a query which consumes a lot of memory and triggers OOM killer.
Memory leak occurs in hashed subplan node.
I was able to create reproducible test case on machine with default config
and postgresql 18beta3:
CREATE TABLE test1(
a numeric,
b int);
INSERT INTO
test1
SELECT
i,
i
FROM
generate_series(1, 30000000) i; -- Make the running time longer
EXPLAIN ANALYZE SELECT
*
FROM
test1
WHERE
a NOT IN(
SELECT
i
FROM
generate_series(1, 10000) i
);
plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test1 (cost=125.00..612432.24 rows=15000108 width=10) (actual
time=135.191..25832.808 rows=29990000 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10000
SubPlan 1
-> Function Scan on generate_series i (cost=0.00..100.00 rows=10000
width=4) (actual time=36.999..38.296 rows=10000 loops=1)
Planning Time: 0.280 ms
JIT:
Functions: 15
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 1.155 ms, Inlining 25.929 ms, Optimization 60.700 ms,
Emission 23.018 ms, Total 110.802 ms
Execution Time: 28217.026 ms
(11 rows)
I observed that the process's RES (resident memory) was increasing rapidly
during SQL execution by using 'top -p <pid>' command.
Furthermore, during SQL execution, I ran 'select
pg_log_backend_memory_contexts(<pid>)'
to print memory context statistics. The context with abnormally high memory
usage was
"Subplan HashTable Temp Context." The key part of the log is as follows:
...
LOG: level: 5; Subplan HashTable Temp Context: 514834432 total in 62849
blocks; 973712 free (60695 chunks); 513860720 used
LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808
free (5 chunks); 400480 used
...
Grand total: 518275344 bytes in 63116 blocks; 2025560 free (60976 chunks);
516249784 used
...
If I change the SQL from "a NOT IN" to "b NOT IN" and do the same action, I
can not
observe abnormally high memory usage. Likewise, the key part of the log is
as follows:
...
LOG: level: 5; Subplan HashTable Temp Context: 1024 total in 1 blocks; 784
free (0 chunks); 240 used
LOG: level: 5; Subplan HashTable Context: 524288 total in 7 blocks; 123808
free (5 chunks); 400480 used
...
Grand total: 3441936 bytes in 268 blocks; 1050520 free (281 chunks); 2391416
used
...
While analyzing the source code, I found that the hashed subplan node fails
to reset
the 'hashtempcxt' context after probing the hash table for each slot.
When variable-length datatypes (e.g., numeric) are processed, this can
trigger calls
to 'detoast_attr', which allocate memory in hashtempcxt. Without a reset,
this memory
is not reclaimed until the context itself is destroyed, resulting in a
memory leak
when processing large numbers of slots.
Fei Changhong
feichanghong <feichanghong@qq.com> writes: > It seems this issue has been around for many years. I took a quick > look at the patch for fixing it. Why don't we reset the temp context > in the LookupTupleHashEntry, TupleHashTableHash, LookupTupleHashEntryHash, > and FindTupleHashEntry functions? This seems more robust. No, I disagree with that. MemoryContextReset is not free. The existing code seems to expect that the hash tempcxt will be a per-tuple context or similar, which will be reset once per executor cycle by existing mechanisms. I wonder why ExecInitSubPlan is making a separate context for this at all, rather than using some surrounding short-lived context. If we do keep a separate context, I agree with the idea of one MemoryContextReset in the exit of ExecHashSubPlan, but the proposed patch seems like a mess. I think what we ought to do is refactor ExecHashSubPlan so that there's exactly one "ExecClearTuple(slot)" down at the bottom, and then we can add a MemoryContextReset after it. > Furthermore, > the added test case doesn't seem to detect whether there's a memory leak. Yeah, test cases for memory leaks are problematic. The only way they can really "detect" one is if they run so long as to be pretty much guaranteed to hit OOM, which is (a) impossible to quantify across a range of platforms and (b) not something we'd care to commit anyway. It's good if we have an example that one can watch to confirm it-leaks-or-not by monitoring the process's memory consumption, but I don't foresee committing it. regards, tom lane
Tom Lane writes:
> If we do keep a separate context, I agree with the idea of one
> MemoryContextReset in the exit of ExecHashSubPlan, but the proposed
> patch seems like a mess. I think what we ought to do is refactor
> ExecHashSubPlan so that there's exactly one "ExecClearTuple(slot)”
> down at the bottom, and then we can add a MemoryContextReset after it.
The proposed patch was inspired by the approach used in ExecRecursiveUnion.
Refactoring ExecHashSubPlan would be a better long‑term solution.
> It's good if we have an example that one can watch to confirm
> it-leaks-or-not by monitoring the process's memory consumption,
> but I don't foresee committing it.
Should we omit the test case, or add one in the same form as in the patch?
—
Thanks,
Haying Li
|
请接收者注意保密,未经发件人书面许可,不得向任何第三方组织和个人透露本邮件所含信息的全部或部分。以上声明仅适用于工作邮件。
Information Security Notice: The information contained in this mail is solely property of the sender's organization.
This mail communication is confidential. Recipients named above are obligated to maintain secrecy and are not permitted to disclose the contents of this communication to others.
------------------------------------------------------------------
发件人:Tom Lane<tgl@sss.pgh.pa.us>
日 期:2025年09月03日 23:35:26
收件人:feichanghong<feichanghong@qq.com>
抄 送:ocean_li_996<ocean_li_996@163.com>; 李海洋(陌痕)<mohen.lhy@alibaba-inc.com>; pgsql-bugs@lists.postgresql.org<pgsql-bugs@lists.postgresql.org>; <jdavis@postgresql.org>
主 题:Re: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
feichanghong <feichanghong@qq.com> writes:
> It seems this issue has been around for many years. I took a quick
> look at the patch for fixing it. Why don't we reset the temp context
> in the LookupTupleHashEntry, TupleHashTableHash, LookupTupleHashEntryHash,
> and FindTupleHashEntry functions? This seems more robust.
No, I disagree with that. MemoryContextReset is not free. The
existing code seems to expect that the hash tempcxt will be a
per-tuple context or similar, which will be reset once per executor
cycle by existing mechanisms. I wonder why ExecInitSubPlan is
making a separate context for this at all, rather than using some
surrounding short-lived context.
If we do keep a separate context, I agree with the idea of one
MemoryContextReset in the exit of ExecHashSubPlan, but the proposed
patch seems like a mess. I think what we ought to do is refactor
ExecHashSubPlan so that there's exactly one "ExecClearTuple(slot)"
down at the bottom, and then we can add a MemoryContextReset after it.
> Furthermore,
> the added test case doesn't seem to detect whether there's a memory leak.
Yeah, test cases for memory leaks are problematic. The only way they
can really "detect" one is if they run so long as to be pretty much
guaranteed to hit OOM, which is (a) impossible to quantify across
a range of platforms and (b) not something we'd care to commit anyway.
It's good if we have an example that one can watch to confirm
it-leaks-or-not by monitoring the process's memory consumption,
but I don't foresee committing it.
regards, tom lane
> If we do keep a separate context, I agree with the idea of one
> MemoryContextReset in the exit of ExecHashSubPlan, but the proposed
> patch seems like a mess. I think what we ought to do is refactor
> ExecHashSubPlan so that there's exactly one "ExecClearTuple(slot)”
> down at the bottom, and then we can add a MemoryContextReset after it.
Вложения
"=?UTF-8?B?5p2O5rW35rSLKOmZjOeXlSk=?=" <mohen.lhy@alibaba-inc.com> writes: > On 2025-09-03 23:35 Tom Lane <tgl@sss.pgh.pa.us> writes: >> I wonder why ExecInitSubPlan is making a separate context for this at all, >> rather than using some surrounding short-lived context. > This behavior was introduced by commit 133924e to fix one bug. AFAICS, the > tempcxt is only used by hashfunc evaluation. We should reset tempcxt after per > hashtable find if tempcxt is a separate context. I thought it unlikely that this leak has been there unreported since 2010, so I tried the test case in older branches and soon found that v10 and older don't exhibit the leak, or at least it's much less virulent there. A "git bisect" session found that the behavior changed at bf6c614a2f2c58312b3be34a47e7fb7362e07bcb is the first bad commit commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb Author: Andres Freund <andres@anarazel.de> Date: Thu Feb 15 21:55:31 2018 -0800 Do execGrouping.c via expression eval machinery, take two. Before that commit, TupleHashTableMatch reset hashtable->tempcxt (via execTuplesMatch). Now what it resets is a different context hashtable->exprcontext->ecxt_per_tuple_memory, and so there's no reset of hashtable->tempcxt anywhere in this particular loop. So that leads me to not trust fixing this in nodeSubplan.c, because that's just one caller that can reach TupleHashTableMatch: assuming that there are no other similar leaks seems dangerous. I experimented with putting MemoryContextReset(hashtable->tempcxt) into TupleHashTableMatch, and that does stop this leak. But even though that's a one-line fix, I don't like that solution either, for two reasons: 1. TupleHashTableMatch is likely to be invoked multiple times per hashtable lookup, so that this way results in many more resets than are really necessary. 2. It's not entirely clear that this way can't clobber data we still need, since the hash function outputs are surely longer-lived than individual tuple matching operations. After contemplating things for awhile, I think that feichanghong's idea is the right one after all: in each of the functions that switch into hashtable->tempcxt, let's do a reset on the way out, as attached. That's straightforward and visibly related to the required data lifespan. Interestingly, both in pre-v11 branches and with the one-line fix, I notice that the test query's memory consumption bounces around a little (10MB or so), while it seems absolutely steady with the attached. I interpret that to mean that we weren't resetting the tempcxt quite often enough, so that there was some amount of leakage in between calls of TupleHashTableMatch, even though we got there often enough to avoid disaster in this particular test. That's another reason to prefer this way over other solutions, I think. regards, tom lane diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c index b5400749353..600d1ae5a96 100644 --- a/src/backend/executor/execGrouping.c +++ b/src/backend/executor/execGrouping.c @@ -316,6 +316,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, Assert(entry == NULL || entry->hash == local_hash); MemoryContextSwitchTo(oldContext); + MemoryContextReset(hashtable->tempcxt); return entry; } @@ -338,6 +339,7 @@ TupleHashTableHash(TupleHashTable hashtable, TupleTableSlot *slot) hash = TupleHashTableHash_internal(hashtable->hashtab, NULL); MemoryContextSwitchTo(oldContext); + MemoryContextReset(hashtable->tempcxt); return hash; } @@ -365,6 +367,7 @@ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot, Assert(entry == NULL || entry->hash == hash); MemoryContextSwitchTo(oldContext); + MemoryContextReset(hashtable->tempcxt); return entry; } @@ -398,7 +401,9 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, /* Search the hash table */ key = NULL; /* flag to reference inputslot */ entry = tuplehash_lookup(hashtable->hashtab, key); + MemoryContextSwitchTo(oldContext); + MemoryContextReset(hashtable->tempcxt); return entry; }
> idea is the right one after all: in each of the functions that switch
> into hashtable->tempcxt, let's do a reset on the way out, as attached.
> That's straightforward and visibly related to the required data
> lifespan.
Re: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
On Sep 7, 2025, at 16:24, 李海洋(陌痕) <mohen.lhy@alibaba-inc.com> wrote:On 2025-09-06 20:31:53 Tom Lane <tgl@sss.pgh.pa.us> writes:> After contemplating things for awhile, I think that feichanghong’s
> idea is the right one after all: in each of the functions that switch
> into hashtable->tempcxt, let's do a reset on the way out, as attached.
> That's straightforward and visibly related to the required data
> lifespan.I have considered this approach as well, but my concern is that "tempcxt"is not always an independent memory context. In some cases, it referencesanother context — for example, in nodeSetOp.c’s "build_hash_table", “tempcxt"points to "setopstate->ps.ps_ExprContext->ecxt_per_tuple_memory". There issimilar usage in nodeAgg.c as well. I’m not entirely sure that this approach wouldnot discard data we still need, because the lifespan of"ps_ExprContext->ecxt_per_tuple_memory" seems to be longer than “tempcxt”.
Should we make tempcxt a completely independent memory context?
Fei Changhong
feichanghong <feichanghong@qq.com> writes: >> On Sep 7, 2025, at 16:24, 李海洋(陌痕) <mohen.lhy@alibaba-inc.com> wrote: >> On 2025-09-06 20:31:53 Tom Lane <tgl@sss.pgh.pa.us> writes: >>> After contemplating things for awhile, I think that feichanghong’s >>> idea is the right one after all: in each of the functions that switch >>> into hashtable->tempcxt, let's do a reset on the way out, as attached. >> I have considered this approach as well, but my concern is that "tempcxt" >> is not always an independent memory context. In some cases, it references >> another context — for example, in nodeSetOp.c’s "build_hash_table", “tempcxt" >> points to "setopstate->ps.ps_ExprContext->ecxt_per_tuple_memory". There is >> similar usage in nodeAgg.c as well. I’m not entirely sure that this approach would >> not discard data we still need, because the lifespan of >> "ps_ExprContext->ecxt_per_tuple_memory" seems to be longer than “tempcxt”. > Yes, I agree with that. Yeah, that is a fair point. The existing API is that the caller is responsible for resetting tempcxt sufficiently often, and it looks like nodeSubplan.c is the only place that gets this wrong. Let's just fix nodeSubplan.c, add a comment documenting this requirement, and call it good. >> Should we make tempcxt a completely independent memory context? > It looks fine. Perhaps we don't need to pass tempcxt to BuildTupleHashTable, > but rather create a new one within it. After each switch to tempcxt, we should > clean it up using MemoryContextReset. I thought about that too, but that would result in two short-lived contexts and two reset operations per tuple cycle where only one is needed. I'm rather tempted to fix nodeSubplan.c by making it use innerecontext->ecxt_per_tuple_memory instead of a separate hash tmpcontext. That context is getting reset already, at least in buildSubPlanHash(). That's probably too risky for the back branches but we could do it in HEAD. regards, tom lane
I wrote: > I thought about that too, but that would result in two short-lived > contexts and two reset operations per tuple cycle where only one > is needed. I'm rather tempted to fix nodeSubplan.c by making it > use innerecontext->ecxt_per_tuple_memory instead of a separate > hash tmpcontext. That context is getting reset already, at least in > buildSubPlanHash(). That's probably too risky for the back branches > but we could do it in HEAD. Concretely, I'm thinking about the attached. 0001 is the same logic as in the v02 patch, but I felt we could make the code be shorter and prettier instead of longer and uglier. That's meant for back-patch, and then 0002 is for master only. regards, tom lane From 498a709bbb1e4942e15c8b1c82b181d9d2d3a686 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Mon, 8 Sep 2025 16:22:29 -0400 Subject: [PATCH v04 1/2] Fix memory leakage in nodeSubplan.c. If the hash functions used for hashing tuples leak any memory, we failed to clean that up, resulting in query-lifespan memory leakage in queries using hashed subplans. One way that could happen is if the values being hashed require de-toasting, since most of our hash functions don't trouble to clean up de-toasted inputs. Prior to commit bf6c614a2, this leakage was largely masked because TupleHashTableMatch would reset hashtable->tempcxt (via execTuplesMatch). But it doesn't do that anymore, and that's not really the right place for this anyway: doing it there could reset the tempcxt many times per hash lookup, or not at all. Instead put reset calls into ExecHashSubPlan and buildSubPlanHash. Along the way to that, rearrange ExecHashSubPlan so that there's just one place to call MemoryContextReset instead of several. This amounts to accepting the de-facto API spec that the caller of the TupleHashTable routines is responsible for resetting the tempcxt adequately often. Although the other callers seem to get this right, it was not documented anywhere, so add a comment about it. Bug: #19040 Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com> Author: Haiyang Li <mohen.lhy@alibaba-inc.com> Reviewed-by: Fei Changhong <feichanghong@qq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org Backpatch-through: 13 --- src/backend/executor/execGrouping.c | 6 +++ src/backend/executor/nodeSubplan.c | 70 +++++++++++------------------ 2 files changed, 33 insertions(+), 43 deletions(-) diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c index b5400749353..75087204f0c 100644 --- a/src/backend/executor/execGrouping.c +++ b/src/backend/executor/execGrouping.c @@ -156,6 +156,12 @@ execTuplesHashPrepare(int numCols, * * Note that the keyColIdx, hashfunctions, and collations arrays must be * allocated in storage that will live as long as the hashtable does. + * + * LookupTupleHashEntry, FindTupleHashEntry, and related functions may leak + * memory in the tempcxt. It is caller's responsibility to reset that context + * reasonably often, typically once per tuple. (We do it that way, rather + * than managing an extra context within the hashtable, because in many cases + * the caller can specify a tempcxt that it needs to reset per-tuple anyway.) */ TupleHashTable BuildTupleHashTable(PlanState *parent, diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c index f7f6fc2da0b..8e55dcc159b 100644 --- a/src/backend/executor/nodeSubplan.c +++ b/src/backend/executor/nodeSubplan.c @@ -102,6 +102,7 @@ ExecHashSubPlan(SubPlanState *node, ExprContext *econtext, bool *isNull) { + bool result = false; SubPlan *subplan = node->subplan; PlanState *planstate = node->planstate; TupleTableSlot *slot; @@ -132,14 +133,6 @@ ExecHashSubPlan(SubPlanState *node, node->projLeft->pi_exprContext = econtext; slot = ExecProject(node->projLeft); - /* - * Note: because we are typically called in a per-tuple context, we have - * to explicitly clear the projected tuple before returning. Otherwise, - * we'll have a double-free situation: the per-tuple context will probably - * be reset before we're called again, and then the tuple slot will think - * it still needs to free the tuple. - */ - /* * If the LHS is all non-null, probe for an exact match in the main hash * table. If we find one, the result is TRUE. Otherwise, scan the @@ -161,19 +154,10 @@ ExecHashSubPlan(SubPlanState *node, slot, node->cur_eq_comp, node->lhs_hash_expr) != NULL) - { - ExecClearTuple(slot); - return BoolGetDatum(true); - } - if (node->havenullrows && - findPartialMatch(node->hashnulls, slot, node->cur_eq_funcs)) - { - ExecClearTuple(slot); + result = true; + else if (node->havenullrows && + findPartialMatch(node->hashnulls, slot, node->cur_eq_funcs)) *isNull = true; - return BoolGetDatum(false); - } - ExecClearTuple(slot); - return BoolGetDatum(false); } /* @@ -186,34 +170,31 @@ ExecHashSubPlan(SubPlanState *node, * aren't provably unequal to the LHS; if so, the result is UNKNOWN. * Otherwise, the result is FALSE. */ - if (node->hashnulls == NULL) - { - ExecClearTuple(slot); - return BoolGetDatum(false); - } - if (slotAllNulls(slot)) - { - ExecClearTuple(slot); + else if (node->hashnulls == NULL) + /* just return FALSE */ ; + else if (slotAllNulls(slot)) *isNull = true; - return BoolGetDatum(false); - } /* Scan partly-null table first, since more likely to get a match */ - if (node->havenullrows && - findPartialMatch(node->hashnulls, slot, node->cur_eq_funcs)) - { - ExecClearTuple(slot); + else if (node->havenullrows && + findPartialMatch(node->hashnulls, slot, node->cur_eq_funcs)) *isNull = true; - return BoolGetDatum(false); - } - if (node->havehashrows && - findPartialMatch(node->hashtable, slot, node->cur_eq_funcs)) - { - ExecClearTuple(slot); + else if (node->havehashrows && + findPartialMatch(node->hashtable, slot, node->cur_eq_funcs)) *isNull = true; - return BoolGetDatum(false); - } + + /* + * Note: because we are typically called in a per-tuple context, we have + * to explicitly clear the projected tuple before returning. Otherwise, + * we'll have a double-free situation: the per-tuple context will probably + * be reset before we're called again, and then the tuple slot will think + * it still needs to free the tuple. + */ ExecClearTuple(slot); - return BoolGetDatum(false); + + /* Also must reset the hashtempcxt after each hashtable lookup. */ + MemoryContextReset(node->hashtempcxt); + + return BoolGetDatum(result); } /* @@ -642,6 +623,9 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext) * during ExecProject. */ ResetExprContext(innerecontext); + + /* Also must reset the hashtempcxt after each hashtable lookup. */ + MemoryContextReset(node->hashtempcxt); } /* -- 2.43.7 From 3df0846dfe41955353287b4b33caf43e1dd2e399 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Mon, 8 Sep 2025 16:41:03 -0400 Subject: [PATCH v04 2/2] Eliminate duplicative hashtempcxt in nodeSubplan.c. Instead of building a separate memory context that's used just for running hash functions, make the hash functions run in the per-tuple context of the node's innerecontext. This saves a little space at runtime, and it avoids needing to reset two contexts instead of one inside buildSubPlanHash's main loop. Per discussion of bug #19040, although this is not directly a fix for that. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org --- src/backend/executor/nodeSubplan.c | 19 +++++-------------- src/include/nodes/execnodes.h | 1 - 2 files changed, 5 insertions(+), 15 deletions(-) diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c index 8e55dcc159b..53fb56f7388 100644 --- a/src/backend/executor/nodeSubplan.c +++ b/src/backend/executor/nodeSubplan.c @@ -191,8 +191,8 @@ ExecHashSubPlan(SubPlanState *node, */ ExecClearTuple(slot); - /* Also must reset the hashtempcxt after each hashtable lookup. */ - MemoryContextReset(node->hashtempcxt); + /* Also must reset the innerecontext after each hashtable lookup. */ + ResetExprContext(node->innerecontext); return BoolGetDatum(result); } @@ -529,7 +529,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext) 0, node->planstate->state->es_query_cxt, node->hashtablecxt, - node->hashtempcxt, + innerecontext->ecxt_per_tuple_memory, false); if (!subplan->unknownEqFalse) @@ -558,7 +558,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext) 0, node->planstate->state->es_query_cxt, node->hashtablecxt, - node->hashtempcxt, + innerecontext->ecxt_per_tuple_memory, false); } else @@ -620,12 +620,9 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext) /* * Reset innerecontext after each inner tuple to free any memory used - * during ExecProject. + * during ExecProject and hashtable lookup. */ ResetExprContext(innerecontext); - - /* Also must reset the hashtempcxt after each hashtable lookup. */ - MemoryContextReset(node->hashtempcxt); } /* @@ -842,7 +839,6 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent) sstate->hashtable = NULL; sstate->hashnulls = NULL; sstate->hashtablecxt = NULL; - sstate->hashtempcxt = NULL; sstate->innerecontext = NULL; sstate->keyColIdx = NULL; sstate->tab_eq_funcoids = NULL; @@ -898,11 +894,6 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent) AllocSetContextCreate(CurrentMemoryContext, "Subplan HashTable Context", ALLOCSET_DEFAULT_SIZES); - /* and a small one for the hash tables to use as temp storage */ - sstate->hashtempcxt = - AllocSetContextCreate(CurrentMemoryContext, - "Subplan HashTable Temp Context", - ALLOCSET_SMALL_SIZES); /* and a short-lived exprcontext for function evaluation */ sstate->innerecontext = CreateExprContext(estate); diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index de782014b2d..71857feae48 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1020,7 +1020,6 @@ typedef struct SubPlanState bool havehashrows; /* true if hashtable is not empty */ bool havenullrows; /* true if hashnulls is not empty */ MemoryContext hashtablecxt; /* memory context containing hash tables */ - MemoryContext hashtempcxt; /* temp memory context for hash tables */ ExprContext *innerecontext; /* econtext for computing inner tuples */ int numCols; /* number of columns being hashed */ /* each of the remaining fields is an array of length numCols: */ -- 2.43.7
> logic as in the v02 patch, but I felt we could make the code
> be shorter and prettier instead of longer and uglier. That’s
> meant for back-patch, and then 0002 is for master only.
Re: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
On Sep 9, 2025, at 04:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:I wrote:I thought about that too, but that would result in two short-lived
contexts and two reset operations per tuple cycle where only one
is needed. I'm rather tempted to fix nodeSubplan.c by making it
use innerecontext->ecxt_per_tuple_memory instead of a separate
hash tmpcontext. That context is getting reset already, at least in
buildSubPlanHash(). That's probably too risky for the back branches
but we could do it in HEAD.
Concretely, I'm thinking about the attached. 0001 is the same
logic as in the v02 patch, but I felt we could make the code
be shorter and prettier instead of longer and uglier. That's
meant for back-patch, and then 0002 is for master only.
regards, tom lane
Fei Changhong
feichanghong <feichanghong@qq.com> writes: > For v04-0002, I checked the commit history, and before commit 133924e1, > buildSubPlanHash was using ecxt_per_tuple_memory. It seems that the > "Subplan HashTable Temp Context" was introduced in order to fix a certain bug. It was a different ExprContext's ecxt_per_tuple_memory, though. This one is owned locally by the TupleHashTable. regards, tom lane
> > For v04-0002, I checked the commit history, and before commit 133924e1,
> > buildSubPlanHash was using ecxt_per_tuple_memory. It seems that the
> > "Subplan HashTable Temp Context" was introduced in order to fix a certain bug.
>
> It was a different ExprContext's ecxt_per_tuple_memory, though.
> This one is owned locally by the TupleHashTable.
"=?UTF-8?B?5p2O5rW35rSLKOmZjOeXlSk=?=" <mohen.lhy@alibaba-inc.com> writes: > For v04-0002, I checked the commit history, and before commit 133924e1, > buildSubPlanHash was using ecxt_per_tuple_memory. It seems that the > "Subplan HashTable Temp Context" was introduced in order to fix a certain bug. You're quite right that 0002 looks suspiciously like it's reverting 133924e1. However, it doesn't cause the test case added by that commit to fail (even under Valgrind). I think the reason is what you say next: > However, the changed behavior of TupleHashTableMatch introduced by commit > bf6c614a (noted in [1]) may make the condition: > ``` > However, the hashtable routines feel free to reset their temp context at any time, > which'd lead to destroying input data that was still needed. > ``` > no longer holds true. Then, the lifespan of tempcxt in buildHashtable is similar > to that of innercontext->ecxt_per_tuple_memory, so it makes sense to merge the two, > I think. Looking around in 133924e1^, the only place in execGrouping.c that would reset the hashtable's tempcxt was execTuplesMatch (which was called by TupleHashTableMatch). But bf6c614a removed execTuplesMatch. The modern execGrouping.c code resets hashtable->tempcxt nowhere. Instead, TupleHashTableMatch applies ExecQualAndReset to hashtable->exprcontext, so that what is reset is the ecxt_per_tuple_memory of that econtext --- but that's manufactured by CreateStandaloneExprContext in BuildTupleHashTable, and has no connection at all to any context that nodeSubplan.c will touch. It is certainly not the same ecxt_per_tuple_memory that pre-133924e1 was resetting. So basically I'm saying that bf6c614a made it okay to revert 133924e1. > BTW, I ran the test case supported in commit 133924e1 on version not contained commit > 133924e1 (tag REL8_0_26). I did not find any problems. But i can not find more information > about this issue. Digging in the archives, I found the discussion leading to 133924e1 at https://www.postgresql.org/message-id/flat/i2jnbo%241lcj%241%40news.hub.org As for trying it on 8.0.26, the commit message for 133924e1 says specifically that the problem isn't there pre-8.1. I did try to duplicate your test using 133924e1^, and it didn't fail for me either. But I'm not too excited about that, because building PG versions this old on modern platforms is really hard. I had to compile with -O0 to get a build that worked at all, and that's already a significant difference from the code we would have been testing back in 2010. It may be that the reason for non-reproduction is buried somewhere in that, or in the different compiler toolchain. I'm not sure it's worth a lot of effort to dig deeply into the details. regards, tom lane
> https://www.postgresql.org/message-id/flat/i2jnbo%241lcj%241%40news.hub.org
Re: BUG #19040: Memory leak in hashed subplan node due to missing hashtempcxt reset
On Sep 10, 2025, at 09:55, 李海洋(陌痕) <mohen.lhy@alibaba-inc.com> wrote:On 2025-09-09 22:24:22 Tom Lane <tgl@sss.pgh.pa.us> wrote:> Digging in the archives, I found the discussion leading to 133924e1 at
> https://www.postgresql.org/message-id/flat/i2jnbo%241lcj%241%40news.hub.orgHaving understood the background of this issue, and given that v04-0002 only applies to master,v04-0002 LGTM, again.Thanks for your analysis.
Fei Changhong
feichanghong <feichanghong@qq.com> writes: > Thanks for the analysis! v04-0002 looks fine to me — please just apply it on > HEAD. Done that way. I thought a bit about your suggestion of adding some kind of assertion check for memory leaks, but it looks too messy and specialized as-is. Maybe with reflection we can find a more generic idea. regards, tom lane