Обсуждение: "type with xxxx does not exist" when doing ExecMemoize()
Hi,
I met Memoize node failed When I used sqlancer test postgres.
database0=# explain select t0.c0 from t0 join t5 on t0.c0 = (t5.c0 - t5.c0);
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=0.17..21.20 rows=4 width=32)
-> Seq Scan on t5 (cost=0.00..1.04 rows=4 width=14)
-> Memoize (cost=0.17..6.18 rows=1 width=32)
Cache Key: (t5.c0 - t5.c0)
Cache Mode: logical
-> Index Only Scan using t0_c0_key on t0 (cost=0.15..6.17 rows=1 width=32)
Index Cond: (c0 = (t5.c0 - t5.c0))
(7 rows)
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=0.17..21.20 rows=4 width=32)
-> Seq Scan on t5 (cost=0.00..1.04 rows=4 width=14)
-> Memoize (cost=0.17..6.18 rows=1 width=32)
Cache Key: (t5.c0 - t5.c0)
Cache Mode: logical
-> Index Only Scan using t0_c0_key on t0 (cost=0.15..6.17 rows=1 width=32)
Index Cond: (c0 = (t5.c0 - t5.c0))
(7 rows)
database0=# select t0.c0 from t0 join t5 on t0.c0 = (t5.c0 - t5.c0);
ERROR: type with OID 2139062143 does not exist
ERROR: type with OID 2139062143 does not exist
How to repeat:
The attached database0.log (created by sqlancer) included statements to repeat this issue.
Firstly, create database test;
then;
psql postgres
\i /xxx/database0.log
I analyzed aboved issue this weekend. And I found that
After called ResetExprContext() in MemoizeHash_hash(), the data in mstate->probeslot was corrputed.
in prepare_probe_slot: the data as below:
(gdb) p *(DatumGetRangeTypeP(pslot->tts_values[0]))
$1 = {vl_len_ = 36, rangetypid = 3904}
after called ResetExprContext() in MemoizeHash_hash:
(gdb) p *(DatumGetRangeTypeP(pslot->tts_values[0]))
$3 = {vl_len_ = 264, rangetypid = 2139062143}
I think in prepare_probe_slot(), should called datumCopy as the attached patch does.
Any thoughts? Thanks.
--
Tender Wang
OpenPie: https://en.openpie.com/Вложения
On 25/2/2024 20:32, Tender Wang wrote: > I think in prepare_probe_slot(), should called datumCopy as the attached > patch does. > > Any thoughts? Thanks. Thanks for the report. I think it is better to invent a Runtime Memory Context; likewise, it is already designed in IndexScan and derivatives. Here, you just allocate the value in some upper memory context. Also, I'm curious why such a trivial error hasn't been found for a long time -- regards, Andrei Lepikhov Postgres Professional
On 26/2/2024 09:52, Andrei Lepikhov wrote: > On 25/2/2024 20:32, Tender Wang wrote: >> I think in prepare_probe_slot(), should called datumCopy as the >> attached patch does. >> >> Any thoughts? Thanks. > Thanks for the report. > I think it is better to invent a Runtime Memory Context; likewise, it is > already designed in IndexScan and derivatives. Here, you just allocate > the value in some upper memory context. > Also, I'm curious why such a trivial error hasn't been found for a long > time Hmmm. I see the problem (test.sql in attachment for reproduction and results). We only detect it by the number of Hits: Cache Key: t1.x, (t1.t)::numeric Cache Mode: logical Hits: 0 Misses: 30 Evictions: 0 Overflows: 0 Memory Usage: 8kB We see no hits in logical mode and 100 hits in binary mode. We see 15 hits for both logical and binary mode if parameters are integer numbers - no problems with resetting expression context. Your patch resolves the issue for logical mode - I see 15 hits for integer and complex keys. But I still see 100 hits in binary mode. Maybe we still have a problem? What's more, why the Memoize node doesn't see the problem at all? -- regards, Andrei Lepikhov Postgres Professional
Вложения
On 26/2/2024 12:44, Tender Wang wrote: > > > Andrei Lepikhov <a.lepikhov@postgrespro.ru > <mailto:a.lepikhov@postgrespro.ru>> 于2024年2月26日周一 10:57写道: > > On 25/2/2024 20:32, Tender Wang wrote: > > I think in prepare_probe_slot(), should called datumCopy as the > attached > > patch does. > > > > Any thoughts? Thanks. > Thanks for the report. > I think it is better to invent a Runtime Memory Context; likewise, > it is > already designed in IndexScan and derivatives. Here, you just allocate > the value in some upper memory context. > Also, I'm curious why such a trivial error hasn't been found for a > long time > > > Make sense. I found MemoizeState already has a MemoryContext, so I used it. > I update the patch. This approach is better for me. In the next version of this patch, I included a test case. I am still unsure about the context chosen and the stability of the test case. Richard, you recently fixed some Memoize issues, could you look at this problem and patch? -- regards, Andrei Lepikhov Postgres Professional
Вложения
On Mon, Feb 26, 2024 at 3:54 PM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 26/2/2024 12:44, Tender Wang wrote:
> Make sense. I found MemoizeState already has a MemoryContext, so I used it.
> I update the patch.
This approach is better for me. In the next version of this patch, I
included a test case. I am still unsure about the context chosen and the
stability of the test case. Richard, you recently fixed some Memoize
issues, could you look at this problem and patch?
I looked at this issue a bit. It seems to me what happens is that at
first the memory areas referenced by probeslot->tts_values[] are
allocated in the per tuple context (see prepare_probe_slot). And then
in MemoizeHash_hash, after we've calculated the hashkey, we will reset
the per tuple context. However, later in MemoizeHash_equal, we still
need to reference the values in probeslot->tts_values[], which have been
cleared.
Actually the context would always be reset in MemoizeHash_equal, for
both binary and logical mode. So I kind of wonder if it's necessary to
reset the context in MemoizeHash_hash.
The ResetExprContext call in MemoizeHash_hash was introduced in
0b053e78b to fix a memory leak issue.
commit 0b053e78b5990cd01e7169ee5bd2bb8e4045deea
Author: David Rowley <drowley@postgresql.org>
Date: Thu Oct 5 20:30:47 2023 +1300
Fix memory leak in Memoize code
It seems to me that switching to the per-tuple memory context is
sufficient to fix the memory leak. Calling ResetExprContext in
MemoizeHash_hash each time seems too aggressive.
I tried to remove the ResetExprContext call in MemoizeHash_hash and did
not see the memory leak with the repro query in [1].
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 18870f10e1..f2f025520d 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -207,7 +207,6 @@ MemoizeHash_hash(struct memoize_hash *tb, const MemoizeKey *key)
}
}
- ResetExprContext(econtext);
MemoryContextSwitchTo(oldcontext);
return murmurhash32(hashkey);
}
Looping in David to have a look.
[1] https://www.postgresql.org/message-id/flat/83281eed63c74e4f940317186372abfd%40cft.ru
Thanks
Richard
first the memory areas referenced by probeslot->tts_values[] are
allocated in the per tuple context (see prepare_probe_slot). And then
in MemoizeHash_hash, after we've calculated the hashkey, we will reset
the per tuple context. However, later in MemoizeHash_equal, we still
need to reference the values in probeslot->tts_values[], which have been
cleared.
Actually the context would always be reset in MemoizeHash_equal, for
both binary and logical mode. So I kind of wonder if it's necessary to
reset the context in MemoizeHash_hash.
The ResetExprContext call in MemoizeHash_hash was introduced in
0b053e78b to fix a memory leak issue.
commit 0b053e78b5990cd01e7169ee5bd2bb8e4045deea
Author: David Rowley <drowley@postgresql.org>
Date: Thu Oct 5 20:30:47 2023 +1300
Fix memory leak in Memoize code
It seems to me that switching to the per-tuple memory context is
sufficient to fix the memory leak. Calling ResetExprContext in
MemoizeHash_hash each time seems too aggressive.
I tried to remove the ResetExprContext call in MemoizeHash_hash and did
not see the memory leak with the repro query in [1].
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 18870f10e1..f2f025520d 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -207,7 +207,6 @@ MemoizeHash_hash(struct memoize_hash *tb, const MemoizeKey *key)
}
}
- ResetExprContext(econtext);
MemoryContextSwitchTo(oldcontext);
return murmurhash32(hashkey);
}
Looping in David to have a look.
[1] https://www.postgresql.org/message-id/flat/83281eed63c74e4f940317186372abfd%40cft.ru
Thanks
Richard
On 26/2/2024 18:34, Richard Guo wrote: > > On Mon, Feb 26, 2024 at 3:54 PM Andrei Lepikhov > <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: > > On 26/2/2024 12:44, Tender Wang wrote: > > Make sense. I found MemoizeState already has a MemoryContext, so > I used it. > > I update the patch. > This approach is better for me. In the next version of this patch, I > included a test case. I am still unsure about the context chosen and > the > stability of the test case. Richard, you recently fixed some Memoize > issues, could you look at this problem and patch? > > > I looked at this issue a bit. It seems to me what happens is that at > first the memory areas referenced by probeslot->tts_values[] are > allocated in the per tuple context (see prepare_probe_slot). And then > in MemoizeHash_hash, after we've calculated the hashkey, we will reset > the per tuple context. However, later in MemoizeHash_equal, we still > need to reference the values in probeslot->tts_values[], which have been > cleared. Agree > > Actually the context would always be reset in MemoizeHash_equal, for > both binary and logical mode. So I kind of wonder if it's necessary to > reset the context in MemoizeHash_hash. I can only provide one thought against this solution: what if we have a lot of unique hash values, maybe all of them? In that case, we still have a kind of 'leak' David fixed by the commit 0b053e78b5. Also, I have a segfault report of one client. As I see, it was caused by too long text column in the table slot. As I see, key value, stored in the Memoize hash table, was corrupted, and the most plain reason is this bug. Should we add a test on this bug, and what do you think about the one proposed in v3? -- regards, Andrei Lepikhov Postgres Professional
The attached patch is a new version based on v3(not including Andrei's the test case). There is no need to call datumCopy when
isnull is true.
I have not added a new runtime memoryContext so far. Continue to use mstate->tableContext, I'm not sure the memory used of probeslot will affect mstate->mem_limit.
Maybe adding a new memoryContext is better. I think I should spend a little time to learn nodeMemoize.c more deeply.
Andrei Lepikhov <a.lepikhov@postgrespro.ru> 于2024年2月26日周一 20:29写道:
On 26/2/2024 18:34, Richard Guo wrote:
>
> On Mon, Feb 26, 2024 at 3:54 PM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
>
> On 26/2/2024 12:44, Tender Wang wrote:
> > Make sense. I found MemoizeState already has a MemoryContext, so
> I used it.
> > I update the patch.
> This approach is better for me. In the next version of this patch, I
> included a test case. I am still unsure about the context chosen and
> the
> stability of the test case. Richard, you recently fixed some Memoize
> issues, could you look at this problem and patch?
>
>
> I looked at this issue a bit. It seems to me what happens is that at
> first the memory areas referenced by probeslot->tts_values[] are
> allocated in the per tuple context (see prepare_probe_slot). And then
> in MemoizeHash_hash, after we've calculated the hashkey, we will reset
> the per tuple context. However, later in MemoizeHash_equal, we still
> need to reference the values in probeslot->tts_values[], which have been
> cleared.
Agree
>
> Actually the context would always be reset in MemoizeHash_equal, for
> both binary and logical mode. So I kind of wonder if it's necessary to
> reset the context in MemoizeHash_hash.
I can only provide one thought against this solution: what if we have a
lot of unique hash values, maybe all of them? In that case, we still
have a kind of 'leak' David fixed by the commit 0b053e78b5.
Also, I have a segfault report of one client. As I see, it was caused by
too long text column in the table slot. As I see, key value, stored in
the Memoize hash table, was corrupted, and the most plain reason is this
bug. Should we add a test on this bug, and what do you think about the
one proposed in v3?
--
regards,
Andrei Lepikhov
Postgres Professional
Tender Wang
OpenPie: https://en.openpie.com/Вложения
On 28/2/2024 13:53, Tender Wang wrote: > The attached patch is a new version based on v3(not including Andrei's > the test case). There is no need to call datumCopy when > isnull is true. > > I have not added a new runtime memoryContext so far. Continue to use > mstate->tableContext, I'm not sure the memory used of probeslot will > affect mstate->mem_limit. > Maybe adding a new memoryContext is better. I think I should spend a > little time to learn nodeMemoize.c more deeply. I am curious about your reasons to stay with tableContext. In terms of memory allocation, Richard's approach looks better. Also, You don't need to initialize tts_values[i] at all if tts_isnull[i] set to true. -- regards, Andrei Lepikhov Postgres Professional
I read Memoize code and how other node use ResetExprContext() recently.
The comments about per_tuple_memory said that :
* ecxt_per_tuple_memory is a short-term context for expression results.
* As the name suggests, it will typically be reset once per tuple,
* before we begin to evaluate expressions for that tuple. Each
* ExprContext normally has its very own per-tuple memory context.
* As the name suggests, it will typically be reset once per tuple,
* before we begin to evaluate expressions for that tuple. Each
* ExprContext normally has its very own per-tuple memory context.
So ResetExprContext() should called once per tuple, but not in Hash and Equal function just as Richard said before.
In ExecResult() and ExecProjectSet(), they call ResetExprContext() once when enter these functions.
So I think ExecMemoize() can do the same way.
The attached patch includes below modifications:
1.
When I read the code in nodeMemoize.c, I found a typos: outer should be inner,
if I don't misunderstand the intend of Memoize.
2.
I found that almost executor node call CHECK_FOR_INTERRUPTS(), so I add it.
Is it right to add it for ExecMemoize()?
3.
I remove ResetExprContext() from Hash and Equal funciton. And I call it when enter
ExecMemoize() just like ExecPrejectSet() does.
ExecQualAndReset() is replaed with ExecQual().
4.
This patch doesn't include test case. I use the Andrei's test case, but I don't repeat the aboved issue.
I may need to spend some more time to think about how to repeat this issue easily.
I don't want to continue to do work based on v3 patch. As Andrei Lepikhov said, using mstate->tableContext for probeslot
is not good. v5 looks more simple.
Вложения
Hi,
When I think about how to add a test case for v5 version patch, and I want to test if v5 version patch has memory leak.
This thread [1] provided a way how to repeat the memory leak, so I used it to test v5 patch. I didn't found memory leak on
v5 patch.
But I found other interesting issue. When changed whereClause in [1], the query reported below error:
"ERROR could not find memoization table entry"
the query:
EXPLAIN analyze
select sum(q.id_table1)
from (
SELECT t2.*
FROM table1 t1
JOIN table2 t2
ON (t2.id_table1 + t2.id_table1) = t1.id) q;
select sum(q.id_table1)
from (
SELECT t2.*
FROM table1 t1
JOIN table2 t2
ON (t2.id_table1 + t2.id_table1) = t1.id) q;
But on v5 patch, it didn't report error.
I guess it is the same reason that data in probeslot was reset in Hash function.
I debug the above query, and get this:
before
(gdb) p *(DatumGetNumeric(mstate->probeslot->tts_values[0]))
$1 = {vl_len_ = 48, choice = {n_header = 32770, n_long = {n_sign_dscale = 32770, n_weight = 60, n_data = 0x564632ebd708}, n_short = {n_header = 32770, n_data = 0x564632ebd706}}}
after
(gdb) p *(DatumGetNumeric(mstate->probeslot->tts_values[0]))
$2 = {vl_len_ = 264, choice = {n_header = 32639, n_long = {n_sign_dscale = 32639, n_weight = 32639, n_data = 0x564632ebd6a8}, n_short = {n_header = 32639, n_data = 0x564632ebd6a6}}}
So after call ResetExprContext() in Hash function, the data in probeslot is corrupted. It is not sure what error will happen when executing on corrupted data.
During debug, I learned that numeric_add doesn't have type check like rangetype, so aboved query will not report "type with xxx does not exist".
And I realize that the test case added by Andrei Lepikhov in v3 is right. So in v6 patch I add Andrei Lepikhov's test case. Thanks a lot.
Now I think the v6 version patch seems to be complete now.
Tender Wang
OpenPie: https://en.openpie.com/Вложения
On 1/3/2024 14:18, Tender Wang wrote: > During debug, I learned that numeric_add doesn't have type check like > rangetype, so aboved query will not report "type with xxx does not exist". > > And I realize that the test case added by Andrei Lepikhov in v3 is > right. So in v6 patch I add Andrei Lepikhov's test case. Thanks a lot. > > Now I think the v6 version patch seems to be complete now. I've passed through the patch, and it looks okay. Although I am afraid of the same problems that future changes can cause and how to detect them, it works correctly. -- regards, Andrei Lepikhov Postgres Professional
Andrei Lepikhov <a.lepikhov@postgrespro.ru> 于2024年3月5日周二 17:36写道:
On 1/3/2024 14:18, Tender Wang wrote:
> During debug, I learned that numeric_add doesn't have type check like
> rangetype, so aboved query will not report "type with xxx does not exist".
>
> And I realize that the test case added by Andrei Lepikhov in v3 is
> right. So in v6 patch I add Andrei Lepikhov's test case. Thanks a lot.
>
> Now I think the v6 version patch seems to be complete now.
I've passed through the patch, and it looks okay. Although I am afraid
of the same problems that future changes can cause and how to detect
them, it works correctly.
Thanks for reviewing it, and I add it to commitfest 2024-07.
Tender Wang
OpenPie: https://en.openpie.com/On 6/3/2024 10:10, Tender Wang wrote: > > > Andrei Lepikhov <a.lepikhov@postgrespro.ru > <mailto:a.lepikhov@postgrespro.ru>> 于2024年3月5日周二 17:36写道: > > On 1/3/2024 14:18, Tender Wang wrote: > > During debug, I learned that numeric_add doesn't have type check > like > > rangetype, so aboved query will not report "type with xxx does > not exist". > > > > And I realize that the test case added by Andrei Lepikhov in v3 is > > right. So in v6 patch I add Andrei Lepikhov's test case. Thanks > a lot. > > > > Now I think the v6 version patch seems to be complete now. > I've passed through the patch, and it looks okay. Although I am afraid > of the same problems that future changes can cause and how to detect > them, it works correctly. > > > Thanks for reviewing it, and I add it to commitfest 2024-07. I think, it is a bug. Should it be fixed (and back-patched) earlier? -- regards, Andrei Lepikhov Postgres Professional
Andrei Lepikhov <a.lepikhov@postgrespro.ru> 于2024年3月6日周三 11:37写道:
I think, it is a bug. Should it be fixed (and back-patched) earlier?
Agreed. Need David to review it as he knows this area best.
Tender Wang
OpenPie: https://en.openpie.com/On Thu, 7 Mar 2024 at 15:24, Tender Wang <tndrwang@gmail.com> wrote: > > Andrei Lepikhov <a.lepikhov@postgrespro.ru> 于2024年3月6日周三 11:37写道: >> I think, it is a bug. Should it be fixed (and back-patched) earlier? > > Agreed. Need David to review it as he knows this area best. This is on my list of things to do. Just not at the top yet. David
On Thu, 7 Mar 2024 at 22:50, David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 7 Mar 2024 at 15:24, Tender Wang <tndrwang@gmail.com> wrote: > > > > Andrei Lepikhov <a.lepikhov@postgrespro.ru> 于2024年3月6日周三 11:37写道: > >> I think, it is a bug. Should it be fixed (and back-patched) earlier? > > > > Agreed. Need David to review it as he knows this area best. > > This is on my list of things to do. Just not at the top yet. I've gone over this patch and I'm happy with the changes to nodeMemoize.c. The thing I did change was the newly added test. The problem there was the test was passing for me with and without the code fix. I ended up changing the test so the cache hits and misses are reported. That required moving the test to above where the work_mem is set to 64KB so we can be certain the values will all be cached and the cache hits are predictable. My other changes were just cosmetic. Thanks for working on this fix. I've pushed the patch. David
David Rowley <dgrowleyml@gmail.com> 于2024年3月11日周一 13:25写道:
On Thu, 7 Mar 2024 at 22:50, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 7 Mar 2024 at 15:24, Tender Wang <tndrwang@gmail.com> wrote:
> >
> > Andrei Lepikhov <a.lepikhov@postgrespro.ru> 于2024年3月6日周三 11:37写道:
> >> I think, it is a bug. Should it be fixed (and back-patched) earlier?
> >
> > Agreed. Need David to review it as he knows this area best.
>
> This is on my list of things to do. Just not at the top yet.
I've gone over this patch and I'm happy with the changes to
nodeMemoize.c. The thing I did change was the newly added test. The
problem there was the test was passing for me with and without the
code fix. I ended up changing the test so the cache hits and misses
are reported. That required moving the test to above where the
work_mem is set to 64KB so we can be certain the values will all be
cached and the cache hits are predictable.
My other changes were just cosmetic.
Thanks for working on this fix. I've pushed the patch.
David
Thanks for pushing the patch.
-- Tender Wang
OpenPie: https://en.openpie.com/