Обсуждение: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

Поиск
Список
Период
Сортировка

BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      19026
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 18beta3
Operating system:   Ubuntu 24.04
Description:

The following script:
create function part_hashint4_noop(value int4, seed int8)
    returns int8 as $$
create table t();
    select value + seed;
    $$ language sql strict immutable parallel safe;

create operator class part_test_int4_ops for type int4 using hash as
  function 2 part_hashint4_noop(int4, int8);

create table pt(i int) partition by hash (i part_test_int4_ops);
create table p1 partition of pt for values with (modulus 4, remainder 0);

insert into pt values (1);
insert into pt values (1);

triggers an expected error and then an internal one:
ERROR:  CREATE TABLE is not allowed in a non-volatile function
CONTEXT:  SQL function "part_hashint4_noop" statement 1

ERROR:  XX000: plancache reference 0x643ae41a7d98 is not owned by resource
owner Portal
CONTEXT:  SQL function "part_hashint4_noop" during startup
LOCATION:  ResourceOwnerForget, resowner.c:618

I also encountered a gibberish owner name (with more complicated queries)
like:
2025-08-19 16:32:22.878 EEST|postgres|regress002|68a47cdb.2d8921|ERROR:
plancache reference 0x5d56f0675c50 is not owned by resource owner
?^O^^?H??^H?????H??^H??^O^^?AUATUSH??^HI??H?w(H?=??

Reproduced starting from 0313c5dc6.


Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Dilip Kumar
Дата:
On Wed, Aug 20, 2025 at 4:01 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      19026
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 18beta3
> Operating system:   Ubuntu 24.04
> Description:
>
> The following script:
> create function part_hashint4_noop(value int4, seed int8)
>     returns int8 as $$
> create table t();
>     select value + seed;
>     $$ language sql strict immutable parallel safe;
>
> create operator class part_test_int4_ops for type int4 using hash as
>   function 2 part_hashint4_noop(int4, int8);
>
> create table pt(i int) partition by hash (i part_test_int4_ops);
> create table p1 partition of pt for values with (modulus 4, remainder 0);
>
> insert into pt values (1);
> insert into pt values (1);
>
> triggers an expected error and then an internal one:
> ERROR:  CREATE TABLE is not allowed in a non-volatile function
> CONTEXT:  SQL function "part_hashint4_noop" statement 1
>
> ERROR:  XX000: plancache reference 0x643ae41a7d98 is not owned by resource
> owner Portal
> CONTEXT:  SQL function "part_hashint4_noop" during startup
> LOCATION:  ResourceOwnerForget, resowner.c:618

The problem is in init_execution_state() we store the

On error resource owner will be released, but that references are
still maintained in the plan and cowner in fcache, but on error the
transaction will be aborted and this resource owner will be released.
So next time when we try to clean up, it will access invalid memory.
So a simple fix would be to cleanup on error as attached POC.

--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Kirill Reshke
Дата:
Hi!


First of all, Alexander, thank you for the excellent bug report!

On Wed, 20 Aug 2025 at 16:21, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > ERROR:  XX000: plancache reference 0x643ae41a7d98 is not owned by resource
> > owner Portal
> > CONTEXT:  SQL function "part_hashint4_noop" during startup
> > LOCATION:  ResourceOwnerForget, resowner.c:618
>
> The problem is in init_execution_state() we store the
>
> On error resource owner will be released, but that references are
> still maintained in the plan and cowner in fcache, but on error the
> transaction will be aborted and this resource owner will be released.
> So next time when we try to clean up, it will access invalid memory.
> So a simple fix would be to cleanup on error as attached POC.
>
> --
> Regards,
> Dilip Kumar
> Google

As for the patch: should we add some regression test for this?
Also, I'm not terribly sure  what this fix does is the right thing to do.
Doesn’t it break some layer of abstraction here? My understanding is
that on transaction rollback, all resources should be freed in the
ResourceOwnerRelease
function and friends. Namely, the ReleaseCachedPlan call we make
before elog(ERROR) is breaking logic. Am I wrong?
For this sql cache there exists sql_exec_error_callback, which is
probably a better place to clean up on error.

How about attached?


--
Best regards,
Kirill Reshke

Вложения

Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Dilip Kumar
Дата:
On Wed, Aug 20, 2025 at 5:13 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> Hi!
>
>
> First of all, Alexander, thank you for the excellent bug report!
>
> On Wed, 20 Aug 2025 at 16:21, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > ERROR:  XX000: plancache reference 0x643ae41a7d98 is not owned by resource
> > > owner Portal
> > > CONTEXT:  SQL function "part_hashint4_noop" during startup
> > > LOCATION:  ResourceOwnerForget, resowner.c:618
> >
> > The problem is in init_execution_state() we store the
> >
> > On error resource owner will be released, but that references are
> > still maintained in the plan and cowner in fcache, but on error the
> > transaction will be aborted and this resource owner will be released.
> > So next time when we try to clean up, it will access invalid memory.
> > So a simple fix would be to cleanup on error as attached POC.
> >
> > --
> > Regards,
> > Dilip Kumar
> > Google
>
> As for the patch: should we add some regression test for this?
> Also, I'm not terribly sure  what this fix does is the right thing to do.
> Doesn’t it break some layer of abstraction here? My understanding is
> that on transaction rollback, all resources should be freed in the
> ResourceOwnerRelease
> function and friends. Namely, the ReleaseCachedPlan call we make
> before elog(ERROR) is breaking logic. Am I wrong?
> For this sql cache there exists sql_exec_error_callback, which is
> probably a better place to clean up on error.
>
> How about attached?

Thanks, right it's better to clean up in sql_exec_error_callback.  But
we should still release the cached plan before setting it to NULL in
order to leak the cache plan as I was doing in my patch.  So I have
merged the logic of both the ideas and came up with this.

I will add a test for this and send it in sometime.

--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Kirill Reshke
Дата:
On Wed, 20 Aug 2025 at 18:25, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 20, 2025 at 5:13 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > How about attached?
>
> Thanks, right it's better to clean up in sql_exec_error_callback.  But
> we should still release the cached plan before setting it to NULL in
> order to leak the cache plan as I was doing in my patch.

I am under the impression that we don’t. Resource owner will release
all cached plans in the rollback tx handler.
This can be confirmed via gdb session: print  fcache->cplan in
sql_exec_error_callback and after this plan in ReleaseCachedPlan in
this repro.
These two pointers will be equal

--
Best regards,
Kirill Reshke



Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Dilip Kumar
Дата:
On Wed, Aug 20, 2025 at 7:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> On Wed, 20 Aug 2025 at 18:25, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 20, 2025 at 5:13 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > > How about attached?
> >
> > Thanks, right it's better to clean up in sql_exec_error_callback.  But
> > we should still release the cached plan before setting it to NULL in
> > order to leak the cache plan as I was doing in my patch.
>
> I am under the impression that we don’t. Resource owner will release
> all cached plans in the rollback tx handler.
> This can be confirmed via gdb session: print  fcache->cplan in
> sql_exec_error_callback and after this plan in ReleaseCachedPlan in
> this repro.
> These two pointers will be equal
>

That's a valid point.  Thanks.  I think we can add the same test case
to what is given in this example, maybe in plpgsql.sql or maybe some
other file which is more relevant.  So we can continue with the v1
patch only you have sent, maybe you can add a one liner comment to it.

--
Regards,
Dilip Kumar
Google



Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Tom Lane
Дата:
Dilip Kumar <dilipbalaut@gmail.com> writes:
> That's a valid point.  Thanks.  I think we can add the same test case
> to what is given in this example, maybe in plpgsql.sql or maybe some
> other file which is more relevant.  So we can continue with the v1
> patch only you have sent, maybe you can add a one liner comment to it.

I do not like either of these proposed patches.  Dilip's suffers from
an extremely myopic idea of which error reports could trigger the
problem, while Kirill's is abusing (IMV) the purpose of an error
context callback.  Those exist to add detail to an error report, not
to clean up state outside the error system, and elog.c doesn't exactly
guarantee that they will be invoked.

The reason this broke at 0313c5dc6 is that that enabled
SQLFunctionCaches to be re-used for the life of the associated
FmgrInfo, and when we are talking about an opclass support procedure,
that FmgrInfo is in the relcache so it is likely to last for the
life of the session.  So the presented test case causes us to error
out of execution of the SQL function during the first INSERT, but
its SQLFunctionCache still exists and has fcache->cplan != NULL,
even though error cleanup would've released the reference count
already.  When we come back to this point in the second INSERT,
init_execution_state is fooled into trying to release the
already-released cplan.

In practice, fcache->cplan will never be not-null after successful
completion of a SQL function, so one idea is to simply clear it
unconditionally as soon as we know we're starting a fresh execution,
more or less as in alternative-1 attached.  However that leaves me
a bit unsatisfied, because it doesn't protect against the case of
erroring out of a set-returning function: if we come in and see
eslist != NULL, we'll pick right back up attempting to execute
plans that probably aren't there anymore.  I think that that case
is unreachable today because we don't allow any opclass support
functions to be SRFs, and AFAIK there are no other cases where an
FmgrInfo would be re-used after a failed query.  Still, I'm inclined
to go with something more like alternative-2, which feels a little
more future-proof.

            regards, tom lane

diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 359aafea681..08cf9dce2af 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -612,7 +612,14 @@ init_sql_fcache(FunctionCallInfo fcinfo, bool lazyEvalOK)
     fcache->lazyEvalOK = lazyEvalOK;
     fcache->lazyEval = false;

-    /* Also reset data about where we are in the function. */
+    /*
+     * Also reset data about where we are in the function.  Notice we just
+     * clear cplan without doing ReleaseCachedPlan.  The only way that cplan
+     * could be non-NULL here is if we errored out of a previous execution of
+     * this SQLFunctionCache, in which case error abort would have released
+     * the plan reference, so we mustn't do so again.
+     */
+    fcache->cplan = NULL;
     fcache->eslist = NULL;
     fcache->next_query_index = 0;
     fcache->error_query_index = 0;
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 359aafea681..92a7be09dd3 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -143,6 +143,7 @@ typedef struct SQLFunctionCache
 {
     SQLFunctionHashEntry *func; /* associated SQLFunctionHashEntry */

+    bool        active;            /* are we executing this cache entry? */
     bool        lazyEvalOK;        /* true if lazyEval is safe */
     bool        shutdown_reg;    /* true if registered shutdown callback */
     bool        lazyEval;        /* true if using lazyEval for result query */
@@ -556,6 +557,22 @@ init_sql_fcache(FunctionCallInfo fcinfo, bool lazyEvalOK)
         finfo->fn_extra = fcache;
     }

+    /*
+     * If the SQLFunctionCache is marked as active, we must have errored out
+     * of a prior execution.  Reset state.  (It might seem that we could also
+     * reach this during recursive invocation of a SQL function, but we won't
+     * because that case won't involve re-use of the same FmgrInfo.)
+     *
+     * In particular, we must clear fcache->cplan without doing
+     * ReleaseCachedPlan, because error cleanup from the prior execution would
+     * have taken care of releasing that plan.
+     */
+    if (fcache->active)
+    {
+        fcache->cplan = NULL;
+        fcache->eslist = NULL;
+    }
+
     /*
      * If we are resuming execution of a set-returning function, just keep
      * using the same cache.  We do not ask funccache.c to re-validate the
@@ -1597,6 +1614,9 @@ fmgr_sql(PG_FUNCTION_ARGS)
      */
     fcache = init_sql_fcache(fcinfo, lazyEvalOK);

+    /* Mark fcache as active */
+    fcache->active = true;
+
     /* Remember info that we might need later to construct tuplestore */
     fcache->tscontext = tscontext;
     fcache->randomAccess = randomAccess;
@@ -1853,6 +1873,9 @@ fmgr_sql(PG_FUNCTION_ARGS)
     if (es == NULL)
         fcache->eslist = NULL;

+    /* Mark fcache as inactive */
+    fcache->active = false;
+
     error_context_stack = sqlerrcontext.previous;

     return result;

Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Tom Lane
Дата:
I wrote:
> In practice, fcache->cplan will never be not-null after successful
> completion of a SQL function, so one idea is to simply clear it
> unconditionally as soon as we know we're starting a fresh execution,
> more or less as in alternative-1 attached.  However that leaves me
> a bit unsatisfied, because it doesn't protect against the case of
> erroring out of a set-returning function: if we come in and see
> eslist != NULL, we'll pick right back up attempting to execute
> plans that probably aren't there anymore.  I think that that case
> is unreachable today because we don't allow any opclass support
> functions to be SRFs, and AFAIK there are no other cases where an
> FmgrInfo would be re-used after a failed query.  Still, I'm inclined
> to go with something more like alternative-2, which feels a little
> more future-proof.

After closer inspection: Alexander's test case doesn't expose the full
scope of the problem.  If the function suffers an error at run-time
rather than early in setup, then we will fall out with fcache->eslist
being non-null as well as fcache->cplan.  Then on the next call,
init_sql_fcache will believe that it is resuming execution of a
set-returning function, and we'll merrily try to execute an executor
state tree that's not there anymore.

So we have to do something more like my alternative-2 than any of
the other proposals, and after reviewing the code I believe it
had better reset the tstore and shutdown_reg fields as well.
Hence, v3 attached, now with regression test case.

            regards, tom lane

diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 359aafea681..97455b1ed4a 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -143,6 +143,7 @@ typedef struct SQLFunctionCache
 {
     SQLFunctionHashEntry *func; /* associated SQLFunctionHashEntry */

+    bool        active;            /* are we executing this cache entry? */
     bool        lazyEvalOK;        /* true if lazyEval is safe */
     bool        shutdown_reg;    /* true if registered shutdown callback */
     bool        lazyEval;        /* true if using lazyEval for result query */
@@ -556,6 +557,28 @@ init_sql_fcache(FunctionCallInfo fcinfo, bool lazyEvalOK)
         finfo->fn_extra = fcache;
     }

+    /*
+     * If the SQLFunctionCache is marked as active, we must have errored out
+     * of a prior execution.  Reset state.  (It might seem that we could also
+     * reach this during recursive invocation of a SQL function, but we won't
+     * because that case won't involve re-use of the same FmgrInfo.)
+     */
+    if (fcache->active)
+    {
+        /*
+         * In general, this stanza should clear all the same fields that
+         * ShutdownSQLFunction would.  Note we must clear fcache->cplan
+         * without doing ReleaseCachedPlan, because error cleanup from the
+         * prior execution would have taken care of releasing that plan.
+         * Likewise, if tstore is still set then it is pointing at garbage.
+         */
+        fcache->cplan = NULL;
+        fcache->eslist = NULL;
+        fcache->tstore = NULL;
+        fcache->shutdown_reg = false;
+        fcache->active = false;
+    }
+
     /*
      * If we are resuming execution of a set-returning function, just keep
      * using the same cache.  We do not ask funccache.c to re-validate the
@@ -1597,6 +1620,9 @@ fmgr_sql(PG_FUNCTION_ARGS)
      */
     fcache = init_sql_fcache(fcinfo, lazyEvalOK);

+    /* Mark fcache as active */
+    fcache->active = true;
+
     /* Remember info that we might need later to construct tuplestore */
     fcache->tscontext = tscontext;
     fcache->randomAccess = randomAccess;
@@ -1853,6 +1879,9 @@ fmgr_sql(PG_FUNCTION_ARGS)
     if (es == NULL)
         fcache->eslist = NULL;

+    /* Mark fcache as inactive */
+    fcache->active = false;
+
     error_context_stack = sqlerrcontext.previous;

     return result;
diff --git a/src/test/regress/expected/create_function_sql.out b/src/test/regress/expected/create_function_sql.out
index 963b6f863ff..da112608d66 100644
--- a/src/test/regress/expected/create_function_sql.out
+++ b/src/test/regress/expected/create_function_sql.out
@@ -733,6 +733,22 @@ SELECT double_append(array_append(ARRAY[q1], q2), q3)
  {4,5,6,4,5,6}
 (2 rows)

+-- Check that we can re-use a SQLFunctionCache after a run-time error.
+-- This function will fail with zero-divide at run time (not plan time).
+CREATE FUNCTION part_hashint4_error(value int4, seed int8) RETURNS int8
+LANGUAGE SQL STRICT IMMUTABLE PARALLEL SAFE AS
+$$ SELECT value + seed + random()::int/0 $$;
+-- Put it into an operator class so that FmgrInfo will be cached in relcache.
+CREATE OPERATOR CLASS part_test_int4_ops_bad FOR TYPE int4 USING hash AS
+  FUNCTION 2 part_hashint4_error(int4, int8);
+CREATE TABLE pt(i int) PARTITION BY hash (i part_test_int4_ops_bad);
+CREATE TABLE p1 PARTITION OF pt FOR VALUES WITH (modulus 4, remainder 0);
+INSERT INTO pt VALUES (1);
+ERROR:  division by zero
+CONTEXT:  SQL function "part_hashint4_error" statement 1
+INSERT INTO pt VALUES (1);
+ERROR:  division by zero
+CONTEXT:  SQL function "part_hashint4_error" statement 1
 -- Things that shouldn't work:
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
     AS 'SELECT ''not an integer'';';
@@ -773,7 +789,7 @@ CONTEXT:  SQL function "test1" during startup
 RESET check_function_bodies;
 -- Cleanup
 DROP SCHEMA temp_func_test CASCADE;
-NOTICE:  drop cascades to 35 other objects
+NOTICE:  drop cascades to 38 other objects
 DETAIL:  drop cascades to function functest_a_1(text,date)
 drop cascades to function functest_a_2(text[])
 drop cascades to function functest_a_3()
@@ -808,6 +824,9 @@ drop cascades to function create_and_insert()
 drop cascades to table ddl_test
 drop cascades to function alter_and_insert()
 drop cascades to function double_append(anyarray,anyelement)
+drop cascades to function part_hashint4_error(integer,bigint)
+drop cascades to operator family part_test_int4_ops_bad for access method hash
+drop cascades to table pt
 drop cascades to function test1(anyelement)
 DROP USER regress_unpriv_user;
 RESET search_path;
diff --git a/src/test/regress/sql/create_function_sql.sql b/src/test/regress/sql/create_function_sql.sql
index 6d1c102d780..3d5f2a92093 100644
--- a/src/test/regress/sql/create_function_sql.sql
+++ b/src/test/regress/sql/create_function_sql.sql
@@ -432,6 +432,23 @@ $$ SELECT array_append($1, $2) || array_append($1, $2) $$;
 SELECT double_append(array_append(ARRAY[q1], q2), q3)
   FROM (VALUES(1,2,3), (4,5,6)) v(q1,q2,q3);

+-- Check that we can re-use a SQLFunctionCache after a run-time error.
+
+-- This function will fail with zero-divide at run time (not plan time).
+CREATE FUNCTION part_hashint4_error(value int4, seed int8) RETURNS int8
+LANGUAGE SQL STRICT IMMUTABLE PARALLEL SAFE AS
+$$ SELECT value + seed + random()::int/0 $$;
+
+-- Put it into an operator class so that FmgrInfo will be cached in relcache.
+CREATE OPERATOR CLASS part_test_int4_ops_bad FOR TYPE int4 USING hash AS
+  FUNCTION 2 part_hashint4_error(int4, int8);
+
+CREATE TABLE pt(i int) PARTITION BY hash (i part_test_int4_ops_bad);
+CREATE TABLE p1 PARTITION OF pt FOR VALUES WITH (modulus 4, remainder 0);
+
+INSERT INTO pt VALUES (1);
+INSERT INTO pt VALUES (1);
+
 -- Things that shouldn't work:

 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL

Re: BUG #19026: ResourceOwnerForget can't find owner for invalid plancache

От
Kirill Reshke
Дата:
On Wed, 20 Aug 2025 at 21:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> The reason this broke at 0313c5dc6 is that that enabled
> SQLFunctionCaches to be re-used for the life of the associated
> FmgrInfo, and when we are talking about an opclass support procedure,
> that FmgrInfo is in the relcache so it is likely to last for the
> life of the session.  So the presented test case causes us to error
> out of execution of the SQL function during the first INSERT, but
> its SQLFunctionCache still exists and has fcache->cplan != NULL,
> even though error cleanup would've released the reference count
> already.  When we come back to this point in the second INSERT,
> init_execution_state is fooled into trying to release the
> already-released cplan.

Thank you for this explanation. I understood how things got buggy and
how they should be fixed.
The v3 LGTM.

-- 
Best regards,
Kirill Reshke