Обсуждение: [PATCH] plpython function causes server panic

Поиск

Список

Период

Сортировка

[PATCH] plpython function causes server panic

От

Hao Zhang

Дата:

30 ноября 2023 г., 10:15:55

Hi hackers,

I found a problem when executing the plpython function:

After the plpython function returns an error, in the same session, if we continue to execute

plpython function, the server panic will be caused.

Reproduce

preparation

SET max_parallel_workers_per_gather=4;
SET parallel_setup_cost=1;
SET min_parallel_table_scan_size ='4kB';

CREATE TABLE t(i int);

INSERT INTO t SELECT generate_series(1, 10000);

CREATE EXTENSION plpython3u;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
yield (i)

$$ LANGUAGE plpython3u parallel safe;

execute the function twice in the same session

postgres=# SELECT test_func() from t where i>10 and i<100;
ERROR: error fetching next item from iterator
DETAIL: Exception: cannot start subtransactions during a parallel operation
CONTEXT: Traceback (most recent call last):
PL/Python function "test_func"

postgres=# SELECT test_func();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

Analysis

There is an SPI call in test_func(): plpy.execute().
Then the server will start a subtransaction by PLy_spi_subtransaction_begin(); BUT! The Python processor does not know whether an error happened during PLy_spi_subtransaction_begin().
If there is an error that occurs in PLy_spi_subtransaction_begin(), the SPI call will be terminated but the python error indicator won't be set and the PyObject won't be free.
Then the next plpython UDF in the same session will fail due to the wrong Python environment.

Solution

Use try-catch to catch the error that occurs in PLy_spi_subtransaction_begin(), and set the python error indicator.

With Regards

Hao Zhang

Вложения

v1-0001-Fix-bug-plpython-function-causes-server-panic.patch

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

02 декабря 2023 г., 04:04:15

Hao Zhang <zhrt1446384557@gmail.com> writes:
> I found a problem when executing the plpython function:
> After the plpython function returns an error, in the same session, if we
> continue to execute
> plpython function, the server panic will be caused.

Thanks for the report!  I see the problem is that we're not expecting
BeginInternalSubTransaction to fail.  However, I'm not sure I like
this solution, mainly because it's only covering a fraction of the
problem.  There are similarly unsafe usages in plperl, pltcl, and
very possibly a lot of third-party PLs.  I wonder if there's a way
to deal with this issue without changing these API assumptions.

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing.  The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

In any case, if we do proceed along the lines of catching errors
from BeginInternalSubTransaction, I think your patch is a bit shy
of a load because it doesn't do all the same things that other callers
of PLy_spi_exception_set do.  Looking at that made me wonder why
the PLy_spi_exceptions lookup business was being duplicated by every
caller rather than being done once in PLy_spi_exception_set.  So
0001 attached is a quick refactoring patch to remove that code
duplication, and then 0002 is your patch adapted to that.

I also attempted to include a test case in 0002, but I'm not very
satisfied with that.  Your original test case seemed pretty expensive
for the amount of code coverage it adds, so I tried to make it happen
with debug_parallel_query instead.  That does exercise the new code,
but it does not exhibit the crash if run against unpatched code.
That's because with this test case the error is only thrown in worker
processes not the leader, so we don't end up with corrupted Python
state in the leader.  That result also points up that the original
test case isn't very reliable for this either: you have to have
parallel_leader_participation on, and you have to have the leader
process at least one row, which makes it pretty timing-sensitive.
On top of all that, the test would become useless if we do eventually
get rid of the !IsInParallelMode restriction.  So I'm kind of inclined
to not bother with a test case if this gets to be committed in this
form.

Thoughts anyone?

            regards, tom lane

diff --git a/src/pl/plpython/plpy_spi.c b/src/pl/plpython/plpy_spi.c
index ff87b27de0..64ca47f218 100644
--- a/src/pl/plpython/plpy_spi.c
+++ b/src/pl/plpython/plpy_spi.c
@@ -28,7 +28,7 @@
 static PyObject *PLy_spi_execute_query(char *query, long limit);
 static PyObject *PLy_spi_execute_fetch_result(SPITupleTable *tuptable,
                                               uint64 rows, int status);
-static void PLy_spi_exception_set(PyObject *excclass, ErrorData *edata);
+static void PLy_spi_exception_set(ErrorData *edata);


 /* prepare(query="select * from foo")
@@ -470,8 +470,6 @@ PLy_commit(PyObject *self, PyObject *args)
     PG_CATCH();
     {
         ErrorData  *edata;
-        PLyExceptionEntry *entry;
-        PyObject   *exc;

         /* Save error info */
         MemoryContextSwitchTo(oldcontext);
@@ -481,17 +479,8 @@ PLy_commit(PyObject *self, PyObject *args)
         /* was cleared at transaction end, reset pointer */
         exec_ctx->scratch_ctx = NULL;

-        /* Look up the correct exception */
-        entry = hash_search(PLy_spi_exceptions, &(edata->sqlerrcode),
-                            HASH_FIND, NULL);
-
-        /*
-         * This could be a custom error code, if that's the case fallback to
-         * SPIError
-         */
-        exc = entry ? entry->exc : PLy_exc_spi_error;
         /* Make Python raise the exception */
-        PLy_spi_exception_set(exc, edata);
+        PLy_spi_exception_set(edata);
         FreeErrorData(edata);

         return NULL;
@@ -517,8 +506,6 @@ PLy_rollback(PyObject *self, PyObject *args)
     PG_CATCH();
     {
         ErrorData  *edata;
-        PLyExceptionEntry *entry;
-        PyObject   *exc;

         /* Save error info */
         MemoryContextSwitchTo(oldcontext);
@@ -528,17 +515,8 @@ PLy_rollback(PyObject *self, PyObject *args)
         /* was cleared at transaction end, reset pointer */
         exec_ctx->scratch_ctx = NULL;

-        /* Look up the correct exception */
-        entry = hash_search(PLy_spi_exceptions, &(edata->sqlerrcode),
-                            HASH_FIND, NULL);
-
-        /*
-         * This could be a custom error code, if that's the case fallback to
-         * SPIError
-         */
-        exc = entry ? entry->exc : PLy_exc_spi_error;
         /* Make Python raise the exception */
-        PLy_spi_exception_set(exc, edata);
+        PLy_spi_exception_set(edata);
         FreeErrorData(edata);

         return NULL;
@@ -594,8 +572,6 @@ void
 PLy_spi_subtransaction_abort(MemoryContext oldcontext, ResourceOwner oldowner)
 {
     ErrorData  *edata;
-    PLyExceptionEntry *entry;
-    PyObject   *exc;

     /* Save error info */
     MemoryContextSwitchTo(oldcontext);
@@ -607,17 +583,8 @@ PLy_spi_subtransaction_abort(MemoryContext oldcontext, ResourceOwner oldowner)
     MemoryContextSwitchTo(oldcontext);
     CurrentResourceOwner = oldowner;

-    /* Look up the correct exception */
-    entry = hash_search(PLy_spi_exceptions, &(edata->sqlerrcode),
-                        HASH_FIND, NULL);
-
-    /*
-     * This could be a custom error code, if that's the case fallback to
-     * SPIError
-     */
-    exc = entry ? entry->exc : PLy_exc_spi_error;
     /* Make Python raise the exception */
-    PLy_spi_exception_set(exc, edata);
+    PLy_spi_exception_set(edata);
     FreeErrorData(edata);
 }

@@ -626,12 +593,21 @@ PLy_spi_subtransaction_abort(MemoryContext oldcontext, ResourceOwner oldowner)
  * internal query and error position.
  */
 static void
-PLy_spi_exception_set(PyObject *excclass, ErrorData *edata)
+PLy_spi_exception_set(ErrorData *edata)
 {
+    PLyExceptionEntry *entry;
+    PyObject   *excclass;
     PyObject   *args = NULL;
     PyObject   *spierror = NULL;
     PyObject   *spidata = NULL;

+    /* Look up the correct exception */
+    entry = hash_search(PLy_spi_exceptions, &(edata->sqlerrcode),
+                        HASH_FIND, NULL);
+
+    /* Fall back to SPIError if no entry */
+    excclass = entry ? entry->exc : PLy_exc_spi_error;
+
     args = Py_BuildValue("(s)", edata->message);
     if (!args)
         goto failure;
diff --git a/src/pl/plpython/expected/plpython_spi.out b/src/pl/plpython/expected/plpython_spi.out
index 8853e2540d..26e1cc3c9a 100644
--- a/src/pl/plpython/expected/plpython_spi.out
+++ b/src/pl/plpython/expected/plpython_spi.out
@@ -464,3 +464,29 @@ SELECT plan_composite_args();
  (3,label)
 (1 row)

+--
+-- Error cases
+--
+-- Check recovery from a subtransaction start failure; to make one
+-- happen repeatably, try to do a SPI operation in an allegedly
+-- parallel-safe function
+CREATE FUNCTION falsely_marked_parallel_safe() RETURNS SETOF int AS
+$$
+plpy.execute("select 2+2")
+for i in range(0, 5):
+    yield (i)
+$$ LANGUAGE plpython3u parallel safe;
+SET debug_parallel_query = 1;
+SELECT falsely_marked_parallel_safe();
+ERROR:  error fetching next item from iterator
+DETAIL:  spiexceptions.InvalidTransactionState: cannot start subtransactions during a parallel operation
+CONTEXT:  Traceback (most recent call last):
+PL/Python function "falsely_marked_parallel_safe"
+parallel worker
+SELECT falsely_marked_parallel_safe();
+ERROR:  error fetching next item from iterator
+DETAIL:  spiexceptions.InvalidTransactionState: cannot start subtransactions during a parallel operation
+CONTEXT:  Traceback (most recent call last):
+PL/Python function "falsely_marked_parallel_safe"
+parallel worker
+RESET debug_parallel_query;
diff --git a/src/pl/plpython/plpy_cursorobject.c b/src/pl/plpython/plpy_cursorobject.c
index 57e8f8ec21..8bfe05815b 100644
--- a/src/pl/plpython/plpy_cursorobject.c
+++ b/src/pl/plpython/plpy_cursorobject.c
@@ -98,7 +98,8 @@ PLy_cursor_query(const char *query)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
@@ -196,7 +197,8 @@ PLy_cursor_plan(PyObject *ob, PyObject *args)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
@@ -333,7 +335,8 @@ PLy_cursor_iternext(PyObject *self)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
@@ -403,7 +406,8 @@ PLy_cursor_fetch(PyObject *self, PyObject *args)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
diff --git a/src/pl/plpython/plpy_spi.c b/src/pl/plpython/plpy_spi.c
index 64ca47f218..ca97f2eca6 100644
--- a/src/pl/plpython/plpy_spi.c
+++ b/src/pl/plpython/plpy_spi.c
@@ -77,7 +77,8 @@ PLy_spi_prepare(PyObject *self, PyObject *args)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
@@ -217,7 +218,8 @@ PLy_spi_execute_plan(PyObject *ob, PyObject *list, long limit)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
@@ -313,7 +315,8 @@ PLy_spi_execute_query(char *query, long limit)
     oldcontext = CurrentMemoryContext;
     oldowner = CurrentResourceOwner;

-    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+        return NULL;

     PG_TRY();
     {
@@ -534,7 +537,9 @@ PLy_rollback(PyObject *self, PyObject *args)
  *    MemoryContext oldcontext = CurrentMemoryContext;
  *    ResourceOwner oldowner = CurrentResourceOwner;
  *
- *    PLy_spi_subtransaction_begin(oldcontext, oldowner);
+ *    if (!PLy_spi_subtransaction_begin(oldcontext, oldowner))
+ *        return NULL;
+ *
  *    PG_TRY();
  *    {
  *        <call SPI functions>
@@ -551,12 +556,38 @@ PLy_rollback(PyObject *self, PyObject *args)
  * These utilities take care of restoring connection to the SPI manager and
  * setting a Python exception in case of an abort.
  */
-void
+bool
 PLy_spi_subtransaction_begin(MemoryContext oldcontext, ResourceOwner oldowner)
 {
-    BeginInternalSubTransaction(NULL);
-    /* Want to run inside function's memory context */
-    MemoryContextSwitchTo(oldcontext);
+    PG_TRY();
+    {
+        /* Start subtransaction (could fail) */
+        BeginInternalSubTransaction(NULL);
+        /* Want to run inside function's memory context */
+        MemoryContextSwitchTo(oldcontext);
+    }
+    PG_CATCH();
+    {
+        ErrorData  *edata;
+
+        /* Save error info */
+        MemoryContextSwitchTo(oldcontext);
+        edata = CopyErrorData();
+        FlushErrorState();
+
+        /* Ensure we restore original context and owner */
+        MemoryContextSwitchTo(oldcontext);
+        CurrentResourceOwner = oldowner;
+
+        /* Make Python raise the exception */
+        PLy_spi_exception_set(edata);
+        FreeErrorData(edata);
+
+        return false;
+    }
+    PG_END_TRY();
+
+    return true;
 }

 void
@@ -580,6 +611,8 @@ PLy_spi_subtransaction_abort(MemoryContext oldcontext, ResourceOwner oldowner)

     /* Abort the inner transaction */
     RollbackAndReleaseCurrentSubTransaction();
+
+    /* Ensure we restore original context and owner */
     MemoryContextSwitchTo(oldcontext);
     CurrentResourceOwner = oldowner;

diff --git a/src/pl/plpython/plpy_spi.h b/src/pl/plpython/plpy_spi.h
index 98ccd21093..a43d803928 100644
--- a/src/pl/plpython/plpy_spi.h
+++ b/src/pl/plpython/plpy_spi.h
@@ -22,7 +22,7 @@ typedef struct PLyExceptionEntry
 } PLyExceptionEntry;

 /* handling of SPI operations inside subtransactions */
-extern void PLy_spi_subtransaction_begin(MemoryContext oldcontext, ResourceOwner oldowner);
+extern bool PLy_spi_subtransaction_begin(MemoryContext oldcontext, ResourceOwner oldowner);
 extern void PLy_spi_subtransaction_commit(MemoryContext oldcontext, ResourceOwner oldowner);
 extern void PLy_spi_subtransaction_abort(MemoryContext oldcontext, ResourceOwner oldowner);

diff --git a/src/pl/plpython/sql/plpython_spi.sql b/src/pl/plpython/sql/plpython_spi.sql
index fcd113acaa..69a448cd68 100644
--- a/src/pl/plpython/sql/plpython_spi.sql
+++ b/src/pl/plpython/sql/plpython_spi.sql
@@ -320,3 +320,22 @@ SELECT cursor_fetch_next_empty();
 SELECT cursor_plan();
 SELECT cursor_plan_wrong_args();
 SELECT plan_composite_args();
+
+--
+-- Error cases
+--
+
+-- Check recovery from a subtransaction start failure; to make one
+-- happen repeatably, try to do a SPI operation in an allegedly
+-- parallel-safe function
+CREATE FUNCTION falsely_marked_parallel_safe() RETURNS SETOF int AS
+$$
+plpy.execute("select 2+2")
+for i in range(0, 5):
+    yield (i)
+$$ LANGUAGE plpython3u parallel safe;
+
+SET debug_parallel_query = 1;
+SELECT falsely_marked_parallel_safe();
+SELECT falsely_marked_parallel_safe();
+RESET debug_parallel_query;

Re: [PATCH] plpython function causes server panic

От

Andres Freund

Дата:

02 декабря 2023 г., 04:30:07

Hi,

On 2023-12-01 20:04:15 -0500, Tom Lane wrote:
> Hao Zhang <zhrt1446384557@gmail.com> writes:
> > I found a problem when executing the plpython function:
> > After the plpython function returns an error, in the same session, if we
> > continue to execute
> > plpython function, the server panic will be caused.
> 
> Thanks for the report!  I see the problem is that we're not expecting
> BeginInternalSubTransaction to fail.  However, I'm not sure I like
> this solution, mainly because it's only covering a fraction of the
> problem.  There are similarly unsafe usages in plperl, pltcl, and
> very possibly a lot of third-party PLs.  I wonder if there's a way
> to deal with this issue without changing these API assumptions.

There are plenty other uses, but it's not clear to me that they are similarly
affected by BeginInternalSubTransaction raising an error? It e.g. doesn't
immediately look like plperl's usage would be affected in a similar way?

Greetings,

Andres Freund

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

02 декабря 2023 г., 04:46:26

Andres Freund <andres@anarazel.de> writes:
> On 2023-12-01 20:04:15 -0500, Tom Lane wrote:
>> Thanks for the report!  I see the problem is that we're not expecting
>> BeginInternalSubTransaction to fail.  However, I'm not sure I like
>> this solution, mainly because it's only covering a fraction of the
>> problem.  There are similarly unsafe usages in plperl, pltcl, and
>> very possibly a lot of third-party PLs.  I wonder if there's a way
>> to deal with this issue without changing these API assumptions.

> There are plenty other uses, but it's not clear to me that they are similarly
> affected by BeginInternalSubTransaction raising an error? It e.g. doesn't
> immediately look like plperl's usage would be affected in a similar way?

Why not?  We'd be longjmp'ing out from inside the Perl interpreter.
Maybe Perl is so robust it doesn't care, but I'd be surprised if this
can't break it.  The same for Tcl.

I think that plpgsql indeed doesn't care because it has no non-PG
interpreter state to worry about.  But it's in the minority I fear.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

02 декабря 2023 г., 04:51:11

I wrote:
> The only readily-reachable error case in BeginInternalSubTransaction
> is this specific one about IsInParallelMode, which was added later
> than the original design and evidently with not a lot of thought or
> testing.  The comment for it speculates about whether we could get
> rid of it, so I wonder if our thoughts about this ought to go in that
> direction.

After thinking a bit more I wonder why we need that error check at all.
Why isn't it sufficient to rely on GetNewTransactionId()'s check that
throws an error if a parallelized subtransaction tries to obtain an XID?
I don't see why we'd need to "synchronize transaction state" about
anything that never acquires an XID.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Hao Zhang

Дата:

04 декабря 2023 г., 12:21:29

Thanks for your reply. These patches look good to me!

> The only readily-reachable error case in BeginInternalSubTransaction

> is this specific one about IsInParallelMode, which was added later
> than the original design and evidently with not a lot of thought or
> testing. The comment for it speculates about whether we could get
> rid of it, so I wonder if our thoughts about this ought to go in that
> direction.

IMHO, there are other error reports in the function BeginInternalSubTransaction(), like

```

ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory"),
errdetail("Failed on request of size %zu in memory context \"%s\".",
size, context->name)));

```

we cannot avoid this crash by just getting rid of IsInParallelMode().

And in my test, the server won't crash in the plperl test.

With regards,

Hao Zhang

Tom Lane <tgl@sss.pgh.pa.us> 于2023年12月2日周六 09:51写道：

I wrote:
> The only readily-reachable error case in BeginInternalSubTransaction
> is this specific one about IsInParallelMode, which was added later
> than the original design and evidently with not a lot of thought or
> testing. The comment for it speculates about whether we could get
> rid of it, so I wonder if our thoughts about this ought to go in that
> direction.

After thinking a bit more I wonder why we need that error check at all.
Why isn't it sufficient to rely on GetNewTransactionId()'s check that
throws an error if a parallelized subtransaction tries to obtain an XID?
I don't see why we'd need to "synchronize transaction state" about
anything that never acquires an XID.

regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

06 декабря 2023 г., 00:56:33

Hao Zhang <zhrt1446384557@gmail.com> writes:
>> The only readily-reachable error case in BeginInternalSubTransaction
>> is this specific one about IsInParallelMode, which was added later
>> than the original design and evidently with not a lot of thought or
>> testing.  The comment for it speculates about whether we could get
>> rid of it, so I wonder if our thoughts about this ought to go in that
>> direction.

> IMHO, there are other error reports in the function
> BeginInternalSubTransaction(), like

Sure, but all the other ones are extremely hard to hit, which is why
we didn't bother to worry about them to begin with.  If we want to
make this more formally bulletproof, my inclination would be to
(a) get rid of the IsInParallelMode restriction and then (b) turn
the function into a critical section, so that any other error gets
treated as a PANIC.  Maybe at some point we'd be willing to make a
variant of BeginInternalSubTransaction that has a different API and
can manage such cases without a PANIC, but that seems far down the
road to me, and certainly not something to be back-patched.

The main reason for my caution here is that, by catching an error
and allowing Python (or Perl, or something else) code to decide
what to do next, we are very dependent on that code doing the right
thing.  This is already a bit of a leap of faith for run-of-the-mill
errors.  For errors in transaction startup or shutdown, I think it's
a bigger leap than I care to make.  We're pretty well hosed if we
can't make the transaction machinery work, so imagining that we can
clean up after such an error and march merrily onwards seems mighty
optimistic.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

29 декабря 2023 г., 20:55:52

I wrote:
> Hao Zhang <zhrt1446384557@gmail.com> writes:
>> IMHO, there are other error reports in the function
>> BeginInternalSubTransaction(), like

> Sure, but all the other ones are extremely hard to hit, which is why
> we didn't bother to worry about them to begin with.  If we want to
> make this more formally bulletproof, my inclination would be to
> (a) get rid of the IsInParallelMode restriction and then (b) turn
> the function into a critical section, so that any other error gets
> treated as a PANIC.

Here's a draft patch along this line.  Basically the idea is that
subtransactions used for error control are now legal in parallel
mode (including in parallel workers) so long as they don't try to
acquire their own XIDs.  I had to clean up some error handling
in xact.c, but really this is a pretty simple patch.

Rather than a true critical section (ie PANIC on failure), it seemed
to me to be enough to force FATAL exit if BeginInternalSubTransaction
fails.  Given the likelihood that our transaction state is messed up
if we get a failure partway through, it's not clear to me that we
could do much better than that even if we were willing to make an API
change for BeginInternalSubTransaction.

I haven't thought hard about what new test cases we might want to
add for this.  It gets through check-world as is, meaning that
nobody has made any test cases exercising the previous restrictions
either.  There might be more documentation work to be done, too.

            regards, tom lane

diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5acc9537d6..7237be40bd 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -546,9 +546,8 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';

   <para>
     Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal> if
-    they write to the database, access sequences, change the transaction state
-    even temporarily (e.g., a PL/pgSQL function that establishes an
-    <literal>EXCEPTION</literal> block to catch errors), or make persistent changes to
+    they write to the database, access sequences, change the transaction state,
+    or make persistent changes to
     settings.  Similarly, functions must be marked <literal>PARALLEL
     RESTRICTED</literal> if they access temporary tables, client connection state,
     cursors, prepared statements, or miscellaneous backend-local state that
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8442c5e6a7..2aa218638c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -638,7 +638,9 @@ AssignTransactionId(TransactionState s)
      * operation, so we can't account for new XIDs at this point.
      */
     if (IsInParallelMode() || IsParallelWorker())
-        elog(ERROR, "cannot assign XIDs during a parallel operation");
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                 errmsg("cannot assign XIDs during a parallel operation")));

     /*
      * Ensure parent(s) have XIDs, so that a child always has an XID later
@@ -826,7 +828,11 @@ GetCurrentCommandId(bool used)
          * could relax this restriction when currentCommandIdUsed was already
          * true at the start of the parallel operation.
          */
-        Assert(!IsParallelWorker());
+        if (IsParallelWorker())
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                     errmsg("cannot modify data in a parallel worker")));
+
         currentCommandIdUsed = true;
     }
     return currentCommandId;
@@ -1091,7 +1097,9 @@ CommandCounterIncrement(void)
          * point.
          */
         if (IsInParallelMode() || IsParallelWorker())
-            elog(ERROR, "cannot start commands during a parallel operation");
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                     errmsg("cannot start commands during a parallel operation")));

         currentCommandId += 1;
         if (currentCommandId == InvalidCommandId)
@@ -4227,7 +4235,7 @@ DefineSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot define savepoints during a parallel operation")));
@@ -4314,7 +4322,7 @@ ReleaseSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot release savepoints during a parallel operation")));
@@ -4423,7 +4431,7 @@ RollbackToSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot rollback to savepoints during a parallel operation")));
@@ -4529,38 +4537,39 @@ RollbackToSavepoint(const char *name)
 /*
  * BeginInternalSubTransaction
  *        This is the same as DefineSavepoint except it allows TBLOCK_STARTED,
- *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE states,
- *        and therefore it can safely be used in functions that might be called
- *        when not inside a BEGIN block or when running deferred triggers at
- *        COMMIT/PREPARE time.  Also, it automatically does
- *        CommitTransactionCommand/StartTransactionCommand instead of expecting
- *        the caller to do it.
+ *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_PARALLEL_INPROGRESS, TBLOCK_END,
+ *        and TBLOCK_PREPARE states, and therefore it can safely be used in
+ *        functions that might be called when not inside a BEGIN block or when
+ *        running deferred triggers at COMMIT/PREPARE time.  Also, it
+ *        automatically does CommitTransactionCommand/StartTransactionCommand
+ *        instead of expecting the caller to do it.
  */
 void
 BeginInternalSubTransaction(const char *name)
 {
     TransactionState s = CurrentTransactionState;
+    bool        save_ExitOnAnyError = ExitOnAnyError;

     /*
-     * Workers synchronize transaction state at the beginning of each parallel
-     * operation, so we can't account for new subtransactions after that
-     * point. We might be able to make an exception for the type of
-     * subtransaction established by this function, which is typically used in
-     * contexts where we're going to release or roll back the subtransaction
-     * before proceeding further, so that no enduring change to the
-     * transaction state occurs. For now, however, we prohibit this case along
-     * with all the others.
+     * Errors within this function are improbable, but if one does happen we
+     * force a FATAL exit.  Callers generally aren't prepared to handle losing
+     * control, and moreover our transaction state is probably corrupted if we
+     * fail partway through; so an ordinary ERROR longjmp isn't okay.
+     */
+    ExitOnAnyError = true;
+
+    /*
+     * We allow "internal" subtransactions inside parallel workers as long as
+     * they don't require their own XIDs.  So, no need to check here;
+     * enforcement occurs in AssignTransactionId().
      */
-    if (IsInParallelMode())
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot start subtransactions during a parallel operation")));

     switch (s->blockState)
     {
         case TBLOCK_STARTED:
         case TBLOCK_INPROGRESS:
         case TBLOCK_IMPLICIT_INPROGRESS:
+        case TBLOCK_PARALLEL_INPROGRESS:
         case TBLOCK_END:
         case TBLOCK_PREPARE:
         case TBLOCK_SUBINPROGRESS:
@@ -4579,7 +4588,6 @@ BeginInternalSubTransaction(const char *name)
             /* These cases are invalid. */
         case TBLOCK_DEFAULT:
         case TBLOCK_BEGIN:
-        case TBLOCK_PARALLEL_INPROGRESS:
         case TBLOCK_SUBBEGIN:
         case TBLOCK_SUBRELEASE:
         case TBLOCK_SUBCOMMIT:
@@ -4598,6 +4606,8 @@ BeginInternalSubTransaction(const char *name)

     CommitTransactionCommand();
     StartTransactionCommand();
+
+    ExitOnAnyError = save_ExitOnAnyError;
 }

 /*
@@ -4613,16 +4623,10 @@ ReleaseCurrentSubTransaction(void)
     TransactionState s = CurrentTransactionState;

     /*
-     * Workers synchronize transaction state at the beginning of each parallel
-     * operation, so we can't account for commit of subtransactions after that
-     * point.  This should not happen anyway.  Code calling this would
-     * typically have called BeginInternalSubTransaction() first, failing
-     * there.
+     * We do not check for parallel mode here.  If the subtransaction tried to
+     * do anything that's forbidden in parallel mode, we'd have errored out
+     * already.
      */
-    if (IsInParallelMode())
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot commit subtransactions during a parallel operation")));

     if (s->blockState != TBLOCK_SUBINPROGRESS)
         elog(ERROR, "ReleaseCurrentSubTransaction: unexpected state %s",
@@ -4647,11 +4651,9 @@ RollbackAndReleaseCurrentSubTransaction(void)
     TransactionState s = CurrentTransactionState;

     /*
-     * Unlike ReleaseCurrentSubTransaction(), this is nominally permitted
-     * during parallel operations.  That's because we may be in the leader,
-     * recovering from an error thrown while we were in parallel mode.  We
-     * won't reach here in a worker, because BeginInternalSubTransaction()
-     * will have failed.
+     * We do not check for parallel mode here.  If the subtransaction tried to
+     * do anything that's forbidden in parallel mode, we'd have errored out
+     * already.
      */

     switch (s->blockState)
@@ -4698,6 +4700,7 @@ RollbackAndReleaseCurrentSubTransaction(void)
     Assert(s->blockState == TBLOCK_SUBINPROGRESS ||
            s->blockState == TBLOCK_INPROGRESS ||
            s->blockState == TBLOCK_IMPLICIT_INPROGRESS ||
+           s->blockState == TBLOCK_PARALLEL_INPROGRESS ||
            s->blockState == TBLOCK_STARTED);
 }

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

22 марта, 18:39:43

On Fri, Dec 29, 2023 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Here's a draft patch along this line.  Basically the idea is that
> subtransactions used for error control are now legal in parallel
> mode (including in parallel workers) so long as they don't try to
> acquire their own XIDs.  I had to clean up some error handling
> in xact.c, but really this is a pretty simple patch.

I agree with the general direction. A few comments:

- Isn't it redundant to test if IsInParallelMode() ||
IsParallelWorker()? We can't be in a parallel worker without also
being in parallel mode, except during the worker startup sequence.

- I don't think the documentation changes are entirely accurate. The
whole point of the patch is to allow parallel workers to make changes
to the transaction state, but the documentation says you can't. Maybe
we should just delete "change the transaction state" entirely from the
list of things that you're not allowed to do, since "write to the
database" is already listed separately; or maybe we should replace it
with something like "assign new transaction IDs or command IDs,"
although that's kind of low-level. I don't think we should just delete
the "even temporarily" bit, as you've done.

- While I like the new comments in BeginInternalSubTransaction(), I
think the changes in ReleaseCurrentSubTransaction() and
RollbackAndReleaseCurrentSubTransaction() need more thought. For one
thing, it's got to be wildly optimistic to claim that we would have
caught *anything* that's forbidden in parallel mode; that would
require solving the halting problem. I'd rather have no comment at all
here than one making such an ambitious claim, and I think that might
be a fine way to go. But if we do have a comment, I think it should be
more narrowly focused e.g. "We do not check for parallel mode here.
It's permissible to start and end subtransactions while in parallel
mode, as long as no new XIDs or command IDs are assigned." One
additional thing that might (or might not) be worth mentioning or
checking for here is that the leader shouldn't try to reduce the
height of the transaction state stack to anything less than what it
was when the parallel operation started; if it wants to do that, it
needs to clean up the parallel workers and exit parallel mode first.
Similarly a worker shouldn't ever end the toplevel transaction except
during backend cleanup.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

22 марта, 18:51:00

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Dec 29, 2023 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a draft patch along this line.  Basically the idea is that
>> subtransactions used for error control are now legal in parallel
>> mode (including in parallel workers) so long as they don't try to
>> acquire their own XIDs.  I had to clean up some error handling
>> in xact.c, but really this is a pretty simple patch.

> I agree with the general direction. A few comments:

Thanks for looking at this!  I was hoping you'd review it, because
I thought there was a pretty significant chance that I'd missed some
fundamental reason it couldn't work.  I feel better now about it
being worth pursuing.

I consider the patch draft quality at this point: I didn't spend
much effort on docs or comments, and none on test cases.  I'll
work on those issues and come back with a v2.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

22 марта, 20:52:53

Robert Haas <robertmhaas@gmail.com> writes:
> I agree with the general direction. A few comments:

> - Isn't it redundant to test if IsInParallelMode() ||
> IsParallelWorker()? We can't be in a parallel worker without also
> being in parallel mode, except during the worker startup sequence.

Hmm.  The existing code in AssignTransactionId and
CommandCounterIncrement tests both, so I figured that the conservative
course was to make DefineSavepoint and friends test both.  Are you
saying AssignTransactionId and CommandCounterIncrement are wrong?
If you're saying you don't believe that these routines are reachable
during parallel worker start, that could be true, but I'm not sure
I want to make that assumption.  In any case, surely the xxxSavepoint
routines are not hot enough to make it an interesting
micro-optimization.  (Perhaps it is worthwhile in AssignTransactionId
and CCI, but changing those seems like a job for another patch.)

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

22 марта, 21:02:40

On Fri, Mar 22, 2024 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I agree with the general direction. A few comments:
>
> > - Isn't it redundant to test if IsInParallelMode() ||
> > IsParallelWorker()? We can't be in a parallel worker without also
> > being in parallel mode, except during the worker startup sequence.
>
> Hmm.  The existing code in AssignTransactionId and
> CommandCounterIncrement tests both, so I figured that the conservative
> course was to make DefineSavepoint and friends test both.  Are you
> saying AssignTransactionId and CommandCounterIncrement are wrong?
> If you're saying you don't believe that these routines are reachable
> during parallel worker start, that could be true, but I'm not sure
> I want to make that assumption.  In any case, surely the xxxSavepoint
> routines are not hot enough to make it an interesting
> micro-optimization.  (Perhaps it is worthwhile in AssignTransactionId
> and CCI, but changing those seems like a job for another patch.)

Yeah, that's all fair enough. I went back and looked at the history of
this and found 94b4f7e2a635c3027a23b07086f740615b56aa64.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

22 марта, 23:37:48

Robert Haas <robertmhaas@gmail.com> writes:
> - I don't think the documentation changes are entirely accurate. The
> whole point of the patch is to allow parallel workers to make changes
> to the transaction state, but the documentation says you can't. Maybe
> we should just delete "change the transaction state" entirely from the
> list of things that you're not allowed to do, since "write to the
> database" is already listed separately; or maybe we should replace it
> with something like "assign new transaction IDs or command IDs,"
> although that's kind of low-level. I don't think we should just delete
> the "even temporarily" bit, as you've done.

Fair enough.  In the attached v2, I wrote "change the transaction
state (other than by using a subtransaction for error recovery)";
what do you think of that?

I dug around in the docs and couldn't really find anything about
parallel-query transaction limitations other than this bit in
parallel.sgml and the more or less copy-pasted text in
create_function.sgml; did you have any other spots in mind?
(I did find the commentary in README.parallel, but that's not
exactly user-facing.)

> - While I like the new comments in BeginInternalSubTransaction(), I
> think the changes in ReleaseCurrentSubTransaction() and
> RollbackAndReleaseCurrentSubTransaction() need more thought.

Yah.  After studying the code a bit more, I realized that what
I'd done would cause IsInParallelMode() to start returning false
during a subtransaction within parallel mode, which is surely not
what we want.  That state has to be heritable into subtransactions
in some fashion.  The attached keeps the current semantics of
parallelModeLevel and adds a bool parallelChildXact field that is
true if any outer transaction level has nonzero parallelModeLevel.
That's possibly more general than we need today, but it seems like
a reasonably clean definition.

> One additional thing that might (or might not) be worth mentioning or
> checking for here is that the leader shouldn't try to reduce the
> height of the transaction state stack to anything less than what it
> was when the parallel operation started; if it wants to do that, it
> needs to clean up the parallel workers and exit parallel mode first.
> Similarly a worker shouldn't ever end the toplevel transaction except
> during backend cleanup.

I think these things are already dealt with.  However, one thing
worth questioning is that CommitSubTransaction() will just silently
kill any workers started during the current subxact, and likewise
CommitTransaction() zaps workers without complaint.  Shouldn't these
instead throw an error about how you didn't close parallel mode,
and then the corresponding Abort function does the cleanup?
I did not change that behavior here, but it seems dubious.

v2 attached works a bit harder on the comments and adds a simplistic
test case.  I feel that I don't want to incorporate the plpython
crash that started this thread, as it's weird and dependent on
Python code outside our control (though I have checked that we
don't crash on that anymore).

            regards, tom lane

diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5acc9537d6..dae9dd7f2f 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -545,10 +545,10 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   </para>

   <para>
-    Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal> if
-    they write to the database, access sequences, change the transaction state
-    even temporarily (e.g., a PL/pgSQL function that establishes an
-    <literal>EXCEPTION</literal> block to catch errors), or make persistent changes to
+    Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal>
+    if they write to the database, change the transaction state (other than by
+    using a subtransaction for error recovery), access sequences, or make
+    persistent changes to
     settings.  Similarly, functions must be marked <literal>PARALLEL
     RESTRICTED</literal> if they access temporary tables, client connection state,
     cursors, prepared statements, or miscellaneous backend-local state that
diff --git a/doc/src/sgml/ref/create_function.sgml b/doc/src/sgml/ref/create_function.sgml
index 863d99d1fc..0d240484cd 100644
--- a/doc/src/sgml/ref/create_function.sgml
+++ b/doc/src/sgml/ref/create_function.sgml
@@ -428,22 +428,24 @@ CREATE [ OR REPLACE ] FUNCTION
     <term><literal>PARALLEL</literal></term>

     <listitem>
-     <para><literal>PARALLEL UNSAFE</literal> indicates that the function
-      can't be executed in parallel mode and the presence of such a
+     <para>
+      <literal>PARALLEL UNSAFE</literal> indicates that the function
+      can't be executed in parallel mode; the presence of such a
       function in an SQL statement forces a serial execution plan.  This is
       the default.  <literal>PARALLEL RESTRICTED</literal> indicates that
-      the function can be executed in parallel mode, but the execution is
-      restricted to parallel group leader.  <literal>PARALLEL SAFE</literal>
+      the function can be executed in parallel mode, but only in the parallel
+      group leader process.  <literal>PARALLEL SAFE</literal>
       indicates that the function is safe to run in parallel mode without
-      restriction.
+      restriction, including in parallel worker processes.
      </para>

      <para>
       Functions should be labeled parallel unsafe if they modify any database
-      state, or if they make changes to the transaction such as using
-      sub-transactions, or if they access sequences or attempt to make
-      persistent changes to settings (e.g., <literal>setval</literal>).  They should
-      be labeled as parallel restricted if they access temporary tables,
+      state, change the transaction state (other than by using a
+      subtransaction for error recovery), access sequences (e.g., by
+      calling <literal>currval</literal>) or make persistent changes to
+      settings.  They should
+      be labeled parallel restricted if they access temporary tables,
       client connection state, cursors, prepared statements, or miscellaneous
       backend-local state which the system cannot synchronize in parallel mode
       (e.g.,  <literal>setseed</literal> cannot be executed other than by the group
diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel
index e486bffc47..9df3da91b0 100644
--- a/src/backend/access/transam/README.parallel
+++ b/src/backend/access/transam/README.parallel
@@ -137,7 +137,7 @@ Transaction Integration
 =======================

 Regardless of what the TransactionState stack looks like in the parallel
-leader, each parallel worker ends up with a stack of depth 1.  This stack
+leader, each parallel worker begins with a stack of depth 1.  This stack
 entry is marked with the special transaction block state
 TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary
 toplevel transaction.  The XID of this TransactionState is set to the XID of
@@ -153,18 +153,18 @@ associated with the memory contexts or resource owners of intermediate
 subtransactions.

 No meaningful change to the transaction state can be made while in parallel
-mode.  No XIDs can be assigned, and no subtransactions can start or end,
+mode.  No XIDs can be assigned, and no command counter increments can happen,
 because we have no way of communicating these state changes to cooperating
 backends, or of synchronizing them.  It's clearly unworkable for the initiating
 backend to exit any transaction or subtransaction that was in progress when
 parallelism was started before all parallel workers have exited; and it's even
 more clearly crazy for a parallel worker to try to subcommit or subabort the
 current subtransaction and execute in some other transaction context than was
-present in the initiating backend.  It might be practical to allow internal
-sub-transactions (e.g. to implement a PL/pgSQL EXCEPTION block) to be used in
-parallel mode, provided that they are XID-less, because other backends
-wouldn't really need to know about those transactions or do anything
-differently because of them.  Right now, we don't even allow that.
+present in the initiating backend.  However, we allow internal subtransactions
+(e.g. those used to implement a PL/pgSQL EXCEPTION block) to be used in
+parallel mode, provided that they remain XID-less, because other backends
+don't really need to know about those transactions or do anything differently
+because of them.

 At the end of a parallel operation, which can happen either because it
 completed successfully or because it was interrupted by an error, parallel
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 751c251cf5..8613fc6fb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1226,10 +1226,8 @@ HandleParallelMessage(ParallelContext *pcxt, int i, StringInfo msg)
 /*
  * End-of-subtransaction cleanup for parallel contexts.
  *
- * Currently, it's forbidden to enter or leave a subtransaction while
- * parallel mode is in effect, so we could just blow away everything.  But
- * we may want to relax that restriction in the future, so this code
- * contemplates that there may be multiple subtransaction IDs in pcxt_list.
+ * Here we remove only parallel contexts initiated within the current
+ * subtransaction.
  */
 void
 AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId)
@@ -1249,6 +1247,8 @@ AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId)

 /*
  * End-of-transaction cleanup for parallel contexts.
+ *
+ * We nuke all remaining parallel contexts.
  */
 void
 AtEOXact_Parallel(bool isCommit)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1d930752c5..e03f60534d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -183,6 +183,10 @@ typedef enum TBlockState

 /*
  *    transaction state structure
+ *
+ * Note: parallelModeLevel counts the number of unmatched EnterParallelMode
+ * calls done at this transaction level.  parallelChildXact is true if any
+ * upper transaction level has nonzero parallelModeLevel.
  */
 typedef struct TransactionStateData
 {
@@ -205,6 +209,7 @@ typedef struct TransactionStateData
     bool        startedInRecovery;    /* did we start in recovery? */
     bool        didLogXid;        /* has xid been included in WAL record? */
     int            parallelModeLevel;    /* Enter/ExitParallelMode counter */
+    bool        parallelChildXact;    /* is any parent transaction parallel? */
     bool        chain;            /* start a new block after this one */
     bool        topXidLogged;    /* for a subxact: is top-level XID logged? */
     struct TransactionStateData *parent;    /* back link to parent */
@@ -639,7 +644,9 @@ AssignTransactionId(TransactionState s)
      * operation, so we can't account for new XIDs at this point.
      */
     if (IsInParallelMode() || IsParallelWorker())
-        elog(ERROR, "cannot assign XIDs during a parallel operation");
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                 errmsg("cannot assign XIDs during a parallel operation")));

     /*
      * Ensure parent(s) have XIDs, so that a child always has an XID later
@@ -827,7 +834,11 @@ GetCurrentCommandId(bool used)
          * could relax this restriction when currentCommandIdUsed was already
          * true at the start of the parallel operation.
          */
-        Assert(!IsParallelWorker());
+        if (IsParallelWorker())
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                     errmsg("cannot modify data in a parallel worker")));
+
         currentCommandIdUsed = true;
     }
     return currentCommandId;
@@ -1052,7 +1063,8 @@ ExitParallelMode(void)
     TransactionState s = CurrentTransactionState;

     Assert(s->parallelModeLevel > 0);
-    Assert(s->parallelModeLevel > 1 || !ParallelContextActive());
+    Assert(s->parallelModeLevel > 1 || s->parallelChildXact ||
+           !ParallelContextActive());

     --s->parallelModeLevel;
 }
@@ -1065,11 +1077,17 @@ ExitParallelMode(void)
  * match across all workers.  Mere caches usually don't require such a
  * restriction.  State modified in a strict push/pop fashion, such as the
  * active snapshot stack, is often fine.
+ *
+ * We say we are in parallel mode if we are in a subxact of a transaction
+ * that's initiated a parallel operation; for most purposes that context
+ * has all the same restrictions.
  */
 bool
 IsInParallelMode(void)
 {
-    return CurrentTransactionState->parallelModeLevel != 0;
+    TransactionState s = CurrentTransactionState;
+
+    return s->parallelModeLevel != 0 || s->parallelChildXact;
 }

 /*
@@ -1092,7 +1110,9 @@ CommandCounterIncrement(void)
          * point.
          */
         if (IsInParallelMode() || IsParallelWorker())
-            elog(ERROR, "cannot start commands during a parallel operation");
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                     errmsg("cannot start commands during a parallel operation")));

         currentCommandId += 1;
         if (currentCommandId == InvalidCommandId)
@@ -2263,6 +2283,7 @@ CommitTransaction(void)
      */
     s->state = TRANS_COMMIT;
     s->parallelModeLevel = 0;
+    s->parallelChildXact = false;

     /* Disable transaction timeout */
     if (TransactionTimeout > 0)
@@ -2809,6 +2830,7 @@ AbortTransaction(void)
     {
         AtEOXact_Parallel(false);
         s->parallelModeLevel = 0;
+        s->parallelChildXact = false;
     }

     /*
@@ -2937,6 +2959,7 @@ CleanupTransaction(void)
     s->nChildXids = 0;
     s->maxChildXids = 0;
     s->parallelModeLevel = 0;
+    s->parallelChildXact = false;

     XactTopFullTransactionId = InvalidFullTransactionId;
     nParallelCurrentXids = 0;
@@ -4303,7 +4326,7 @@ DefineSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot define savepoints during a parallel operation")));
@@ -4390,7 +4413,7 @@ ReleaseSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot release savepoints during a parallel operation")));
@@ -4499,7 +4522,7 @@ RollbackToSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot rollback to savepoints during a parallel operation")));
@@ -4605,38 +4628,40 @@ RollbackToSavepoint(const char *name)
 /*
  * BeginInternalSubTransaction
  *        This is the same as DefineSavepoint except it allows TBLOCK_STARTED,
- *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE states,
- *        and therefore it can safely be used in functions that might be called
- *        when not inside a BEGIN block or when running deferred triggers at
- *        COMMIT/PREPARE time.  Also, it automatically does
- *        CommitTransactionCommand/StartTransactionCommand instead of expecting
- *        the caller to do it.
+ *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_PARALLEL_INPROGRESS, TBLOCK_END,
+ *        and TBLOCK_PREPARE states, and therefore it can safely be used in
+ *        functions that might be called when not inside a BEGIN block or when
+ *        running deferred triggers at COMMIT/PREPARE time.  Also, it
+ *        automatically does CommitTransactionCommand/StartTransactionCommand
+ *        instead of expecting the caller to do it.
  */
 void
 BeginInternalSubTransaction(const char *name)
 {
     TransactionState s = CurrentTransactionState;
+    bool        save_ExitOnAnyError = ExitOnAnyError;

     /*
-     * Workers synchronize transaction state at the beginning of each parallel
-     * operation, so we can't account for new subtransactions after that
-     * point. We might be able to make an exception for the type of
-     * subtransaction established by this function, which is typically used in
-     * contexts where we're going to release or roll back the subtransaction
-     * before proceeding further, so that no enduring change to the
-     * transaction state occurs. For now, however, we prohibit this case along
-     * with all the others.
+     * Errors within this function are improbable, but if one does happen we
+     * force a FATAL exit.  Callers generally aren't prepared to handle losing
+     * control, and moreover our transaction state is probably corrupted if we
+     * fail partway through; so an ordinary ERROR longjmp isn't okay.
+     */
+    ExitOnAnyError = true;
+
+    /*
+     * We do not check for parallel mode here.  It's permissible to start and
+     * end "internal" subtransactions while in parallel mode, so long as no
+     * new XIDs or command IDs are assigned.  Enforcement of that occurs in
+     * AssignTransactionId() and CommandCounterIncrement().
      */
-    if (IsInParallelMode())
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot start subtransactions during a parallel operation")));

     switch (s->blockState)
     {
         case TBLOCK_STARTED:
         case TBLOCK_INPROGRESS:
         case TBLOCK_IMPLICIT_INPROGRESS:
+        case TBLOCK_PARALLEL_INPROGRESS:
         case TBLOCK_END:
         case TBLOCK_PREPARE:
         case TBLOCK_SUBINPROGRESS:
@@ -4655,7 +4680,6 @@ BeginInternalSubTransaction(const char *name)
             /* These cases are invalid. */
         case TBLOCK_DEFAULT:
         case TBLOCK_BEGIN:
-        case TBLOCK_PARALLEL_INPROGRESS:
         case TBLOCK_SUBBEGIN:
         case TBLOCK_SUBRELEASE:
         case TBLOCK_SUBCOMMIT:
@@ -4674,6 +4698,8 @@ BeginInternalSubTransaction(const char *name)

     CommitTransactionCommand();
     StartTransactionCommand();
+
+    ExitOnAnyError = save_ExitOnAnyError;
 }

 /*
@@ -4689,16 +4715,10 @@ ReleaseCurrentSubTransaction(void)
     TransactionState s = CurrentTransactionState;

     /*
-     * Workers synchronize transaction state at the beginning of each parallel
-     * operation, so we can't account for commit of subtransactions after that
-     * point.  This should not happen anyway.  Code calling this would
-     * typically have called BeginInternalSubTransaction() first, failing
-     * there.
+     * We do not check for parallel mode here.  It's permissible to start and
+     * end "internal" subtransactions while in parallel mode, so long as no
+     * new XIDs or command IDs are assigned.
      */
-    if (IsInParallelMode())
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot commit subtransactions during a parallel operation")));

     if (s->blockState != TBLOCK_SUBINPROGRESS)
         elog(ERROR, "ReleaseCurrentSubTransaction: unexpected state %s",
@@ -4723,11 +4743,9 @@ RollbackAndReleaseCurrentSubTransaction(void)
     TransactionState s = CurrentTransactionState;

     /*
-     * Unlike ReleaseCurrentSubTransaction(), this is nominally permitted
-     * during parallel operations.  That's because we may be in the leader,
-     * recovering from an error thrown while we were in parallel mode.  We
-     * won't reach here in a worker, because BeginInternalSubTransaction()
-     * will have failed.
+     * We do not check for parallel mode here.  It's permissible to start and
+     * end "internal" subtransactions while in parallel mode, so long as no
+     * new XIDs or command IDs are assigned.
      */

     switch (s->blockState)
@@ -4774,6 +4792,7 @@ RollbackAndReleaseCurrentSubTransaction(void)
     Assert(s->blockState == TBLOCK_SUBINPROGRESS ||
            s->blockState == TBLOCK_INPROGRESS ||
            s->blockState == TBLOCK_IMPLICIT_INPROGRESS ||
+           s->blockState == TBLOCK_PARALLEL_INPROGRESS ||
            s->blockState == TBLOCK_STARTED);
 }

@@ -5037,8 +5056,13 @@ CommitSubTransaction(void)
     CallSubXactCallbacks(SUBXACT_EVENT_PRE_COMMIT_SUB, s->subTransactionId,
                          s->parent->subTransactionId);

-    /* If in parallel mode, clean up workers and exit parallel mode. */
-    if (IsInParallelMode())
+    /*
+     * If this subxact has started any unfinished parallel operation, clean up
+     * its workers and exit parallel mode.  We don't use IsInParallelMode()
+     * here since that'd return true if any parent level is parallel, causing
+     * us to uselessly call AtEOSubXact_Parallel.
+     */
+    if (s->parallelModeLevel != 0)
     {
         AtEOSubXact_Parallel(true, s->subTransactionId);
         s->parallelModeLevel = 0;
@@ -5213,8 +5237,13 @@ AbortSubTransaction(void)
      * exports are not supported in subtransactions.
      */

-    /* Exit from parallel mode, if necessary. */
-    if (IsInParallelMode())
+    /*
+     * If this subxact has started any unfinished parallel operation, clean up
+     * its workers and exit parallel mode.  We don't use IsInParallelMode()
+     * here since that'd return true if any parent level is parallel, causing
+     * us to uselessly call AtEOSubXact_Parallel.
+     */
+    if (s->parallelModeLevel != 0)
     {
         AtEOSubXact_Parallel(false, s->subTransactionId);
         s->parallelModeLevel = 0;
@@ -5377,6 +5406,7 @@ PushTransaction(void)
     s->prevXactReadOnly = XactReadOnly;
     s->startedInRecovery = p->startedInRecovery;
     s->parallelModeLevel = 0;
+    s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
     s->topXidLogged = false;

     CurrentTransactionState = s;
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 57c1ae092d..0637256676 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4032,7 +4032,7 @@ declare v int := 0;
 begin
   return 10 / v;
 end;
-$$ language plpgsql;
+$$ language plpgsql parallel safe;
 create or replace function raise_test() returns void as $$
 begin
   raise exception 'custom exception'
@@ -4099,8 +4099,37 @@ $$ language plpgsql;
 select stacked_diagnostics_test();
 ERROR:  GET STACKED DIAGNOSTICS cannot be used outside an exception handler
 CONTEXT:  PL/pgSQL function stacked_diagnostics_test() line 6 at GET STACKED DIAGNOSTICS
-drop function zero_divide();
 drop function stacked_diagnostics_test();
+-- Test that an error recovery subtransaction is parallel safe
+create function error_trap_test() returns text as $$
+begin
+  perform zero_divide();
+  return 'no error detected!';
+exception when division_by_zero then
+  return 'division_by_zero detected';
+end;
+$$ language plpgsql parallel safe;
+set debug_parallel_query to on;
+explain (verbose, costs off) select error_trap_test();
+            QUERY PLAN
+-----------------------------------
+ Gather
+   Output: (error_trap_test())
+   Workers Planned: 1
+   Single Copy: true
+   ->  Result
+         Output: error_trap_test()
+(6 rows)
+
+select error_trap_test();
+      error_trap_test
+---------------------------
+ division_by_zero detected
+(1 row)
+
+reset debug_parallel_query;
+drop function error_trap_test();
+drop function zero_divide();
 -- check cases where implicit SQLSTATE variable could be confused with
 -- SQLSTATE as a keyword, cf bug #5524
 create or replace function raise_test() returns void as $$
diff --git a/src/test/regress/sql/plpgsql.sql b/src/test/regress/sql/plpgsql.sql
index 5a9b2f0040..9ca9449a50 100644
--- a/src/test/regress/sql/plpgsql.sql
+++ b/src/test/regress/sql/plpgsql.sql
@@ -3356,7 +3356,7 @@ declare v int := 0;
 begin
   return 10 / v;
 end;
-$$ language plpgsql;
+$$ language plpgsql parallel safe;

 create or replace function raise_test() returns void as $$
 begin
@@ -3417,9 +3417,29 @@ $$ language plpgsql;

 select stacked_diagnostics_test();

-drop function zero_divide();
 drop function stacked_diagnostics_test();

+-- Test that an error recovery subtransaction is parallel safe
+
+create function error_trap_test() returns text as $$
+begin
+  perform zero_divide();
+  return 'no error detected!';
+exception when division_by_zero then
+  return 'division_by_zero detected';
+end;
+$$ language plpgsql parallel safe;
+
+set debug_parallel_query to on;
+
+explain (verbose, costs off) select error_trap_test();
+select error_trap_test();
+
+reset debug_parallel_query;
+
+drop function error_trap_test();
+drop function zero_divide();
+
 -- check cases where implicit SQLSTATE variable could be confused with
 -- SQLSTATE as a keyword, cf bug #5524
 create or replace function raise_test() returns void as $$

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

23 марта, 15:55:30

On Fri, Mar 22, 2024 at 4:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fair enough.  In the attached v2, I wrote "change the transaction
> state (other than by using a subtransaction for error recovery)";
> what do you think of that?

I think that's pretty good. I wonder if there are some bizarre cases
where the patch would allow slightly more than that ... who is to say
that you must pop the subtransaction you pushed? But that sort of
pedantry is probably not worth worrying about for purposes of the
documentation, especially because such a thing might not be a very
good idea anyway.

> I dug around in the docs and couldn't really find anything about
> parallel-query transaction limitations other than this bit in
> parallel.sgml and the more or less copy-pasted text in
> create_function.sgml; did you have any other spots in mind?
> (I did find the commentary in README.parallel, but that's not
> exactly user-facing.)

I don't have anything else in mind at the moment.

> I think these things are already dealt with.  However, one thing
> worth questioning is that CommitSubTransaction() will just silently
> kill any workers started during the current subxact, and likewise
> CommitTransaction() zaps workers without complaint.  Shouldn't these
> instead throw an error about how you didn't close parallel mode,
> and then the corresponding Abort function does the cleanup?
> I did not change that behavior here, but it seems dubious.

I'm not sure. I definitely knew when I wrote this code that we often
emit warnings about resources that aren't cleaned up at (sub)commit
time rather than just silently releasing them, and I feel like the
fact that I didn't implement that behavior here was probably a
deliberate choice to avoid some problem. But I have no memory of what
that problem was, and it is entirely possible that it was eliminated
at some later phase of development. I think that decision was made
quite early, before much of anything was working.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

23 марта, 19:31:51

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 22, 2024 at 4:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think these things are already dealt with.  However, one thing
>> worth questioning is that CommitSubTransaction() will just silently
>> kill any workers started during the current subxact, and likewise
>> CommitTransaction() zaps workers without complaint.  Shouldn't these
>> instead throw an error about how you didn't close parallel mode,
>> and then the corresponding Abort function does the cleanup?
>> I did not change that behavior here, but it seems dubious.

> I'm not sure. I definitely knew when I wrote this code that we often
> emit warnings about resources that aren't cleaned up at (sub)commit
> time rather than just silently releasing them, and I feel like the
> fact that I didn't implement that behavior here was probably a
> deliberate choice to avoid some problem.

Ah, right, it's reasonable to consider this an end-of-xact resource
leak, which we generally handle with WARNING not ERROR.  And I see
that AtEOXact_Parallel and AtEOSubXact_Parallel already do

        if (isCommit)
            elog(WARNING, "leaked parallel context");

However, the calling logic seems a bit shy of a load, in that it
trusts IsInParallelMode() completely to decide whether to check for
leaked parallel contexts.  So we'd miss the case where somebody did
ExitParallelMode without having cleaned up workers.  It's not like
AtEOXact_Parallel and AtEOSubXact_Parallel cost a lot when they have
nothing to do, so I think we should call them unconditionally, and
separately from that issue a warning if parallelModeLevel isn't zero
(and we're committing).

            regards, tom lane

diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5acc9537d6..dae9dd7f2f 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -545,10 +545,10 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   </para>

   <para>
-    Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal> if
-    they write to the database, access sequences, change the transaction state
-    even temporarily (e.g., a PL/pgSQL function that establishes an
-    <literal>EXCEPTION</literal> block to catch errors), or make persistent changes to
+    Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal>
+    if they write to the database, change the transaction state (other than by
+    using a subtransaction for error recovery), access sequences, or make
+    persistent changes to
     settings.  Similarly, functions must be marked <literal>PARALLEL
     RESTRICTED</literal> if they access temporary tables, client connection state,
     cursors, prepared statements, or miscellaneous backend-local state that
diff --git a/doc/src/sgml/ref/create_function.sgml b/doc/src/sgml/ref/create_function.sgml
index 863d99d1fc..0d240484cd 100644
--- a/doc/src/sgml/ref/create_function.sgml
+++ b/doc/src/sgml/ref/create_function.sgml
@@ -428,22 +428,24 @@ CREATE [ OR REPLACE ] FUNCTION
     <term><literal>PARALLEL</literal></term>

     <listitem>
-     <para><literal>PARALLEL UNSAFE</literal> indicates that the function
-      can't be executed in parallel mode and the presence of such a
+     <para>
+      <literal>PARALLEL UNSAFE</literal> indicates that the function
+      can't be executed in parallel mode; the presence of such a
       function in an SQL statement forces a serial execution plan.  This is
       the default.  <literal>PARALLEL RESTRICTED</literal> indicates that
-      the function can be executed in parallel mode, but the execution is
-      restricted to parallel group leader.  <literal>PARALLEL SAFE</literal>
+      the function can be executed in parallel mode, but only in the parallel
+      group leader process.  <literal>PARALLEL SAFE</literal>
       indicates that the function is safe to run in parallel mode without
-      restriction.
+      restriction, including in parallel worker processes.
      </para>

      <para>
       Functions should be labeled parallel unsafe if they modify any database
-      state, or if they make changes to the transaction such as using
-      sub-transactions, or if they access sequences or attempt to make
-      persistent changes to settings (e.g., <literal>setval</literal>).  They should
-      be labeled as parallel restricted if they access temporary tables,
+      state, change the transaction state (other than by using a
+      subtransaction for error recovery), access sequences (e.g., by
+      calling <literal>currval</literal>) or make persistent changes to
+      settings.  They should
+      be labeled parallel restricted if they access temporary tables,
       client connection state, cursors, prepared statements, or miscellaneous
       backend-local state which the system cannot synchronize in parallel mode
       (e.g.,  <literal>setseed</literal> cannot be executed other than by the group
diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel
index e486bffc47..9df3da91b0 100644
--- a/src/backend/access/transam/README.parallel
+++ b/src/backend/access/transam/README.parallel
@@ -137,7 +137,7 @@ Transaction Integration
 =======================

 Regardless of what the TransactionState stack looks like in the parallel
-leader, each parallel worker ends up with a stack of depth 1.  This stack
+leader, each parallel worker begins with a stack of depth 1.  This stack
 entry is marked with the special transaction block state
 TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary
 toplevel transaction.  The XID of this TransactionState is set to the XID of
@@ -153,18 +153,18 @@ associated with the memory contexts or resource owners of intermediate
 subtransactions.

 No meaningful change to the transaction state can be made while in parallel
-mode.  No XIDs can be assigned, and no subtransactions can start or end,
+mode.  No XIDs can be assigned, and no command counter increments can happen,
 because we have no way of communicating these state changes to cooperating
 backends, or of synchronizing them.  It's clearly unworkable for the initiating
 backend to exit any transaction or subtransaction that was in progress when
 parallelism was started before all parallel workers have exited; and it's even
 more clearly crazy for a parallel worker to try to subcommit or subabort the
 current subtransaction and execute in some other transaction context than was
-present in the initiating backend.  It might be practical to allow internal
-sub-transactions (e.g. to implement a PL/pgSQL EXCEPTION block) to be used in
-parallel mode, provided that they are XID-less, because other backends
-wouldn't really need to know about those transactions or do anything
-differently because of them.  Right now, we don't even allow that.
+present in the initiating backend.  However, we allow internal subtransactions
+(e.g. those used to implement a PL/pgSQL EXCEPTION block) to be used in
+parallel mode, provided that they remain XID-less, because other backends
+don't really need to know about those transactions or do anything differently
+because of them.

 At the end of a parallel operation, which can happen either because it
 completed successfully or because it was interrupted by an error, parallel
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 751c251cf5..8613fc6fb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1226,10 +1226,8 @@ HandleParallelMessage(ParallelContext *pcxt, int i, StringInfo msg)
 /*
  * End-of-subtransaction cleanup for parallel contexts.
  *
- * Currently, it's forbidden to enter or leave a subtransaction while
- * parallel mode is in effect, so we could just blow away everything.  But
- * we may want to relax that restriction in the future, so this code
- * contemplates that there may be multiple subtransaction IDs in pcxt_list.
+ * Here we remove only parallel contexts initiated within the current
+ * subtransaction.
  */
 void
 AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId)
@@ -1249,6 +1247,8 @@ AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId)

 /*
  * End-of-transaction cleanup for parallel contexts.
+ *
+ * We nuke all remaining parallel contexts.
  */
 void
 AtEOXact_Parallel(bool isCommit)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1d930752c5..df5a67e4c3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -183,6 +183,10 @@ typedef enum TBlockState

 /*
  *    transaction state structure
+ *
+ * Note: parallelModeLevel counts the number of unmatched EnterParallelMode
+ * calls done at this transaction level.  parallelChildXact is true if any
+ * upper transaction level has nonzero parallelModeLevel.
  */
 typedef struct TransactionStateData
 {
@@ -205,6 +209,7 @@ typedef struct TransactionStateData
     bool        startedInRecovery;    /* did we start in recovery? */
     bool        didLogXid;        /* has xid been included in WAL record? */
     int            parallelModeLevel;    /* Enter/ExitParallelMode counter */
+    bool        parallelChildXact;    /* is any parent transaction parallel? */
     bool        chain;            /* start a new block after this one */
     bool        topXidLogged;    /* for a subxact: is top-level XID logged? */
     struct TransactionStateData *parent;    /* back link to parent */
@@ -639,7 +644,9 @@ AssignTransactionId(TransactionState s)
      * operation, so we can't account for new XIDs at this point.
      */
     if (IsInParallelMode() || IsParallelWorker())
-        elog(ERROR, "cannot assign XIDs during a parallel operation");
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                 errmsg("cannot assign XIDs during a parallel operation")));

     /*
      * Ensure parent(s) have XIDs, so that a child always has an XID later
@@ -827,7 +834,11 @@ GetCurrentCommandId(bool used)
          * could relax this restriction when currentCommandIdUsed was already
          * true at the start of the parallel operation.
          */
-        Assert(!IsParallelWorker());
+        if (IsParallelWorker())
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                     errmsg("cannot modify data in a parallel worker")));
+
         currentCommandIdUsed = true;
     }
     return currentCommandId;
@@ -1052,7 +1063,8 @@ ExitParallelMode(void)
     TransactionState s = CurrentTransactionState;

     Assert(s->parallelModeLevel > 0);
-    Assert(s->parallelModeLevel > 1 || !ParallelContextActive());
+    Assert(s->parallelModeLevel > 1 || s->parallelChildXact ||
+           !ParallelContextActive());

     --s->parallelModeLevel;
 }
@@ -1065,11 +1077,17 @@ ExitParallelMode(void)
  * match across all workers.  Mere caches usually don't require such a
  * restriction.  State modified in a strict push/pop fashion, such as the
  * active snapshot stack, is often fine.
+ *
+ * We say we are in parallel mode if we are in a subxact of a transaction
+ * that's initiated a parallel operation; for most purposes that context
+ * has all the same restrictions.
  */
 bool
 IsInParallelMode(void)
 {
-    return CurrentTransactionState->parallelModeLevel != 0;
+    TransactionState s = CurrentTransactionState;
+
+    return s->parallelModeLevel != 0 || s->parallelChildXact;
 }

 /*
@@ -1092,7 +1110,9 @@ CommandCounterIncrement(void)
          * point.
          */
         if (IsInParallelMode() || IsParallelWorker())
-            elog(ERROR, "cannot start commands during a parallel operation");
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+                     errmsg("cannot start commands during a parallel operation")));

         currentCommandId += 1;
         if (currentCommandId == InvalidCommandId)
@@ -2210,9 +2230,26 @@ CommitTransaction(void)
     CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
                       : XACT_EVENT_PRE_COMMIT);

-    /* If we might have parallel workers, clean them up now. */
-    if (IsInParallelMode())
-        AtEOXact_Parallel(true);
+    /*
+     * If this xact has started any unfinished parallel operation, clean up
+     * its workers, warning about leaked resources.  (But we don't actually
+     * reset parallelModeLevel till entering TRANS_COMMIT, a bit below.  This
+     * keeps parallel mode restrictions active as long as possible in a
+     * parallel worker.)
+     */
+    AtEOXact_Parallel(true);
+    if (is_parallel_worker)
+    {
+        if (s->parallelModeLevel != 1)
+            elog(WARNING, "parallelModeLevel is %d not 1 at end of parallel worker transaction",
+                 s->parallelModeLevel);
+    }
+    else
+    {
+        if (s->parallelModeLevel != 0)
+            elog(WARNING, "parallelModeLevel is %d not 0 at end of transaction",
+                 s->parallelModeLevel);
+    }

     /* Shut down the deferred-trigger manager */
     AfterTriggerEndXact(true);
@@ -2263,6 +2300,7 @@ CommitTransaction(void)
      */
     s->state = TRANS_COMMIT;
     s->parallelModeLevel = 0;
+    s->parallelChildXact = false;    /* should be false already */

     /* Disable transaction timeout */
     if (TransactionTimeout > 0)
@@ -2804,12 +2842,13 @@ AbortTransaction(void)
     /* Reset snapshot export state. */
     SnapBuildResetExportedSnapshotState();

-    /* If in parallel mode, clean up workers and exit parallel mode. */
-    if (IsInParallelMode())
-    {
-        AtEOXact_Parallel(false);
-        s->parallelModeLevel = 0;
-    }
+    /*
+     * If this xact has started any unfinished parallel operation, clean up
+     * its workers and exit parallel mode.  Don't warn about leaked resources.
+     */
+    AtEOXact_Parallel(false);
+    s->parallelModeLevel = 0;
+    s->parallelChildXact = false;    /* should be false already */

     /*
      * do abort processing
@@ -2937,6 +2976,7 @@ CleanupTransaction(void)
     s->nChildXids = 0;
     s->maxChildXids = 0;
     s->parallelModeLevel = 0;
+    s->parallelChildXact = false;

     XactTopFullTransactionId = InvalidFullTransactionId;
     nParallelCurrentXids = 0;
@@ -4303,7 +4343,7 @@ DefineSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot define savepoints during a parallel operation")));
@@ -4390,7 +4430,7 @@ ReleaseSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot release savepoints during a parallel operation")));
@@ -4499,7 +4539,7 @@ RollbackToSavepoint(const char *name)
      * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case
      * below.)
      */
-    if (IsInParallelMode())
+    if (IsInParallelMode() || IsParallelWorker())
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("cannot rollback to savepoints during a parallel operation")));
@@ -4605,38 +4645,40 @@ RollbackToSavepoint(const char *name)
 /*
  * BeginInternalSubTransaction
  *        This is the same as DefineSavepoint except it allows TBLOCK_STARTED,
- *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE states,
- *        and therefore it can safely be used in functions that might be called
- *        when not inside a BEGIN block or when running deferred triggers at
- *        COMMIT/PREPARE time.  Also, it automatically does
- *        CommitTransactionCommand/StartTransactionCommand instead of expecting
- *        the caller to do it.
+ *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_PARALLEL_INPROGRESS, TBLOCK_END,
+ *        and TBLOCK_PREPARE states, and therefore it can safely be used in
+ *        functions that might be called when not inside a BEGIN block or when
+ *        running deferred triggers at COMMIT/PREPARE time.  Also, it
+ *        automatically does CommitTransactionCommand/StartTransactionCommand
+ *        instead of expecting the caller to do it.
  */
 void
 BeginInternalSubTransaction(const char *name)
 {
     TransactionState s = CurrentTransactionState;
+    bool        save_ExitOnAnyError = ExitOnAnyError;

     /*
-     * Workers synchronize transaction state at the beginning of each parallel
-     * operation, so we can't account for new subtransactions after that
-     * point. We might be able to make an exception for the type of
-     * subtransaction established by this function, which is typically used in
-     * contexts where we're going to release or roll back the subtransaction
-     * before proceeding further, so that no enduring change to the
-     * transaction state occurs. For now, however, we prohibit this case along
-     * with all the others.
-     */
-    if (IsInParallelMode())
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot start subtransactions during a parallel operation")));
+     * Errors within this function are improbable, but if one does happen we
+     * force a FATAL exit.  Callers generally aren't prepared to handle losing
+     * control, and moreover our transaction state is probably corrupted if we
+     * fail partway through; so an ordinary ERROR longjmp isn't okay.
+     */
+    ExitOnAnyError = true;
+
+    /*
+     * We do not check for parallel mode here.  It's permissible to start and
+     * end "internal" subtransactions while in parallel mode, so long as no
+     * new XIDs or command IDs are assigned.  Enforcement of that occurs in
+     * AssignTransactionId() and CommandCounterIncrement().
+     */

     switch (s->blockState)
     {
         case TBLOCK_STARTED:
         case TBLOCK_INPROGRESS:
         case TBLOCK_IMPLICIT_INPROGRESS:
+        case TBLOCK_PARALLEL_INPROGRESS:
         case TBLOCK_END:
         case TBLOCK_PREPARE:
         case TBLOCK_SUBINPROGRESS:
@@ -4655,7 +4697,6 @@ BeginInternalSubTransaction(const char *name)
             /* These cases are invalid. */
         case TBLOCK_DEFAULT:
         case TBLOCK_BEGIN:
-        case TBLOCK_PARALLEL_INPROGRESS:
         case TBLOCK_SUBBEGIN:
         case TBLOCK_SUBRELEASE:
         case TBLOCK_SUBCOMMIT:
@@ -4674,6 +4715,8 @@ BeginInternalSubTransaction(const char *name)

     CommitTransactionCommand();
     StartTransactionCommand();
+
+    ExitOnAnyError = save_ExitOnAnyError;
 }

 /*
@@ -4689,16 +4732,10 @@ ReleaseCurrentSubTransaction(void)
     TransactionState s = CurrentTransactionState;

     /*
-     * Workers synchronize transaction state at the beginning of each parallel
-     * operation, so we can't account for commit of subtransactions after that
-     * point.  This should not happen anyway.  Code calling this would
-     * typically have called BeginInternalSubTransaction() first, failing
-     * there.
+     * We do not check for parallel mode here.  It's permissible to start and
+     * end "internal" subtransactions while in parallel mode, so long as no
+     * new XIDs or command IDs are assigned.
      */
-    if (IsInParallelMode())
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot commit subtransactions during a parallel operation")));

     if (s->blockState != TBLOCK_SUBINPROGRESS)
         elog(ERROR, "ReleaseCurrentSubTransaction: unexpected state %s",
@@ -4723,11 +4760,9 @@ RollbackAndReleaseCurrentSubTransaction(void)
     TransactionState s = CurrentTransactionState;

     /*
-     * Unlike ReleaseCurrentSubTransaction(), this is nominally permitted
-     * during parallel operations.  That's because we may be in the leader,
-     * recovering from an error thrown while we were in parallel mode.  We
-     * won't reach here in a worker, because BeginInternalSubTransaction()
-     * will have failed.
+     * We do not check for parallel mode here.  It's permissible to start and
+     * end "internal" subtransactions while in parallel mode, so long as no
+     * new XIDs or command IDs are assigned.
      */

     switch (s->blockState)
@@ -4774,6 +4809,7 @@ RollbackAndReleaseCurrentSubTransaction(void)
     Assert(s->blockState == TBLOCK_SUBINPROGRESS ||
            s->blockState == TBLOCK_INPROGRESS ||
            s->blockState == TBLOCK_IMPLICIT_INPROGRESS ||
+           s->blockState == TBLOCK_PARALLEL_INPROGRESS ||
            s->blockState == TBLOCK_STARTED);
 }

@@ -5037,10 +5073,15 @@ CommitSubTransaction(void)
     CallSubXactCallbacks(SUBXACT_EVENT_PRE_COMMIT_SUB, s->subTransactionId,
                          s->parent->subTransactionId);

-    /* If in parallel mode, clean up workers and exit parallel mode. */
-    if (IsInParallelMode())
+    /*
+     * If this subxact has started any unfinished parallel operation, clean up
+     * its workers and exit parallel mode.  Warn about leaked resources.
+     */
+    AtEOSubXact_Parallel(true, s->subTransactionId);
+    if (s->parallelModeLevel != 0)
     {
-        AtEOSubXact_Parallel(true, s->subTransactionId);
+        elog(WARNING, "parallelModeLevel is %d not 0 at end of subtransaction",
+             s->parallelModeLevel);
         s->parallelModeLevel = 0;
     }

@@ -5213,12 +5254,12 @@ AbortSubTransaction(void)
      * exports are not supported in subtransactions.
      */

-    /* Exit from parallel mode, if necessary. */
-    if (IsInParallelMode())
-    {
-        AtEOSubXact_Parallel(false, s->subTransactionId);
-        s->parallelModeLevel = 0;
-    }
+    /*
+     * If this subxact has started any unfinished parallel operation, clean up
+     * its workers and exit parallel mode.  Don't warn about leaked resources.
+     */
+    AtEOSubXact_Parallel(false, s->subTransactionId);
+    s->parallelModeLevel = 0;

     /*
      * We can skip all this stuff if the subxact failed before creating a
@@ -5377,6 +5418,7 @@ PushTransaction(void)
     s->prevXactReadOnly = XactReadOnly;
     s->startedInRecovery = p->startedInRecovery;
     s->parallelModeLevel = 0;
+    s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
     s->topXidLogged = false;

     CurrentTransactionState = s;
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 57c1ae092d..0637256676 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4032,7 +4032,7 @@ declare v int := 0;
 begin
   return 10 / v;
 end;
-$$ language plpgsql;
+$$ language plpgsql parallel safe;
 create or replace function raise_test() returns void as $$
 begin
   raise exception 'custom exception'
@@ -4099,8 +4099,37 @@ $$ language plpgsql;
 select stacked_diagnostics_test();
 ERROR:  GET STACKED DIAGNOSTICS cannot be used outside an exception handler
 CONTEXT:  PL/pgSQL function stacked_diagnostics_test() line 6 at GET STACKED DIAGNOSTICS
-drop function zero_divide();
 drop function stacked_diagnostics_test();
+-- Test that an error recovery subtransaction is parallel safe
+create function error_trap_test() returns text as $$
+begin
+  perform zero_divide();
+  return 'no error detected!';
+exception when division_by_zero then
+  return 'division_by_zero detected';
+end;
+$$ language plpgsql parallel safe;
+set debug_parallel_query to on;
+explain (verbose, costs off) select error_trap_test();
+            QUERY PLAN
+-----------------------------------
+ Gather
+   Output: (error_trap_test())
+   Workers Planned: 1
+   Single Copy: true
+   ->  Result
+         Output: error_trap_test()
+(6 rows)
+
+select error_trap_test();
+      error_trap_test
+---------------------------
+ division_by_zero detected
+(1 row)
+
+reset debug_parallel_query;
+drop function error_trap_test();
+drop function zero_divide();
 -- check cases where implicit SQLSTATE variable could be confused with
 -- SQLSTATE as a keyword, cf bug #5524
 create or replace function raise_test() returns void as $$
diff --git a/src/test/regress/sql/plpgsql.sql b/src/test/regress/sql/plpgsql.sql
index 5a9b2f0040..9ca9449a50 100644
--- a/src/test/regress/sql/plpgsql.sql
+++ b/src/test/regress/sql/plpgsql.sql
@@ -3356,7 +3356,7 @@ declare v int := 0;
 begin
   return 10 / v;
 end;
-$$ language plpgsql;
+$$ language plpgsql parallel safe;

 create or replace function raise_test() returns void as $$
 begin
@@ -3417,9 +3417,29 @@ $$ language plpgsql;

 select stacked_diagnostics_test();

-drop function zero_divide();
 drop function stacked_diagnostics_test();

+-- Test that an error recovery subtransaction is parallel safe
+
+create function error_trap_test() returns text as $$
+begin
+  perform zero_divide();
+  return 'no error detected!';
+exception when division_by_zero then
+  return 'division_by_zero detected';
+end;
+$$ language plpgsql parallel safe;
+
+set debug_parallel_query to on;
+
+explain (verbose, costs off) select error_trap_test();
+select error_trap_test();
+
+reset debug_parallel_query;
+
+drop function error_trap_test();
+drop function zero_divide();
+
 -- check cases where implicit SQLSTATE variable could be confused with
 -- SQLSTATE as a keyword, cf bug #5524
 create or replace function raise_test() returns void as $$

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

25 марта, 18:16:35

On Sat, Mar 23, 2024 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, the calling logic seems a bit shy of a load, in that it
> trusts IsInParallelMode() completely to decide whether to check for
> leaked parallel contexts.  So we'd miss the case where somebody did
> ExitParallelMode without having cleaned up workers.  It's not like
> AtEOXact_Parallel and AtEOSubXact_Parallel cost a lot when they have
> nothing to do, so I think we should call them unconditionally, and
> separately from that issue a warning if parallelModeLevel isn't zero
> (and we're committing).

I wasn't worried about this case when I wrote this code. The general
flow that I anticipated was that somebody would run a query, and
ExecMain.c would enter parallel mode, and then maybe eventually reach
some SQL-callable C function that hadn't gotten the memo about
parallel query but had been mistakenly labelled as PARALLEL RESTRICTED
or PARALLEL SAFE when it wasn't really, and so the goal was for core
functions that such a function might reasonably attempt to call to
notice that something bad was happening.

But if the user puts a call to ExitParallelMode() inside such a
function, it's hard to imagine what goal they have other than to
deliberately circumvent the safeguards. And they're always going to be
able to do that somehow, if they're coding in C. So I'm not convinced
that the sanity checks you've added are really going to do anything
other than burn a handful of CPU cycles. If there's some plausible
case in which they protect us against a user who has legitimately made
an error, fine; but if we're just wandering down the slippery slope of
believing we can defend against malicious C code, we absolutely should
not do that, not even a little bit. The first CPU instruction we burn
in the service of a hopeless cause is already one too many.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

25 марта, 18:36:19

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Mar 23, 2024 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, the calling logic seems a bit shy of a load, in that it
>> trusts IsInParallelMode() completely to decide whether to check for
>> leaked parallel contexts.  So we'd miss the case where somebody did
>> ExitParallelMode without having cleaned up workers.

> But if the user puts a call to ExitParallelMode() inside such a
> function, it's hard to imagine what goal they have other than to
> deliberately circumvent the safeguards. And they're always going to be
> able to do that somehow, if they're coding in C. So I'm not convinced
> that the sanity checks you've added are really going to do anything
> other than burn a handful of CPU cycles. If there's some plausible
> case in which they protect us against a user who has legitimately made
> an error, fine; but if we're just wandering down the slippery slope of
> believing we can defend against malicious C code, we absolutely should
> not do that, not even a little bit. The first CPU instruction we burn
> in the service of a hopeless cause is already one too many.

By that logic, we should rip out every Assert in the system, as well
as all of the (extensive) resource leak checking that already happens
during CommitTransaction.  We've always felt that those leak checks
were worth the cost to help us find bugs --- which they have done and
still do from time to time.  I don't see why this case is different,
especially when the added cost compared to HEAD is not much more than
one C function call.

Or in other words: the point is not about stopping malicious C code,
it's about recognizing that we make mistakes.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

25 марта, 18:50:42

On Mon, Mar 25, 2024 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> By that logic, we should rip out every Assert in the system, as well
> as all of the (extensive) resource leak checking that already happens
> during CommitTransaction.  We've always felt that those leak checks
> were worth the cost to help us find bugs --- which they have done and
> still do from time to time.  I don't see why this case is different,
> especially when the added cost compared to HEAD is not much more than
> one C function call.

Well, I explained why *I* thought it was different, but obviously you
don't agree.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

28 марта, 00:28:05

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 25, 2024 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ...  I don't see why this case is different,
>> especially when the added cost compared to HEAD is not much more than
>> one C function call.

> Well, I explained why *I* thought it was different, but obviously you
> don't agree.

After mulling it over for awhile, I still think the extra checking
is appropriate, especially since this patch is enlarging the set of
things that can happen in parallel mode.  How do you want to proceed?

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

28 марта, 17:22:25

On Wed, Mar 27, 2024 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After mulling it over for awhile, I still think the extra checking
> is appropriate, especially since this patch is enlarging the set of
> things that can happen in parallel mode.  How do you want to proceed?

I sort of assumed you were going to commit the patch as you had it.
I'm not a huge fan of that, but I don't think that's it's catastrophe,
either. It pains me a bit to add CPU cycles that I consider
unnecessary to a very frequently taken code path, but as you say, it's
not a lot of CPU cycles, so maybe nobody will ever notice. I actually
really wish we could find some way of making subtransactions
significantly lighter-wait, because I think the cost of spinning up
and tearing down a trivial subtransaction is a real performance
problem, but fixing that is probably a pretty hard problem whether
this patch gets committed or not.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

28 марта, 17:59:01

Robert Haas <robertmhaas@gmail.com> writes:
> I sort of assumed you were going to commit the patch as you had it.

OK, I will move ahead on that.

> I actually
> really wish we could find some way of making subtransactions
> significantly lighter-wait, because I think the cost of spinning up
> and tearing down a trivial subtransaction is a real performance
> problem, but fixing that is probably a pretty hard problem whether
> this patch gets committed or not.

Yeah.  The whole ResourceOwner mechanism is not exactly lightweight,
but it's hard to argue that we don't need it.  I wonder whether we
could get anywhere by deeming that a "small enough" subtransaction
doesn't need to have its resources cleaned up instantly, and
instead re-use its ResourceOwner to accumulate resources of the
next subtransaction, and the next, until there's enough to be
worth cleaning up.

Having said that, it's hard to see any regime under which tied-up
parallel workers wouldn't count as a resource worth releasing ASAP.
I started this mail with the idea of suggesting that parallel contexts
ought to become a ResourceOwner-managed resource, but maybe that
wouldn't be an improvement after all.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

28 марта, 18:27:10

On Thu, Mar 28, 2024 at 10:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  The whole ResourceOwner mechanism is not exactly lightweight,
> but it's hard to argue that we don't need it.  I wonder whether we
> could get anywhere by deeming that a "small enough" subtransaction
> doesn't need to have its resources cleaned up instantly, and
> instead re-use its ResourceOwner to accumulate resources of the
> next subtransaction, and the next, until there's enough to be
> worth cleaning up.

Hmm, I wonder if that's actually where the cycles are going. There's
an awful lot of separate function calls inside CommitSubTransaction(),
and in the common case, each one of them has to individually decide
that it doesn't need to do anything. Sure, they're all fast, but if
you have enough of them, it's still going to add up, at least a bit.
In that sense, the resource owner mechanism seems like it should, or
at least could, be better. I'm not sure this is quite the way it works
now, but if you had one single list/array/thingamabob that listed all
of the resources that needed releasing, that should in theory be
better when there's a lot of kinds of resources that you COULD hold
but only a small number of kinds of resources that you actually do
hold -- and it also shouldn't be any worse if it turns out that you
hold a whole lot of resources of many different types.

But I haven't done any benchmarking of this area in a long time.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

28 марта, 18:50:30

Robert Haas <robertmhaas@gmail.com> writes:
> Hmm, I wonder if that's actually where the cycles are going. There's
> an awful lot of separate function calls inside CommitSubTransaction(),
> and in the common case, each one of them has to individually decide
> that it doesn't need to do anything. Sure, they're all fast, but if
> you have enough of them, it's still going to add up, at least a bit.
> In that sense, the resource owner mechanism seems like it should, or
> at least could, be better.

Yeah, I was thinking about that too.  The normal case is that you
don't hold any releasable resources except locks when arriving at
CommitSubTransaction --- if you do, it's a bug and we're going to
print leak warnings.  Seems like maybe it'd be worth trying to
have a fast path for that case.  (Also, given that we probably
do need to release locks right away, this point invalidates my
earlier idea of postponing the work.)

> But I haven't done any benchmarking of this area in a long time.

Ditto.

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

28 марта, 18:59:21

On Thu, Mar 28, 2024 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I was thinking about that too.  The normal case is that you
> don't hold any releasable resources except locks when arriving at
> CommitSubTransaction --- if you do, it's a bug and we're going to
> print leak warnings.  Seems like maybe it'd be worth trying to
> have a fast path for that case.

Well, there's the abort case, too, which I think is almost equally important.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

От

Tom Lane

Дата:

28 марта, 19:01:37

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Mar 28, 2024 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I was thinking about that too.  The normal case is that you
>> don't hold any releasable resources except locks when arriving at
>> CommitSubTransaction --- if you do, it's a bug and we're going to
>> print leak warnings.  Seems like maybe it'd be worth trying to
>> have a fast path for that case.

> Well, there's the abort case, too, which I think is almost equally important.

True, but in the abort case there probably *are* resources to be
cleaned up, so I'm not seeing that the fast-path idea helps.
Although maybe the idea of batching multiple cleanups would?

            regards, tom lane

Re: [PATCH] plpython function causes server panic

От

Robert Haas

Дата:

28 марта, 19:27:55

On Thu, Mar 28, 2024 at 12:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Well, there's the abort case, too, which I think is almost equally important.
>
> True, but in the abort case there probably *are* resources to be
> cleaned up, so I'm not seeing that the fast-path idea helps.
> Although maybe the idea of batching multiple cleanups would?

Yes, I think we should be trying to optimize for the case where the
(sub)transaction being cleaned up holds a small but non-zero number of
resources. I think if we just optimize the case where it's exactly
zero, there will be enough cases where the optimization doesn't apply
that we'll feel like we haven't really solved the problem. Whether the
specific idea of trying to batch the cleanups could be made to help
enough to matter, I'm not quite sure. Another idea I had at one point
was to have some kind of bitmask where each bit tells you whether or
not one particular resource type might be held, so that
{Commit,Abort}{Sub,}Transaction would end up doing a bunch of stuff
like if (ResourcesNeedingCleanup & MIGHT_HOLD_THINGY)
AtEO(Sub)Xact_Thingy(...). But I wasn't sure that would really move
the needle, either. This seems to be one of those annoying cases where
the problem is much more obvious than the solution.

--
Robert Haas
EDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [PATCH] plpython function causes server panic

Вложения