Обсуждение: 024_add_drop_pub.pl might fail due to deadlock

Поиск
Список
Период
Сортировка

024_add_drop_pub.pl might fail due to deadlock

От
Alexander Lakhin
Дата:
Hello hackers,

The recent buildfarm failure [1] on REL_15_STABLE with the following
diagnostics:
# Looks like your test exited with 29 just after 1.
t/024_add_drop_pub.pl ..............
Dubious, test returned 29 (wstat 7424, 0x1d00)

pgsql.build/src/test/subscription/tmp_check/log/regress_log_024_add_drop_pub
[21:01:34.406](16.501s) ok 1 - check initial data is copied to subscriber
error running SQL: 'psql:<stdin>:1: ERROR:  deadlock detected
DETAIL:  Process 219632 waits for ExclusiveLock on relation 6000 of database 0; blocked by process 218369.
Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632.
HINT:  See server log for query details.'
while running 'psql -XAtq -d port=14957 host=/home/bf/bf-build/petalura/tmp/bGI6HuRtfa dbname='postgres' -f - -v 
ON_ERROR_STOP=1' with sql 'ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1' at 
/home/bf/bf-build/petalura/REL_15_STABLE/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 1951.

pgsql.build/src/test/subscription/tmp_check/log/024_add_drop_pub_subscriber.log
2025-07-01 21:01:32.682 CEST [218369][logical replication worker][3/6:0] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
...
2025-07-01 21:01:34.771 CEST [219632][client backend][4/14:0] LOG: statement: ALTER SUBSCRIPTION tap_sub DROP 
PUBLICATION tap_pub_1
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] ERROR:  deadlock detected
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] DETAIL:  Process 219632 waits for ExclusiveLock on 
relation 6000 of database 0; blocked by process 218369.
     Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632.
     Process 219632: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1
     Process 218369: <command string not enabled>
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] HINT:  See server log for query details.
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] STATEMENT:  ALTER SUBSCRIPTION tap_sub DROP PUBLICATION

tap_pub_1

shows that the test can fail due to deadlock on accessing
pg_replication_origin (relation 6000).

This failure can be easily reproduced with:
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
          * the specific origin and then re-check if the origin still exists.
          */
         rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+pg_usleep(300000);

Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58

Best regards,
Alexander



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> --- a/src/backend/replication/logical/origin.c
> +++ b/src/backend/replication/logical/origin.c
> @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
>           * the specific origin and then re-check if the origin still exists.
>           */
>          rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
> +pg_usleep(300000);
>
> Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
> 024_add_drop_pub.pl was added in v15).
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
>
> Best regards,
> Alexander
>

Hi Alexander,

Yes, the problem can be reproduced by the changes you suggested. I
will look into what is happening and how we can fix this.

regards,
Ajin Cherian
Fujitsu Australia



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > --- a/src/backend/replication/logical/origin.c
> > +++ b/src/backend/replication/logical/origin.c
> > @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> >           * the specific origin and then re-check if the origin still exists.
> >           */
> >          rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
> > +pg_usleep(300000);
> >
> > Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
> > 024_add_drop_pub.pl was added in v15).
> >
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
> >
> > Best regards,
> > Alexander
> >
>
> Hi Alexander,
>
> Yes, the problem can be reproduced by the changes you suggested. I
> will look into what is happening and how we can fix this.

The issue appears to be a deadlock caused by inconsistent lock
acquisition order between two processes:

Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1):
In AlterSubscription_refresh(), it first acquires an
AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then
later tries to acquire an ExclusiveLock on ReplicationOriginRelationId
(resource 2).

Process B (apply worker):
In process_syncing_tables_for_apply(), it first acquires an
ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls
UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock
on SubscriptionRelRelationId (resource 1).

This leads to a deadlock:
Process A holds a lock on resource 1 and waits for resource 2, while
process B holds a lock on resource 2 and waits for resource 1.

Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

regards,
Ajin Cherian
Fujitsu Australia.

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

> Proposed fix:
> In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
> on SubscriptionRelRelationId before acquiring the lock on
> ReplicationOriginRelationId.
>
> Patch with fix attached.
> I'll continue investigating whether this issue also affects HEAD.

While debugging this further, I found that there is another lock taken
prior to this in AlterSubscription(),
    /* Lock the subscription so nobody else can do anything with it. */
    LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);

Since this is the first lock taken while altering subscription, that
should also be taken first by the tablesync worker to avoid deadlock.
So, attaching a modified patch.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Patch with fix attached.
> I'll continue investigating whether this issue also affects HEAD.
>

While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
vignesh C
Дата:
On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Patch with fix attached.
> > I'll continue investigating whether this issue also affects HEAD.
> >
>
> While debugging if this problem can occur on HEAD, I found out that on
> head, it is mostly the tablesync worker that drops the origin on HEAD
> and since the tablesysnc worker does not attempt to update the
> SubscriptionRel state in that process, there doesn't seem to be the
> possibility of a deadlock. But there is a rare situation where the
> tablesync worker could crash or get an error just prior to dropping
> the origin, then the origin is dropped in the apply worker (this is
> explained in the comments in process_syncing_tables_for_sync()). If
> the origin has to be dropped in the apply worker, then the same
> deadlock can happen in HEAD code as well. I was able to simulate this
> by using an injection point to create an error on the tablesync worker
> and then the similar deadlock happens on HEAD as well. Attaching a
> patch for fixing this on HEAD as well.

I was able to reproduce the deadlock on HEAD as well using the
attached patch, which introduces a delay in the tablesync worker
before dropping the replication origin by adding a sleep of a few
seconds. During this delay, the apply worker also attempts to drop the
replication origin. If an ALTER SUBSCRIPTION command is executed
concurrently, a deadlock frequently occurs:
2025-07-14 15:59:53.572 IST [141100] DETAIL:  Process 141100 waits for
AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
by process 140974.
Process 140974 waits for AccessShareLock on object 16396 of class 6100
of database 0; blocked by process 141100.
Process 141100: alter subscription sub1 drop publication pub1
Process 140974: <command string not enabled>

After apply the attached patch, create the logical replication setup
for a publication pub1 having table t1 and then run the following
commands in a loop:
alter subscription sub1 drop publication pub1;
alter subscription sub1 add publication pub1;
sleep 4

Regards,
Vignesh

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
vignesh C
Дата:
On Mon, 14 Jul 2025 at 16:15, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > Patch with fix attached.
> > > I'll continue investigating whether this issue also affects HEAD.
> > >
> >
> > While debugging if this problem can occur on HEAD, I found out that on
> > head, it is mostly the tablesync worker that drops the origin on HEAD
> > and since the tablesysnc worker does not attempt to update the
> > SubscriptionRel state in that process, there doesn't seem to be the
> > possibility of a deadlock. But there is a rare situation where the
> > tablesync worker could crash or get an error just prior to dropping
> > the origin, then the origin is dropped in the apply worker (this is
> > explained in the comments in process_syncing_tables_for_sync()). If
> > the origin has to be dropped in the apply worker, then the same
> > deadlock can happen in HEAD code as well. I was able to simulate this
> > by using an injection point to create an error on the tablesync worker
> > and then the similar deadlock happens on HEAD as well. Attaching a
> > patch for fixing this on HEAD as well.
>
> I was able to reproduce the deadlock on HEAD as well using the
> attached patch, which introduces a delay in the tablesync worker
> before dropping the replication origin by adding a sleep of a few
> seconds. During this delay, the apply worker also attempts to drop the
> replication origin. If an ALTER SUBSCRIPTION command is executed
> concurrently, a deadlock frequently occurs:
> 2025-07-14 15:59:53.572 IST [141100] DETAIL:  Process 141100 waits for
> AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
> by process 140974.
> Process 140974 waits for AccessShareLock on object 16396 of class 6100
> of database 0; blocked by process 141100.
> Process 141100: alter subscription sub1 drop publication pub1
> Process 140974: <command string not enabled>
>
> After apply the attached patch, create the logical replication setup
> for a publication pub1 having table t1 and then run the following
> commands in a loop:
> alter subscription sub1 drop publication pub1;
> alter subscription sub1 add publication pub1;
> sleep 4

Attached is the script used to reproduce the issue and the deadlock
logs for the same. Your patch fixes the issue.
Couple of comments:
1) This change is not required:
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
+#include "utils/injection_point.h"

2) This can not only happen in error case but also in normal cases
where the tablesync worker is slower as shown in the script to
reproduce, we can update the commit message accordingly:
In most situations the tablesync worker will drop the corresponding
origin before it
finishes executing, but if an error causes the tablesync worker to
fail just prior to
dropping the origin, the apply worker will later find the origin and drop it.

Regards,
Vignesh

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Tue, Jul 15, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Couple of comments:
> 1) This change is not required:
>  #include "utils/snapmgr.h"
>  #include "utils/syscache.h"
>  #include "utils/usercontext.h"
> +#include "utils/injection_point.h"
>
> 2) This can not only happen in error case but also in normal cases
> where the tablesync worker is slower as shown in the script to
> reproduce, we can update the commit message accordingly:
> In most situations the tablesync worker will drop the corresponding
> origin before it
> finishes executing, but if an error causes the tablesync worker to
> fail just prior to
> dropping the origin, the apply worker will later find the origin and drop it.

Thanks for the test and confirming the fix. Fixed the comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Wed, Jul 16, 2025 at 8:38 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Thanks for the test and confirming the fix. Fixed the comments.
>

         * origin. So passing missing_ok = true.
+         *
+         * Also lock SubscriptionRelationId with AccessShareLock to
+         * prevent any deadlocks with the user concurrently performing
+         * refresh on the subscription.
         */
+        LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+                 0, AccessShareLock);

It seems the patch assumes that the above lock is sufficient because
AlterSubscription will take an AcessExclusiveLock on the same
subscription. So, with this proposal, if both of those become
serialized then the other locking order may not matter. Am I correct
or is there any other theory you have in mind?

If that is the theory then I think we are missing cases where the
apply worker and Alter subscription operates on different
subscriptions.

Consider AlterSubscription_refresh() takes first AccessExclusiveLock
on SubscriptionRelRelationId and then ExclusiveLock on
ReplicationOriginRelationId via replorigin_drop_by_name() . The apply
worker first takes ExclusiveLock on ReplicationOriginRelationId via
replorigin_drop_by_name() and then RowExclusiveLock on
SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such
a scenario taking conflicting locks in reverse order can lead to
deadlock at least in PG15?

--
With Regards,
Amit Kapila.



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Thu, Jul 17, 2025 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

> It seems the patch assumes that the above lock is sufficient because
> AlterSubscription will take an AcessExclusiveLock on the same
> subscription. So, with this proposal, if both of those become
> serialized then the other locking order may not matter. Am I correct
> or is there any other theory you have in mind?
>
> If that is the theory then I think we are missing cases where the
> apply worker and Alter subscription operates on different
> subscriptions.
>
> Consider AlterSubscription_refresh() takes first AccessExclusiveLock
> on SubscriptionRelRelationId and then ExclusiveLock on
> ReplicationOriginRelationId via replorigin_drop_by_name() . The apply
> worker first takes ExclusiveLock on ReplicationOriginRelationId via
> replorigin_drop_by_name() and then RowExclusiveLock on
> SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such
> a scenario taking conflicting locks in reverse order can lead to
> deadlock at least in PG15?

Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
> locked. I've made a new version of the patch on PG_15.

I've made a similar fix on HEAD just so that the code is now
consistent. I don't think the similar problem (deadlock between two
different subscriptions trying to drop tracking origin)  occurs on
HEAD with my previous patch, as the way origins are dropped are
different on HEAD. On HEAD, while dropping origin, a RowExclusiveLock
lock is taken on ReplicationOriginRelationId and then an
AccessExclusiveLock is taken on the particular origin. Since the
particular origin will be different on different subscriptions, the
similar deadlock will not happen. But just to keep code consistent, I
have made a similar fix.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Yes, this is correct. I have also verified this in my testing that
> when there is a second subscription, a deadlock can still happen with
> my previous patch. I have now modified the patch in tablesync worker
> to take locks on both SubscriptionRelationId and
> SubscriptionRelRelationId prior to taking lock on
> ReplicationOriginRelationId. I have also found that there is a similar
> problem in DropSubscription() where lock on SubscriptionRelRelationId
> is not taken before dropping the tracking origin. I've also modified
> the signature of UpdateSubscriptionRelState to take a bool
> "lock_needed" which if false, the SubscriptionRelationId is not
> locked. I've made a new version of the patch on PG_15.
>

Review comments:
================
1.
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

Why did you acquire AccessExclusiveLock here when the current code has
RowExclusiveLock? It should be RowExclusiveLock.

2.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take AccessExclusiveLock on SubscriptionRelRelationId to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
  */

Try to tell in the comments that we want to keep the locking order
same as DDL commands to prevent deadlocks.

3.
+ * Also close any tables prior to the commit.
  */
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ rel = NULL;

We don't need to explicitly release lock on table_close, it will be
done at transaction end, so use NoLock here as we do in current HEAD
code.

4.
DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 {
- Relation rel;
+ Relation rel, sub_rel;
  ObjectAddress myself;
  HeapTuple tup;
  Oid subid;
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
  * Note that the state can't change because we have already stopped both
  * the apply and tablesync workers and they can't restart because of
  * exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying
+ * to drop tracking origin.
  */
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
in DropSubscription. Kindly test again after fixing the first comment
above.
--
With Regards,
Amit Kapila.



RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

Order    target                        level
1        pg_subscription                row exclusive
2        pg_subscription, my tuple    access exclusive
3        pg_subscription_rel            access exclusive
4        pg_replication_origin        excluive

In contrast, apply worker works like:

Order    target                        level
1        pg_replication_origin        excluive
2        pg_subscription, my tuple    access share
3        pg_subscrition_rel            row exclusive

Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.

Below are my comments:

```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
                 * is dropped. So passing missing_ok = false.
                 */
                ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.

```
+                               if (!rel)
+                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
```

I feel it is too strong, isn't it enough to use row exclusive as initially used?

```
 UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
-                                                  XLogRecPtr sublsn)
+                                                  XLogRecPtr sublsn, bool lock_needed)
```

I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?

```
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
         * Note that the state can't change because we have already stopped both
         * the apply and tablesync workers and they can't restart because of
         * exclusive lock on the subscription.
+        *
+        * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+        * deadlock with apply workers of other subscriptions trying
+        * to drop tracking origin.
         */
+       sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
```

Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
replorigin_drop_by_name().

Order    target                        level
1        pg_subscription                access exclusive
2        pg_subscription, my tuple    access exclusive
3        pg_replication_origin        excluive

IIUC we must preserve the ordering, not the target of locks.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Wed, Jul 23, 2025 at 2:42 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Ajin,
>
> Thanks for the patch. Firstly let me confirm my understanding. While altering the
> subscription, locks are acquired with below ordering:
>
> Order   target                                          level
> 1               pg_subscription                         row exclusive
> 2               pg_subscription, my tuple       access exclusive
> 3               pg_subscription_rel                     access exclusive
> 4               pg_replication_origin           excluive
>
> In contrast, apply worker works like:
>
> Order   target                                          level
> 1               pg_replication_origin           excluive
> 2               pg_subscription, my tuple       access share
> 3               pg_subscrition_rel                      row exclusive
>
> Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
>
> Below are my comments:
>
> ```
> @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>                  * is dropped. So passing missing_ok = false.
>                  */
>                 ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
> -
> ```
> This change is not needed.
>
> ```
> +                               if (!rel)
> +                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> +
> ```
>
> I feel it is too strong, isn't it enough to use row exclusive as initially used?
>
> ```
>  UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
> -                                                  XLogRecPtr sublsn)
> +                                                  XLogRecPtr sublsn, bool lock_needed)
> ```
>
> I feel the name `lock_needed` is bit misleading, because the function still opens
> the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?
>

I think if we lock in a caller, we don't need to use any lock during
table_open. We can use the parameter name as already_locked as we do
at some other places in the code.

--
With Regards,
Amit Kapila.



RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
> Dear Ajin,
> 
> Thanks for the patch. Firstly let me confirm my understanding. While altering the
> subscription, locks are acquired with below ordering:
>

I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.
Regarding the PG13, it may not be affected because the replication origin seemed
not to be used for the table sync. It was introduced for ce0fdbfe97.

[1]: https://www.postgresql.org/message-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297%40gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Yes, this is correct. I have also verified this in my testing that
> > when there is a second subscription, a deadlock can still happen with
> > my previous patch. I have now modified the patch in tablesync worker
> > to take locks on both SubscriptionRelationId and
> > SubscriptionRelRelationId prior to taking lock on
> > ReplicationOriginRelationId. I have also found that there is a similar
> > problem in DropSubscription() where lock on SubscriptionRelRelationId
> > is not taken before dropping the tracking origin. I've also modified
> > the signature of UpdateSubscriptionRelState to take a bool
> > "lock_needed" which if false, the SubscriptionRelationId is not
> > locked. I've made a new version of the patch on PG_15.
> >
>
> Review comments:
> ================
> 1.
> + if (!rel)
> + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>
> Why did you acquire AccessExclusiveLock here when the current code has
> RowExclusiveLock? It should be RowExclusiveLock.
>

Yes, you are correct. I have replaced it with RowExclusiveLock.

> 2.
> + *
> + * Lock SubscriptionRelationId with AccessShareLock and
> + * take AccessExclusiveLock on SubscriptionRelRelationId to
> + * prevent any deadlocks with user concurrently performing
> + * refresh on the subscription.
>   */
>
> Try to tell in the comments that we want to keep the locking order
> same as DDL commands to prevent deadlocks.
>

Modified.

> 3.
> + * Also close any tables prior to the commit.
>   */
> + if (rel)
> + {
> + table_close(rel, AccessExclusiveLock);
> + rel = NULL;
>
> We don't need to explicitly release lock on table_close, it will be
> done at transaction end, so use NoLock here as we do in current HEAD
> code.
>

Done.

> 4.
> DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
>  {
> - Relation rel;
> + Relation rel, sub_rel;
>   ObjectAddress myself;
>   HeapTuple tup;
>   Oid subid;
> @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
> bool isTopLevel)
>   * Note that the state can't change because we have already stopped both
>   * the apply and tablesync workers and they can't restart because of
>   * exclusive lock on the subscription.
> + *
> + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
> + * deadlock with apply workers of other subscriptions trying
> + * to drop tracking origin.
>   */
> + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>
> I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
> in DropSubscription. Kindly test again after fixing the first comment
> above.
> --

Yes, it was failing because I was taking AccessExclusiveLock in the
apply worker, and that was causing the deadlock in my testing. I
tested this worked.



On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Ajin,
>
> Thanks for the patch. Firstly let me confirm my understanding. While altering the
> subscription, locks are acquired with below ordering:
>
> Order   target                                          level
> 1               pg_subscription                         row exclusive
> 2               pg_subscription, my tuple       access exclusive
> 3               pg_subscription_rel                     access exclusive
> 4               pg_replication_origin           excluive
>
> In contrast, apply worker works like:
>
> Order   target                                          level
> 1               pg_replication_origin           excluive
> 2               pg_subscription, my tuple       access share
> 3               pg_subscrition_rel                      row exclusive
>
> Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
>

Yes, that is correct.


> Below are my comments:
>
> ```
> @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>                  * is dropped. So passing missing_ok = false.
>                  */
>                 ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
> -
> ```
> This change is not needed.

Removed.

>
> ```
> +                               if (!rel)
> +                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> +
> ```
>
> I feel it is too strong, isn't it enough to use row exclusive as initially used?
>

Yes, agree. Fixed.

> ```
>  UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
> -                                                  XLogRecPtr sublsn)
> +                                                  XLogRecPtr sublsn, bool lock_needed)
> ```
>
> I feel the name `lock_needed` is bit misleading, because the function still opens
> the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?
>

I have modified it to not take lock while table_open as well and
changed the name of the variable to already_locked.

> ```
> @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
>          * Note that the state can't change because we have already stopped both
>          * the apply and tablesync workers and they can't restart because of
>          * exclusive lock on the subscription.
> +        *
> +        * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
> +        * deadlock with apply workers of other subscriptions trying
> +        * to drop tracking origin.
>          */
> +       sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> ```
>
> Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
> replorigin_drop_by_name().
>
> Order   target                                          level
> 1               pg_subscription                         access exclusive
> 2               pg_subscription, my tuple       access exclusive
> 3               pg_replication_origin           excluive
>
> IIUC we must preserve the ordering, not the target of locks.

I have removed this change as this does not now conflict with the apply worker.

Two patches attached. One for HEAD and the other for PG_15.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > Dear Ajin,
> >
> > Thanks for the patch. Firstly let me confirm my understanding. While altering the
> > subscription, locks are acquired with below ordering:
> >
>
> I forgot to confirm one point. For which branch should be backpatch? Initially
> it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.

Yes, here's a patch for PG14 as well, based on REL_14_STABLE.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Ajin,

Thanks for patches! While checking, I recalled that the backpatch policy [1].
We must not modify definitions of opened functions but this does. Can you define
another function like UpdateSubscriptionRelStateEx or something on the back
branches?

Another comment:
```
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
     Datum        values[Natts_pg_subscription_rel];
     bool        replaces[Natts_pg_subscription_rel];
 
-    LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
-    rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+    if (already_locked)
+        rel = table_open(SubscriptionRelRelationId, NoLock);
```

Can we assert that RowExclusiveLock for pg_subscription_rel has already been
acquired, by using CheckRelationOidLockedByMe() family?

Also, I'm now bit confusing for which branches are really affected. Can you
create all patches for branches, attach at the same e-mail and add some summary
what you want to fix?
E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13?
If not needed, why?

[1]: https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: 024_add_drop_pub.pl might fail due to deadlock

От
vignesh C
Дата:
On Thu, 24 Jul 2025 at 17:45, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > > Dear Ajin,
> > >
> > > Thanks for the patch. Firstly let me confirm my understanding. While altering the
> > > subscription, locks are acquired with below ordering:
> > >
> >
> > I forgot to confirm one point. For which branch should be backpatch? Initially
> > it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.
>
> Yes, here's a patch for PG14 as well, based on REL_14_STABLE.
>

I believe the patch is trying the address the following issues reported:
1) 024_add_drop_pub.pl test failure reported on REL_16_STABLE at [1]
2) Different variation of the above issue on head with the script
attached at [2]
3) Amit reported different variant of it for PG15 with the patch at [3]

I felt these issues are not applicable to the PG13 branch as
Replication origin creation for table sync is not there in the PG13
branch. So the fix is required from master to PG14 branches.

The patch does not apply on the PG16 branch.

In PG15 you have the following code:
+       /* Close table if opened */
+       if (rel)
+       {
+               table_close(rel, NoLock);
+       }

In master branch you have the following code:
+       /* Close table if opened */
+       if (rel)
+               table_close(rel, NoLock);
+
+

We can keep the fix consistent in both cases and additional newlines
not required in the master branch.

[1] - https://www.postgresql.org/message-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297%40gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS%2BQ%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAA4eK1KPa1dJrcd%3DXfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ%40mail.gmail.com

Regards,
Vignesh



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Ajin,
>
> Thanks for patches! While checking, I recalled that the backpatch policy [1].
> We must not modify definitions of opened functions but this does. Can you define
> another function like UpdateSubscriptionRelStateEx or something on the back
> branches?
>
> Another comment:
> ```
> @@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
>         Datum           values[Natts_pg_subscription_rel];
>         bool            replaces[Natts_pg_subscription_rel];
>
> -       LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
> -
> -       rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
> +       if (already_locked)
> +               rel = table_open(SubscriptionRelRelationId, NoLock);
> ```
>
> Can we assert that RowExclusiveLock for pg_subscription_rel has already been
> acquired, by using CheckRelationOidLockedByMe() family?
>

Thanks for your review Kuroda-san, I have changed the logic to not use
already_locked and instead check if the locks are taken inside
UpdateSubscriptionRelState itself. I've tested this and this works
fine. If this logic is acceptable to the reviewers I can update the
other patches also in a similar way.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
> Also, I'm now bit confusing for which branches are really affected. Can you
> create all patches for branches, attach at the same e-mail and add some summary
> what you want to fix?
> E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13?
> If not needed, why?

Yes, patches are needed for all  these versions except PG13 because
replicating origin is not there in PG13. If this logic is fine, I will
create patches for all these branches as well and send them tomorrow.

regards,
Ajin Cherian
Fujitsu Australia



RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Ajin,

> Thanks for your review Kuroda-san, I have changed the logic to not use
> already_locked and instead check if the locks are taken inside
> UpdateSubscriptionRelState itself. I've tested this and this works
> fine. If this logic is acceptable to the reviewers I can update the
> other patches also in a similar way.

Thanks for updates.
However, I found that functions like LockHeldByMe(), CheckRelationOidLockedByMe()
and LWLockHeldByMe() have been used only for the debug build. Functions like
ProcArraySetReplicationSlotXmin() and MarkAsPrepared() can remove the flag from
the argument but they are retained till now.
Based on that, I suggest adding new argument (or add new Ex function for bank branches)
and do the assertion check when the assertion is enabled in this build. Thought?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Tue, Jul 29, 2025 at 7:23 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > Thanks for your review Kuroda-san, I have changed the logic to not use
> > already_locked and instead check if the locks are taken inside
> > UpdateSubscriptionRelState itself. I've tested this and this works
> > fine. If this logic is acceptable to the reviewers I can update the
> > other patches also in a similar way.
>
> Thanks for updates.
> However, I found that functions like LockHeldByMe(), CheckRelationOidLockedByMe()
> and LWLockHeldByMe() have been used only for the debug build. Functions like
> ProcArraySetReplicationSlotXmin() and MarkAsPrepared() can remove the flag from
> the argument but they are retained till now.
> Based on that, I suggest adding new argument (or add new Ex function for bank branches)
> and do the assertion check when the assertion is enabled in this build. Thought?
>

Yes, that makes sense to me. For HEAD and PG18, we can still add a new
argument to the API. For other bank branches, it is better to use a
new Ex function as suggested by Kuroda-San.

--
With Regards,
Amit Kapila.



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Yes, that makes sense to me. For HEAD and PG18, we can still add a new
> argument to the API. For other bank branches, it is better to use a
> new Ex function as suggested by Kuroda-San.
>

Here are the updated patches.

I have added "Ex functions" for back branches (PG_17 and earlier) ,
which also have Asserts for making sure locks are taken. For PG_18 and
HEAD, I've used the extra parameter already_locked.
PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patch
is for both PG_14 and PG_15 and
PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION
is for both PG_18 and HEAD.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
vignesh C
Дата:
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Yes, that makes sense to me. For HEAD and PG18, we can still add a new
> > argument to the API. For other bank branches, it is better to use a
> > new Ex function as suggested by Kuroda-San.
> >
>
> Here are the updated patches.

I noticed the order of LockSharedObject and table lock is different
here compared to disable subscription:
@@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
                                 * worker to remove the origin
tracking as if there is any
                                 * error while dropping we won't
restart it to drop the
                                 * origin. So passing missing_ok = true.
+                                *
+                                * Lock the subscription and origin in
the same order as we
+                                * are doing during DDL commands to
avoid deadlocks. See
+                                * AlterSubscription_refresh.
                                 */
+
LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+                                                                0,
AccessShareLock);
+
+                               if (!rel)
+                                       rel =
table_open(SubscriptionRelRelationId, RowExclusiveLock);

DisableSubscription(Oid subid)
{
....
rel = table_open(SubscriptionRelationId, RowExclusiveLock);
tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid));

if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for subscription %u", subid);

LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
....

Do we need to enforce consistent lock ordering here, or is it safe to
ignore because we're only using AccessShareLock?

Regards,
Vignesh



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Tue, Jul 29, 2025 at 4:26 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Yes, that makes sense to me. For HEAD and PG18, we can still add a new
> > > argument to the API. For other bank branches, it is better to use a
> > > new Ex function as suggested by Kuroda-San.
> > >
> >
> > Here are the updated patches.
>
> I noticed the order of LockSharedObject and table lock is different
> here compared to disable subscription:
>

Note that catalog tables are not the same. The DisableSubscription()
takes lock on pg_subscrition catalog and then on a particular
subscription using subscription_id. Here, we are first taking lock on
a particular subscription and then on pg_subscription_rel. So, this
seems to follow exactly the order which we should follow and I don't
see any problem here. Please let us know if you have a specific
concern.

--
With Regards,
Amit Kapila.



RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Ajin,

> I have added "Ex functions" for back branches (PG_17 and earlier) ,
> which also have Asserts for making sure locks are taken. For PG_18 and
> HEAD, I've used the extra parameter already_locked.
> PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.p
> atch
> is for both PG_14 and PG_15 and
> PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION
> is for both PG_18 and HEAD.
>

Thanks for creating the patch!
I feel the existing function here should be the wrapper of Ex function, so attached
patch did not match the expectation. table_open() and table_close() are called
both in UpdateSubscriptionRelState() and UpdateSubscriptionRelStateEx().
Another issue is that the variable "tag" is used by SET_LOCKTAG_OBJECT() even
without the debug build.

How do you feel the .diff file can be applied atop PG17 patch? It is mainly
same as v4 patch but has some assertion.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
> How do you feel the .diff file can be applied atop PG17 patch? It is mainly
> same as v4 patch but has some assertion.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
> > How do you feel the .diff file can be applied atop PG17 patch? It is mainly
> > same as v4 patch but has some assertion.

Sorry for my interrupted message. I noticed only I attached old version patch.
PSA the correct version.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
vignesh C
Дата:
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Yes, that makes sense to me. For HEAD and PG18, we can still add a new
> > argument to the API. For other bank branches, it is better to use a
> > new Ex function as suggested by Kuroda-San.
> >
>
> Here are the updated patches.

The only case where the ordering of lock now is different is in
DropSubscription, but in this case we take an AccessExclusiveLock on
pg_subscription which should prevent this deadlock from occurring:
...
replorigin_drop_by_name(originname, true, false);
}

/* Clean up dependencies */
deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);

/* Remove any associated relation synchronization states. */
RemoveSubscriptionRel(subid, InvalidOid);
...

Regards,
Vignesh



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Tue, Jul 29, 2025 at 10:45 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > > How do you feel the .diff file can be applied atop PG17 patch? It is mainly
> > > same as v4 patch but has some assertion.
>
> Sorry for my interrupted message. I noticed only I attached old version patch.
> PSA the correct version.

Attaching the updated patches with the changes you requested. I've
also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
so that everything is together in one mail.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Ajin,

> Attaching the updated patches with the changes you requested. I've
> also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
> so that everything is together in one mail.

Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
Can you modify like others do?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Ajin,
>
> > Attaching the updated patches with the changes you requested. I've
> > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
> > so that everything is together in one mail.
>
> Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
> Can you modify like others do?
>

Updated patch for PG18/HEAD.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

RE: 024_add_drop_pub.pl might fail due to deadlock

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Ajin,

Thanks for updates. I confirmed that reported issues could be fixed by your patch.
I have no comments anymore.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: 024_add_drop_pub.pl might fail due to deadlock

От
vignesh C
Дата:
On Thu, 31 Jul 2025 at 08:23, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Ajin,
> >
> > > Attaching the updated patches with the changes you requested. I've
> > > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
> > > so that everything is together in one mail.
> >
> > Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
> > Can you modify like others do?
> >
>
> Updated patch for PG18/HEAD.

How about we change the below:
+#ifdef USE_ASSERT_CHECKING
+               LOCKTAG tag;
+#endif
+
+               Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+
           RowExclusiveLock, true));
+
+               rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+               SET_LOCKTAG_OBJECT(tag, InvalidOid,
SubscriptionRelationId, subid, 0);
+               Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif

to:
 #ifdef USE_ASSERT_CHECKING
        LOCKTAG tag;
        Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
                                          RowExclusiveLock, true));
        SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
        Assert(LockHeldByMe(&tag, AccessShareLock, true));
#endif

        rel = table_open(SubscriptionRelRelationId, NoLock);
    }

The rest looks good to me.

Regards,
Vignesh



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Thu, Jul 31, 2025 at 2:37 PM vignesh C <vignesh21@gmail.com> wrote:
>
> How about we change the below:
> +#ifdef USE_ASSERT_CHECKING
> +               LOCKTAG tag;
> +#endif
> +
> +               Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
> +
>            RowExclusiveLock, true));
> +
> +               rel = table_open(SubscriptionRelRelationId, NoLock);
> +#ifdef USE_ASSERT_CHECKING
> +               SET_LOCKTAG_OBJECT(tag, InvalidOid,
> SubscriptionRelationId, subid, 0);
> +               Assert(LockHeldByMe(&tag, AccessShareLock, true));
> +#endif
>
> to:
>  #ifdef USE_ASSERT_CHECKING
>         LOCKTAG tag;
>         Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
>                                           RowExclusiveLock, true));
>         SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
>         Assert(LockHeldByMe(&tag, AccessShareLock, true));
> #endif
>
>         rel = table_open(SubscriptionRelRelationId, NoLock);
>     }
>

Your suggested change looks better to me.

--
With Regards,
Amit Kapila.



Re: 024_add_drop_pub.pl might fail due to deadlock

От
Ajin Cherian
Дата:
On Thu, Jul 31, 2025 at 7:07 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 31 Jul 2025 at 08:23, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Ajin,
> > >
> > > > Attaching the updated patches with the changes you requested. I've
> > > > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
> > > > so that everything is together in one mail.
> > >
> > > Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
> > > Can you modify like others do?
> > >
> >
> > Updated patch for PG18/HEAD.
>
> How about we change the below:
> +#ifdef USE_ASSERT_CHECKING
> +               LOCKTAG tag;
> +#endif
> +
> +               Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
> +
>            RowExclusiveLock, true));
> +
> +               rel = table_open(SubscriptionRelRelationId, NoLock);
> +#ifdef USE_ASSERT_CHECKING
> +               SET_LOCKTAG_OBJECT(tag, InvalidOid,
> SubscriptionRelationId, subid, 0);
> +               Assert(LockHeldByMe(&tag, AccessShareLock, true));
> +#endif
>
> to:
>  #ifdef USE_ASSERT_CHECKING
>         LOCKTAG tag;
>         Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
>                                           RowExclusiveLock, true));
>         SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
>         Assert(LockHeldByMe(&tag, AccessShareLock, true));
> #endif
>
>         rel = table_open(SubscriptionRelRelationId, NoLock);
>     }
>
> The rest looks good to me.

I've fixed the patches accordingly for all branches.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: 024_add_drop_pub.pl might fail due to deadlock

От
Amit Kapila
Дата:
On Thu, Jul 31, 2025 at 4:22 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> I've fixed the patches accordingly for all branches.
>

Pushed.

--
With Regards,
Amit Kapila.