Обсуждение: 024_add_drop_pub.pl might fail due to deadlock
Hello hackers, The recent buildfarm failure [1] on REL_15_STABLE with the following diagnostics: # Looks like your test exited with 29 just after 1. t/024_add_drop_pub.pl .............. Dubious, test returned 29 (wstat 7424, 0x1d00) pgsql.build/src/test/subscription/tmp_check/log/regress_log_024_add_drop_pub [21:01:34.406](16.501s) ok 1 - check initial data is copied to subscriber error running SQL: 'psql:<stdin>:1: ERROR: deadlock detected DETAIL: Process 219632 waits for ExclusiveLock on relation 6000 of database 0; blocked by process 218369. Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632. HINT: See server log for query details.' while running 'psql -XAtq -d port=14957 host=/home/bf/bf-build/petalura/tmp/bGI6HuRtfa dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1' at /home/bf/bf-build/petalura/REL_15_STABLE/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 1951. pgsql.build/src/test/subscription/tmp_check/log/024_add_drop_pub_subscriber.log 2025-07-01 21:01:32.682 CEST [218369][logical replication worker][3/6:0] LOG: logical replication apply worker for subscription "tap_sub" has started ... 2025-07-01 21:01:34.771 CEST [219632][client backend][4/14:0] LOG: statement: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1 2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] ERROR: deadlock detected 2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] DETAIL: Process 219632 waits for ExclusiveLock on relation 6000 of database 0; blocked by process 218369. Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632. Process 219632: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1 Process 218369: <command string not enabled> 2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] HINT: See server log for query details. 2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] STATEMENT: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1 shows that the test can fail due to deadlock on accessing pg_replication_origin (relation 6000). This failure can be easily reproduced with: --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) * the specific origin and then re-check if the origin still exists. */ rel = table_open(ReplicationOriginRelationId, ExclusiveLock); +pg_usleep(300000); Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because 024_add_drop_pub.pl was added in v15). [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58 Best regards, Alexander
On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > --- a/src/backend/replication/logical/origin.c > +++ b/src/backend/replication/logical/origin.c > @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > * the specific origin and then re-check if the origin still exists. > */ > rel = table_open(ReplicationOriginRelationId, ExclusiveLock); > +pg_usleep(300000); > > Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because > 024_add_drop_pub.pl was added in v15). > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58 > > Best regards, > Alexander > Hi Alexander, Yes, the problem can be reproduced by the changes you suggested. I will look into what is happening and how we can fix this. regards, Ajin Cherian Fujitsu Australia
On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > > > --- a/src/backend/replication/logical/origin.c > > +++ b/src/backend/replication/logical/origin.c > > @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > > * the specific origin and then re-check if the origin still exists. > > */ > > rel = table_open(ReplicationOriginRelationId, ExclusiveLock); > > +pg_usleep(300000); > > > > Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because > > 024_add_drop_pub.pl was added in v15). > > > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58 > > > > Best regards, > > Alexander > > > > Hi Alexander, > > Yes, the problem can be reproduced by the changes you suggested. I > will look into what is happening and how we can fix this. The issue appears to be a deadlock caused by inconsistent lock acquisition order between two processes: Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1): In AlterSubscription_refresh(), it first acquires an AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then later tries to acquire an ExclusiveLock on ReplicationOriginRelationId (resource 2). Process B (apply worker): In process_syncing_tables_for_apply(), it first acquires an ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock on SubscriptionRelRelationId (resource 1). This leads to a deadlock: Process A holds a lock on resource 1 and waits for resource 2, while process B holds a lock on resource 2 and waits for resource 1. Proposed fix: In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock on SubscriptionRelRelationId before acquiring the lock on ReplicationOriginRelationId. Patch with fix attached. I'll continue investigating whether this issue also affects HEAD. regards, Ajin Cherian Fujitsu Australia.
Вложения
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote: > Proposed fix: > In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock > on SubscriptionRelRelationId before acquiring the lock on > ReplicationOriginRelationId. > > Patch with fix attached. > I'll continue investigating whether this issue also affects HEAD. While debugging this further, I found that there is another lock taken prior to this in AlterSubscription(), /* Lock the subscription so nobody else can do anything with it. */ LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock); Since this is the first lock taken while altering subscription, that should also be taken first by the tablesync worker to avoid deadlock. So, attaching a modified patch. regards, Ajin Cherian Fujitsu Australia
Вложения
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Patch with fix attached. > I'll continue investigating whether this issue also affects HEAD. > While debugging if this problem can occur on HEAD, I found out that on head, it is mostly the tablesync worker that drops the origin on HEAD and since the tablesysnc worker does not attempt to update the SubscriptionRel state in that process, there doesn't seem to be the possibility of a deadlock. But there is a rare situation where the tablesync worker could crash or get an error just prior to dropping the origin, then the origin is dropped in the apply worker (this is explained in the comments in process_syncing_tables_for_sync()). If the origin has to be dropped in the apply worker, then the same deadlock can happen in HEAD code as well. I was able to simulate this by using an injection point to create an error on the tablesync worker and then the similar deadlock happens on HEAD as well. Attaching a patch for fixing this on HEAD as well. regards, Ajin Cherian Fujitsu Australia
Вложения
On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Patch with fix attached. > > I'll continue investigating whether this issue also affects HEAD. > > > > While debugging if this problem can occur on HEAD, I found out that on > head, it is mostly the tablesync worker that drops the origin on HEAD > and since the tablesysnc worker does not attempt to update the > SubscriptionRel state in that process, there doesn't seem to be the > possibility of a deadlock. But there is a rare situation where the > tablesync worker could crash or get an error just prior to dropping > the origin, then the origin is dropped in the apply worker (this is > explained in the comments in process_syncing_tables_for_sync()). If > the origin has to be dropped in the apply worker, then the same > deadlock can happen in HEAD code as well. I was able to simulate this > by using an injection point to create an error on the tablesync worker > and then the similar deadlock happens on HEAD as well. Attaching a > patch for fixing this on HEAD as well. I was able to reproduce the deadlock on HEAD as well using the attached patch, which introduces a delay in the tablesync worker before dropping the replication origin by adding a sleep of a few seconds. During this delay, the apply worker also attempts to drop the replication origin. If an ALTER SUBSCRIPTION command is executed concurrently, a deadlock frequently occurs: 2025-07-14 15:59:53.572 IST [141100] DETAIL: Process 141100 waits for AccessExclusiveLock on object 2 of class 6000 of database 0; blocked by process 140974. Process 140974 waits for AccessShareLock on object 16396 of class 6100 of database 0; blocked by process 141100. Process 141100: alter subscription sub1 drop publication pub1 Process 140974: <command string not enabled> After apply the attached patch, create the logical replication setup for a publication pub1 having table t1 and then run the following commands in a loop: alter subscription sub1 drop publication pub1; alter subscription sub1 add publication pub1; sleep 4 Regards, Vignesh
Вложения
On Mon, 14 Jul 2025 at 16:15, vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > Patch with fix attached. > > > I'll continue investigating whether this issue also affects HEAD. > > > > > > > While debugging if this problem can occur on HEAD, I found out that on > > head, it is mostly the tablesync worker that drops the origin on HEAD > > and since the tablesysnc worker does not attempt to update the > > SubscriptionRel state in that process, there doesn't seem to be the > > possibility of a deadlock. But there is a rare situation where the > > tablesync worker could crash or get an error just prior to dropping > > the origin, then the origin is dropped in the apply worker (this is > > explained in the comments in process_syncing_tables_for_sync()). If > > the origin has to be dropped in the apply worker, then the same > > deadlock can happen in HEAD code as well. I was able to simulate this > > by using an injection point to create an error on the tablesync worker > > and then the similar deadlock happens on HEAD as well. Attaching a > > patch for fixing this on HEAD as well. > > I was able to reproduce the deadlock on HEAD as well using the > attached patch, which introduces a delay in the tablesync worker > before dropping the replication origin by adding a sleep of a few > seconds. During this delay, the apply worker also attempts to drop the > replication origin. If an ALTER SUBSCRIPTION command is executed > concurrently, a deadlock frequently occurs: > 2025-07-14 15:59:53.572 IST [141100] DETAIL: Process 141100 waits for > AccessExclusiveLock on object 2 of class 6000 of database 0; blocked > by process 140974. > Process 140974 waits for AccessShareLock on object 16396 of class 6100 > of database 0; blocked by process 141100. > Process 141100: alter subscription sub1 drop publication pub1 > Process 140974: <command string not enabled> > > After apply the attached patch, create the logical replication setup > for a publication pub1 having table t1 and then run the following > commands in a loop: > alter subscription sub1 drop publication pub1; > alter subscription sub1 add publication pub1; > sleep 4 Attached is the script used to reproduce the issue and the deadlock logs for the same. Your patch fixes the issue. Couple of comments: 1) This change is not required: #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/usercontext.h" +#include "utils/injection_point.h" 2) This can not only happen in error case but also in normal cases where the tablesync worker is slower as shown in the script to reproduce, we can update the commit message accordingly: In most situations the tablesync worker will drop the corresponding origin before it finishes executing, but if an error causes the tablesync worker to fail just prior to dropping the origin, the apply worker will later find the origin and drop it. Regards, Vignesh
Вложения
On Tue, Jul 15, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote: > > Couple of comments: > 1) This change is not required: > #include "utils/snapmgr.h" > #include "utils/syscache.h" > #include "utils/usercontext.h" > +#include "utils/injection_point.h" > > 2) This can not only happen in error case but also in normal cases > where the tablesync worker is slower as shown in the script to > reproduce, we can update the commit message accordingly: > In most situations the tablesync worker will drop the corresponding > origin before it > finishes executing, but if an error causes the tablesync worker to > fail just prior to > dropping the origin, the apply worker will later find the origin and drop it. Thanks for the test and confirming the fix. Fixed the comments. regards, Ajin Cherian Fujitsu Australia
Вложения
On Wed, Jul 16, 2025 at 8:38 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Thanks for the test and confirming the fix. Fixed the comments. > * origin. So passing missing_ok = true. + * + * Also lock SubscriptionRelationId with AccessShareLock to + * prevent any deadlocks with the user concurrently performing + * refresh on the subscription. */ + LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid, + 0, AccessShareLock); It seems the patch assumes that the above lock is sufficient because AlterSubscription will take an AcessExclusiveLock on the same subscription. So, with this proposal, if both of those become serialized then the other locking order may not matter. Am I correct or is there any other theory you have in mind? If that is the theory then I think we are missing cases where the apply worker and Alter subscription operates on different subscriptions. Consider AlterSubscription_refresh() takes first AccessExclusiveLock on SubscriptionRelRelationId and then ExclusiveLock on ReplicationOriginRelationId via replorigin_drop_by_name() . The apply worker first takes ExclusiveLock on ReplicationOriginRelationId via replorigin_drop_by_name() and then RowExclusiveLock on SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such a scenario taking conflicting locks in reverse order can lead to deadlock at least in PG15? -- With Regards, Amit Kapila.
On Thu, Jul 17, 2025 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > It seems the patch assumes that the above lock is sufficient because > AlterSubscription will take an AcessExclusiveLock on the same > subscription. So, with this proposal, if both of those become > serialized then the other locking order may not matter. Am I correct > or is there any other theory you have in mind? > > If that is the theory then I think we are missing cases where the > apply worker and Alter subscription operates on different > subscriptions. > > Consider AlterSubscription_refresh() takes first AccessExclusiveLock > on SubscriptionRelRelationId and then ExclusiveLock on > ReplicationOriginRelationId via replorigin_drop_by_name() . The apply > worker first takes ExclusiveLock on ReplicationOriginRelationId via > replorigin_drop_by_name() and then RowExclusiveLock on > SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such > a scenario taking conflicting locks in reverse order can lead to > deadlock at least in PG15? Yes, this is correct. I have also verified this in my testing that when there is a second subscription, a deadlock can still happen with my previous patch. I have now modified the patch in tablesync worker to take locks on both SubscriptionRelationId and SubscriptionRelRelationId prior to taking lock on ReplicationOriginRelationId. I have also found that there is a similar problem in DropSubscription() where lock on SubscriptionRelRelationId is not taken before dropping the tracking origin. I've also modified the signature of UpdateSubscriptionRelState to take a bool "lock_needed" which if false, the SubscriptionRelationId is not locked. I've made a new version of the patch on PG_15. regards, Ajin Cherian Fujitsu Australia
Вложения
> locked. I've made a new version of the patch on PG_15. I've made a similar fix on HEAD just so that the code is now consistent. I don't think the similar problem (deadlock between two different subscriptions trying to drop tracking origin) occurs on HEAD with my previous patch, as the way origins are dropped are different on HEAD. On HEAD, while dropping origin, a RowExclusiveLock lock is taken on ReplicationOriginRelationId and then an AccessExclusiveLock is taken on the particular origin. Since the particular origin will be different on different subscriptions, the similar deadlock will not happen. But just to keep code consistent, I have made a similar fix. regards, Ajin Cherian Fujitsu Australia
Вложения
On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Yes, this is correct. I have also verified this in my testing that > when there is a second subscription, a deadlock can still happen with > my previous patch. I have now modified the patch in tablesync worker > to take locks on both SubscriptionRelationId and > SubscriptionRelRelationId prior to taking lock on > ReplicationOriginRelationId. I have also found that there is a similar > problem in DropSubscription() where lock on SubscriptionRelRelationId > is not taken before dropping the tracking origin. I've also modified > the signature of UpdateSubscriptionRelState to take a bool > "lock_needed" which if false, the SubscriptionRelationId is not > locked. I've made a new version of the patch on PG_15. > Review comments: ================ 1. + if (!rel) + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); Why did you acquire AccessExclusiveLock here when the current code has RowExclusiveLock? It should be RowExclusiveLock. 2. + * + * Lock SubscriptionRelationId with AccessShareLock and + * take AccessExclusiveLock on SubscriptionRelRelationId to + * prevent any deadlocks with user concurrently performing + * refresh on the subscription. */ Try to tell in the comments that we want to keep the locking order same as DDL commands to prevent deadlocks. 3. + * Also close any tables prior to the commit. */ + if (rel) + { + table_close(rel, AccessExclusiveLock); + rel = NULL; We don't need to explicitly release lock on table_close, it will be done at transaction end, so use NoLock here as we do in current HEAD code. 4. DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) { - Relation rel; + Relation rel, sub_rel; ObjectAddress myself; HeapTuple tup; Oid subid; @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) * Note that the state can't change because we have already stopped both * the apply and tablesync workers and they can't restart because of * exclusive lock on the subscription. + * + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any + * deadlock with apply workers of other subscriptions trying + * to drop tracking origin. */ + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); I don't think we need AccessExclusiveLock on SubscriptionRelRelationId in DropSubscription. Kindly test again after fixing the first comment above. -- With Regards, Amit Kapila.
Dear Ajin, Thanks for the patch. Firstly let me confirm my understanding. While altering the subscription, locks are acquired with below ordering: Order target level 1 pg_subscription row exclusive 2 pg_subscription, my tuple access exclusive 3 pg_subscription_rel access exclusive 4 pg_replication_origin excluive In contrast, apply worker works like: Order target level 1 pg_replication_origin excluive 2 pg_subscription, my tuple access share 3 pg_subscrition_rel row exclusive Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3. Below are my comments: ``` @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn) * is dropped. So passing missing_ok = false. */ ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false); - ``` This change is not needed. ``` + if (!rel) + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); + ``` I feel it is too strong, isn't it enough to use row exclusive as initially used? ``` UpdateSubscriptionRelState(Oid subid, Oid relid, char state, - XLogRecPtr sublsn) + XLogRecPtr sublsn, bool lock_needed) ``` I feel the name `lock_needed` is bit misleading, because the function still opens the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"? ``` @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) * Note that the state can't change because we have already stopped both * the apply and tablesync workers and they can't restart because of * exclusive lock on the subscription. + * + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any + * deadlock with apply workers of other subscriptions trying + * to drop tracking origin. */ + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); ``` Hmm. Per my understanding, DropSubscription acquires below locks till it reaches replorigin_drop_by_name(). Order target level 1 pg_subscription access exclusive 2 pg_subscription, my tuple access exclusive 3 pg_replication_origin excluive IIUC we must preserve the ordering, not the target of locks. Best regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Jul 23, 2025 at 2:42 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Ajin, > > Thanks for the patch. Firstly let me confirm my understanding. While altering the > subscription, locks are acquired with below ordering: > > Order target level > 1 pg_subscription row exclusive > 2 pg_subscription, my tuple access exclusive > 3 pg_subscription_rel access exclusive > 4 pg_replication_origin excluive > > In contrast, apply worker works like: > > Order target level > 1 pg_replication_origin excluive > 2 pg_subscription, my tuple access share > 3 pg_subscrition_rel row exclusive > > Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3. > > Below are my comments: > > ``` > @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn) > * is dropped. So passing missing_ok = false. > */ > ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false); > - > ``` > This change is not needed. > > ``` > + if (!rel) > + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > + > ``` > > I feel it is too strong, isn't it enough to use row exclusive as initially used? > > ``` > UpdateSubscriptionRelState(Oid subid, Oid relid, char state, > - XLogRecPtr sublsn) > + XLogRecPtr sublsn, bool lock_needed) > ``` > > I feel the name `lock_needed` is bit misleading, because the function still opens > the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"? > I think if we lock in a caller, we don't need to use any lock during table_open. We can use the parameter name as already_locked as we do at some other places in the code. -- With Regards, Amit Kapila.
> Dear Ajin, > > Thanks for the patch. Firstly let me confirm my understanding. While altering the > subscription, locks are acquired with below ordering: > I forgot to confirm one point. For which branch should be backpatch? Initially it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14. Regarding the PG13, it may not be affected because the replication origin seemed not to be used for the table sync. It was introduced for ce0fdbfe97. [1]: https://www.postgresql.org/message-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297%40gmail.com Best regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Yes, this is correct. I have also verified this in my testing that > > when there is a second subscription, a deadlock can still happen with > > my previous patch. I have now modified the patch in tablesync worker > > to take locks on both SubscriptionRelationId and > > SubscriptionRelRelationId prior to taking lock on > > ReplicationOriginRelationId. I have also found that there is a similar > > problem in DropSubscription() where lock on SubscriptionRelRelationId > > is not taken before dropping the tracking origin. I've also modified > > the signature of UpdateSubscriptionRelState to take a bool > > "lock_needed" which if false, the SubscriptionRelationId is not > > locked. I've made a new version of the patch on PG_15. > > > > Review comments: > ================ > 1. > + if (!rel) > + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > > Why did you acquire AccessExclusiveLock here when the current code has > RowExclusiveLock? It should be RowExclusiveLock. > Yes, you are correct. I have replaced it with RowExclusiveLock. > 2. > + * > + * Lock SubscriptionRelationId with AccessShareLock and > + * take AccessExclusiveLock on SubscriptionRelRelationId to > + * prevent any deadlocks with user concurrently performing > + * refresh on the subscription. > */ > > Try to tell in the comments that we want to keep the locking order > same as DDL commands to prevent deadlocks. > Modified. > 3. > + * Also close any tables prior to the commit. > */ > + if (rel) > + { > + table_close(rel, AccessExclusiveLock); > + rel = NULL; > > We don't need to explicitly release lock on table_close, it will be > done at transaction end, so use NoLock here as we do in current HEAD > code. > Done. > 4. > DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) > { > - Relation rel; > + Relation rel, sub_rel; > ObjectAddress myself; > HeapTuple tup; > Oid subid; > @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, > bool isTopLevel) > * Note that the state can't change because we have already stopped both > * the apply and tablesync workers and they can't restart because of > * exclusive lock on the subscription. > + * > + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any > + * deadlock with apply workers of other subscriptions trying > + * to drop tracking origin. > */ > + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > > I don't think we need AccessExclusiveLock on SubscriptionRelRelationId > in DropSubscription. Kindly test again after fixing the first comment > above. > -- Yes, it was failing because I was taking AccessExclusiveLock in the apply worker, and that was causing the deadlock in my testing. I tested this worked. On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Ajin, > > Thanks for the patch. Firstly let me confirm my understanding. While altering the > subscription, locks are acquired with below ordering: > > Order target level > 1 pg_subscription row exclusive > 2 pg_subscription, my tuple access exclusive > 3 pg_subscription_rel access exclusive > 4 pg_replication_origin excluive > > In contrast, apply worker works like: > > Order target level > 1 pg_replication_origin excluive > 2 pg_subscription, my tuple access share > 3 pg_subscrition_rel row exclusive > > Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3. > Yes, that is correct. > Below are my comments: > > ``` > @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn) > * is dropped. So passing missing_ok = false. > */ > ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false); > - > ``` > This change is not needed. Removed. > > ``` > + if (!rel) > + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > + > ``` > > I feel it is too strong, isn't it enough to use row exclusive as initially used? > Yes, agree. Fixed. > ``` > UpdateSubscriptionRelState(Oid subid, Oid relid, char state, > - XLogRecPtr sublsn) > + XLogRecPtr sublsn, bool lock_needed) > ``` > > I feel the name `lock_needed` is bit misleading, because the function still opens > the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"? > I have modified it to not take lock while table_open as well and changed the name of the variable to already_locked. > ``` > @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) > * Note that the state can't change because we have already stopped both > * the apply and tablesync workers and they can't restart because of > * exclusive lock on the subscription. > + * > + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any > + * deadlock with apply workers of other subscriptions trying > + * to drop tracking origin. > */ > + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > ``` > > Hmm. Per my understanding, DropSubscription acquires below locks till it reaches > replorigin_drop_by_name(). > > Order target level > 1 pg_subscription access exclusive > 2 pg_subscription, my tuple access exclusive > 3 pg_replication_origin excluive > > IIUC we must preserve the ordering, not the target of locks. I have removed this change as this does not now conflict with the apply worker. Two patches attached. One for HEAD and the other for PG_15. regards, Ajin Cherian Fujitsu Australia
Вложения
On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Dear Ajin, > > > > Thanks for the patch. Firstly let me confirm my understanding. While altering the > > subscription, locks are acquired with below ordering: > > > > I forgot to confirm one point. For which branch should be backpatch? Initially > it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14. Yes, here's a patch for PG14 as well, based on REL_14_STABLE. regards, Ajin Cherian Fujitsu Australia
Вложения
Dear Ajin, Thanks for patches! While checking, I recalled that the backpatch policy [1]. We must not modify definitions of opened functions but this does. Can you define another function like UpdateSubscriptionRelStateEx or something on the back branches? Another comment: ``` @@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state, Datum values[Natts_pg_subscription_rel]; bool replaces[Natts_pg_subscription_rel]; - LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock); - - rel = table_open(SubscriptionRelRelationId, RowExclusiveLock); + if (already_locked) + rel = table_open(SubscriptionRelRelationId, NoLock); ``` Can we assert that RowExclusiveLock for pg_subscription_rel has already been acquired, by using CheckRelationOidLockedByMe() family? Also, I'm now bit confusing for which branches are really affected. Can you create all patches for branches, attach at the same e-mail and add some summary what you want to fix? E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13? If not needed, why? [1]: https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, 24 Jul 2025 at 17:45, Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > > Dear Ajin, > > > > > > Thanks for the patch. Firstly let me confirm my understanding. While altering the > > > subscription, locks are acquired with below ordering: > > > > > > > I forgot to confirm one point. For which branch should be backpatch? Initially > > it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14. > > Yes, here's a patch for PG14 as well, based on REL_14_STABLE. > I believe the patch is trying the address the following issues reported: 1) 024_add_drop_pub.pl test failure reported on REL_16_STABLE at [1] 2) Different variation of the above issue on head with the script attached at [2] 3) Amit reported different variant of it for PG15 with the patch at [3] I felt these issues are not applicable to the PG13 branch as Replication origin creation for table sync is not there in the PG13 branch. So the fix is required from master to PG14 branches. The patch does not apply on the PG16 branch. In PG15 you have the following code: + /* Close table if opened */ + if (rel) + { + table_close(rel, NoLock); + } In master branch you have the following code: + /* Close table if opened */ + if (rel) + table_close(rel, NoLock); + + We can keep the fix consistent in both cases and additional newlines not required in the master branch. [1] - https://www.postgresql.org/message-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297%40gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS%2BQ%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAA4eK1KPa1dJrcd%3DXfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ%40mail.gmail.com Regards, Vignesh
On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Ajin, > > Thanks for patches! While checking, I recalled that the backpatch policy [1]. > We must not modify definitions of opened functions but this does. Can you define > another function like UpdateSubscriptionRelStateEx or something on the back > branches? > > Another comment: > ``` > @@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state, > Datum values[Natts_pg_subscription_rel]; > bool replaces[Natts_pg_subscription_rel]; > > - LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock); > - > - rel = table_open(SubscriptionRelRelationId, RowExclusiveLock); > + if (already_locked) > + rel = table_open(SubscriptionRelRelationId, NoLock); > ``` > > Can we assert that RowExclusiveLock for pg_subscription_rel has already been > acquired, by using CheckRelationOidLockedByMe() family? > Thanks for your review Kuroda-san, I have changed the logic to not use already_locked and instead check if the locks are taken inside UpdateSubscriptionRelState itself. I've tested this and this works fine. If this logic is acceptable to the reviewers I can update the other patches also in a similar way. regards, Ajin Cherian Fujitsu Australia
Вложения
On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > Also, I'm now bit confusing for which branches are really affected. Can you > create all patches for branches, attach at the same e-mail and add some summary > what you want to fix? > E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13? > If not needed, why? Yes, patches are needed for all these versions except PG13 because replicating origin is not there in PG13. If this logic is fine, I will create patches for all these branches as well and send them tomorrow. regards, Ajin Cherian Fujitsu Australia
Dear Ajin, > Thanks for your review Kuroda-san, I have changed the logic to not use > already_locked and instead check if the locks are taken inside > UpdateSubscriptionRelState itself. I've tested this and this works > fine. If this logic is acceptable to the reviewers I can update the > other patches also in a similar way. Thanks for updates. However, I found that functions like LockHeldByMe(), CheckRelationOidLockedByMe() and LWLockHeldByMe() have been used only for the debug build. Functions like ProcArraySetReplicationSlotXmin() and MarkAsPrepared() can remove the flag from the argument but they are retained till now. Based on that, I suggest adding new argument (or add new Ex function for bank branches) and do the assertion check when the assertion is enabled in this build. Thought? Best regards, Hayato Kuroda FUJITSU LIMITED
On Tue, Jul 29, 2025 at 7:23 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Thanks for your review Kuroda-san, I have changed the logic to not use > > already_locked and instead check if the locks are taken inside > > UpdateSubscriptionRelState itself. I've tested this and this works > > fine. If this logic is acceptable to the reviewers I can update the > > other patches also in a similar way. > > Thanks for updates. > However, I found that functions like LockHeldByMe(), CheckRelationOidLockedByMe() > and LWLockHeldByMe() have been used only for the debug build. Functions like > ProcArraySetReplicationSlotXmin() and MarkAsPrepared() can remove the flag from > the argument but they are retained till now. > Based on that, I suggest adding new argument (or add new Ex function for bank branches) > and do the assertion check when the assertion is enabled in this build. Thought? > Yes, that makes sense to me. For HEAD and PG18, we can still add a new argument to the API. For other bank branches, it is better to use a new Ex function as suggested by Kuroda-San. -- With Regards, Amit Kapila.
On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Yes, that makes sense to me. For HEAD and PG18, we can still add a new > argument to the API. For other bank branches, it is better to use a > new Ex function as suggested by Kuroda-San. > Here are the updated patches. I have added "Ex functions" for back branches (PG_17 and earlier) , which also have Asserts for making sure locks are taken. For PG_18 and HEAD, I've used the extra parameter already_locked. PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patch is for both PG_14 and PG_15 and PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION is for both PG_18 and HEAD. regards, Ajin Cherian Fujitsu Australia
Вложения
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Yes, that makes sense to me. For HEAD and PG18, we can still add a new > > argument to the API. For other bank branches, it is better to use a > > new Ex function as suggested by Kuroda-San. > > > > Here are the updated patches. I noticed the order of LockSharedObject and table lock is different here compared to disable subscription: @@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn) * worker to remove the origin tracking as if there is any * error while dropping we won't restart it to drop the * origin. So passing missing_ok = true. + * + * Lock the subscription and origin in the same order as we + * are doing during DDL commands to avoid deadlocks. See + * AlterSubscription_refresh. */ + LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid, + 0, AccessShareLock); + + if (!rel) + rel = table_open(SubscriptionRelRelationId, RowExclusiveLock); DisableSubscription(Oid subid) { .... rel = table_open(SubscriptionRelationId, RowExclusiveLock); tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); if (!HeapTupleIsValid(tup)) elog(ERROR, "cache lookup failed for subscription %u", subid); LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock); .... Do we need to enforce consistent lock ordering here, or is it safe to ignore because we're only using AccessShareLock? Regards, Vignesh
On Tue, Jul 29, 2025 at 4:26 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Yes, that makes sense to me. For HEAD and PG18, we can still add a new > > > argument to the API. For other bank branches, it is better to use a > > > new Ex function as suggested by Kuroda-San. > > > > > > > Here are the updated patches. > > I noticed the order of LockSharedObject and table lock is different > here compared to disable subscription: > Note that catalog tables are not the same. The DisableSubscription() takes lock on pg_subscrition catalog and then on a particular subscription using subscription_id. Here, we are first taking lock on a particular subscription and then on pg_subscription_rel. So, this seems to follow exactly the order which we should follow and I don't see any problem here. Please let us know if you have a specific concern. -- With Regards, Amit Kapila.
Dear Ajin, > I have added "Ex functions" for back branches (PG_17 and earlier) , > which also have Asserts for making sure locks are taken. For PG_18 and > HEAD, I've used the extra parameter already_locked. > PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.p > atch > is for both PG_14 and PG_15 and > PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION > is for both PG_18 and HEAD. > Thanks for creating the patch! I feel the existing function here should be the wrapper of Ex function, so attached patch did not match the expectation. table_open() and table_close() are called both in UpdateSubscriptionRelState() and UpdateSubscriptionRelStateEx(). Another issue is that the variable "tag" is used by SET_LOCKTAG_OBJECT() even without the debug build. How do you feel the .diff file can be applied atop PG17 patch? It is mainly same as v4 patch but has some assertion. Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
> How do you feel the .diff file can be applied atop PG17 patch? It is mainly > same as v4 patch but has some assertion. Best regards, Hayato Kuroda FUJITSU LIMITED
> > How do you feel the .diff file can be applied atop PG17 patch? It is mainly > > same as v4 patch but has some assertion. Sorry for my interrupted message. I noticed only I attached old version patch. PSA the correct version. Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Yes, that makes sense to me. For HEAD and PG18, we can still add a new > > argument to the API. For other bank branches, it is better to use a > > new Ex function as suggested by Kuroda-San. > > > > Here are the updated patches. The only case where the ordering of lock now is different is in DropSubscription, but in this case we take an AccessExclusiveLock on pg_subscription which should prevent this deadlock from occurring: ... replorigin_drop_by_name(originname, true, false); } /* Clean up dependencies */ deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0); /* Remove any associated relation synchronization states. */ RemoveSubscriptionRel(subid, InvalidOid); ... Regards, Vignesh
On Tue, Jul 29, 2025 at 10:45 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > > How do you feel the .diff file can be applied atop PG17 patch? It is mainly > > > same as v4 patch but has some assertion. > > Sorry for my interrupted message. I noticed only I attached old version patch. > PSA the correct version. Attaching the updated patches with the changes you requested. I've also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*), so that everything is together in one mail. regards, Ajin Cherian Fujitsu Australia
Вложения
Dear Ajin, > Attaching the updated patches with the changes you requested. I've > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*), > so that everything is together in one mail. Thanks for update, but the patch for PG18/HEAD seemed not to have Assert(). Can you modify like others do? Best regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Ajin, > > > Attaching the updated patches with the changes you requested. I've > > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*), > > so that everything is together in one mail. > > Thanks for update, but the patch for PG18/HEAD seemed not to have Assert(). > Can you modify like others do? > Updated patch for PG18/HEAD. regards, Ajin Cherian Fujitsu Australia
Вложения
Dear Ajin, Thanks for updates. I confirmed that reported issues could be fixed by your patch. I have no comments anymore. Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, 31 Jul 2025 at 08:23, Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Ajin, > > > > > Attaching the updated patches with the changes you requested. I've > > > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*), > > > so that everything is together in one mail. > > > > Thanks for update, but the patch for PG18/HEAD seemed not to have Assert(). > > Can you modify like others do? > > > > Updated patch for PG18/HEAD. How about we change the below: +#ifdef USE_ASSERT_CHECKING + LOCKTAG tag; +#endif + + Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, + RowExclusiveLock, true)); + + rel = table_open(SubscriptionRelRelationId, NoLock); +#ifdef USE_ASSERT_CHECKING + SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0); + Assert(LockHeldByMe(&tag, AccessShareLock, true)); +#endif to: #ifdef USE_ASSERT_CHECKING LOCKTAG tag; Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, RowExclusiveLock, true)); SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0); Assert(LockHeldByMe(&tag, AccessShareLock, true)); #endif rel = table_open(SubscriptionRelRelationId, NoLock); } The rest looks good to me. Regards, Vignesh
On Thu, Jul 31, 2025 at 2:37 PM vignesh C <vignesh21@gmail.com> wrote: > > How about we change the below: > +#ifdef USE_ASSERT_CHECKING > + LOCKTAG tag; > +#endif > + > + Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, > + > RowExclusiveLock, true)); > + > + rel = table_open(SubscriptionRelRelationId, NoLock); > +#ifdef USE_ASSERT_CHECKING > + SET_LOCKTAG_OBJECT(tag, InvalidOid, > SubscriptionRelationId, subid, 0); > + Assert(LockHeldByMe(&tag, AccessShareLock, true)); > +#endif > > to: > #ifdef USE_ASSERT_CHECKING > LOCKTAG tag; > Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, > RowExclusiveLock, true)); > SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0); > Assert(LockHeldByMe(&tag, AccessShareLock, true)); > #endif > > rel = table_open(SubscriptionRelRelationId, NoLock); > } > Your suggested change looks better to me. -- With Regards, Amit Kapila.
On Thu, Jul 31, 2025 at 7:07 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 31 Jul 2025 at 08:23, Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear Ajin, > > > > > > > Attaching the updated patches with the changes you requested. I've > > > > also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*), > > > > so that everything is together in one mail. > > > > > > Thanks for update, but the patch for PG18/HEAD seemed not to have Assert(). > > > Can you modify like others do? > > > > > > > Updated patch for PG18/HEAD. > > How about we change the below: > +#ifdef USE_ASSERT_CHECKING > + LOCKTAG tag; > +#endif > + > + Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, > + > RowExclusiveLock, true)); > + > + rel = table_open(SubscriptionRelRelationId, NoLock); > +#ifdef USE_ASSERT_CHECKING > + SET_LOCKTAG_OBJECT(tag, InvalidOid, > SubscriptionRelationId, subid, 0); > + Assert(LockHeldByMe(&tag, AccessShareLock, true)); > +#endif > > to: > #ifdef USE_ASSERT_CHECKING > LOCKTAG tag; > Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, > RowExclusiveLock, true)); > SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0); > Assert(LockHeldByMe(&tag, AccessShareLock, true)); > #endif > > rel = table_open(SubscriptionRelRelationId, NoLock); > } > > The rest looks good to me. I've fixed the patches accordingly for all branches. regards, Ajin Cherian Fujitsu Australia
Вложения
On Thu, Jul 31, 2025 at 4:22 PM Ajin Cherian <itsajin@gmail.com> wrote: > > I've fixed the patches accordingly for all branches. > Pushed. -- With Regards, Amit Kapila.