Обсуждение: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database
The following bug has been logged on the website: Bug reference: 18988 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 18beta1 Operating system: Ubuntu 24.04 Description: The following script: createdb ndb echo " CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH (connect = false); " | psql echo " DROP SUBSCRIPTION testsub " | psql & sleep 1 timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT" makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot, while this connection is waiting for a lock for relation 6100 (pg_subscription), locked by DROP SUBSCRIPTION: law 1545967 1545946 0 21:10 ? 00:00:00 postgres: law regression [local] DROP SUBSCRIPTION law 1545968 1545946 0 21:10 ? 00:00:00 postgres: walsender law ndb [local] startup waiting With debug_discard_caches = 1 or under some lucky circumstances (I encountered these), this leads to inability to connect to any database. Reproduced on REL_13_STABLE .. master.
On Thu, 17 Jul 2025 at 00:36, PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 18988 > Logged by: Alexander Lakhin > Email address: exclusion@gmail.com > PostgreSQL version: 18beta1 > Operating system: Ubuntu 24.04 > Description: > > The following script: > createdb ndb > echo " > CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH > (connect = false); > " | psql > > echo " > DROP SUBSCRIPTION testsub > " | psql & > sleep 1 > timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT" > > makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot, > while this connection is waiting for a lock for relation 6100 > (pg_subscription), locked by DROP SUBSCRIPTION: > law 1545967 1545946 0 21:10 ? 00:00:00 postgres: law regression > [local] DROP SUBSCRIPTION > law 1545968 1545946 0 21:10 ? 00:00:00 postgres: walsender law > ndb [local] startup waiting > > With debug_discard_caches = 1 or under some lucky circumstances (I > encountered these), this leads to inability to connect to any database. > > Reproduced on REL_13_STABLE .. master. Thanks, I was able to reproduce the issue using the steps provided. The problem occurs because: When dropping a subscription, it takes an AccessExclusiveLock on the pg_subscription system tables to prevent the launcher from restarting the worker. During this process, it also attempts to connect to the publisher in order to drop the replication slot. As we are connecting to a newly created database, it may not yet have initialized its catalog caches. As part of the backend startup, it attempts to build the cache hierarchy via: RelationCacheInitializePhase3 → InitCatalogCachePhase2 → InitCatCachePhase2 This cache initialization requires acquiring a shared lock on pg_subscription, since it is one of the syscache-backed catalog tables. But that shared lock is blocked by the AccessExclusiveLock already held by the dropping process. As a result, the new backend hangs waiting for the lock, and the original DROP SUBSCRIPTION process cannot proceed, leading to a self-blocking scenario. In this specific case, no replication slot was created during subscription creation as the connect option was specified as false. Therefore, I believe the system should skip connecting to the publisher when dropping the subscription. I've attached a patch that addresses this behavior. Thoughts? Regards, Vignesh
Вложения
On Wed, Jul 30, 2025 at 4:24 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 17 Jul 2025 at 00:36, PG Bug reporting form > <noreply@postgresql.org> wrote: > > > > The following bug has been logged on the website: > > > > Bug reference: 18988 > > Logged by: Alexander Lakhin > > Email address: exclusion@gmail.com > > PostgreSQL version: 18beta1 > > Operating system: Ubuntu 24.04 > > Description: > > > > The following script: > > createdb ndb > > echo " > > CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH > > (connect = false); > > " | psql > > > > echo " > > DROP SUBSCRIPTION testsub > > " | psql & > > sleep 1 > > timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT" > > > > makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot, > > while this connection is waiting for a lock for relation 6100 > > (pg_subscription), locked by DROP SUBSCRIPTION: > > law 1545967 1545946 0 21:10 ? 00:00:00 postgres: law regression > > [local] DROP SUBSCRIPTION > > law 1545968 1545946 0 21:10 ? 00:00:00 postgres: walsender law > > ndb [local] startup waiting > > > > With debug_discard_caches = 1 or under some lucky circumstances (I > > encountered these), this leads to inability to connect to any database. > > > > Reproduced on REL_13_STABLE .. master. > > Thanks, I was able to reproduce the issue using the steps provided. > The problem occurs because: When dropping a subscription, it takes an > AccessExclusiveLock on the pg_subscription system tables to prevent > the launcher from restarting the worker. During this process, it also > attempts to connect to the publisher in order to drop the replication > slot. As we are connecting to a newly created database, it may not yet > have initialized its catalog caches. As part of the backend startup, > it attempts to build the cache hierarchy via: > RelationCacheInitializePhase3 → InitCatalogCachePhase2 → > InitCatCachePhase2 This cache initialization requires acquiring a > shared lock on pg_subscription, since it is one of the syscache-backed > catalog tables. But that shared lock is blocked by the > AccessExclusiveLock already held by the dropping process. As a result, > the new backend hangs waiting for the lock, and the original DROP > SUBSCRIPTION process cannot proceed, leading to a self-blocking > scenario. > > In this specific case, no replication slot was created during > subscription creation as the connect option was specified as false. > Therefore, I believe the system should skip connecting to the > publisher when dropping the subscription. I've attached a patch that > addresses this behavior. Thoughts? I think this fix looks correct, while testing I realize that now we need an additional step before we can enable the subscription, so I think we should put that additional step in the error hint, as attached, this top up patch can be applied on top of your patch. -- Regards, Dilip Kumar Google
Вложения
RE: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Vignesh, Dilip, I found another corner case: ``` postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false,slot_name=sub); WARNING: subscription was created, but is not connected HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription. CREATE SUBSCRIPTION postgres=# DROP SUBSCRIPTION sub ; ... (won't return) ``` Because still can explicitly specify the slot_name while creating the subscription. Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the CREATE SUBSCRIPTION WITH (connect=false);. Should we fix the case? If so, how? Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Jul 31, 2025 at 2:34 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Vignesh, Dilip, > > I found another corner case: > > ``` > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false,slot_name=sub); > WARNING: subscription was created, but is not connected > HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh thesubscription. > CREATE SUBSCRIPTION > postgres=# DROP SUBSCRIPTION sub ; > ... (won't return) > ``` > > Because still can explicitly specify the slot_name while creating the subscription. > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the > CREATE SUBSCRIPTION WITH (connect=false);. > > Should we fix the case? If so, how? IMHO, what we should do is to set a flag in subscription that whether the connect is true or not, and drop subscription should not try to drop the slot if the connect is false, thoughts? -- Regards, Dilip Kumar Google
On Thu, Jul 31, 2025 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jul 31, 2025 at 2:34 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Vignesh, Dilip, > > > > I found another corner case: > > > > ``` > > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false,slot_name=sub); > > WARNING: subscription was created, but is not connected > > HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh thesubscription. > > CREATE SUBSCRIPTION > > postgres=# DROP SUBSCRIPTION sub ; > > ... (won't return) > > ``` > > > > Because still can explicitly specify the slot_name while creating the subscription. > > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the > > CREATE SUBSCRIPTION WITH (connect=false);. > > > > Should we fix the case? If so, how? > > IMHO, what we should do is to set a flag in subscription that whether > the connect is true or not, and drop subscription should not try to > drop the slot if the connect is false, thoughts? Maybe not, since this exists in previous branches as well and we cannot alter the system catalog in back branches. Let me put more thoughts on this. -- Regards, Dilip Kumar Google
On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Vignesh, Dilip, > > I found another corner case: > > ``` > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false,slot_name=sub); > WARNING: subscription was created, but is not connected > HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh thesubscription. > CREATE SUBSCRIPTION > postgres=# DROP SUBSCRIPTION sub ; > ... (won't return) > ``` > > Because still can explicitly specify the slot_name while creating the subscription. > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the > CREATE SUBSCRIPTION WITH (connect=false);. > > Should we fix the case? If so, how? An alternative approach would be to acquire an AccessShareLock on pg_subscription, and then explicitly lock the target subscription row using LockSharedObject with AccessExclusiveLock while dropping the subscription. On the launcher side, before starting a worker, it should similarly acquire an AccessExclusiveLock on the corresponding subscription row using LockSharedObject. Once the lock is acquired, it must revalidate that the subscription still exists, to prevent race conditions with concurrent drops, before proceeding to start the worker. Thoughts? Regards, Vignesh
On Mon, Aug 4, 2025 at 9:07 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Vignesh, Dilip, > > > > I found another corner case: > > > > ``` > > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false,slot_name=sub); > > WARNING: subscription was created, but is not connected > > HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh thesubscription. > > CREATE SUBSCRIPTION > > postgres=# DROP SUBSCRIPTION sub ; > > ... (won't return) > > ``` > > > > Because still can explicitly specify the slot_name while creating the subscription. > > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the > > CREATE SUBSCRIPTION WITH (connect=false);. > > > > Should we fix the case? If so, how? > > An alternative approach would be to acquire an AccessShareLock on > pg_subscription, and then explicitly lock the target subscription row > using LockSharedObject with AccessExclusiveLock while dropping the > subscription. On the launcher side, before starting a worker, it > should similarly acquire an AccessExclusiveLock on the corresponding > subscription row using LockSharedObject. Once the lock is acquired, it > must revalidate that the subscription still exists, to prevent race > conditions with concurrent drops, before proceeding to start the > worker. Thanks for the idea, so currently during drop subscription we are already doing LockSharedObject on the subscription in AccessExclusiveLock. So now we should try to acquire AccessExclusiveLock on the launcher side and then revalidate the object existence, let me try this and post a patch in a while? -- Regards, Dilip Kumar Google
On Mon, Aug 4, 2025 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 4, 2025 at 9:07 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear Vignesh, Dilip, > > > > > > I found another corner case: > > > > > > ``` > > > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false,slot_name=sub); > > > WARNING: subscription was created, but is not connected > > > HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refreshthe subscription. > > > CREATE SUBSCRIPTION > > > postgres=# DROP SUBSCRIPTION sub ; > > > ... (won't return) > > > ``` > > > > > > Because still can explicitly specify the slot_name while creating the subscription. > > > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the > > > CREATE SUBSCRIPTION WITH (connect=false);. > > > > > > Should we fix the case? If so, how? > > > > An alternative approach would be to acquire an AccessShareLock on > > pg_subscription, and then explicitly lock the target subscription row > > using LockSharedObject with AccessExclusiveLock while dropping the > > subscription. On the launcher side, before starting a worker, it > > should similarly acquire an AccessExclusiveLock on the corresponding > > subscription row using LockSharedObject. Once the lock is acquired, it > > must revalidate that the subscription still exists, to prevent race > > conditions with concurrent drops, before proceeding to start the > > worker. > > Thanks for the idea, so currently during drop subscription we are > already doing LockSharedObject on the subscription in > AccessExclusiveLock. So now we should try to acquire > AccessExclusiveLock on the launcher side and then revalidate the > object existence, let me try this and post a patch in a while? > I have worked on this and produced a first version of patch, let's see what others think about this idea. It would have been better if we could use SysCache for rechecking the subscription, but since we are not connected to the database in the launcher we can not use the SysCache, at least that's what I think. -- Regards, Dilip Kumar Google
Вложения
On 2025-Aug-04, Dilip Kumar wrote: > I have worked on this and produced a first version of patch, let's see > what others think about this idea. It would have been better if we > could use SysCache for rechecking the subscription, but since we are > not connected to the database in the launcher we can not use the > SysCache, at least that's what I think. I think it's reasonable to recheck after locking. There's a comment in DropSubscription that says we get AEL, which is no longer true. In is_subscription_exists() you should use the index on OID instead of seqscanning the catalog without a scankey; also I think the name ought to be "does" rather than "is". I think it's really odd that that function opens and closes a transaction; sounds to me that something like that really belongs in the caller (frankly the same is true with the other function that your comment references). Why isn't systable_beginscan being used to scan the catalog? I think with this coding, the resource owner for this new lock is NULL. Is this really a good approach? Maybe there should be a resowner here. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Always assume the user will do much worse than the stupidest thing you can imagine." (Julien PUYDT)
On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote: > > On 2025-Aug-04, Dilip Kumar wrote: > > > I have worked on this and produced a first version of patch, let's see > > what others think about this idea. It would have been better if we > > could use SysCache for rechecking the subscription, but since we are > > not connected to the database in the launcher we can not use the > > SysCache, at least that's what I think. > > I think it's reasonable to recheck after locking. There's a comment in > DropSubscription that says we get AEL, which is no longer true. Right, will remove that. In > is_subscription_exists() you should use the index on OID instead of > seqscanning the catalog without a scankey; I thought since launcher is not connected to the database we will not be able to open the index relation. Otherwise we may just call SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); Maybe this is not connected because it was not required so far and we can just connect it to template1 ? also I think the name ought > to be "does" rather than "is". Okay I think it's really odd that that > function opens and closes a transaction; sounds to me that something > like that really belongs in the caller (frankly the same is true with > the other function that your comment references). Why isn't > systable_beginscan being used to scan the catalog? You mean for this function or for get_subscription_list() as well, yeah logically systable_beginscan() sounds better. > I think with this coding, the resource owner for this new lock is NULL. > Is this really a good approach? Maybe there should be a resowner here. As you suggested we should move the transaction to the caller and start it before LockSharedObject() so that we will acquire the lock under the TopTransactionResourceOwner ? -- Regards, Dilip Kumar Google
On Mon, Aug 4, 2025 at 7:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote: > > > > On 2025-Aug-04, Dilip Kumar wrote: > > > > > I have worked on this and produced a first version of patch, let's see > > > what others think about this idea. It would have been better if we > > > could use SysCache for rechecking the subscription, but since we are > > > not connected to the database in the launcher we can not use the > > > SysCache, at least that's what I think. > > > > I think it's reasonable to recheck after locking. There's a comment in > > DropSubscription that says we get AEL, which is no longer true. > > Right, will remove that. > > In > > is_subscription_exists() you should use the index on OID instead of > > seqscanning the catalog without a scankey; > > I thought since launcher is not connected to the database we will not > be able to open the index relation. Otherwise we may just call > SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); Maybe this > is not connected because it was not required so far and we can just > connect it to template1 ? > > > also I think the name ought > > to be "does" rather than "is". > > Okay > > I think it's really odd that that > > function opens and closes a transaction; sounds to me that something > > like that really belongs in the caller (frankly the same is true with > > the other function that your comment references). Why isn't > > systable_beginscan being used to scan the catalog? > > You mean for this function or for get_subscription_list() as well, > yeah logically systable_beginscan() sounds better. > > > I think with this coding, the resource owner for this new lock is NULL. > > Is this really a good approach? Maybe there should be a resowner here. > > As you suggested we should move the transaction to the caller and > start it before LockSharedObject() so that we will acquire the lock > under the TopTransactionResourceOwner ? Here is revised version based on what I proposed here - I have removed the comment in DropSubscription where we acquire the lock, as mentioning the ASL is not interesting anymore, instead I am explaining in launcher why we are acquiring shared object lock. - Connected launcher to "postgres" database so that we can do syscache lookup - got rid of "is_subscription_exists" as we are directly validating using syscache lookup - Didn't do anything with existing function i.e. get_subscription_list -- Regards, Dilip Kumar Google
Вложения
On Tue, Aug 5, 2025 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 4, 2025 at 7:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote: > > > > > > On 2025-Aug-04, Dilip Kumar wrote: > > > > > > > I have worked on this and produced a first version of patch, let's see > > > > what others think about this idea. It would have been better if we > > > > could use SysCache for rechecking the subscription, but since we are > > > > not connected to the database in the launcher we can not use the > > > > SysCache, at least that's what I think. > > > > > > I think it's reasonable to recheck after locking. There's a comment in > > > DropSubscription that says we get AEL, which is no longer true. > > > > Right, will remove that. > > > > In > > > is_subscription_exists() you should use the index on OID instead of > > > seqscanning the catalog without a scankey; > > > > I thought since launcher is not connected to the database we will not > > be able to open the index relation. Otherwise we may just call > > SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); Maybe this > > is not connected because it was not required so far and we can just > > connect it to template1 ? > > > > > > also I think the name ought > > > to be "does" rather than "is". > > > > Okay > > > > I think it's really odd that that > > > function opens and closes a transaction; sounds to me that something > > > like that really belongs in the caller (frankly the same is true with > > > the other function that your comment references). Why isn't > > > systable_beginscan being used to scan the catalog? > > > > You mean for this function or for get_subscription_list() as well, > > yeah logically systable_beginscan() sounds better. > > > > > I think with this coding, the resource owner for this new lock is NULL. > > > Is this really a good approach? Maybe there should be a resowner here. > > > > As you suggested we should move the transaction to the caller and > > start it before LockSharedObject() so that we will acquire the lock > > under the TopTransactionResourceOwner ? > > Here is revised version based on what I proposed here > > - I have removed the comment in DropSubscription where we acquire the > lock, as mentioning the ASL is not interesting anymore, instead I am > explaining in launcher why we are acquiring shared object lock. > - Connected launcher to "postgres" database so that we can do syscache lookup > Won't that add a new restriction that one can't drop postgres database after this? BTW, isn't it better to acquire the share_lock on subscription during worker initialization (InitializeLogRepWorker()) where we already checking whether the subscription is dropped by that time? I think if we don't acquire the lock on subscription during launcher or during InitializeLogRepWorker(), there is a risk that DropSubscription would have reached the place where it would have tried to stop the workers and launcher will launch the worker after that point. Then there is a possibility of dangling worker because the GetSubscription() can still return valid value as the transaction in which we are dropping a subscription could still be in-progress. -- With Regards, Amit Kapila.
On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Won't that add a new restriction that one can't drop postgres database > after this? That's correct, so maybe this is not acceptable IMHO. > BTW, isn't it better to acquire the share_lock on subscription during > worker initialization (InitializeLogRepWorker()) where we already > checking whether the subscription is dropped by that time? I think if > we don't acquire the lock on subscription during launcher or during > InitializeLogRepWorker(), there is a risk that DropSubscription would > have reached the place where it would have tried to stop the workers > and launcher will launch the worker after that point. Then there is a > possibility of dangling worker because the GetSubscription() can still > return valid value as the transaction in which we are dropping a > subscription could still be in-progress. Here is my thought on these 2 approaches, so if we lock in launcher and check there we avoid launching the worker altogether but for that we will have to revalidate the subscription using sequence scan on pg_subcription as we can not open additional indexes as we are not connected to any database OTOH if we acquire lock in InitializeLogRepWorker() we don't need to do any additional scan but we will have to launch the worker and after that if subscription recheck shows its dropped we will exit the worker. I think either way it's fine because this is not a very common performance path, I prefer what you proposed as we will have to add less code in this case, because we are already checking the subscription and just need to acquire the shared object lock in InitializeLogRepWorker(), lets see what other thinks, meanwhile I will modify the code with this approach as well. -- Regards, Dilip Kumar Google
On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Won't that add a new restriction that one can't drop postgres database > > after this? > > That's correct, so maybe this is not acceptable IMHO. > > > BTW, isn't it better to acquire the share_lock on subscription during > > worker initialization (InitializeLogRepWorker()) where we already > > checking whether the subscription is dropped by that time? I think if > > we don't acquire the lock on subscription during launcher or during > > InitializeLogRepWorker(), there is a risk that DropSubscription would > > have reached the place where it would have tried to stop the workers > > and launcher will launch the worker after that point. Then there is a > > possibility of dangling worker because the GetSubscription() can still > > return valid value as the transaction in which we are dropping a > > subscription could still be in-progress. > > Here is my thought on these 2 approaches, so if we lock in launcher > and check there we avoid launching the worker altogether but for that > we will have to revalidate the subscription using sequence scan on > pg_subcription as we can not open additional indexes as we are not > connected to any database OTOH if we acquire lock in > InitializeLogRepWorker() we don't need to do any additional scan but > we will have to launch the worker and after that if subscription > recheck shows its dropped we will exit the worker. > > I think either way it's fine because this is not a very common > performance path, I prefer what you proposed as we will have to add > less code in this case, because we are already checking the > subscription and just need to acquire the shared object lock in > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will > modify the code with this approach as well. Here is a patch with a different approach. IMHO with this approach the code change is much simpler. -- Regards, Dilip Kumar Google
Вложения
On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Won't that add a new restriction that one can't drop postgres database > > > after this? > > > > That's correct, so maybe this is not acceptable IMHO. > > > > > BTW, isn't it better to acquire the share_lock on subscription during > > > worker initialization (InitializeLogRepWorker()) where we already > > > checking whether the subscription is dropped by that time? I think if > > > we don't acquire the lock on subscription during launcher or during > > > InitializeLogRepWorker(), there is a risk that DropSubscription would > > > have reached the place where it would have tried to stop the workers > > > and launcher will launch the worker after that point. Then there is a > > > possibility of dangling worker because the GetSubscription() can still > > > return valid value as the transaction in which we are dropping a > > > subscription could still be in-progress. > > > > Here is my thought on these 2 approaches, so if we lock in launcher > > and check there we avoid launching the worker altogether but for that > > we will have to revalidate the subscription using sequence scan on > > pg_subcription as we can not open additional indexes as we are not > > connected to any database OTOH if we acquire lock in > > InitializeLogRepWorker() we don't need to do any additional scan but > > we will have to launch the worker and after that if subscription > > recheck shows its dropped we will exit the worker. > > > > I think either way it's fine because this is not a very common > > performance path, I prefer what you proposed as we will have to add > > less code in this case, because we are already checking the > > subscription and just need to acquire the shared object lock in > > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will > > modify the code with this approach as well. > > Here is a patch with a different approach. IMHO with this approach > the code change is much simpler. I have slightly modified the patch, basically removing explicitly unlocking the shared object as that will be taken care by transaction commit / proc_exit() -- Regards, Dilip Kumar Google
Вложения
On Thu, 7 Aug 2025 at 19:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Won't that add a new restriction that one can't drop postgres database > > > > after this? > > > > > > That's correct, so maybe this is not acceptable IMHO. > > > > > > > BTW, isn't it better to acquire the share_lock on subscription during > > > > worker initialization (InitializeLogRepWorker()) where we already > > > > checking whether the subscription is dropped by that time? I think if > > > > we don't acquire the lock on subscription during launcher or during > > > > InitializeLogRepWorker(), there is a risk that DropSubscription would > > > > have reached the place where it would have tried to stop the workers > > > > and launcher will launch the worker after that point. Then there is a > > > > possibility of dangling worker because the GetSubscription() can still > > > > return valid value as the transaction in which we are dropping a > > > > subscription could still be in-progress. > > > > > > Here is my thought on these 2 approaches, so if we lock in launcher > > > and check there we avoid launching the worker altogether but for that > > > we will have to revalidate the subscription using sequence scan on > > > pg_subcription as we can not open additional indexes as we are not > > > connected to any database OTOH if we acquire lock in > > > InitializeLogRepWorker() we don't need to do any additional scan but > > > we will have to launch the worker and after that if subscription > > > recheck shows its dropped we will exit the worker. > > > > > > I think either way it's fine because this is not a very common > > > performance path, I prefer what you proposed as we will have to add > > > less code in this case, because we are already checking the > > > subscription and just need to acquire the shared object lock in > > > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will > > > modify the code with this approach as well. > > > > Here is a patch with a different approach. IMHO with this approach > > the code change is much simpler. > > I have slightly modified the patch, basically removing explicitly > unlocking the shared object as that will be taken care by transaction > commit / proc_exit() Thanks for the updated patch. This change is required for the back branches too. Currently it is not applying for PG16 and below branches. git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch Applying: Fix DROP SUBSCRIPTION deadlock with new database creation error: patch failed: src/backend/commands/subscriptioncmds.c:1571 error: src/backend/commands/subscriptioncmds.c: patch does not apply error: patch failed: src/backend/replication/logical/worker.c:4624 error: src/backend/replication/logical/worker.c: patch does not apply Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation Regards, Vignesh
On Wed, Aug 13, 2025 at 11:16 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 7 Aug 2025 at 19:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > Won't that add a new restriction that one can't drop postgres database > > > > > after this? > > > > > > > > That's correct, so maybe this is not acceptable IMHO. > > > > > > > > > BTW, isn't it better to acquire the share_lock on subscription during > > > > > worker initialization (InitializeLogRepWorker()) where we already > > > > > checking whether the subscription is dropped by that time? I think if > > > > > we don't acquire the lock on subscription during launcher or during > > > > > InitializeLogRepWorker(), there is a risk that DropSubscription would > > > > > have reached the place where it would have tried to stop the workers > > > > > and launcher will launch the worker after that point. Then there is a > > > > > possibility of dangling worker because the GetSubscription() can still > > > > > return valid value as the transaction in which we are dropping a > > > > > subscription could still be in-progress. > > > > > > > > Here is my thought on these 2 approaches, so if we lock in launcher > > > > and check there we avoid launching the worker altogether but for that > > > > we will have to revalidate the subscription using sequence scan on > > > > pg_subcription as we can not open additional indexes as we are not > > > > connected to any database OTOH if we acquire lock in > > > > InitializeLogRepWorker() we don't need to do any additional scan but > > > > we will have to launch the worker and after that if subscription > > > > recheck shows its dropped we will exit the worker. > > > > > > > > I think either way it's fine because this is not a very common > > > > performance path, I prefer what you proposed as we will have to add > > > > less code in this case, because we are already checking the > > > > subscription and just need to acquire the shared object lock in > > > > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will > > > > modify the code with this approach as well. > > > > > > Here is a patch with a different approach. IMHO with this approach > > > the code change is much simpler. > > > > I have slightly modified the patch, basically removing explicitly > > unlocking the shared object as that will be taken care by transaction > > commit / proc_exit() > > Thanks for the updated patch. This change is required for the back > branches too. Currently it is not applying for PG16 and below > branches. > git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch > Applying: Fix DROP SUBSCRIPTION deadlock with new database creation > error: patch failed: src/backend/commands/subscriptioncmds.c:1571 > error: src/backend/commands/subscriptioncmds.c: patch does not apply > error: patch failed: src/backend/replication/logical/worker.c:4624 > error: src/backend/replication/logical/worker.c: patch does not apply > Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation Right, I was waiting for consensus for the fix, but anyway I will send the patches for back branches, thanks. -- Regards, Dilip Kumar Google
On Wed, Aug 13, 2025 at 11:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 13, 2025 at 11:16 AM vignesh C <vignesh21@gmail.com> wrote: > > > > Thanks for the updated patch. This change is required for the back > > branches too. Currently it is not applying for PG16 and below > > branches. > > git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch > > Applying: Fix DROP SUBSCRIPTION deadlock with new database creation > > error: patch failed: src/backend/commands/subscriptioncmds.c:1571 > > error: src/backend/commands/subscriptioncmds.c: patch does not apply > > error: patch failed: src/backend/replication/logical/worker.c:4624 > > error: src/backend/replication/logical/worker.c: patch does not apply > > Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation > > Right, I was waiting for consensus for the fix, but anyway I will send > the patches for back branches, thanks. I have prepared back branch patches. -- Regards, Dilip Kumar Google
Вложения
- v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V15.patch
- v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v14.patch
- v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-Master.patch
- v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V13.patch
- v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V16.patch
- v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V17-18.patch
On Wed, Aug 13, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have prepared back branch patches. > @@ -1802,11 +1802,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) ... - /* - * Lock pg_subscription with AccessExclusiveLock to ensure that the - * launcher doesn't restart new worker during dropping the subscription - */ - rel = table_open(SubscriptionRelationId, AccessExclusiveLock); + rel = table_open(SubscriptionRelationId, AccessShareLock); We need to acquire RowExclusiveLock here as we are deleting the subscription row during this operation. See docs [1] and AlterSubscription() code for reference. Attached is a top-up patch to fix this for HEAD. Additionally, I have edited a few comments. If you agree with these changes then please prepare backbranch patches accordingly. [1] - https://www.postgresql.org/docs/devel/explicit-locking.html -- With Regards, Amit Kapila.
Вложения
On Thu, Aug 14, 2025 at 11:39 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Aug 13, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have prepared back branch patches. > > > > @@ -1802,11 +1802,7 @@ DropSubscription(DropSubscriptionStmt *stmt, > bool isTopLevel) > ... > - /* > - * Lock pg_subscription with AccessExclusiveLock to ensure that the > - * launcher doesn't restart new worker during dropping the subscription > - */ > - rel = table_open(SubscriptionRelationId, AccessExclusiveLock); > + rel = table_open(SubscriptionRelationId, AccessShareLock); > > We need to acquire RowExclusiveLock here as we are deleting the > subscription row during this operation. See docs [1] and > AlterSubscription() code for reference. Yeah that's right Attached is a top-up patch to > fix this for HEAD. Additionally, I have edited a few comments. If you > agree with these changes then please prepare backbranch patches > accordingly. Yeah this looks fine to me. PFA patches for back branches. -- Regards, Dilip Kumar Google
Вложения
- v4-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V13.patch
- v4-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-master.patch
- v4-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V16.patch
- v4-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V14_15.patch
- v4-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V17_18.patch
On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yeah this looks fine to me. PFA patches for back branches. > Can we add a test based on the scenario reported in this email? -- With Regards, Amit Kapila.
On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Yeah this looks fine to me. PFA patches for back branches. > > > > Can we add a test based on the scenario reported in this email? At first, I thought we can just add this test in subscription.sql, but IMHO we should not do that because after testing DROP SUBSCRIPTION give error we should eventually DROP the subscription for cleanup by creating the slot and enabling the subscription (as shown below[1]) but I that realized during regression wal_level is not logical so we are not allowed to create the logical slot. So maybe we should write it in 100_bugs.pl? Am I missing something? [1] -- ok CREATE SUBSCRIPTION WITH (connect=false), CREATE DATABASE regression_db; CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regression_db' PUBLICATION testpub WITH (connect=false); -- fail, due to non existing slot -- should not create deadlock with initilizing caches of new databse DROP SUBSCRIPTION regress_testsub; SELECT 'init' FROM pg_create_logical_replication_slot('regress_testsub', 'pgoutput'); -- ok, create missing slot, enable the subscription and drop again SELECT 'init' FROM pg_create_logical_replication_slot('regress_testsub', 'pgoutput'); ALTER SUBSCRIPTION regress_testsub ENABLE; DROP SUBSCRIPTION regress_testsub; -- Regards, Dilip Kumar Google
On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Yeah this looks fine to me. PFA patches for back branches. > > > > > > > Can we add a test based on the scenario reported in this email? > > At first, I thought we can just add this test in subscription.sql, but > IMHO we should not do that because after testing DROP SUBSCRIPTION > give error we should eventually DROP the subscription for cleanup by > creating the slot and enabling the subscription (as shown below[1]) > but I that realized during regression wal_level is not logical so we > are not allowed to create the logical slot. So maybe we should write > it in 100_bugs.pl? Am I missing something? > Another alternative is for cleanup instead of creating a missing slot we can just set the slot_name=NONE and then drop. I will move ahead with this approach. -- Regards, Dilip Kumar Google
On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > Yeah this looks fine to me. PFA patches for back branches. > > > > > > > > > > Can we add a test based on the scenario reported in this email? > > > > At first, I thought we can just add this test in subscription.sql, but > > IMHO we should not do that because after testing DROP SUBSCRIPTION > > give error we should eventually DROP the subscription for cleanup by > > creating the slot and enabling the subscription (as shown below[1]) > > but I that realized during regression wal_level is not logical so we > > are not allowed to create the logical slot. So maybe we should write > > it in 100_bugs.pl? Am I missing something? > > > > Another alternative is for cleanup instead of creating a missing slot > we can just set the slot_name=NONE and then drop. I will move ahead > with this approach. > Yeah, that sounds good to me. -- With Regards, Amit Kapila.
On Fri, Aug 15, 2025 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > Yeah this looks fine to me. PFA patches for back branches. > > > > > > > > > > > > > Can we add a test based on the scenario reported in this email? > > > > > > At first, I thought we can just add this test in subscription.sql, but > > > IMHO we should not do that because after testing DROP SUBSCRIPTION > > > give error we should eventually DROP the subscription for cleanup by > > > creating the slot and enabling the subscription (as shown below[1]) > > > but I that realized during regression wal_level is not logical so we > > > are not allowed to create the logical slot. So maybe we should write > > > it in 100_bugs.pl? Am I missing something? > > > > > > > Another alternative is for cleanup instead of creating a missing slot > > we can just set the slot_name=NONE and then drop. I will move ahead > > with this approach. > > > > Yeah, that sounds good to me. > PFA, updated patches for v13 to head, make check-world is passing for all the branches for v15+ we had upgrade test which also run regression test so we additionally have to set max_wal_senders=1 for 002_pg_upgrade.pl -- Regards, Dilip Kumar Google
Вложения
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-head.patch
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v15.patch
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v13.patch
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v14.patch
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v16.patch
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v17.patch
- v5-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v18.patch
On Fri, 15 Aug 2025 at 11:58, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Aug 15, 2025 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > Yeah this looks fine to me. PFA patches for back branches. > > > > > > > > > > > > > > > > Can we add a test based on the scenario reported in this email? > > > > > > > > At first, I thought we can just add this test in subscription.sql, but > > > > IMHO we should not do that because after testing DROP SUBSCRIPTION > > > > give error we should eventually DROP the subscription for cleanup by > > > > creating the slot and enabling the subscription (as shown below[1]) > > > > but I that realized during regression wal_level is not logical so we > > > > are not allowed to create the logical slot. So maybe we should write > > > > it in 100_bugs.pl? Am I missing something? > > > > > > > > > > Another alternative is for cleanup instead of creating a missing slot > > > we can just set the slot_name=NONE and then drop. I will move ahead > > > with this approach. > > > > > > > Yeah, that sounds good to me. > > > > PFA, updated patches for v13 to head, make check-world is passing for > all the branches for v15+ we had upgrade test which also run > regression test so we additionally have to set max_wal_senders=1 for > 002_pg_upgrade.pl Do you feel it will be better to append the configurations using single call like below: -$oldnode->append_conf('postgresql.conf', 'log_statement = none'); -$oldnode->append_conf('postgresql.conf', 'wal_level = replica'); -$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1'); +$oldnode->append_conf( + 'postgresql.conf', + qq{log_statement = 'none' + wal_level = 'replica' + max_wal_senders = 1}); Regards, Vignesh
On Sun, Aug 17, 2025 at 7:23 PM vignesh C <vignesh21@gmail.com> wrote: > > > PFA, updated patches for v13 to head, make check-world is passing for > > all the branches for v15+ we had upgrade test which also run > > regression test so we additionally have to set max_wal_senders=1 for > > 002_pg_upgrade.pl > > Do you feel it will be better to append the configurations using > single call like below: > -$oldnode->append_conf('postgresql.conf', 'log_statement = none'); > -$oldnode->append_conf('postgresql.conf', 'wal_level = replica'); > -$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1'); > +$oldnode->append_conf( > + 'postgresql.conf', > + qq{log_statement = 'none' > + wal_level = 'replica' > + max_wal_senders = 1}); Not sure if one way is superior over another or preferred. I see both sorts of usage patterns throughout our test suit. -- Regards, Dilip Kumar Google
On Fri, 15 Aug 2025 at 11:58, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Aug 15, 2025 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > Yeah this looks fine to me. PFA patches for back branches. > > > > > > > > > > > > > > > > Can we add a test based on the scenario reported in this email? > > > > > > > > At first, I thought we can just add this test in subscription.sql, but > > > > IMHO we should not do that because after testing DROP SUBSCRIPTION > > > > give error we should eventually DROP the subscription for cleanup by > > > > creating the slot and enabling the subscription (as shown below[1]) > > > > but I that realized during regression wal_level is not logical so we > > > > are not allowed to create the logical slot. So maybe we should write > > > > it in 100_bugs.pl? Am I missing something? > > > > > > > > > > Another alternative is for cleanup instead of creating a missing slot > > > we can just set the slot_name=NONE and then drop. I will move ahead > > > with this approach. > > > > > > > Yeah, that sounds good to me. > > > > PFA, updated patches for v13 to head, make check-world is passing for > all the branches for v15+ we had upgrade test which also run > regression test so we additionally have to set max_wal_senders=1 for > 002_pg_upgrade.pl This change will run not just the newly added test, but the entire regression suite under the new configuration. Should we consider moving this test elsewhere such as into the 100_bugs: diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl index 4b5e895809b..284b2ef07e0 100644 --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params); # Override log_statement=all set by Cluster.pm. This avoids large amounts # of log traffic that slow this test down even more when run under valgrind. $oldnode->append_conf('postgresql.conf', 'log_statement = none'); +$oldnode->append_conf('postgresql.conf', 'wal_level = replica'); +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1'); Regards, Vignesh
On Mon, Aug 18, 2025 at 9:04 AM vignesh C <vignesh21@gmail.com> wrote: > This change will run not just the newly added test, but the entire > regression suite under the new configuration. Should we consider > moving this test elsewhere such as into the 100_bugs: > diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl > b/src/bin/pg_upgrade/t/002_pg_upgrade.pl > index 4b5e895809b..284b2ef07e0 100644 > --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl > +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl > @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params); > # Override log_statement=all set by Cluster.pm. This avoids large amounts > # of log traffic that slow this test down even more when run under valgrind. > $oldnode->append_conf('postgresql.conf', 'log_statement = none'); > +$oldnode->append_conf('postgresql.conf', 'wal_level = replica'); > +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1'); Yeah even I didn't like it to set wal_level to replica only for this test, it makes sense to write it in 100_bugs.pl, here is a updated patch for HEAD only, I have also updated the commit message, once we get agreement on this I will prepare a backbranch patches. -- Regards, Dilip Kumar Google
Вложения
On Mon, Aug 18, 2025 at 11:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 18, 2025 at 9:04 AM vignesh C <vignesh21@gmail.com> wrote: > > This change will run not just the newly added test, but the entire > > regression suite under the new configuration. Should we consider > > moving this test elsewhere such as into the 100_bugs: > > diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > b/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > index 4b5e895809b..284b2ef07e0 100644 > > --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params); > > # Override log_statement=all set by Cluster.pm. This avoids large amounts > > # of log traffic that slow this test down even more when run under valgrind. > > $oldnode->append_conf('postgresql.conf', 'log_statement = none'); > > +$oldnode->append_conf('postgresql.conf', 'wal_level = replica'); > > +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1'); > > Yeah even I didn't like it to set wal_level to replica only for this > test, it makes sense to write it in 100_bugs.pl, here is a updated > patch for HEAD only, I have also updated the commit message, once we > get agreement on this I will prepare a backbranch patches. > Updated test case, cleanup reported by 1) pgperltidy 2) dropped the database post test and 3) changed subscription name -- Regards, Dilip Kumar Google
Вложения
On Mon, Aug 18, 2025 at 12:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 18, 2025 at 11:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Aug 18, 2025 at 9:04 AM vignesh C <vignesh21@gmail.com> wrote: > > > This change will run not just the newly added test, but the entire > > > regression suite under the new configuration. Should we consider > > > moving this test elsewhere such as into the 100_bugs: > > > diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > > b/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > > index 4b5e895809b..284b2ef07e0 100644 > > > --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > > +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl > > > @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params); > > > # Override log_statement=all set by Cluster.pm. This avoids large amounts > > > # of log traffic that slow this test down even more when run under valgrind. > > > $oldnode->append_conf('postgresql.conf', 'log_statement = none'); > > > +$oldnode->append_conf('postgresql.conf', 'wal_level = replica'); > > > +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1'); > > > > Yeah even I didn't like it to set wal_level to replica only for this > > test, it makes sense to write it in 100_bugs.pl, here is a updated > > patch for HEAD only, I have also updated the commit message, once we > > get agreement on this I will prepare a backbranch patches. > > > > Updated test case, cleanup reported by 1) pgperltidy 2) dropped the > database post test and 3) changed subscription name > Looks mostly good. How about slightly changing the comment as in attached? -- With Regards, Amit Kapila.
Вложения
On Mon, Aug 18, 2025 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Looks mostly good. How about slightly changing the comment as in attached? Your suggestion LGTM, I am making back branch patches. I observed another behavior change in the test in v13. The $sdterr is matching with error details not with the error code, so for v13 i have to modify the error comparison string as [1] whereas with other versions it's like [2]. But if we execute this test from psql then I can see the same error in v13 as well[3]. Not sure why perl is behaving differently. Maybe there is a way to change this but I could not figure it out. And another difference I have noticed is that in v13 I had to give complete publisher_connstr that's maybe it's not the starting server on the default port. From this I realized that maybe for other branches where it is working fine with just dbname, we should also include publisher_connstr as shown [4] just to ensure we are connecting to the right node, thoughts? [1] +like( + $stderr, + qr/ERROR: replication slot "regress_sub1" does not exist/, + "could not drop replication slot: error message"); [2] +like( + $stderr, + qr/ERROR: could not drop replication slot "regress_sub1" on publisher/, + "could not drop replication slot: error message"); [3] postgres[2045257]=# DROP SUBSCRIPTION regress_sub1; ERROR: XX000: could not drop the replication slot "regress_sub1" on publisher DETAIL: The error was: ERROR: replication slot "regress_sub1" does not exist LOCATION: DropSubscription, subscriptioncmds.c:1008 [4] +$publisher_connstr = $node_publisher->connstr . ' dbname=regress_db'; +$node_publisher->safe_psql( + 'postgres', qq( + CREATE DATABASE regress_db; + CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub WITH (connect=false); +)); Attaching only v13 and head patches, once we have a way forward I will send other patches as well. -- Regards, Dilip Kumar Google
Вложения
On Mon, 18 Aug 2025 at 19:52, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 18, 2025 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Looks mostly good. How about slightly changing the comment as in attached? > > Your suggestion LGTM, I am making back branch patches. I observed > another behavior change in the test in v13. The $sdterr is matching > with error details not with the error code, so for v13 i have to > modify the error comparison string as [1] whereas with other versions > it's like [2]. But if we execute this test from psql then I can see > the same error in v13 as well[3]. Not sure why perl is behaving > differently. The error message in PG13 and master branch is different, that is the reason the test was failing. The error in master is: could not drop replication slot whereas the error message in PG13 is: could not drop the replication slot. The test passes with the below change in PG13: --- a/src/test/subscription/t/100_bugs.pl +++ b/src/test/subscription/t/100_bugs.pl @@ -315,7 +315,7 @@ my ($ret, $stdout, $stderr) = isnt($ret, 0, "replication slot does not exist: exit code not 0"); like( $stderr, - qr/ERROR: replication slot "regress_sub1" does not exist/, + qr/ERROR: could not drop the replication slot "regress_sub1" on publisher/, "could not drop replication slot: error message"); $node_publisher->safe_psql('postgres', "DROP DATABASE regress_db"); Regards, Vignesh
On Mon, Aug 18, 2025 at 10:11 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 18 Aug 2025 at 19:52, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Aug 18, 2025 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Looks mostly good. How about slightly changing the comment as in attached? > > > > Your suggestion LGTM, I am making back branch patches. I observed > > another behavior change in the test in v13. The $sdterr is matching > > with error details not with the error code, so for v13 i have to > > modify the error comparison string as [1] whereas with other versions > > it's like [2]. But if we execute this test from psql then I can see > > the same error in v13 as well[3]. Not sure why perl is behaving > > differently. > > The error message in PG13 and master branch is different, that is the > reason the test was failing. > The error in master is: could not drop replication slot whereas the > error message in PG13 is: could not drop the replication slot. > The test passes with the below change in PG13: > --- a/src/test/subscription/t/100_bugs.pl > +++ b/src/test/subscription/t/100_bugs.pl > @@ -315,7 +315,7 @@ my ($ret, $stdout, $stderr) = > isnt($ret, 0, "replication slot does not exist: exit code not 0"); > like( > $stderr, > - qr/ERROR: replication slot "regress_sub1" does not exist/, > + qr/ERROR: could not drop the replication slot "regress_sub1" > on publisher/, > "could not drop replication slot: error message"); > > $node_publisher->safe_psql('postgres', "DROP DATABASE regress_db"); Oh my bad, the error codes were so similar I didn't notice the difference, I probably would have just done string search instead of comparing by reading. It's a classic case where a machine's precision beats a human's eyes :) PFA, patches for head to v13. I have also changed the connection string to $node_publisher->connstr . ' dbname=regress_db'; instead of assuming it will start on the default port. make check-world is passing on all the branches. -- Regards, Dilip Kumar Google
Вложения
- v8-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro_head.patch
- v8-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro_v14.patch
- v8-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro_v16_17_18.patch
- v8-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro_v13.patch
- v8-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro_v15.patch
On Tue, Aug 19, 2025 at 8:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > PFA, patches for head to v13. > Pushed. -- With Regards, Amit Kapila.
On Tue, Aug 19, 2025 at 3:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 19, 2025 at 8:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > PFA, patches for head to v13. > > > > Pushed. Thanks. -- Regards, Dilip Kumar Google