Обсуждение: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

Поиск

Список

Период

Сортировка

BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

PG Bug reporting form

Дата:

16 июля, 22:00:01

The following bug has been logged on the website:

Bug reference:      18988
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 18beta1
Operating system:   Ubuntu 24.04
Description:

The following script:
createdb ndb
echo "
CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH
(connect = false);
" | psql

echo "
DROP SUBSCRIPTION testsub
" | psql &
sleep 1
timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT"

makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot,
while this connection is waiting for a lock for relation 6100
(pg_subscription), locked by DROP SUBSCRIPTION:
law      1545967 1545946  0 21:10 ?        00:00:00 postgres: law regression
[local] DROP SUBSCRIPTION
law      1545968 1545946  0 21:10 ?        00:00:00 postgres: walsender law
ndb [local] startup waiting

With debug_discard_caches = 1 or under some lucky circumstances (I
encountered these), this leads to inability to connect to any database.

Reproduced on REL_13_STABLE .. master.

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

vignesh C

Дата:

30 июля, 13:54:04

On Thu, 17 Jul 2025 at 00:36, PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18988
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 18beta1
> Operating system:   Ubuntu 24.04
> Description:
>
> The following script:
> createdb ndb
> echo "
> CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH
> (connect = false);
> " | psql
>
> echo "
> DROP SUBSCRIPTION testsub
> " | psql &
> sleep 1
> timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT"
>
> makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot,
> while this connection is waiting for a lock for relation 6100
> (pg_subscription), locked by DROP SUBSCRIPTION:
> law      1545967 1545946  0 21:10 ?        00:00:00 postgres: law regression
> [local] DROP SUBSCRIPTION
> law      1545968 1545946  0 21:10 ?        00:00:00 postgres: walsender law
> ndb [local] startup waiting
>
> With debug_discard_caches = 1 or under some lucky circumstances (I
> encountered these), this leads to inability to connect to any database.
>
> Reproduced on REL_13_STABLE .. master.

Thanks, I was able to reproduce the issue using the steps provided.
The problem occurs because: When dropping a subscription, it takes an
AccessExclusiveLock on the pg_subscription system tables to prevent
the launcher from restarting the worker. During this process, it also
attempts to connect to the publisher in order to drop the replication
slot. As we are connecting to a newly created database, it may not yet
have initialized its catalog caches. As part of the backend startup,
it attempts to build the cache hierarchy via:
RelationCacheInitializePhase3 → InitCatalogCachePhase2 →
InitCatCachePhase2 This cache initialization requires acquiring a
shared lock on pg_subscription, since it is one of the syscache-backed
catalog tables. But that shared lock is blocked by the
AccessExclusiveLock already held by the dropping process. As a result,
the new backend hangs waiting for the lock, and the original DROP
SUBSCRIPTION process cannot proceed, leading to a self-blocking
scenario.

In this specific case, no replication slot was created during
subscription creation as the connect option was specified as false.
Therefore, I believe the system should skip connecting to the
publisher when dropping the subscription. I've attached a patch that
addresses this behavior. Thoughts?

Regards,
Vignesh

Вложения

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

31 июля, 08:20:43

On Wed, Jul 30, 2025 at 4:24 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 17 Jul 2025 at 00:36, PG Bug reporting form
> <noreply@postgresql.org> wrote:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference:      18988
> > Logged by:          Alexander Lakhin
> > Email address:      exclusion@gmail.com
> > PostgreSQL version: 18beta1
> > Operating system:   Ubuntu 24.04
> > Description:
> >
> > The following script:
> > createdb ndb
> > echo "
> > CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH
> > (connect = false);
> > " | psql
> >
> > echo "
> > DROP SUBSCRIPTION testsub
> > " | psql &
> > sleep 1
> > timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT"
> >
> > makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot,
> > while this connection is waiting for a lock for relation 6100
> > (pg_subscription), locked by DROP SUBSCRIPTION:
> > law      1545967 1545946  0 21:10 ?        00:00:00 postgres: law regression
> > [local] DROP SUBSCRIPTION
> > law      1545968 1545946  0 21:10 ?        00:00:00 postgres: walsender law
> > ndb [local] startup waiting
> >
> > With debug_discard_caches = 1 or under some lucky circumstances (I
> > encountered these), this leads to inability to connect to any database.
> >
> > Reproduced on REL_13_STABLE .. master.
>
> Thanks, I was able to reproduce the issue using the steps provided.
> The problem occurs because: When dropping a subscription, it takes an
> AccessExclusiveLock on the pg_subscription system tables to prevent
> the launcher from restarting the worker. During this process, it also
> attempts to connect to the publisher in order to drop the replication
> slot. As we are connecting to a newly created database, it may not yet
> have initialized its catalog caches. As part of the backend startup,
> it attempts to build the cache hierarchy via:
> RelationCacheInitializePhase3 → InitCatalogCachePhase2 →
> InitCatCachePhase2 This cache initialization requires acquiring a
> shared lock on pg_subscription, since it is one of the syscache-backed
> catalog tables. But that shared lock is blocked by the
> AccessExclusiveLock already held by the dropping process. As a result,
> the new backend hangs waiting for the lock, and the original DROP
> SUBSCRIPTION process cannot proceed, leading to a self-blocking
> scenario.
>
> In this specific case, no replication slot was created during
> subscription creation as the connect option was specified as false.
> Therefore, I believe the system should skip connecting to the
> publisher when dropping the subscription. I've attached a patch that
> addresses this behavior. Thoughts?

I think this fix looks correct, while testing I realize that now we
need an additional step before we can enable the subscription, so I
think we should put that additional step in the error hint, as
attached, this top up patch can be applied on top of your patch.

--
Regards,
Dilip Kumar
Google

Вложения

additinal_errhint.patch

RE: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

"Hayato Kuroda (Fujitsu)"

Дата:

31 июля, 12:03:58

Dear Vignesh, Dilip,

I found another corner case:

```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH
(connect=false,slot_name=sub);
 
WARNING:  subscription was created, but is not connected
HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the
subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```

Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.

Should we fix the case? If so, how?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

31 июля, 14:44:01

On Thu, Jul 31, 2025 at 2:34 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Vignesh, Dilip,
>
> I found another corner case:
>
> ```
> postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH
(connect=false,slot_name=sub); 
> WARNING:  subscription was created, but is not connected
> HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh
thesubscription. 
> CREATE SUBSCRIPTION
> postgres=# DROP SUBSCRIPTION sub ;
> ... (won't return)
> ```
>
> Because still can explicitly specify the slot_name while creating the subscription.
> Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
> CREATE SUBSCRIPTION WITH (connect=false);.
>
> Should we fix the case? If so, how?

IMHO, what we should do is to set a flag in subscription that whether
the connect is true or not, and drop subscription should not try to
drop the slot if the connect is false, thoughts?

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

31 июля, 14:59:43

On Thu, Jul 31, 2025 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 2:34 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Vignesh, Dilip,
> >
> > I found another corner case:
> >
> > ```
> > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH
(connect=false,slot_name=sub); 
> > WARNING:  subscription was created, but is not connected
> > HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh
thesubscription. 
> > CREATE SUBSCRIPTION
> > postgres=# DROP SUBSCRIPTION sub ;
> > ... (won't return)
> > ```
> >
> > Because still can explicitly specify the slot_name while creating the subscription.
> > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
> > CREATE SUBSCRIPTION WITH (connect=false);.
> >
> > Should we fix the case? If so, how?
>
> IMHO, what we should do is to set a flag in subscription that whether
> the connect is true or not, and drop subscription should not try to
> drop the slot if the connect is false, thoughts?

Maybe not, since this exists in previous branches as well and we
cannot alter the system catalog in back branches.  Let me put more
thoughts on this.

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

vignesh C

Дата:

04 августа, 06:37:00

On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Vignesh, Dilip,
>
> I found another corner case:
>
> ```
> postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH
(connect=false,slot_name=sub);

> WARNING:  subscription was created, but is not connected
> HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh
thesubscription.

> CREATE SUBSCRIPTION
> postgres=# DROP SUBSCRIPTION sub ;
> ... (won't return)
> ```
>
> Because still can explicitly specify the slot_name while creating the subscription.
> Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
> CREATE SUBSCRIPTION WITH (connect=false);.
>
> Should we fix the case? If so, how?

An alternative approach would be to acquire an AccessShareLock on
pg_subscription, and then explicitly lock the target subscription row
using LockSharedObject with AccessExclusiveLock while dropping the
subscription. On the launcher side, before starting a worker, it
should similarly acquire an AccessExclusiveLock on the corresponding
subscription row using LockSharedObject. Once the lock is acquired, it
must revalidate that the subscription still exists, to prevent race
conditions with concurrent drops, before proceeding to start the
worker.

Thoughts?

Regards,
Vignesh

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

04 августа, 06:59:19

On Mon, Aug 4, 2025 at 9:07 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Vignesh, Dilip,
> >
> > I found another corner case:
> >
> > ```
> > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH
(connect=false,slot_name=sub); 
> > WARNING:  subscription was created, but is not connected
> > HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh
thesubscription. 
> > CREATE SUBSCRIPTION
> > postgres=# DROP SUBSCRIPTION sub ;
> > ... (won't return)
> > ```
> >
> > Because still can explicitly specify the slot_name while creating the subscription.
> > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
> > CREATE SUBSCRIPTION WITH (connect=false);.
> >
> > Should we fix the case? If so, how?
>
> An alternative approach would be to acquire an AccessShareLock on
> pg_subscription, and then explicitly lock the target subscription row
> using LockSharedObject with AccessExclusiveLock while dropping the
> subscription. On the launcher side, before starting a worker, it
> should similarly acquire an AccessExclusiveLock on the corresponding
> subscription row using LockSharedObject. Once the lock is acquired, it
> must revalidate that the subscription still exists, to prevent race
> conditions with concurrent drops, before proceeding to start the
> worker.

Thanks for the idea, so currently during drop subscription we are
already doing LockSharedObject on the subscription in
AccessExclusiveLock.  So now we should try to acquire
AccessExclusiveLock on the launcher side and then revalidate the
object existence, let me try this and post a patch in a while?

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

04 августа, 15:28:11

On Mon, Aug 4, 2025 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 9:07 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Vignesh, Dilip,
> > >
> > > I found another corner case:
> > >
> > > ```
> > > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH
(connect=false,slot_name=sub); 
> > > WARNING:  subscription was created, but is not connected
> > > HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and
refreshthe subscription. 
> > > CREATE SUBSCRIPTION
> > > postgres=# DROP SUBSCRIPTION sub ;
> > > ... (won't return)
> > > ```
> > >
> > > Because still can explicitly specify the slot_name while creating the subscription.
> > > Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
> > > CREATE SUBSCRIPTION WITH (connect=false);.
> > >
> > > Should we fix the case? If so, how?
> >
> > An alternative approach would be to acquire an AccessShareLock on
> > pg_subscription, and then explicitly lock the target subscription row
> > using LockSharedObject with AccessExclusiveLock while dropping the
> > subscription. On the launcher side, before starting a worker, it
> > should similarly acquire an AccessExclusiveLock on the corresponding
> > subscription row using LockSharedObject. Once the lock is acquired, it
> > must revalidate that the subscription still exists, to prevent race
> > conditions with concurrent drops, before proceeding to start the
> > worker.
>
> Thanks for the idea, so currently during drop subscription we are
> already doing LockSharedObject on the subscription in
> AccessExclusiveLock.  So now we should try to acquire
> AccessExclusiveLock on the launcher side and then revalidate the
> object existence, let me try this and post a patch in a while?
>

I have worked on this and produced a first version of patch, let's see
what others think about this idea.  It would have been better if we
could use SysCache for rechecking the subscription, but since we are
not connected to the database in the launcher we can not use the
SysCache, at least that's what I think.

--
Regards,
Dilip Kumar
Google

Вложения

fix_drop_subscription_hang.patch

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Álvaro Herrera

Дата:

04 августа, 15:50:34

On 2025-Aug-04, Dilip Kumar wrote:

> I have worked on this and produced a first version of patch, let's see
> what others think about this idea.  It would have been better if we
> could use SysCache for rechecking the subscription, but since we are
> not connected to the database in the launcher we can not use the
> SysCache, at least that's what I think.

I think it's reasonable to recheck after locking.  There's a comment in
DropSubscription that says we get AEL, which is no longer true.  In
is_subscription_exists() you should use the index on OID instead of
seqscanning the catalog without a scankey; also I think the name ought
to be "does" rather than "is".  I think it's really odd that that
function opens and closes a transaction; sounds to me that something
like that really belongs in the caller (frankly the same is true with
the other function that your comment references).  Why isn't
systable_beginscan being used to scan the catalog?

I think with this coding, the resource owner for this new lock is NULL.
Is this really a good approach?  Maybe there should be a resowner here.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."                                (Julien PUYDT)

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

04 августа, 16:38:48

On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
>
> On 2025-Aug-04, Dilip Kumar wrote:
>
> > I have worked on this and produced a first version of patch, let's see
> > what others think about this idea.  It would have been better if we
> > could use SysCache for rechecking the subscription, but since we are
> > not connected to the database in the launcher we can not use the
> > SysCache, at least that's what I think.
>
> I think it's reasonable to recheck after locking.  There's a comment in
> DropSubscription that says we get AEL, which is no longer true.

Right, will remove that.

  In
> is_subscription_exists() you should use the index on OID instead of
> seqscanning the catalog without a scankey;

I thought since launcher is not connected to the database we will not
be able to open the index relation.  Otherwise we may just call
SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid));  Maybe this
is not connected because it was not required so far and we can just
connect it to template1 ?

 also I think the name ought
> to be "does" rather than "is".

Okay

  I think it's really odd that that
> function opens and closes a transaction; sounds to me that something
> like that really belongs in the caller (frankly the same is true with
> the other function that your comment references).  Why isn't
> systable_beginscan being used to scan the catalog?

You mean for this function or for get_subscription_list() as well,
yeah logically systable_beginscan() sounds better.

> I think with this coding, the resource owner for this new lock is NULL.
> Is this really a good approach?  Maybe there should be a resowner here.

As you suggested we should move the transaction to the caller and
start it before LockSharedObject() so that we will acquire the lock
under the TopTransactionResourceOwner ?

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

05 августа, 09:21:39

On Mon, Aug 4, 2025 at 7:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
> >
> > On 2025-Aug-04, Dilip Kumar wrote:
> >
> > > I have worked on this and produced a first version of patch, let's see
> > > what others think about this idea.  It would have been better if we
> > > could use SysCache for rechecking the subscription, but since we are
> > > not connected to the database in the launcher we can not use the
> > > SysCache, at least that's what I think.
> >
> > I think it's reasonable to recheck after locking.  There's a comment in
> > DropSubscription that says we get AEL, which is no longer true.
>
> Right, will remove that.
>
>   In
> > is_subscription_exists() you should use the index on OID instead of
> > seqscanning the catalog without a scankey;
>
> I thought since launcher is not connected to the database we will not
> be able to open the index relation.  Otherwise we may just call
> SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid));  Maybe this
> is not connected because it was not required so far and we can just
> connect it to template1 ?
>
>
>  also I think the name ought
> > to be "does" rather than "is".
>
> Okay
>
>   I think it's really odd that that
> > function opens and closes a transaction; sounds to me that something
> > like that really belongs in the caller (frankly the same is true with
> > the other function that your comment references).  Why isn't
> > systable_beginscan being used to scan the catalog?
>
> You mean for this function or for get_subscription_list() as well,
> yeah logically systable_beginscan() sounds better.
>
> > I think with this coding, the resource owner for this new lock is NULL.
> > Is this really a good approach?  Maybe there should be a resowner here.
>
> As you suggested we should move the transaction to the caller and
> start it before LockSharedObject() so that we will acquire the lock
> under the TopTransactionResourceOwner ?

Here is revised version based on what I proposed here

- I have removed the comment in DropSubscription where we acquire the
lock, as mentioning the ASL is not interesting anymore, instead I am
explaining in launcher why we are acquiring shared object lock.
- Connected launcher to "postgres" database so that we can do syscache lookup
- got rid of "is_subscription_exists" as we are directly validating
using syscache lookup
- Didn't do anything with existing function i.e. get_subscription_list

--
Regards,
Dilip Kumar
Google

Вложения

v1-0001-Fix-drop-subcription-deadlock-with-create-databas.patch

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Amit Kapila

Дата:

06 августа, 07:58:13

On Tue, Aug 5, 2025 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 7:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
> > >
> > > On 2025-Aug-04, Dilip Kumar wrote:
> > >
> > > > I have worked on this and produced a first version of patch, let's see
> > > > what others think about this idea.  It would have been better if we
> > > > could use SysCache for rechecking the subscription, but since we are
> > > > not connected to the database in the launcher we can not use the
> > > > SysCache, at least that's what I think.
> > >
> > > I think it's reasonable to recheck after locking.  There's a comment in
> > > DropSubscription that says we get AEL, which is no longer true.
> >
> > Right, will remove that.
> >
> >   In
> > > is_subscription_exists() you should use the index on OID instead of
> > > seqscanning the catalog without a scankey;
> >
> > I thought since launcher is not connected to the database we will not
> > be able to open the index relation.  Otherwise we may just call
> > SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid));  Maybe this
> > is not connected because it was not required so far and we can just
> > connect it to template1 ?
> >
> >
> >  also I think the name ought
> > > to be "does" rather than "is".
> >
> > Okay
> >
> >   I think it's really odd that that
> > > function opens and closes a transaction; sounds to me that something
> > > like that really belongs in the caller (frankly the same is true with
> > > the other function that your comment references).  Why isn't
> > > systable_beginscan being used to scan the catalog?
> >
> > You mean for this function or for get_subscription_list() as well,
> > yeah logically systable_beginscan() sounds better.
> >
> > > I think with this coding, the resource owner for this new lock is NULL.
> > > Is this really a good approach?  Maybe there should be a resowner here.
> >
> > As you suggested we should move the transaction to the caller and
> > start it before LockSharedObject() so that we will acquire the lock
> > under the TopTransactionResourceOwner ?
>
> Here is revised version based on what I proposed here
>
> - I have removed the comment in DropSubscription where we acquire the
> lock, as mentioning the ASL is not interesting anymore, instead I am
> explaining in launcher why we are acquiring shared object lock.
> - Connected launcher to "postgres" database so that we can do syscache lookup
>

Won't that add a new restriction that one can't drop postgres database
after this?

BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.

--
With Regards,
Amit Kapila.

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

06 августа, 08:37:28

On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Won't that add a new restriction that one can't drop postgres database
> after this?

That's correct, so maybe this is not acceptable IMHO.

> BTW, isn't it better to acquire the share_lock on subscription during
> worker initialization (InitializeLogRepWorker()) where we already
> checking whether the subscription is dropped by that time? I think if
> we don't acquire the lock on subscription during launcher or during
> InitializeLogRepWorker(), there is a risk that DropSubscription would
> have reached the place where it would have tried to stop the workers
> and launcher will launch the worker after that point. Then there is a
> possibility of dangling worker because the GetSubscription() can still
> return valid value as the transaction in which we are dropping a
> subscription could still be in-progress.

Here is my thought on these 2 approaches, so if we lock in launcher
and check there we avoid launching the worker altogether but for that
we will have to revalidate the subscription using sequence scan on
pg_subcription as we can not open additional indexes as we are not
connected to any database OTOH if we acquire lock in
InitializeLogRepWorker() we don't need to do any additional scan but
we will have to launch the worker and after that if subscription
recheck shows its dropped we will exit the worker.

I think either way it's fine because this is not a very common
performance path, I prefer what you proposed as we will have to add
less code in this case, because we are already checking the
subscription and just need to acquire the shared object lock in
InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
modify the code with this approach as well.

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

06 августа, 09:08:44

On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Won't that add a new restriction that one can't drop postgres database
> > after this?
>
> That's correct, so maybe this is not acceptable IMHO.
>
> > BTW, isn't it better to acquire the share_lock on subscription during
> > worker initialization (InitializeLogRepWorker()) where we already
> > checking whether the subscription is dropped by that time? I think if
> > we don't acquire the lock on subscription during launcher or during
> > InitializeLogRepWorker(), there is a risk that DropSubscription would
> > have reached the place where it would have tried to stop the workers
> > and launcher will launch the worker after that point. Then there is a
> > possibility of dangling worker because the GetSubscription() can still
> > return valid value as the transaction in which we are dropping a
> > subscription could still be in-progress.
>
> Here is my thought on these 2 approaches, so if we lock in launcher
> and check there we avoid launching the worker altogether but for that
> we will have to revalidate the subscription using sequence scan on
> pg_subcription as we can not open additional indexes as we are not
> connected to any database OTOH if we acquire lock in
> InitializeLogRepWorker() we don't need to do any additional scan but
> we will have to launch the worker and after that if subscription
> recheck shows its dropped we will exit the worker.
>
> I think either way it's fine because this is not a very common
> performance path, I prefer what you proposed as we will have to add
> less code in this case, because we are already checking the
> subscription and just need to acquire the shared object lock in
> InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
> modify the code with this approach as well.

Here is a patch with a different approach.  IMHO with this approach
the code change is much simpler.

--
Regards,
Dilip Kumar
Google

Вложения

v2-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

07 августа, 16:48:19

On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Won't that add a new restriction that one can't drop postgres database
> > > after this?
> >
> > That's correct, so maybe this is not acceptable IMHO.
> >
> > > BTW, isn't it better to acquire the share_lock on subscription during
> > > worker initialization (InitializeLogRepWorker()) where we already
> > > checking whether the subscription is dropped by that time? I think if
> > > we don't acquire the lock on subscription during launcher or during
> > > InitializeLogRepWorker(), there is a risk that DropSubscription would
> > > have reached the place where it would have tried to stop the workers
> > > and launcher will launch the worker after that point. Then there is a
> > > possibility of dangling worker because the GetSubscription() can still
> > > return valid value as the transaction in which we are dropping a
> > > subscription could still be in-progress.
> >
> > Here is my thought on these 2 approaches, so if we lock in launcher
> > and check there we avoid launching the worker altogether but for that
> > we will have to revalidate the subscription using sequence scan on
> > pg_subcription as we can not open additional indexes as we are not
> > connected to any database OTOH if we acquire lock in
> > InitializeLogRepWorker() we don't need to do any additional scan but
> > we will have to launch the worker and after that if subscription
> > recheck shows its dropped we will exit the worker.
> >
> > I think either way it's fine because this is not a very common
> > performance path, I prefer what you proposed as we will have to add
> > less code in this case, because we are already checking the
> > subscription and just need to acquire the shared object lock in
> > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
> > modify the code with this approach as well.
>
> Here is a patch with a different approach.  IMHO with this approach
> the code change is much simpler.

I have slightly modified the patch, basically removing explicitly
unlocking the shared object as that will be taken care by transaction
commit / proc_exit()

--
Regards,
Dilip Kumar
Google

Вложения

v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

vignesh C

Дата:

13 августа, 08:46:15

On Thu, 7 Aug 2025 at 19:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > Won't that add a new restriction that one can't drop postgres database
> > > > after this?
> > >
> > > That's correct, so maybe this is not acceptable IMHO.
> > >
> > > > BTW, isn't it better to acquire the share_lock on subscription during
> > > > worker initialization (InitializeLogRepWorker()) where we already
> > > > checking whether the subscription is dropped by that time? I think if
> > > > we don't acquire the lock on subscription during launcher or during
> > > > InitializeLogRepWorker(), there is a risk that DropSubscription would
> > > > have reached the place where it would have tried to stop the workers
> > > > and launcher will launch the worker after that point. Then there is a
> > > > possibility of dangling worker because the GetSubscription() can still
> > > > return valid value as the transaction in which we are dropping a
> > > > subscription could still be in-progress.
> > >
> > > Here is my thought on these 2 approaches, so if we lock in launcher
> > > and check there we avoid launching the worker altogether but for that
> > > we will have to revalidate the subscription using sequence scan on
> > > pg_subcription as we can not open additional indexes as we are not
> > > connected to any database OTOH if we acquire lock in
> > > InitializeLogRepWorker() we don't need to do any additional scan but
> > > we will have to launch the worker and after that if subscription
> > > recheck shows its dropped we will exit the worker.
> > >
> > > I think either way it's fine because this is not a very common
> > > performance path, I prefer what you proposed as we will have to add
> > > less code in this case, because we are already checking the
> > > subscription and just need to acquire the shared object lock in
> > > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
> > > modify the code with this approach as well.
> >
> > Here is a patch with a different approach.  IMHO with this approach
> > the code change is much simpler.
>
> I have slightly modified the patch, basically removing explicitly
> unlocking the shared object as that will be taken care by transaction
> commit / proc_exit()

Thanks for the updated patch. This change is required for the back
branches too. Currently it is not applying for PG16 and below
branches.
git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch
Applying: Fix DROP SUBSCRIPTION deadlock with new database creation
error: patch failed: src/backend/commands/subscriptioncmds.c:1571
error: src/backend/commands/subscriptioncmds.c: patch does not apply
error: patch failed: src/backend/replication/logical/worker.c:4624
error: src/backend/replication/logical/worker.c: patch does not apply
Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation

Regards,
Vignesh

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

13 августа, 08:50:49

On Wed, Aug 13, 2025 at 11:16 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 7 Aug 2025 at 19:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > Won't that add a new restriction that one can't drop postgres database
> > > > > after this?
> > > >
> > > > That's correct, so maybe this is not acceptable IMHO.
> > > >
> > > > > BTW, isn't it better to acquire the share_lock on subscription during
> > > > > worker initialization (InitializeLogRepWorker()) where we already
> > > > > checking whether the subscription is dropped by that time? I think if
> > > > > we don't acquire the lock on subscription during launcher or during
> > > > > InitializeLogRepWorker(), there is a risk that DropSubscription would
> > > > > have reached the place where it would have tried to stop the workers
> > > > > and launcher will launch the worker after that point. Then there is a
> > > > > possibility of dangling worker because the GetSubscription() can still
> > > > > return valid value as the transaction in which we are dropping a
> > > > > subscription could still be in-progress.
> > > >
> > > > Here is my thought on these 2 approaches, so if we lock in launcher
> > > > and check there we avoid launching the worker altogether but for that
> > > > we will have to revalidate the subscription using sequence scan on
> > > > pg_subcription as we can not open additional indexes as we are not
> > > > connected to any database OTOH if we acquire lock in
> > > > InitializeLogRepWorker() we don't need to do any additional scan but
> > > > we will have to launch the worker and after that if subscription
> > > > recheck shows its dropped we will exit the worker.
> > > >
> > > > I think either way it's fine because this is not a very common
> > > > performance path, I prefer what you proposed as we will have to add
> > > > less code in this case, because we are already checking the
> > > > subscription and just need to acquire the shared object lock in
> > > > InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
> > > > modify the code with this approach as well.
> > >
> > > Here is a patch with a different approach.  IMHO with this approach
> > > the code change is much simpler.
> >
> > I have slightly modified the patch, basically removing explicitly
> > unlocking the shared object as that will be taken care by transaction
> > commit / proc_exit()
>
> Thanks for the updated patch. This change is required for the back
> branches too. Currently it is not applying for PG16 and below
> branches.
> git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch
> Applying: Fix DROP SUBSCRIPTION deadlock with new database creation
> error: patch failed: src/backend/commands/subscriptioncmds.c:1571
> error: src/backend/commands/subscriptioncmds.c: patch does not apply
> error: patch failed: src/backend/replication/logical/worker.c:4624
> error: src/backend/replication/logical/worker.c: patch does not apply
> Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation

Right, I was waiting for consensus for the fix, but anyway I will send
the patches for back branches, thanks.

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

13 августа, 12:30:34

On Wed, Aug 13, 2025 at 11:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 13, 2025 at 11:16 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Thanks for the updated patch. This change is required for the back
> > branches too. Currently it is not applying for PG16 and below
> > branches.
> > git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch
> > Applying: Fix DROP SUBSCRIPTION deadlock with new database creation
> > error: patch failed: src/backend/commands/subscriptioncmds.c:1571
> > error: src/backend/commands/subscriptioncmds.c: patch does not apply
> > error: patch failed: src/backend/replication/logical/worker.c:4624
> > error: src/backend/replication/logical/worker.c: patch does not apply
> > Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation
>
> Right, I was waiting for consensus for the fix, but anyway I will send
> the patches for back branches, thanks.

I have prepared back branch patches.


--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Amit Kapila

Дата:

14 августа, 09:09:04

On Wed, Aug 13, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have prepared back branch patches.
>

@@ -1802,11 +1802,7 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
...
- /*
- * Lock pg_subscription with AccessExclusiveLock to ensure that the
- * launcher doesn't restart new worker during dropping the subscription
- */
- rel = table_open(SubscriptionRelationId, AccessExclusiveLock);
+ rel = table_open(SubscriptionRelationId, AccessShareLock);

We need to acquire RowExclusiveLock here as we are deleting the
subscription row during this operation. See docs [1] and
AlterSubscription() code for reference. Attached is a top-up patch to
fix this for HEAD. Additionally, I have edited a few comments. If you
agree with these changes then please prepare backbranch patches
accordingly.

[1] - https://www.postgresql.org/docs/devel/explicit-locking.html
--
With Regards,
Amit Kapila.

Вложения

v3_amit.1.patch.txt

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

14 августа, 11:35:32

On Thu, Aug 14, 2025 at 11:39 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 13, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have prepared back branch patches.
> >
>
> @@ -1802,11 +1802,7 @@ DropSubscription(DropSubscriptionStmt *stmt,
> bool isTopLevel)
> ...
> - /*
> - * Lock pg_subscription with AccessExclusiveLock to ensure that the
> - * launcher doesn't restart new worker during dropping the subscription
> - */
> - rel = table_open(SubscriptionRelationId, AccessExclusiveLock);
> + rel = table_open(SubscriptionRelationId, AccessShareLock);
>
> We need to acquire RowExclusiveLock here as we are deleting the
> subscription row during this operation. See docs [1] and
> AlterSubscription() code for reference.

Yeah that's right

Attached is a top-up patch to
> fix this for HEAD. Additionally, I have edited a few comments. If you
> agree with these changes then please prepare backbranch patches
> accordingly.

Yeah this looks fine to me.  PFA patches for back branches.


--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Amit Kapila

Дата:

14 августа, 11:48:27

On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Yeah this looks fine to me.  PFA patches for back branches.
>

Can we add a test based on the scenario reported in this email?

--
With Regards,
Amit Kapila.

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

14 августа, 14:02:29

On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Yeah this looks fine to me.  PFA patches for back branches.
> >
>
> Can we add a test based on the scenario reported in this email?

At first, I thought we can just add this test in subscription.sql, but
IMHO we should not do that because after testing DROP SUBSCRIPTION
give error we should eventually DROP the subscription for cleanup by
creating the slot and enabling the subscription (as shown below[1])
but I that realized during regression wal_level is not logical so we
are not allowed to create the logical slot.  So maybe we should write
it in 100_bugs.pl?  Am I missing something?

[1]
-- ok CREATE SUBSCRIPTION WITH (connect=false),
CREATE DATABASE regression_db;
CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regression_db'
PUBLICATION testpub WITH (connect=false);

-- fail, due to non existing slot
-- should not create deadlock with initilizing caches of new databse
DROP SUBSCRIPTION regress_testsub;
SELECT 'init' FROM
pg_create_logical_replication_slot('regress_testsub', 'pgoutput');

-- ok, create missing slot, enable the subscription and drop again
SELECT 'init' FROM
pg_create_logical_replication_slot('regress_testsub', 'pgoutput');
ALTER SUBSCRIPTION regress_testsub ENABLE;
DROP SUBSCRIPTION regress_testsub;

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

14 августа, 14:34:02

On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Yeah this looks fine to me.  PFA patches for back branches.
> > >
> >
> > Can we add a test based on the scenario reported in this email?
>
> At first, I thought we can just add this test in subscription.sql, but
> IMHO we should not do that because after testing DROP SUBSCRIPTION
> give error we should eventually DROP the subscription for cleanup by
> creating the slot and enabling the subscription (as shown below[1])
> but I that realized during regression wal_level is not logical so we
> are not allowed to create the logical slot.  So maybe we should write
> it in 100_bugs.pl?  Am I missing something?
>

Another alternative is for cleanup instead of creating a missing slot
we can just set the slot_name=NONE and then drop. I will move ahead
with this approach.


--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Amit Kapila

Дата:

15 августа, 09:21:47

On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > Yeah this looks fine to me.  PFA patches for back branches.
> > > >
> > >
> > > Can we add a test based on the scenario reported in this email?
> >
> > At first, I thought we can just add this test in subscription.sql, but
> > IMHO we should not do that because after testing DROP SUBSCRIPTION
> > give error we should eventually DROP the subscription for cleanup by
> > creating the slot and enabling the subscription (as shown below[1])
> > but I that realized during regression wal_level is not logical so we
> > are not allowed to create the logical slot.  So maybe we should write
> > it in 100_bugs.pl?  Am I missing something?
> >
>
> Another alternative is for cleanup instead of creating a missing slot
> we can just set the slot_name=NONE and then drop. I will move ahead
> with this approach.
>

Yeah, that sounds good to me.

--
With Regards,
Amit Kapila.

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

15 августа, 09:28:15

On Fri, Aug 15, 2025 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > Yeah this looks fine to me.  PFA patches for back branches.
> > > > >
> > > >
> > > > Can we add a test based on the scenario reported in this email?
> > >
> > > At first, I thought we can just add this test in subscription.sql, but
> > > IMHO we should not do that because after testing DROP SUBSCRIPTION
> > > give error we should eventually DROP the subscription for cleanup by
> > > creating the slot and enabling the subscription (as shown below[1])
> > > but I that realized during regression wal_level is not logical so we
> > > are not allowed to create the logical slot.  So maybe we should write
> > > it in 100_bugs.pl?  Am I missing something?
> > >
> >
> > Another alternative is for cleanup instead of creating a missing slot
> > we can just set the slot_name=NONE and then drop. I will move ahead
> > with this approach.
> >
>
> Yeah, that sounds good to me.
>

PFA, updated patches for v13 to head, make check-world is passing for
all the branches for v15+ we had upgrade test which also run
regression test so we additionally have to set max_wal_senders=1 for
002_pg_upgrade.pl

--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

vignesh C

Дата:

17 августа, 16:53:18

On Fri, 15 Aug 2025 at 11:58, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 15, 2025 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > > Yeah this looks fine to me.  PFA patches for back branches.
> > > > > >
> > > > >
> > > > > Can we add a test based on the scenario reported in this email?
> > > >
> > > > At first, I thought we can just add this test in subscription.sql, but
> > > > IMHO we should not do that because after testing DROP SUBSCRIPTION
> > > > give error we should eventually DROP the subscription for cleanup by
> > > > creating the slot and enabling the subscription (as shown below[1])
> > > > but I that realized during regression wal_level is not logical so we
> > > > are not allowed to create the logical slot.  So maybe we should write
> > > > it in 100_bugs.pl?  Am I missing something?
> > > >
> > >
> > > Another alternative is for cleanup instead of creating a missing slot
> > > we can just set the slot_name=NONE and then drop. I will move ahead
> > > with this approach.
> > >
> >
> > Yeah, that sounds good to me.
> >
>
> PFA, updated patches for v13 to head, make check-world is passing for
> all the branches for v15+ we had upgrade test which also run
> regression test so we additionally have to set max_wal_senders=1 for
> 002_pg_upgrade.pl

Do you feel it will be better to append the configurations using
single call like below:
-$oldnode->append_conf('postgresql.conf', 'log_statement = none');
-$oldnode->append_conf('postgresql.conf', 'wal_level = replica');
-$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1');
+$oldnode->append_conf(
+       'postgresql.conf',
+       qq{log_statement = 'none'
+       wal_level = 'replica'
+       max_wal_senders = 1});

Regards,
Vignesh

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

18 августа, 06:04:51

On Sun, Aug 17, 2025 at 7:23 PM vignesh C <vignesh21@gmail.com> wrote:
>
> > PFA, updated patches for v13 to head, make check-world is passing for
> > all the branches for v15+ we had upgrade test which also run
> > regression test so we additionally have to set max_wal_senders=1 for
> > 002_pg_upgrade.pl
>
> Do you feel it will be better to append the configurations using
> single call like below:
> -$oldnode->append_conf('postgresql.conf', 'log_statement = none');
> -$oldnode->append_conf('postgresql.conf', 'wal_level = replica');
> -$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1');
> +$oldnode->append_conf(
> +       'postgresql.conf',
> +       qq{log_statement = 'none'
> +       wal_level = 'replica'
> +       max_wal_senders = 1});

Not sure if one way is superior over another or preferred.  I see both
sorts of usage patterns throughout our test suit.

--
Regards,
Dilip Kumar
Google

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

vignesh C

Дата:

18 августа, 06:34:09

On Fri, 15 Aug 2025 at 11:58, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 15, 2025 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 14, 2025 at 5:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 14, 2025 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 14, 2025 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 14, 2025 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > > Yeah this looks fine to me.  PFA patches for back branches.
> > > > > >
> > > > >
> > > > > Can we add a test based on the scenario reported in this email?
> > > >
> > > > At first, I thought we can just add this test in subscription.sql, but
> > > > IMHO we should not do that because after testing DROP SUBSCRIPTION
> > > > give error we should eventually DROP the subscription for cleanup by
> > > > creating the slot and enabling the subscription (as shown below[1])
> > > > but I that realized during regression wal_level is not logical so we
> > > > are not allowed to create the logical slot.  So maybe we should write
> > > > it in 100_bugs.pl?  Am I missing something?
> > > >
> > >
> > > Another alternative is for cleanup instead of creating a missing slot
> > > we can just set the slot_name=NONE and then drop. I will move ahead
> > > with this approach.
> > >
> >
> > Yeah, that sounds good to me.
> >
>
> PFA, updated patches for v13 to head, make check-world is passing for
> all the branches for v15+ we had upgrade test which also run
> regression test so we additionally have to set max_wal_senders=1 for
> 002_pg_upgrade.pl

This change will run not just the newly added test, but the entire
regression suite under the new configuration. Should we consider
moving this test elsewhere such as into the 100_bugs:
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4b5e895809b..284b2ef07e0 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -225,6 +225,8 @@ $oldnode->init(%old_node_params);
 # Override log_statement=all set by Cluster.pm.  This avoids large amounts
 # of log traffic that slow this test down even more when run under valgrind.
 $oldnode->append_conf('postgresql.conf', 'log_statement = none');
+$oldnode->append_conf('postgresql.conf', 'wal_level = replica');
+$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1');

Regards,
Vignesh

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

18 августа, 09:06:21

On Mon, Aug 18, 2025 at 9:04 AM vignesh C <vignesh21@gmail.com> wrote:
> This change will run not just the newly added test, but the entire
> regression suite under the new configuration. Should we consider
> moving this test elsewhere such as into the 100_bugs:
> diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> index 4b5e895809b..284b2ef07e0 100644
> --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params);
>  # Override log_statement=all set by Cluster.pm.  This avoids large amounts
>  # of log traffic that slow this test down even more when run under valgrind.
>  $oldnode->append_conf('postgresql.conf', 'log_statement = none');
> +$oldnode->append_conf('postgresql.conf', 'wal_level = replica');
> +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1');

Yeah even I didn't like it to set wal_level to replica only for this
test, it makes sense to write it in 100_bugs.pl, here is a updated
patch for HEAD only, I have also updated the commit message, once we
get agreement on this I will prepare a backbranch patches.

--
Regards,
Dilip Kumar
Google

Вложения

v6-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro_head.patch

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

18 августа, 10:25:09

On Mon, Aug 18, 2025 at 11:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 18, 2025 at 9:04 AM vignesh C <vignesh21@gmail.com> wrote:
> > This change will run not just the newly added test, but the entire
> > regression suite under the new configuration. Should we consider
> > moving this test elsewhere such as into the 100_bugs:
> > diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > index 4b5e895809b..284b2ef07e0 100644
> > --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params);
> >  # Override log_statement=all set by Cluster.pm.  This avoids large amounts
> >  # of log traffic that slow this test down even more when run under valgrind.
> >  $oldnode->append_conf('postgresql.conf', 'log_statement = none');
> > +$oldnode->append_conf('postgresql.conf', 'wal_level = replica');
> > +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1');
>
> Yeah even I didn't like it to set wal_level to replica only for this
> test, it makes sense to write it in 100_bugs.pl, here is a updated
> patch for HEAD only, I have also updated the commit message, once we
> get agreement on this I will prepare a backbranch patches.
>

Updated test case, cleanup reported by 1) pgperltidy 2) dropped the
database post test and 3) changed subscription name


--
Regards,
Dilip Kumar
Google

Вложения

v7-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-walsender-pro.patch

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Amit Kapila

Дата:

18 августа, 12:26:17

On Mon, Aug 18, 2025 at 12:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 18, 2025 at 11:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Aug 18, 2025 at 9:04 AM vignesh C <vignesh21@gmail.com> wrote:
> > > This change will run not just the newly added test, but the entire
> > > regression suite under the new configuration. Should we consider
> > > moving this test elsewhere such as into the 100_bugs:
> > > diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > > b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > > index 4b5e895809b..284b2ef07e0 100644
> > > --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > > +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
> > > @@ -225,6 +225,8 @@ $oldnode->init(%old_node_params);
> > >  # Override log_statement=all set by Cluster.pm.  This avoids large amounts
> > >  # of log traffic that slow this test down even more when run under valgrind.
> > >  $oldnode->append_conf('postgresql.conf', 'log_statement = none');
> > > +$oldnode->append_conf('postgresql.conf', 'wal_level = replica');
> > > +$oldnode->append_conf('postgresql.conf', 'max_wal_senders = 1');
> >
> > Yeah even I didn't like it to set wal_level to replica only for this
> > test, it makes sense to write it in 100_bugs.pl, here is a updated
> > patch for HEAD only, I have also updated the commit message, once we
> > get agreement on this I will prepare a backbranch patches.
> >
>
> Updated test case, cleanup reported by 1) pgperltidy 2) dropped the
> database post test and 3) changed subscription name
>

Looks mostly good. How about slightly changing the comment as in attached?

--
With Regards,
Amit Kapila.

Вложения

v7_amit_1.patch.txt

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

18 августа, 17:21:39

On Mon, Aug 18, 2025 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Looks mostly good. How about slightly changing the comment as in attached?

Your suggestion LGTM, I am making back branch patches. I observed
another behavior change in the test in v13.  The $sdterr is matching
with error details not with the error code, so for v13 i have to
modify the error comparison string as [1] whereas with other versions
it's like [2].  But if we execute this test from psql then I can see
the same error in v13 as well[3].  Not sure why perl is behaving
differently.  Maybe there is a way to change this but I could not
figure it out.  And another difference I have noticed is that in v13 I
had to give complete publisher_connstr that's maybe it's not the
starting server on the default port.  From this I realized that maybe
for other branches where it is working fine with just dbname, we
should also include publisher_connstr as shown [4] just to ensure we
are connecting to the right node, thoughts?

[1]
+like(
+ $stderr,
+ qr/ERROR:  replication slot "regress_sub1" does not exist/,
+ "could not drop replication slot: error message");

[2]

+like(
+ $stderr,
+ qr/ERROR:  could not drop replication slot "regress_sub1" on publisher/,
+ "could not drop replication slot: error message");

[3]
postgres[2045257]=# DROP SUBSCRIPTION regress_sub1;
ERROR:  XX000: could not drop the replication slot "regress_sub1" on publisher
DETAIL:  The error was: ERROR:  replication slot "regress_sub1" does not exist
LOCATION:  DropSubscription, subscriptioncmds.c:1008

[4]

+$publisher_connstr = $node_publisher->connstr . ' dbname=regress_db';
+$node_publisher->safe_psql(
+ 'postgres', qq(
+ CREATE DATABASE regress_db;
+ CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr'
PUBLICATION regress_pub WITH (connect=false);
+));

Attaching only v13 and head patches, once we have a way forward I will
send other patches as well.

--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

vignesh C

Дата:

18 августа, 19:41:04

On Mon, 18 Aug 2025 at 19:52, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 18, 2025 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Looks mostly good. How about slightly changing the comment as in attached?
>
> Your suggestion LGTM, I am making back branch patches. I observed
> another behavior change in the test in v13.  The $sdterr is matching
> with error details not with the error code, so for v13 i have to
> modify the error comparison string as [1] whereas with other versions
> it's like [2].  But if we execute this test from psql then I can see
> the same error in v13 as well[3].  Not sure why perl is behaving
> differently.

The error message in PG13 and master branch is different, that is the
reason the test was failing.
The error in master is: could not drop replication slot whereas the
error message in PG13 is: could not drop the replication slot.
The test passes with the below change in PG13:
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -315,7 +315,7 @@ my ($ret, $stdout, $stderr) =
 isnt($ret, 0, "replication slot does not exist: exit code not 0");
 like(
        $stderr,
-       qr/ERROR:  replication slot "regress_sub1" does not exist/,
+       qr/ERROR:  could not drop the replication slot "regress_sub1"
on publisher/,
        "could not drop replication slot: error message");

 $node_publisher->safe_psql('postgres', "DROP DATABASE regress_db");

Regards,
Vignesh

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

19 августа, 05:54:12

On Mon, Aug 18, 2025 at 10:11 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 18 Aug 2025 at 19:52, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Aug 18, 2025 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Looks mostly good. How about slightly changing the comment as in attached?
> >
> > Your suggestion LGTM, I am making back branch patches. I observed
> > another behavior change in the test in v13.  The $sdterr is matching
> > with error details not with the error code, so for v13 i have to
> > modify the error comparison string as [1] whereas with other versions
> > it's like [2].  But if we execute this test from psql then I can see
> > the same error in v13 as well[3].  Not sure why perl is behaving
> > differently.
>
> The error message in PG13 and master branch is different, that is the
> reason the test was failing.
> The error in master is: could not drop replication slot whereas the
> error message in PG13 is: could not drop the replication slot.
> The test passes with the below change in PG13:
> --- a/src/test/subscription/t/100_bugs.pl
> +++ b/src/test/subscription/t/100_bugs.pl
> @@ -315,7 +315,7 @@ my ($ret, $stdout, $stderr) =
>  isnt($ret, 0, "replication slot does not exist: exit code not 0");
>  like(
>         $stderr,
> -       qr/ERROR:  replication slot "regress_sub1" does not exist/,
> +       qr/ERROR:  could not drop the replication slot "regress_sub1"
> on publisher/,
>         "could not drop replication slot: error message");
>
>  $node_publisher->safe_psql('postgres', "DROP DATABASE regress_db");

Oh my bad, the error codes were so similar I didn't notice the
difference, I probably would have just done string search instead of
comparing by reading. It's a classic case where a machine's precision
beats a human's eyes :)

PFA, patches for head to v13.  I have also changed the connection
string to $node_publisher->connstr . ' dbname=regress_db'; instead of
assuming it will start on the default port.

make check-world is passing on all the branches.

--
Regards,
Dilip Kumar
Google

Вложения

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Amit Kapila

Дата:

19 августа, 12:37:37

On Tue, Aug 19, 2025 at 8:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> PFA, patches for head to v13.
>

Pushed.

--
With Regards,
Amit Kapila.

Re: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

От

Dilip Kumar

Дата:

21 августа, 06:46:20

On Tue, Aug 19, 2025 at 3:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 19, 2025 at 8:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > PFA, patches for head to v13.
> >
>
> Pushed.

Thanks.


--
Regards,
Dilip Kumar
Google

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения