Обсуждение: [MASSMAIL]Add missing ConditionVariableCancelSleep() in slot.c

Поиск
Список
Период
Сортировка

[MASSMAIL]Add missing ConditionVariableCancelSleep() in slot.c

От
Bharath Rupireddy
Дата:
Hi,

It looks like there's missing ConditionVariableCancelSleep() in
InvalidatePossiblyObsoleteSlot() after waiting for the replication
slot to be released by another process. Although prepare to sleep
cancels the sleep if the previous one wasn't canceled, the calling
process still remains in the CV's wait queue for the last cycle after
the replication slot holding process releases the slot. I'm wondering
why there's no issue detected if the calling process isn't removed
from the CV's wait queue. It may be due to the cancel sleep call in
the sigsetjmp() path for almost every process. IMO, it's a clean
practice to cancel the sleep immediately after the sleep ends like
elsewhere.

Please find the attached patch to fix the issue. Alternatively, we can
just add ConditionVariableCancelSleep() outside of the for loop to
cancel the sleep of the last cycle if any. This also looks correct
because every prepare to sleep does ensure the previous sleep is
canceled, and if no sleep at all, the cacnel sleep call exits.

PS: I've encountered an assertion failure [1] some time back
sporadically in t/001_rep_changes.pl which I've reported in
https://www.postgresql.org/message-id/CALj2ACXO6TJ4rhc3Uz3MWJGob9e4H1C71qufH-DGKJh8k4QGZA%40mail.gmail.com.
I'm not so sure if this patch fixes the assertion failure. I ran the
tests for 100 times [2] on my dev system, I didn't see any issue with
the patch.

[1]
t/001_rep_changes.pl

2024-01-31 12:24:38.474 UTC [840166]
pg_16435_sync_16393_7330237333761601891 STATEMENT:
DROP_REPLICATION_SLOT pg_16435_sync_16393_7330237333761601891 WAIT
TRAP: failed Assert("list->head != INVALID_PGPROCNO"), File:
"../../../../src/include/storage/proclist.h", Line: 101, PID: 840166
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ExceptionalCondition+0xbb)[0x55c8edf6b8f9]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x6637de)[0x55c8edd517de]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ConditionVariablePrepareToSleep+0x85)[0x55c8edd51b91]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ReplicationSlotAcquire+0x142)[0x55c8edcead6b]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ReplicationSlotDrop+0x51)[0x55c8edceb47f]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x60da71)[0x55c8edcfba71]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(exec_replication_command+0x47e)[0x55c8edcfc96a]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(PostgresMain+0x7df)[0x55c8edd7d644]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x5ab50c)[0x55c8edc9950c]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x5aab21)[0x55c8edc98b21]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x5a70de)[0x55c8edc950de]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(PostmasterMain+0x1534)[0x55c8edc949db]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x459c47)[0x55c8edb47c47]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f19fe629d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f19fe629e40]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(_start+0x25)[0x55c8ed7c4565]
2024-01-31 12:24:38.476 UTC [840168]
pg_16435_sync_16390_7330237333761601891 LOG:  statement: SELECT
a.attnum,       a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey)  FROM pg_catalog.pg_attribute a  LEFT JOIN
pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(16391)) WHERE a.attnum >
0::pg_catalog.int2   AND NOT a.attisdropped AND a.attgenerated = ''
AND a.attrelid = 16391 ORDER BY a.attnum

[2] for i in {1..100}; do make check
PROVE_TESTS="t/001_rep_changes.pl"; if [ $? -ne 0 ]; then echo "The
command failed on iteration $i"; break; fi; done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add missing ConditionVariableCancelSleep() in slot.c

От
Alvaro Herrera
Дата:
On 2024-Apr-13, Bharath Rupireddy wrote:

> It looks like there's missing ConditionVariableCancelSleep() in
> InvalidatePossiblyObsoleteSlot() after waiting for the replication
> slot to be released by another process. Although prepare to sleep
> cancels the sleep if the previous one wasn't canceled, the calling
> process still remains in the CV's wait queue for the last cycle after
> the replication slot holding process releases the slot. I'm wondering
> why there's no issue detected if the calling process isn't removed
> from the CV's wait queue. It may be due to the cancel sleep call in
> the sigsetjmp() path for almost every process. IMO, it's a clean
> practice to cancel the sleep immediately after the sleep ends like
> elsewhere.

Hmm, but shouldn't we cancel the sleep after we have completed sleeping
altogether, that is, until we've determined that we're no longer to
sleep waiting for this slot?  That would suggest to put the call to
cancel sleep after the for(;;) loop is complete, rather than immediately
after sleeping.  No?

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index cebf44bb0f..59e9bef642 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1756,6 +1756,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
         }
     }
 
+    ConditionVariableCancelSleep();
+
     Assert(released_lock == !LWLockHeldByMe(ReplicationSlotControlLock));
 
     return released_lock;


However, I noticed that ConditionVariablePrepareToSleep() does a
CancelSleep upon being called ... so what I suggest would not have any
effect whatsoever, because the sleep would be cancelled next time
through the loop anyway.  But shouldn't we also modify PrepareToSleep to
exit without doing anything if our current sleep CV is the same one
being newly installed?

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 112a518bae..65811ff989 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -57,6 +57,14 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 {
     int            pgprocno = MyProcNumber;
 
+    /*
+     * If we're preparing to sleep on the same CV we were already going to
+     * sleep on, we don't actually need to do anything.  This may seem like
+     * supporting sloppy coding, which is what it actually does, so ¯\_(ツ)_/¯
+     */
+    if (cv_sleep_target == cv)
+        return;
+
     /*
      * If some other sleep is already prepared, cancel it; this is necessary
      * because we have just one static variable tracking the prepared sleep,

Alternatively, maybe we need to not prepare-to-sleep in
InvalidatePossiblyObsoleteSlot() if we have already prepared to sleep in
a previous iteration through the loop (and of course, don't cancel the
sleep until we're out of the loop).

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
                               http://postgr.es/m/482D1632.8010507@sigaev.ru



Re: Add missing ConditionVariableCancelSleep() in slot.c

От
Bharath Rupireddy
Дата:
On Sat, Apr 13, 2024 at 4:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hmm, but shouldn't we cancel the sleep after we have completed sleeping
> altogether, that is, until we've determined that we're no longer to
> sleep waiting for this slot?  That would suggest to put the call to
> cancel sleep after the for(;;) loop is complete, rather than immediately
> after sleeping.  No?
>
> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index cebf44bb0f..59e9bef642 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1756,6 +1756,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
>                 }
>         }
>
> +       ConditionVariableCancelSleep();
> +
>         Assert(released_lock == !LWLockHeldByMe(ReplicationSlotControlLock));
>
>         return released_lock;

We can do that and +1 for it since the prepare to sleep cancel the
previous one anyway. I mentioned that approach in the original email:

>> Alternatively, we can
>> just add ConditionVariableCancelSleep() outside of the for loop to
>> cancel the sleep of the last cycle if any. This also looks correct
>> because every prepare to sleep does ensure the previous sleep is
>> canceled, and if no sleep at all, the cacnel sleep call exits.

> However, I noticed that ConditionVariablePrepareToSleep() does a
> CancelSleep upon being called ... so what I suggest would not have any
> effect whatsoever, because the sleep would be cancelled next time
> through the loop anyway.

But what if we break from the loop and never come back? We have to
wait until the sigsetjmp/exit path of the backend to hit and cancel
the sleep.

> But shouldn't we also modify PrepareToSleep to
> exit without doing anything if our current sleep CV is the same one
> being newly installed?
>
> diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
> index 112a518bae..65811ff989 100644
> --- a/src/backend/storage/lmgr/condition_variable.c
> +++ b/src/backend/storage/lmgr/condition_variable.c
> @@ -57,6 +57,14 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
>  {
>         int                     pgprocno = MyProcNumber;
>
> +       /*
> +        * If we're preparing to sleep on the same CV we were already going to
> +        * sleep on, we don't actually need to do anything.  This may seem like
> +        * supporting sloppy coding, which is what it actually does, so ¯\_(ツ)_/¯
> +        */
> +       if (cv_sleep_target == cv)
> +               return;
> +
>         /*
>          * If some other sleep is already prepared, cancel it; this is necessary
>          * because we have just one static variable tracking the prepared sleep,

That seems to work as a quick exit path avoiding spin lock acquisition
and release if the CV is already prepared to sleep. Specifically in
the InvalidatePossiblyObsoleteSlot for loop, it can avoid a bunch of
spin lock acquisitions and releases if we ever sleep on the same
slot's CV. However, I'm not sure if it will have any other issues.

BTW, I like the emoji "¯\_(ツ)_/¯" in the code comments :).

> Alternatively, maybe we need to not prepare-to-sleep in
> InvalidatePossiblyObsoleteSlot() if we have already prepared to sleep in
> a previous iteration through the loop (and of course, don't cancel the
> sleep until we're out of the loop).

I think this looks complicated. To keep things simple, I prefer to add
the ConditionVariableCancelSleep() out of the for loop in
InvalidatePossiblyObsoleteSlot().

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com