Обсуждение: Deadlock between logrep apply worker and tablesync worker

Поиск
Список
Период
Сортировка

Deadlock between logrep apply worker and tablesync worker

От
Tom Lane
Дата:
On my machine, the src/test/subscription/t/002_types.pl test
usually takes right about 1.5 seconds:

$ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl
...
[14:22:12] t/002_types.pl .. ok     1550 ms ( 0.00 usr  0.00 sys +  0.70 cusr  0.25 csys =  0.95 CPU)
[14:22:13]

I noticed however that sometimes (at least one try in ten, for me)
it takes about 2.5 seconds:

[14:22:16] t/002_types.pl .. ok     2591 ms ( 0.00 usr  0.00 sys +  0.69 cusr  0.28 csys =  0.97 CPU)
[14:22:18]

and I've even seen 3.5 seconds.  I dug into this and eventually
identified the cause: it's a deadlock between a subscription's apply
worker and a tablesync worker that it's spawned.  Sometimes the apply
worker calls wait_for_relation_state_change (to wait for the tablesync
worker to finish) while it's holding a lock on pg_replication_origin.
If that's the case, then when the tablesync worker reaches
process_syncing_tables_for_sync it is able to perform
UpdateSubscriptionRelState and reach the transaction commit below
that; but when it tries to do replorigin_drop_by_name a little further
down, it blocks on acquiring ExclusiveLock on pg_replication_origin.
So we have an undetected deadlock.  We escape that because
wait_for_relation_state_change has a 1-second timeout, after which
it rechecks GetSubscriptionRelState and is able to see the committed
relation state change; so it continues, and eventually releases its
transaction and the lock, permitting the tablesync worker to finish.

I've not tracked down the exact circumstances in which the apply
worker ends up holding a problematic lock, but it seems likely
that it corresponds to cases where its main loop has itself called
replorigin_drop_by_name, a bit further up, for some other concurrent
tablesync operation.  (In all the cases I've traced through, the apply
worker is herding multiple tablesync workers when this happens.)

I experimented with having the apply worker release its locks
before waiting for the tablesync worker, as attached.  This passes
check-world and it seems to eliminate the test runtime instability,
but I wonder whether it's semantically correct.  This whole business
of taking table-wide ExclusiveLock on pg_replication_origin looks
like a horrid kluge that we should try to get rid of, not least
because I don't see any clear documentation of what hazard it's
trying to prevent.

Another thing that has a bad smell about it is the fact that
process_syncing_tables_for_sync uses two transactions in the first
place.  There's a comment there claiming that it's for crash safety,
but I can't help suspecting it's really because this case becomes a
hard deadlock without that mid-function commit.

It's not great in any case that the apply worker can move on in
the belief that the tablesync worker is done when in fact the latter
still has catalog state updates to make.  And I wonder what we're
doing with having both of them calling replorigin_drop_by_name
... shouldn't that responsibility belong to just one of them?

So I think this whole area deserves a hard look, and I'm not at
all sure that what's attached is a good solution.

            regards, tom lane

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4647837b82..bb45c2107f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -539,15 +539,27 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
                     LWLockRelease(LogicalRepWorkerLock);

                     /*
-                     * Enter busy loop and wait for synchronization worker to
-                     * reach expected state (or die trying).
+                     * If we have a transaction, we must commit it to release
+                     * any locks we have (it's quite likely we hold lock on
+                     * pg_replication_origin, which the sync worker will need
+                     * to update).  Then start a new transaction so we can
+                     * examine catalog state.
                      */
-                    if (!started_tx)
+                    if (started_tx)
+                    {
+                        CommitTransactionCommand();
+                        StartTransactionCommand();
+                    }
+                    else
                     {
                         StartTransactionCommand();
                         started_tx = true;
                     }

+                    /*
+                     * Enter busy loop and wait for synchronization worker to
+                     * reach expected state (or die trying).
+                     */
                     wait_for_relation_state_change(rstate->relid,
                                                    SUBREL_STATE_SYNCDONE);
                 }

Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> On my machine, the src/test/subscription/t/002_types.pl test
> usually takes right about 1.5 seconds:
>
> $ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl
> ...
> [14:22:12] t/002_types.pl .. ok     1550 ms ( 0.00 usr  0.00 sys +  0.70 cusr  0.25 csys =  0.95 CPU)
> [14:22:13]
>
> I noticed however that sometimes (at least one try in ten, for me)
> it takes about 2.5 seconds:
>
> [14:22:16] t/002_types.pl .. ok     2591 ms ( 0.00 usr  0.00 sys +  0.69 cusr  0.28 csys =  0.97 CPU)
> [14:22:18]
>
> and I've even seen 3.5 seconds.  I dug into this and eventually
> identified the cause: it's a deadlock between a subscription's apply
> worker and a tablesync worker that it's spawned.  Sometimes the apply
> worker calls wait_for_relation_state_change (to wait for the tablesync
> worker to finish) while it's holding a lock on pg_replication_origin.
> If that's the case, then when the tablesync worker reaches
> process_syncing_tables_for_sync it is able to perform
> UpdateSubscriptionRelState and reach the transaction commit below
> that; but when it tries to do replorigin_drop_by_name a little further
> down, it blocks on acquiring ExclusiveLock on pg_replication_origin.
> So we have an undetected deadlock.  We escape that because
> wait_for_relation_state_change has a 1-second timeout, after which
> it rechecks GetSubscriptionRelState and is able to see the committed
> relation state change; so it continues, and eventually releases its
> transaction and the lock, permitting the tablesync worker to finish.
>
> I've not tracked down the exact circumstances in which the apply
> worker ends up holding a problematic lock, but it seems likely
> that it corresponds to cases where its main loop has itself called
> replorigin_drop_by_name, a bit further up, for some other concurrent
> tablesync operation.  (In all the cases I've traced through, the apply
> worker is herding multiple tablesync workers when this happens.)
>
> I experimented with having the apply worker release its locks
> before waiting for the tablesync worker, as attached.
>

I don't see any problem with your proposed change but I was wondering
if it would be better to commit the transaction and release locks
immediately after performing the replication origin drop? By doing
that, we will minimize the amount of time the transaction holds the
lock.

>  This passes
> check-world and it seems to eliminate the test runtime instability,
> but I wonder whether it's semantically correct.  This whole business
> of taking table-wide ExclusiveLock on pg_replication_origin looks
> like a horrid kluge that we should try to get rid of, not least
> because I don't see any clear documentation of what hazard it's
> trying to prevent.
>

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

[1] - https://www.postgresql.org/message-id/CAHut%2BPuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Another thing that has a bad smell about it is the fact that
> process_syncing_tables_for_sync uses two transactions in the first
> place.  There's a comment there claiming that it's for crash safety,
> but I can't help suspecting it's really because this case becomes a
> hard deadlock without that mid-function commit.
>
> It's not great in any case that the apply worker can move on in
> the belief that the tablesync worker is done when in fact the latter
> still has catalog state updates to make.  And I wonder what we're
> doing with having both of them calling replorigin_drop_by_name
> ... shouldn't that responsibility belong to just one of them?
>

Originally, it was being dropped at one place only (via tablesync
worker) but we found a race condition as mentioned in the comments in
process_syncing_tables_for_sync() before the start of the second
transaction which leads to this change. See the report and discussion
about that race condition in the email [1].

[1] - https://www.postgresql.org/message-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj=sLUOn4Gd2GjUAKG-fw@mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> IIRC, this is done to prevent concurrent drops of origin drop say by
> exposed API pg_replication_origin_drop(). See the discussion in [1]
> related to it. If we want we can optimize it so that we can acquire
> the lock on the specific origin as mentioned in comments
> replorigin_drop_by_name() but it was not clear that this operation
> would be frequent enough.

Here is an attached patch to lock the replication origin record using
LockSharedObject instead of locking pg_replication_origin relation in
ExclusiveLock mode. Now tablesync worker will wait only if the
tablesync worker is trying to drop the same replication origin which
has already been dropped by the apply worker, the other tablesync
workers will be able to successfully drop the replication origin
without any wait.

Regards,
Vignesh

Вложения

Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > IIRC, this is done to prevent concurrent drops of origin drop say by
> > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > related to it. If we want we can optimize it so that we can acquire
> > the lock on the specific origin as mentioned in comments
> > replorigin_drop_by_name() but it was not clear that this operation
> > would be frequent enough.
>
> Here is an attached patch to lock the replication origin record using
> LockSharedObject instead of locking pg_replication_origin relation in
> ExclusiveLock mode. Now tablesync worker will wait only if the
> tablesync worker is trying to drop the same replication origin which
> has already been dropped by the apply worker, the other tablesync
> workers will be able to successfully drop the replication origin
> without any wait.
>

There is a code in the function replorigin_drop_guts() that uses the
functionality introduced by replorigin_exists(). Can we reuse this
function for the same?

Also, it would be good if you can share the numbers for different runs
of "src/test/subscription/t/002_types.pl" before and after the patch.

-- 
With Regards,
Amit Kapila.



RE: Deadlock between logrep apply worker and tablesync worker

От
"houzj.fnst@fujitsu.com"
Дата:
On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com>
> 
> On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > related to it. If we want we can optimize it so that we can acquire
> > > the lock on the specific origin as mentioned in comments
> > > replorigin_drop_by_name() but it was not clear that this operation
> > > would be frequent enough.
> >
> > Here is an attached patch to lock the replication origin record using
> > LockSharedObject instead of locking pg_replication_origin relation in
> > ExclusiveLock mode. Now tablesync worker will wait only if the
> > tablesync worker is trying to drop the same replication origin which
> > has already been dropped by the apply worker, the other tablesync
> > workers will be able to successfully drop the replication origin
> > without any wait.
> >
> 
> There is a code in the function replorigin_drop_guts() that uses the
> functionality introduced by replorigin_exists(). Can we reuse this function for
> the same?

Maybe we can use SearchSysCacheExists1 to check the existence instead of
adding a new function.

One comment about the patch.

@@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
+    /* Drop the replication origin if it has not been dropped already */
+    if (replorigin_exists(roident))
         replorigin_drop_guts(rel, roident, nowait);

If developer pass missing_ok as false, should we report an ERROR here
instead of silently return ?

Best Regards,
Hou zj

Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Sat, Jan 28, 2023 at 9:36 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com>
> >
> > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > > related to it. If we want we can optimize it so that we can acquire
> > > > the lock on the specific origin as mentioned in comments
> > > > replorigin_drop_by_name() but it was not clear that this operation
> > > > would be frequent enough.
> > >
> > > Here is an attached patch to lock the replication origin record using
> > > LockSharedObject instead of locking pg_replication_origin relation in
> > > ExclusiveLock mode. Now tablesync worker will wait only if the
> > > tablesync worker is trying to drop the same replication origin which
> > > has already been dropped by the apply worker, the other tablesync
> > > workers will be able to successfully drop the replication origin
> > > without any wait.
> > >
> >
> > There is a code in the function replorigin_drop_guts() that uses the
> > functionality introduced by replorigin_exists(). Can we reuse this function for
> > the same?
>
> Maybe we can use SearchSysCacheExists1 to check the existence instead of
> adding a new function.
>

Yeah, I think that would be better.

> One comment about the patch.
>
> @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> ...
> +       /* Drop the replication origin if it has not been dropped already */
> +       if (replorigin_exists(roident))
>                 replorigin_drop_guts(rel, roident, nowait);
>
> If developer pass missing_ok as false, should we report an ERROR here
> instead of silently return ?
>

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and called
from only one place, so, will it be required to check at both places?

-- 
With Regards,
Amit Kapila.



Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 28, 2023 at 9:36 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com>
> > >
> > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > >
> > > > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > > > related to it. If we want we can optimize it so that we can acquire
> > > > > the lock on the specific origin as mentioned in comments
> > > > > replorigin_drop_by_name() but it was not clear that this operation
> > > > > would be frequent enough.
> > > >
> > > > Here is an attached patch to lock the replication origin record using
> > > > LockSharedObject instead of locking pg_replication_origin relation in
> > > > ExclusiveLock mode. Now tablesync worker will wait only if the
> > > > tablesync worker is trying to drop the same replication origin which
> > > > has already been dropped by the apply worker, the other tablesync
> > > > workers will be able to successfully drop the replication origin
> > > > without any wait.
> > > >
> > >
> > > There is a code in the function replorigin_drop_guts() that uses the
> > > functionality introduced by replorigin_exists(). Can we reuse this function for
> > > the same?
> >
> > Maybe we can use SearchSysCacheExists1 to check the existence instead of
> > adding a new function.
> >
>
> Yeah, I think that would be better.
>
> > One comment about the patch.
> >
> > @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> > ...
> > +       /* Drop the replication origin if it has not been dropped already */
> > +       if (replorigin_exists(roident))
> >                 replorigin_drop_guts(rel, roident, nowait);
> >
> > If developer pass missing_ok as false, should we report an ERROR here
> > instead of silently return ?
> >
>
> One thing that looks a bit odd is that we will anyway have a similar
> check in replorigin_drop_guts() which is a static function and called
> from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin, so it is better to check again before calling
replorigin_drop_guts, ideally the tuple should be valid in
replorigin_drop_guts, but can we keep the check as it is to maintain
the consistency before calling CatalogTupleDelete.

Regards,
Vignesh



Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Fri, 27 Jan 2023 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > related to it. If we want we can optimize it so that we can acquire
> > > the lock on the specific origin as mentioned in comments
> > > replorigin_drop_by_name() but it was not clear that this operation
> > > would be frequent enough.
> >
> > Here is an attached patch to lock the replication origin record using
> > LockSharedObject instead of locking pg_replication_origin relation in
> > ExclusiveLock mode. Now tablesync worker will wait only if the
> > tablesync worker is trying to drop the same replication origin which
> > has already been dropped by the apply worker, the other tablesync
> > workers will be able to successfully drop the replication origin
> > without any wait.
> >
>
> There is a code in the function replorigin_drop_guts() that uses the
> functionality introduced by replorigin_exists(). Can we reuse this
> function for the same?
>
> Also, it would be good if you can share the numbers for different runs
> of "src/test/subscription/t/002_types.pl" before and after the patch.

By using only Tom Lane's Fix, I noticed that the execution time is
varying between 3.6 seconds to 4.4 seconds.
By using only the "Changing to LockSharedObject" fix, I noticed that
the execution time is varying between 3.8 seconds to 4.6 seconds.
By using the combined Fix(Tom Lane's fix + changing to
LockSharedObject) , I noticed that the execution time is varying
between 3.5 seconds to 3.8 seconds.
I felt both the changes will be required as it will also handle the
scenario when both the apply worker and the table sync worker try to
drop the same replication origin.

The execution results for the same:
With only Tom Lane's fix:
[12:25:32] t/002_types.pl .. ok     3604 ms ( 0.00 usr  0.00 sys +
2.26 cusr  0.37 csys =  2.63 CPU)
[12:25:48] t/002_types.pl .. ok     3788 ms ( 0.00 usr  0.00 sys +
2.24 cusr  0.39 csys =  2.63 CPU)
[12:26:01] t/002_types.pl .. ok     3783 ms ( 0.00 usr  0.00 sys +
2.42 cusr  0.37 csys =  2.79 CPU)
[12:26:14] t/002_types.pl .. ok     3845 ms ( 0.00 usr  0.00 sys +
2.38 cusr  0.44 csys =  2.82 CPU)
[12:26:29] t/002_types.pl .. ok     3923 ms ( 0.00 usr  0.00 sys +
2.54 cusr  0.39 csys =  2.93 CPU)
[12:26:42] t/002_types.pl .. ok     4416 ms ( 0.00 usr  0.00 sys +
2.73 cusr  0.48 csys =  3.21 CPU)
[12:26:55] t/002_types.pl .. ok     4310 ms ( 0.00 usr  0.00 sys +
2.62 cusr  0.39 csys =  3.01 CPU)
[12:27:09] t/002_types.pl .. ok     4168 ms ( 0.00 usr  0.00 sys +
2.67 cusr  0.46 csys =  3.13 CPU)
[12:27:21] t/002_types.pl .. ok     4167 ms ( 0.00 usr  0.00 sys +
2.46 cusr  0.53 csys =  2.99 CPU)
[12:27:34] t/002_types.pl .. ok     4144 ms ( 0.00 usr  0.00 sys +
2.59 cusr  0.41 csys =  3.00 CPU)
[12:27:46] t/002_types.pl .. ok     3982 ms ( 0.00 usr  0.00 sys +
2.52 cusr  0.41 csys =  2.93 CPU)
[12:28:03] t/002_types.pl .. ok     4190 ms ( 0.01 usr  0.00 sys +
2.67 cusr  0.46 csys =  3.14 CPU)

With only "Changing to LockSharedObject" fix:
[12:33:02] t/002_types.pl .. ok     3815 ms ( 0.00 usr  0.00 sys +
2.30 cusr  0.38 csys =  2.68 CPU)
[12:33:16] t/002_types.pl .. ok     4295 ms ( 0.00 usr  0.00 sys +
2.66 cusr  0.42 csys =  3.08 CPU)
[12:33:31] t/002_types.pl .. ok     4270 ms ( 0.00 usr  0.00 sys +
2.72 cusr  0.44 csys =  3.16 CPU)
[12:33:44] t/002_types.pl .. ok     4460 ms ( 0.00 usr  0.00 sys +
2.78 cusr  0.45 csys =  3.23 CPU)
[12:33:58] t/002_types.pl .. ok     4340 ms ( 0.01 usr  0.00 sys +
2.67 cusr  0.45 csys =  3.13 CPU)
[12:34:11] t/002_types.pl .. ok     4142 ms ( 0.00 usr  0.00 sys +
2.58 cusr  0.42 csys =  3.00 CPU)
[12:34:24] t/002_types.pl .. ok     4459 ms ( 0.00 usr  0.00 sys +
2.76 cusr  0.49 csys =  3.25 CPU)
[12:34:38] t/002_types.pl .. ok     4427 ms ( 0.00 usr  0.00 sys +
2.68 cusr  0.48 csys =  3.16 CPU)
[12:35:10] t/002_types.pl .. ok     4642 ms ( 0.00 usr  0.00 sys +
2.84 cusr  0.55 csys =  3.39 CPU)
[12:35:22] t/002_types.pl .. ok     4047 ms ( 0.01 usr  0.00 sys +
2.49 cusr  0.46 csys =  2.96 CPU)
[12:35:32] t/002_types.pl .. ok     4505 ms ( 0.01 usr  0.00 sys +
2.90 cusr  0.45 csys =  3.36 CPU)
[12:36:03] t/002_types.pl .. ok     4088 ms ( 0.00 usr  0.00 sys +
2.51 cusr  0.42 csys =  2.93 CPU)

002_types with combination of Tom Lane's and "Changing to LockSharedObject" fix:
[10:22:04] t/002_types.pl .. ok     3730 ms ( 0.00 usr  0.00 sys +
2.30 cusr  0.41 csys =  2.71 CPU)
[10:23:40] t/002_types.pl .. ok     3666 ms ( 0.00 usr  0.00 sys +
2.16 cusr  0.42 csys =  2.58 CPU)
[10:23:31] t/002_types.pl .. ok     3665 ms ( 0.00 usr  0.00 sys +
2.31 cusr  0.40 csys =  2.71 CPU)
[10:23:23] t/002_types.pl .. ok     3500 ms ( 0.00 usr  0.00 sys +
2.20 cusr  0.36 csys =  2.56 CPU)
[10:23:14] t/002_types.pl .. ok     3704 ms ( 0.00 usr  0.00 sys +
2.36 cusr  0.35 csys =  2.71 CPU)
[10:23:05] t/002_types.pl .. ok     3594 ms ( 0.00 usr  0.00 sys +
2.32 cusr  0.31 csys =  2.63 CPU)
[10:24:10] t/002_types.pl .. ok     3702 ms ( 0.00 usr  0.00 sys +
2.27 cusr  0.42 csys =  2.69 CPU)
[10:24:22] t/002_types.pl .. ok     3741 ms ( 0.00 usr  0.00 sys +
2.39 cusr  0.36 csys =  2.75 CPU)
[10:24:38] t/002_types.pl .. ok     3676 ms ( 0.00 usr  0.00 sys +
2.28 cusr  0.43 csys =  2.71 CPU)
[10:24:50] t/002_types.pl .. ok     3843 ms ( 0.00 usr  0.00 sys +
2.36 cusr  0.43 csys =  2.79 CPU)
[10:25:03] t/002_types.pl .. ok     3710 ms ( 0.00 usr  0.00 sys +
2.30 cusr  0.36 csys =  2.66 CPU)
[10:25:12] t/002_types.pl .. ok     3695 ms ( 0.00 usr  0.00 sys +
2.34 cusr  0.35 csys =  2.69 CPU)

 002_types on HEAD:
[10:31:05] t/002_types.pl .. ok     5687 ms ( 0.00 usr  0.00 sys +
2.35 cusr  0.45 csys =  2.80 CPU)
[10:31:31] t/002_types.pl .. ok     6815 ms ( 0.00 usr  0.00 sys +
2.61 cusr  0.43 csys =  3.04 CPU)
[10:31:47] t/002_types.pl .. ok     5561 ms ( 0.00 usr  0.00 sys +
2.24 cusr  0.47 csys =  2.71 CPU)
[10:32:05] t/002_types.pl .. ok     4542 ms ( 0.00 usr  0.00 sys +
2.27 cusr  0.39 csys =  2.66 CPU)
[10:32:20] t/002_types.pl .. ok     3663 ms ( 0.00 usr  0.00 sys +
2.30 cusr  0.38 csys =  2.68 CPU)
[10:32:33] t/002_types.pl .. ok     3627 ms ( 0.00 usr  0.00 sys +
2.27 cusr  0.32 csys =  2.59 CPU)
[10:32:45] t/002_types.pl .. ok     3808 ms ( 0.00 usr  0.00 sys +
2.41 cusr  0.39 csys =  2.80 CPU)
[10:32:59] t/002_types.pl .. ok     4536 ms ( 0.00 usr  0.00 sys +
2.24 cusr  0.38 csys =  2.62 CPU)
[10:33:13] t/002_types.pl .. ok     3638 ms ( 0.00 usr  0.00 sys +
2.25 cusr  0.41 csys =  2.66 CPU)
[10:33:35] t/002_types.pl .. ok     4796 ms ( 0.00 usr  0.00 sys +
2.38 cusr  0.38 csys =  2.76 CPU)
[10:33:51] t/002_types.pl .. ok     4695 ms ( 0.00 usr  0.00 sys +
2.40 cusr  0.37 csys =  2.77 CPU)
[10:34:06] t/002_types.pl .. ok     5738 ms ( 0.00 usr  0.00 sys +
2.44 cusr  0.43 csys =  2.87 CPU)

Regards,
Vignesh



Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > One thing that looks a bit odd is that we will anyway have a similar
> > check in replorigin_drop_guts() which is a static function and called
> > from only one place, so, will it be required to check at both places?
>
> There is a possibility that the initial check to verify if replication
> origin exists in replorigin_drop_by_name was successful but later one
> of either table sync worker or apply worker process might have dropped
> the replication origin,
>

Won't locking on the particular origin prevent concurrent drops? IIUC,
the drop happens after the patch acquires the lock on the origin.

-- 
With Regards,
Amit Kapila.



RE: Deadlock between logrep apply worker and tablesync worker

От
"houzj.fnst@fujitsu.com"
Дата:
On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > One thing that looks a bit odd is that we will anyway have a similar
> > > check in replorigin_drop_guts() which is a static function and
> > > called from only one place, so, will it be required to check at both places?
> >
> > There is a possibility that the initial check to verify if replication
> > origin exists in replorigin_drop_by_name was successful but later one
> > of either table sync worker or apply worker process might have dropped
> > the replication origin,
> >
> 
> Won't locking on the particular origin prevent concurrent drops? IIUC, the
> drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the tuple from
system cache to the replorigin_drop_by_name(). After locking the origin, we
can try to get the tuple and do the existence check, and we can reuse
this tuple to perform origin delete. In this approach we only need to check
origin existence once after locking. BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as the function
only clear the in-memory information after that.

The code could be like:

-------
replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
    /*
     * Lock the origin to prevent concurrent drops. We keep the lock until the
     * end of transaction.
     */
    LockSharedObject(ReplicationOriginRelationId, roident, 0,
                     AccessExclusiveLock);

    tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
    if (!HeapTupleIsValid(tuple))
    {
        if (!missing_ok)
            elog(ERROR, "cache lookup failed for replication origin with ID %d",
                 roident);
        
        return;
    }

    replorigin_state_clear(rel, roident, nowait);

    /*
     * Now, we can delete the catalog entry.
     */
    CatalogTupleDelete(rel, &tuple->t_self);
    ReleaseSysCache(tuple);

    CommandCounterIncrement();
...
-------

Best Regards,
Hou zj


Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > check in replorigin_drop_guts() which is a static function and
> > > > called from only one place, so, will it be required to check at both places?
> > >
> > > There is a possibility that the initial check to verify if replication
> > > origin exists in replorigin_drop_by_name was successful but later one
> > > of either table sync worker or apply worker process might have dropped
> > > the replication origin,
> > >
> >
> > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > drop happens after the patch acquires the lock on the origin.
>
> Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> already lock the origin before that. I think the check in replorigin_drop_guts
> is a custom check after calling SearchSysCache1 to get the tuple, but the error
> should not happen as no concurrent drop can be performed.

This scenario is possible while creating subscription, apply worker
will try to drop the replication origin if the state is
SUBREL_STATE_SYNCDONE. Table sync worker will set the state to
SUBREL_STATE_SYNCDONE and update the relation state before calling
replorigin_drop_by_name. Since the transaction is committed by table
sync worker, the state is visible to apply worker, now apply worker
will  parallelly try to drop the replication origin in this case.
There is a race condition in this case, one of the process table sync
worker or apply worker will acquire the lock and drop the replication
origin, the other process will get the lock after the process drops
the origin and commits the transaction. Now the other process will try
to drop the replication origin once it acquires the lock and get the
error(from replorigin_drop_guts): cache lookup failed for replication
origin with ID.
Concurrent drop is possible in this case.

Regards,
Vignesh



Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > check in replorigin_drop_guts() which is a static function and
> > > > called from only one place, so, will it be required to check at both places?
> > >
> > > There is a possibility that the initial check to verify if replication
> > > origin exists in replorigin_drop_by_name was successful but later one
> > > of either table sync worker or apply worker process might have dropped
> > > the replication origin,
> > >
> >
> > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > drop happens after the patch acquires the lock on the origin.
>
> Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> already lock the origin before that. I think the check in replorigin_drop_guts
> is a custom check after calling SearchSysCache1 to get the tuple, but the error
> should not happen as no concurrent drop can be performed.
>
> To make it simpler, one idea is to move the code that getting the tuple from
> system cache to the replorigin_drop_by_name(). After locking the origin, we
> can try to get the tuple and do the existence check, and we can reuse
> this tuple to perform origin delete. In this approach we only need to check
> origin existence once after locking. BTW, if we do this, then we'd better rename the
> replorigin_drop_guts() to something like replorigin_state_clear() as the function
> only clear the in-memory information after that.
>
> The code could be like:
>
> -------
> replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> ...
>         /*
>          * Lock the origin to prevent concurrent drops. We keep the lock until the
>          * end of transaction.
>          */
>         LockSharedObject(ReplicationOriginRelationId, roident, 0,
>                                          AccessExclusiveLock);
>
>         tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
>         if (!HeapTupleIsValid(tuple))
>         {
>                 if (!missing_ok)
>                         elog(ERROR, "cache lookup failed for replication origin with ID %d",
>                                  roident);
>
>                 return;
>         }
>
>         replorigin_state_clear(rel, roident, nowait);
>
>         /*
>          * Now, we can delete the catalog entry.
>          */
>         CatalogTupleDelete(rel, &tuple->t_self);
>         ReleaseSysCache(tuple);
>
>         CommandCounterIncrement();
> ...

+1 for this change as it removes the redundant check which is not
required. I will post an updated version for this.

Regards,
Vignesh



Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > check in replorigin_drop_guts() which is a static function and
> > > > called from only one place, so, will it be required to check at both places?
> > >
> > > There is a possibility that the initial check to verify if replication
> > > origin exists in replorigin_drop_by_name was successful but later one
> > > of either table sync worker or apply worker process might have dropped
> > > the replication origin,
> > >
> >
> > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > drop happens after the patch acquires the lock on the origin.
>
> Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> already lock the origin before that. I think the check in replorigin_drop_guts
> is a custom check after calling SearchSysCache1 to get the tuple, but the error
> should not happen as no concurrent drop can be performed.
>
> To make it simpler, one idea is to move the code that getting the tuple from
> system cache to the replorigin_drop_by_name(). After locking the origin, we
> can try to get the tuple and do the existence check, and we can reuse
> this tuple to perform origin delete. In this approach we only need to check
> origin existence once after locking. BTW, if we do this, then we'd better rename the
> replorigin_drop_guts() to something like replorigin_state_clear() as the function
> only clear the in-memory information after that.
>
> The code could be like:
>
> -------
> replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> ...
>         /*
>          * Lock the origin to prevent concurrent drops. We keep the lock until the
>          * end of transaction.
>          */
>         LockSharedObject(ReplicationOriginRelationId, roident, 0,
>                                          AccessExclusiveLock);
>
>         tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
>         if (!HeapTupleIsValid(tuple))
>         {
>                 if (!missing_ok)
>                         elog(ERROR, "cache lookup failed for replication origin with ID %d",
>                                  roident);
>
>                 return;
>         }
>
>         replorigin_state_clear(rel, roident, nowait);
>
>         /*
>          * Now, we can delete the catalog entry.
>          */
>         CatalogTupleDelete(rel, &tuple->t_self);
>         ReleaseSysCache(tuple);
>
>         CommandCounterIncrement();

The attached updated patch has the changes to handle the same.

Regards,
Vignesh

Вложения

Re: Deadlock between logrep apply worker and tablesync worker

От
vignesh C
Дата:
On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > > check in replorigin_drop_guts() which is a static function and
> > > > > called from only one place, so, will it be required to check at both places?
> > > >
> > > > There is a possibility that the initial check to verify if replication
> > > > origin exists in replorigin_drop_by_name was successful but later one
> > > > of either table sync worker or apply worker process might have dropped
> > > > the replication origin,
> > > >
> > >
> > > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > > drop happens after the patch acquires the lock on the origin.
> >
> > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> > already lock the origin before that. I think the check in replorigin_drop_guts
> > is a custom check after calling SearchSysCache1 to get the tuple, but the error
> > should not happen as no concurrent drop can be performed.
> >
> > To make it simpler, one idea is to move the code that getting the tuple from
> > system cache to the replorigin_drop_by_name(). After locking the origin, we
> > can try to get the tuple and do the existence check, and we can reuse
> > this tuple to perform origin delete. In this approach we only need to check
> > origin existence once after locking. BTW, if we do this, then we'd better rename the
> > replorigin_drop_guts() to something like replorigin_state_clear() as the function
> > only clear the in-memory information after that.
> >
> > The code could be like:
> >
> > -------
> > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> > ...
> >         /*
> >          * Lock the origin to prevent concurrent drops. We keep the lock until the
> >          * end of transaction.
> >          */
> >         LockSharedObject(ReplicationOriginRelationId, roident, 0,
> >                                          AccessExclusiveLock);
> >
> >         tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
> >         if (!HeapTupleIsValid(tuple))
> >         {
> >                 if (!missing_ok)
> >                         elog(ERROR, "cache lookup failed for replication origin with ID %d",
> >                                  roident);
> >
> >                 return;
> >         }
> >
> >         replorigin_state_clear(rel, roident, nowait);
> >
> >         /*
> >          * Now, we can delete the catalog entry.
> >          */
> >         CatalogTupleDelete(rel, &tuple->t_self);
> >         ReleaseSysCache(tuple);
> >
> >         CommandCounterIncrement();
>
> The attached updated patch has the changes to handle the same.

I had not merged one of the local changes that was present, please
find the updated patch including that change. Sorry for missing that
change.

Regards,
Vignesh

Вложения

RE: Deadlock between logrep apply worker and tablesync worker

От
"houzj.fnst@fujitsu.com"
Дата:
On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote:
> On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Monday, January 30, 2023 2:32 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com>
> wrote:
> > > > >
> > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > > > >
> > > > > > One thing that looks a bit odd is that we will anyway have a
> > > > > > similar check in replorigin_drop_guts() which is a static
> > > > > > function and called from only one place, so, will it be required to
> check at both places?
> > > > >
> > > > > There is a possibility that the initial check to verify if
> > > > > replication origin exists in replorigin_drop_by_name was
> > > > > successful but later one of either table sync worker or apply
> > > > > worker process might have dropped the replication origin,
> > > > >
> > > >
> > > > Won't locking on the particular origin prevent concurrent drops?
> > > > IIUC, the drop happens after the patch acquires the lock on the origin.
> > >
> > > Yes, I think the existence check in replorigin_drop_guts is
> > > unnecessary as we already lock the origin before that. I think the
> > > check in replorigin_drop_guts is a custom check after calling
> > > SearchSysCache1 to get the tuple, but the error should not happen as no
> concurrent drop can be performed.
> > >
> > > To make it simpler, one idea is to move the code that getting the
> > > tuple from system cache to the replorigin_drop_by_name(). After
> > > locking the origin, we can try to get the tuple and do the existence
> > > check, and we can reuse this tuple to perform origin delete. In this
> > > approach we only need to check origin existence once after locking.
> > > BTW, if we do this, then we'd better rename the
> > > replorigin_drop_guts() to something like replorigin_state_clear() as
> > > the function only clear the in-memory information after that.
> > >
> >
> > The attached updated patch has the changes to handle the same.
> 
> I had not merged one of the local changes that was present, please find the
> updated patch including that change. Sorry for missing that change.
> 

I also tried to test the time of "src/test/subscription/t/002_types.pl"
before and after the patch(change the lock level) and Tom's patch(split
transaction) like what Vignesh has shared on -hackers.

I run about 100 times for each case. Tom's and the lock level patch
behave similarly on my machines[1].

HEAD: 3426 ~ 6425 ms
HEAD + Tom: 3404 ~ 3462 ms
HEAD + Vignesh: 3419 ~ 3474 ms
HEAD + Tom + Vignesh: 3408 ~ 3454 ms

Even apart from the testing time reduction, reducing the lock level and lock
the specific object can also help improve the lock contention which user(that
use the exposed function) , table sync worker and apply worker can also benefit
from it. So, I think pushing the patch to change the lock level makes sense.

And the patch looks good to me.

While on it, after pushing the patch, I think there is another case might also
worth to be improved, that is the table sync and apply worker try to drop the
same origin which might cause some delay. This is another case(different from
the deadlock), so I feel we can try to improve this in another patch.

[1] CentOS 8.2, 128G RAM, 40 processors Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

Best Regards,
Hou zj

Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote:
> > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:
> >
>
> I also tried to test the time of "src/test/subscription/t/002_types.pl"
> before and after the patch(change the lock level) and Tom's patch(split
> transaction) like what Vignesh has shared on -hackers.
>
> I run about 100 times for each case. Tom's and the lock level patch
> behave similarly on my machines[1].
>
> HEAD: 3426 ~ 6425 ms
> HEAD + Tom: 3404 ~ 3462 ms
> HEAD + Vignesh: 3419 ~ 3474 ms
> HEAD + Tom + Vignesh: 3408 ~ 3454 ms
>
> Even apart from the testing time reduction, reducing the lock level and lock
> the specific object can also help improve the lock contention which user(that
> use the exposed function) , table sync worker and apply worker can also benefit
> from it. So, I think pushing the patch to change the lock level makes sense.
>
> And the patch looks good to me.
>

Thanks for the tests. I also see a reduction in test time variability
with Vignesh's patch. I think we can release the locks in case the
origin is concurrently dropped as in the attached patch. I am planning
to commit this patch tomorrow unless there are more comments or
objections.

> While on it, after pushing the patch, I think there is another case might also
> worth to be improved, that is the table sync and apply worker try to drop the
> same origin which might cause some delay. This is another case(different from
> the deadlock), so I feel we can try to improve this in another patch.
>

Right, I think that case could be addressed by Tom's patch to some
extent but I am thinking we should also try to analyze if we can
completely avoid the need to remove origins from both processes. One
idea could be to introduce another relstate something like
PRE_SYNCDONE and set it in a separate transaction before we set the
state as SYNCDONE and remove the slot and origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the
second transaction, it can remove the slot and origin after restart by
checking the state. However, it would add another relstate which may
not be the best way to address this problem. Anyway, that can be
accomplished as a separate patch.

-- 
With Regards,
Amit Kapila.

Вложения

Re: Deadlock between logrep apply worker and tablesync worker

От
Amit Kapila
Дата:
On Thu, Feb 2, 2023 at 4:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Thanks for the tests. I also see a reduction in test time variability
> with Vignesh's patch. I think we can release the locks in case the
> origin is concurrently dropped as in the attached patch. I am planning
> to commit this patch tomorrow unless there are more comments or
> objections.
>

Pushed.

-- 
With Regards,
Amit Kapila.



RE: Deadlock between logrep apply worker and tablesync worker

От
"houzj.fnst@fujitsu.com"
Дата:
On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com>
> wrote:
> > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:
> > >
> >
> > I also tried to test the time of "src/test/subscription/t/002_types.pl"
> > before and after the patch(change the lock level) and Tom's
> > patch(split
> > transaction) like what Vignesh has shared on -hackers.
> >
> > I run about 100 times for each case. Tom's and the lock level patch
> > behave similarly on my machines[1].
> >
> > HEAD: 3426 ~ 6425 ms
> > HEAD + Tom: 3404 ~ 3462 ms
> > HEAD + Vignesh: 3419 ~ 3474 ms
> > HEAD + Tom + Vignesh: 3408 ~ 3454 ms
> >
> > Even apart from the testing time reduction, reducing the lock level
> > and lock the specific object can also help improve the lock contention
> > which user(that use the exposed function) , table sync worker and
> > apply worker can also benefit from it. So, I think pushing the patch to change
> the lock level makes sense.
> >
> > And the patch looks good to me.
> >
> 
> Thanks for the tests. I also see a reduction in test time variability with Vignesh's
> patch. I think we can release the locks in case the origin is concurrently
> dropped as in the attached patch. I am planning to commit this patch
> tomorrow unless there are more comments or objections.
> 
> > While on it, after pushing the patch, I think there is another case
> > might also worth to be improved, that is the table sync and apply
> > worker try to drop the same origin which might cause some delay. This
> > is another case(different from the deadlock), so I feel we can try to improve
> this in another patch.
> >
> 
> Right, I think that case could be addressed by Tom's patch to some extent but
> I am thinking we should also try to analyze if we can completely avoid the need
> to remove origins from both processes. One idea could be to introduce
> another relstate something like PRE_SYNCDONE and set it in a separate
> transaction before we set the state as SYNCDONE and remove the slot and
> origin in tablesync worker.
> Now, if the tablesync worker errors out due to some reason during the second
> transaction, it can remove the slot and origin after restart by checking the state.
> However, it would add another relstate which may not be the best way to
> address this problem. Anyway, that can be accomplished as a separate patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply worker. And
add a new state PRE_SYNCDONE after synchronization finished in front of apply
(sublsn set), but before dropping the origin and other final cleanups. The
tablesync will restart and redo the cleanup if it failed after reaching the new
state. Besides, since the changes can already be applied on the table in
PRE_SYNCDONE state, so I also modified the check in
should_apply_changes_for_rel(). And some other conditions for the origin drop
in subscription commands are were adjusted in this patch.

Best Regards,
Hou zj

Вложения

Re: Deadlock between logrep apply worker and tablesync worker

От
Peter Smith
Дата:
On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
...
> > Right, I think that case could be addressed by Tom's patch to some extent but
> > I am thinking we should also try to analyze if we can completely avoid the need
> > to remove origins from both processes. One idea could be to introduce
> > another relstate something like PRE_SYNCDONE and set it in a separate
> > transaction before we set the state as SYNCDONE and remove the slot and
> > origin in tablesync worker.
> > Now, if the tablesync worker errors out due to some reason during the second
> > transaction, it can remove the slot and origin after restart by checking the state.
> > However, it would add another relstate which may not be the best way to
> > address this problem. Anyway, that can be accomplished as a separate patch.
>
> Here is an attempt to achieve the same.
> Basically, the patch removes the code that drop the origin in apply worker. And
> add a new state PRE_SYNCDONE after synchronization finished in front of apply
> (sublsn set), but before dropping the origin and other final cleanups. The
> tablesync will restart and redo the cleanup if it failed after reaching the new
> state. Besides, since the changes can already be applied on the table in
> PRE_SYNCDONE state, so I also modified the check in
> should_apply_changes_for_rel(). And some other conditions for the origin drop
> in subscription commands are were adjusted in this patch.
>

Here are some review comments for the 0001 patch

======
General Comment

0.
The idea of using the extra relstate for clean-up seems OK, but the
implementation of the new state in this patch appears misordered and
misnamed to me.

The state name should indicate what it is doing (PRE_SYNCDONE is
meaningless). The patch describes in several places that this state
means "synchronized, but not yet cleaned up" therefore IMO it means
the SYNCDONE state should be *before* this new state. And since this
new state is for "cleanup" then let's call it something like that.

To summarize, I don’t think the meaning of SYNCDONE should be touched.
SYNCDONE means the synchronization is done, same as before. And your
new "cleanup" state belongs directly *after* that. IMO it should be
like this:

1. STATE_INIT
2. STATE_DATASYNC
3. STATE_FINISHEDCOPY
4. STATE_SYNCDONE
5. STATE_CLEANUP <-- new relstate
6. STATE_READY

Of course, this is going to impact almost every aspect of the patch,
but I think everything will be basically the same as you have it now
-- only all the state names and comments need to be adjusted according
to the above.

======
Commit Message

1.
Previously, we allowed the apply worker to drop the origin to avoid the case
that the tablesync worker fails to the origin(due to crash). In this case we
don't restart the tablesync worker, and the apply worker can clean the origin.

~

There seem to be some words missing in this paragraph.

SUGGESTION
Previously, we allowed the apply worker to drop the origin as a way to
recover from the scenario where the tablesync worker failed to drop it
(due to crash).

~~~

2.
To improve this, we introduce a new relstate SUBREL_STATE_PRE_SYNCDONE which
will be set after synchronization finished in front of apply (sublsn set), but
before dropping the origin and other final cleanups. The apply worker will
restart tablesync worker if the relstate is SUBREL_STATE_PRE_SYNCDONE. This
way, even if the tablesync worker error out in the transaction that tries to
drop the origin, the apply worker will restart the tablesync worker to redo the
cleanup(for origin and other stuff) and then directly exit.

~

2a.
This is going to be impacted by my "General Comment". Notice how you
describe again "will be set after synchronization finished".  Evidence
again this means
the new CLEANUP state should directly follow the SYNCDONE state.

2b.
"error out” --> "encounters an error"

2c.
"cleanup(for origin" --> space before the "("

======
doc/src/sgml/catalogs.sgml

3.
@@ -8071,7 +8071,8 @@ SCRAM-SHA-256$<replaceable><iteration
count></replaceable>:<replaceable>&l
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
        <literal>f</literal> = finished table copy,
-       <literal>s</literal> = synchronized,
+       <literal>p</literal> = synchronized but not yet cleaned up,
+       <literal>s</literal> = synchronization done,
        <literal>r</literal> = ready (normal replication)
       </para></entry>
      </row>
@@ -8082,8 +8083,8 @@ SCRAM-SHA-256$<replaceable><iteration
count></replaceable>:<replaceable>&l
       </para>
       <para>
        Remote LSN of the state change used for synchronization coordination
-       when in <literal>s</literal> or <literal>r</literal> states,
-       otherwise null
+       when in <literal>p</literal>, <literal>s</literal> or
+       <literal>r</literal> states, otherwise null
       </para></entry>
      </row>
     </tbody>

This state order and choice of the letter are impacted by my "General Comment".

IMO it should be more like this:

State code: i = initialize, d = data is being copied, f = finished
table copy,  s = synchronization done, c = clean-up done, r = ready
(normal replication)

======
src/backend/commands/subscriptioncmds.c

4. AlterSubscription_refresh

Some adjustments are needed according to my "General Comment".

~~~

5. DropSubscription

Some adjustments are needed according to my "General Comment".

======
src/backend/replication/logical/tablesync.c

6.

+ * Update the state of the table to SUBREL_STATE_SYNCDONE and cleanup the
+ * tablesync slot and drop the tablesync's origin tracking.
+ */
+static void
+finish_synchronization(bool restart_after_crash)

6a.
Suggest calling this function something like 'cleanup_after_synchronization'

~

6b.
Some adjustments to states and comments are needed according to my
"General Comment".

~~~

7. process_syncing_tables_for_sync

- MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
+ MyLogicalRepWorker->relstate = SUBREL_STATE_PRE_SYNCDONE;
  MyLogicalRepWorker->relstate_lsn = current_lsn;

This should just be setting SUBREL_STATE_SYNCDONE how it previously did.

Other states/comments in this function to change according to my
"General Comments".

~

8.
if (rstate->state == SUBREL_STATE_SYNCDONE)
{
/*
* Apply has caught up to the position where the table sync has
* finished.  Mark the table as ready so that the apply will just
* continue to replicate it normally.
*/

That should now be checking for SUBREL_STATE_CLEANUPDONE according to
me "General Comment"

~~~

9. process_syncing_tables_for_apply

Some adjustments to states and comments are needed according to my
"General Comment".

~~~

10. LogicalRepSyncTableStart

Some adjustments to states and comments are needed according to my
"General Comment".

======
src/backend/replication/logical/worker.c

11. should_apply_changes_for_rel

Some adjustments to states according to my "General Comment".

======
src/include/catalog/pg_subscription_rel.h

12.
@@ -62,8 +62,10 @@
DECLARE_UNIQUE_INDEX_PKEY(pg_subscription_rel_srrelid_srsubid_index,
6117, Subsc
  * NULL) */
 #define SUBREL_STATE_FINISHEDCOPY 'f' /* tablesync copy phase is completed
  * (sublsn NULL) */
-#define SUBREL_STATE_SYNCDONE 's' /* synchronization finished in front of
- * apply (sublsn set) */
+#define SUBREL_STATE_PRE_SYNCDONE 'p' /* synchronization finished in front of
+ * apply (sublsn set), but the final
+ * cleanup has not yet been performed */
+#define SUBREL_STATE_SYNCDONE 's' /* synchronization complete */
 #define SUBREL_STATE_READY 'r' /* ready (sublsn set) */

Some adjustments to states and comments are needed according to my
"General Comment".

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Deadlock between logrep apply worker and tablesync worker

От
"houzj.fnst@fujitsu.com"
Дата:
On Tuesday, February 7, 2023 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote:
> On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> wrote:
> >
> ...
> > > Right, I think that case could be addressed by Tom's patch to some
> > > extent but I am thinking we should also try to analyze if we can
> > > completely avoid the need to remove origins from both processes. One
> > > idea could be to introduce another relstate something like
> > > PRE_SYNCDONE and set it in a separate transaction before we set the
> > > state as SYNCDONE and remove the slot and origin in tablesync worker.
> > > Now, if the tablesync worker errors out due to some reason during
> > > the second transaction, it can remove the slot and origin after restart by
> checking the state.
> > > However, it would add another relstate which may not be the best way
> > > to address this problem. Anyway, that can be accomplished as a separate
> patch.
> >
> > Here is an attempt to achieve the same.
> > Basically, the patch removes the code that drop the origin in apply
> > worker. And add a new state PRE_SYNCDONE after synchronization
> > finished in front of apply (sublsn set), but before dropping the
> > origin and other final cleanups. The tablesync will restart and redo
> > the cleanup if it failed after reaching the new state. Besides, since
> > the changes can already be applied on the table in PRE_SYNCDONE state,
> > so I also modified the check in should_apply_changes_for_rel(). And
> > some other conditions for the origin drop in subscription commands are
> were adjusted in this patch.
> >
> 
> Here are some review comments for the 0001 patch
> 
> ======
> General Comment
> 
> 0.
> The idea of using the extra relstate for clean-up seems OK, but the
> implementation of the new state in this patch appears misordered and
> misnamed to me.
> 
> The state name should indicate what it is doing (PRE_SYNCDONE is
> meaningless). The patch describes in several places that this state means
> "synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE
> state should be *before* this new state. And since this new state is for
> "cleanup" then let's call it something like that.
> 
> To summarize, I don’t think the meaning of SYNCDONE should be touched.
> SYNCDONE means the synchronization is done, same as before. And your new
> "cleanup" state belongs directly *after* that. IMO it should be like this:
> 
> 1. STATE_INIT
> 2. STATE_DATASYNC
> 3. STATE_FINISHEDCOPY
> 4. STATE_SYNCDONE
> 5. STATE_CLEANUP <-- new relstate
> 6. STATE_READY
> 
> Of course, this is going to impact almost every aspect of the patch, but I think
> everything will be basically the same as you have it now
> -- only all the state names and comments need to be adjusted according to the
> above.

Although I agree the CLEANUP is easier to understand, but I am a bit concerned
that the changes would be a bit invasive.

If we add a CLEANUP state at the end as suggested, it will change the meaning
of the existing SYNCDONE state, before the change it means both data sync and
cleanup have been done, but after the change it only mean the data sync is
over. This also means all the current C codes that considered the SYNCDONE as
the final state of table sync will need to be changed. Moreover, it's common
for user to query the relation state from pg_subscription_rel to identify if
the table sync of a table is finished(e.g. check relstate IN ('r', 's')), but
if we add a new state(CLEANUP) as the final state, then all these SQLs would
need to be changed as they need to check like relstate IN ('r', 'x'(new cleanup
state)).

Best Regards,
Hou zj

Re: Deadlock between logrep apply worker and tablesync worker

От
Peter Smith
Дата:
On Tue, Feb 7, 2023 at 6:46 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, February 7, 2023 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > ...
> > > > Right, I think that case could be addressed by Tom's patch to some
> > > > extent but I am thinking we should also try to analyze if we can
> > > > completely avoid the need to remove origins from both processes. One
> > > > idea could be to introduce another relstate something like
> > > > PRE_SYNCDONE and set it in a separate transaction before we set the
> > > > state as SYNCDONE and remove the slot and origin in tablesync worker.
> > > > Now, if the tablesync worker errors out due to some reason during
> > > > the second transaction, it can remove the slot and origin after restart by
> > checking the state.
> > > > However, it would add another relstate which may not be the best way
> > > > to address this problem. Anyway, that can be accomplished as a separate
> > patch.
> > >
> > > Here is an attempt to achieve the same.
> > > Basically, the patch removes the code that drop the origin in apply
> > > worker. And add a new state PRE_SYNCDONE after synchronization
> > > finished in front of apply (sublsn set), but before dropping the
> > > origin and other final cleanups. The tablesync will restart and redo
> > > the cleanup if it failed after reaching the new state. Besides, since
> > > the changes can already be applied on the table in PRE_SYNCDONE state,
> > > so I also modified the check in should_apply_changes_for_rel(). And
> > > some other conditions for the origin drop in subscription commands are
> > were adjusted in this patch.
> > >
> >
> > Here are some review comments for the 0001 patch
> >
> > ======
> > General Comment
> >
> > 0.
> > The idea of using the extra relstate for clean-up seems OK, but the
> > implementation of the new state in this patch appears misordered and
> > misnamed to me.
> >
> > The state name should indicate what it is doing (PRE_SYNCDONE is
> > meaningless). The patch describes in several places that this state means
> > "synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE
> > state should be *before* this new state. And since this new state is for
> > "cleanup" then let's call it something like that.
> >
> > To summarize, I don’t think the meaning of SYNCDONE should be touched.
> > SYNCDONE means the synchronization is done, same as before. And your new
> > "cleanup" state belongs directly *after* that. IMO it should be like this:
> >
> > 1. STATE_INIT
> > 2. STATE_DATASYNC
> > 3. STATE_FINISHEDCOPY
> > 4. STATE_SYNCDONE
> > 5. STATE_CLEANUP <-- new relstate
> > 6. STATE_READY
> >
> > Of course, this is going to impact almost every aspect of the patch, but I think
> > everything will be basically the same as you have it now
> > -- only all the state names and comments need to be adjusted according to the
> > above.
>
> Although I agree the CLEANUP is easier to understand, but I am a bit concerned
> that the changes would be a bit invasive.
>
> If we add a CLEANUP state at the end as suggested, it will change the meaning
> of the existing SYNCDONE state, before the change it means both data sync and
> cleanup have been done, but after the change it only mean the data sync is
> over. This also means all the current C codes that considered the SYNCDONE as
> the final state of table sync will need to be changed. Moreover, it's common
> for user to query the relation state from pg_subscription_rel to identify if
> the table sync of a table is finished(e.g. check relstate IN ('r', 's')), but
> if we add a new state(CLEANUP) as the final state, then all these SQLs would
> need to be changed as they need to check like relstate IN ('r', 'x'(new cleanup
> state)).
>

IIUC, you are saying that we still want to keep the SYNCDONE state as
the last state before READY mainly because otherwise there is too much
impact on user/test SQL that is currently checking those ('s','r')
states.

OTOH, in the current 001 patch you had the SUBREL_STATE_PRE_SYNCDONE
meaning "synchronized but not yet cleaned up" (that's verbatim from
your PGDOCS). And there is C code where you are checking
SUBREL_STATE_PRE_SYNCDONE and essentially giving the state before the
SYNCDONE an equal status to the SYNCDONE (e.g.
should_apply_changes_for_rel seemed to be doing this).

It seems to be trying to have an each-way bet...

~~~

But I think there may be an easy way out of this problem:

Current HEAD
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_SYNCDONE 's'
5. STATE_READY 'r'

The patch 0001
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_PRESYNCDONE 'p' <-- new relstate
4. STATE_SYNCDONE 's'
5. STATE_READY 'r'

My previous suggestion (which you acknowledge is easier to understand,
but might cause hassles for existing SQL)
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_SYNCDONE 's'
5. STATE_CLEANUP 'x' <-- new relstate
6. STATE_READY 'r'

SUGGESTED (hack to solve everything?)
1. STATE_INIT 'i'
2. STATE_DATASYNC 'd'
3. STATE_FINISHEDCOPY 'f'
4. STATE_SYNCDONE_PRE_CLEANUP 'x' <-- change the char code for this
existing relstate (was SYNCDONE 's')
5. STATE_SYNCDONE_WITH_CLEANUP 's' <-- new relstate using 's'
6. STATE_READY 'r'

By commandeering the 's' flag for the new CLEANUP state it means no
existing user code or test code needs to change - IIUC everything will
work the same as before.

~

Hmmm -- In hindsight, perhaps I have gone around in a big circle here
and the solution I am describing here is almost exactly the same as
your patch 0001 only with better names for the relstates.

Thoughts?

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Deadlock between logrep apply worker and tablesync worker

От
Peter Smith
Дата:
On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com>
> > wrote:
> > > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > >
> > > I also tried to test the time of "src/test/subscription/t/002_types.pl"
> > > before and after the patch(change the lock level) and Tom's
> > > patch(split
> > > transaction) like what Vignesh has shared on -hackers.
> > >
> > > I run about 100 times for each case. Tom's and the lock level patch
> > > behave similarly on my machines[1].
> > >
> > > HEAD: 3426 ~ 6425 ms
> > > HEAD + Tom: 3404 ~ 3462 ms
> > > HEAD + Vignesh: 3419 ~ 3474 ms
> > > HEAD + Tom + Vignesh: 3408 ~ 3454 ms
> > >
> > > Even apart from the testing time reduction, reducing the lock level
> > > and lock the specific object can also help improve the lock contention
> > > which user(that use the exposed function) , table sync worker and
> > > apply worker can also benefit from it. So, I think pushing the patch to change
> > the lock level makes sense.
> > >
> > > And the patch looks good to me.
> > >
> >
> > Thanks for the tests. I also see a reduction in test time variability with Vignesh's
> > patch. I think we can release the locks in case the origin is concurrently
> > dropped as in the attached patch. I am planning to commit this patch
> > tomorrow unless there are more comments or objections.
> >
> > > While on it, after pushing the patch, I think there is another case
> > > might also worth to be improved, that is the table sync and apply
> > > worker try to drop the same origin which might cause some delay. This
> > > is another case(different from the deadlock), so I feel we can try to improve
> > this in another patch.
> > >
> >
> > Right, I think that case could be addressed by Tom's patch to some extent but
> > I am thinking we should also try to analyze if we can completely avoid the need
> > to remove origins from both processes. One idea could be to introduce
> > another relstate something like PRE_SYNCDONE and set it in a separate
> > transaction before we set the state as SYNCDONE and remove the slot and
> > origin in tablesync worker.
> > Now, if the tablesync worker errors out due to some reason during the second
> > transaction, it can remove the slot and origin after restart by checking the state.
> > However, it would add another relstate which may not be the best way to
> > address this problem. Anyway, that can be accomplished as a separate patch.
>
> Here is an attempt to achieve the same.
> Basically, the patch removes the code that drop the origin in apply worker. And
> add a new state PRE_SYNCDONE after synchronization finished in front of apply
> (sublsn set), but before dropping the origin and other final cleanups. The
> tablesync will restart and redo the cleanup if it failed after reaching the new
> state. Besides, since the changes can already be applied on the table in
> PRE_SYNCDONE state, so I also modified the check in
> should_apply_changes_for_rel(). And some other conditions for the origin drop
> in subscription commands are were adjusted in this patch.
>


BTW, the tablesync.c has a large file header comment which describes
all about the relstates including some examples. So this patch needs
to include modifications to that comment.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.