Обсуждение: "ERROR: deadlock detected" when replicating TRUNCATE

Поиск
Список
Период
Сортировка

"ERROR: deadlock detected" when replicating TRUNCATE

От
Peter Smith
Дата:
While doing logical replication testing we encountered a problem which
causes a deadlock error to be logged when replicating a TRUNCATE
simultaneously to 2 Subscriptions.
e.g.
----------
2021-05-12 11:30:19.457 AEST [11393] ERROR:  deadlock detected
2021-05-12 11:30:19.457 AEST [11393] DETAIL:  Process 11393 waits for
ShareLock on transaction 597; blocked by process 11582.
Process 11582 waits for ShareLock on relation 16384 of database 14896;
blocked by process 11393.
----------

At this time, both the subscriber (apply worker) processes are
executing inside the ExecuteTruncateGuts function simultaneously and
they become co-dependent.

e.g.
----------
Process 11582
(gdb) bt
#0  0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
latch.c:1450
#2  0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
at latch.c:1396
#3  0x000000000093d8cd in WaitLatch (latch=0x7fa191042654,
wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
#4  0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8,
lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361
#5  0x0000000000954fd5 in WaitOnLock (locallock=0x2fd06d8,
owner=0x2fd9a48) at lock.c:1858
#6  0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe0370,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x7ffed5fe0368) at lock.c:1100
#7  0x00000000009511f1 in LockRelationOid (relid=16384, lockmode=5) at
lmgr.c:117
#8  0x000000000049e779 in relation_open (relationId=16384, lockmode=5)
at relation.c:56
#9  0x00000000005652ff in table_open (relationId=16384, lockmode=5) at
table.c:43
#10 0x00000000005c8b5a in reindex_relation (relid=16384, flags=1,
params=0x7ffed5fe05f0) at index.c:3990
#11 0x00000000006d2c85 in ExecuteTruncateGuts
(explicit_rels=0x3068aa8, relids=0x3068b00, relids_extra=0x3068b58,
relids_logged=0x3068bb0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2036
#12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
#14 0x00000000008ec424 in LogicalRepApplyLoop (last_received=24192624)
at worker.c:2564
----------
Process 11393
(gdb) bt
#0  0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007fa197917e44 in sleep () from /lib64/libc.so.6
#2  0x0000000000950f84 in DeadLockReport () at deadlock.c:1151
#3  0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0,
owner=0x2fd9a48) at lock.c:1873
#4  0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x0) at lock.c:1100
#5  0x00000000009531bc in LockAcquire (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false) at lock.c:751
#6  0x0000000000951d86 in XactLockTableWait (xid=597,
rel=0x7fa1986e9e08, ctid=0x7ffed5fe0284, oper=XLTW_Update) at
lmgr.c:674
#7  0x00000000004f84d8 in heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, newtup=0x3067dc0, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffed5fe03b0, lockmode=0x7ffed5fe03ac) at heapam.c:3549
#8  0x00000000004fa5dd in simple_heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at heapam.c:4222
#9  0x00000000005c9932 in CatalogTupleUpdate (heapRel=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at indexing.c:312
#10 0x0000000000af5774 in RelationSetNewRelfilenode
(relation=0x7fa1986fdc80, persistence=112 'p') at relcache.c:3707
#11 0x00000000006d2b4d in ExecuteTruncateGuts
(explicit_rels=0x30688b8, relids=0x3068910, relids_extra=0x3068968,
relids_logged=0x30689c0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2012
#12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
----------

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time. The
code of apply_handle_truncate function (worker.c) has a very similar
pattern to the ExecuteTruncate function (tablecmds.c) but the
ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
apply_handle_truncate was using.

~~

PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
same as the ExecuteTruncate function does.

PSA a patch adding a test for this scenario.

--------
Kind Regards,
Peter Smith.
Fujitsu Australia

Вложения

Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Amit Kapila
Дата:
On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> While doing logical replication testing we encountered a problem which
> causes a deadlock error to be logged when replicating a TRUNCATE
> simultaneously to 2 Subscriptions.
> e.g.
> ----------
> 2021-05-12 11:30:19.457 AEST [11393] ERROR:  deadlock detected
> 2021-05-12 11:30:19.457 AEST [11393] DETAIL:  Process 11393 waits for
> ShareLock on transaction 597; blocked by process 11582.
> Process 11582 waits for ShareLock on relation 16384 of database 14896;
> blocked by process 11393.
> ----------
>
> At this time, both the subscriber (apply worker) processes are
> executing inside the ExecuteTruncateGuts function simultaneously and
> they become co-dependent.
>
> e.g.
> ----------
> Process 11582
> (gdb) bt
> #0  0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
> cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
> latch.c:1450
> #2  0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
> occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
> at latch.c:1396
> #3  0x000000000093d8cd in WaitLatch (latch=0x7fa191042654,
> wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
> #4  0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8,
> lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361
..
> ----------
> Process 11393
> (gdb) bt
> #0  0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
> #1  0x00007fa197917e44 in sleep () from /lib64/libc.so.6
> #2  0x0000000000950f84 in DeadLockReport () at deadlock.c:1151
> #3  0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0,
> owner=0x2fd9a48) at lock.c:1873
>
..
> ----------
>
> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time.
>

Yeah, have you checked it in the back branches?

I am also able to reproduce and have analyzed the cause of the above
error. In the above, Process 11393 waits while updating pg_class tuple
via RelationSetNewRelfilenode() which is already updated by process
11582 (with transaction id 597 which is yet not committed). Now,
process 11582 waits for acquiring ShareLock on relation 16384 which is
acquired RowExclusiveMode by process 11393 in function
apply_handle_truncate. So, both the processes started waiting on each
other which causes a deadlock.

>
> PSA a patch adding a test for this scenario.
>

+
+$node_publisher->safe_psql('postgres',
+ "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1, sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");

Do you really need these steps to reproduce the problem? IIUC, this
has nothing to do with synchronous replication.

-- 
With Regards,
Amit Kapila.



Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Dilip Kumar
Дата:
On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time. The
> code of apply_handle_truncate function (worker.c) has a very similar
> pattern to the ExecuteTruncate function (tablecmds.c) but the
> ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
> apply_handle_truncate was using.

Right, that's a problem.

>
> PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
> same as the ExecuteTruncate function does.

I think the fix makes sense to me.

> PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: "ERROR: deadlock detected" when replicating TRUNCATE

От
"tanghy.fnst@fujitsu.com"
Дата:
On Monday, May 17, 2021 5:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote
> +$node_publisher->safe_psql('postgres',
> + "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1,
> sub5_2)'");
> +$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");
> 
> Do you really need these steps to reproduce the problem? IIUC, this
> has nothing to do with synchronous replication.

Agreed. 
I tested in asynchronous mode, and could reproduce this problem, too.

The attached patch removed the steps for setting synchronous replication.
And the test passed after applying Peter's patch.
Please take it as your reference.

Regards
Tang


Вложения

Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Amit Kapila
Дата:
On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> > PSA a patch adding a test for this scenario.
>
> I am not sure this test case is exactly targeting the problematic
> behavior because that will depends upon the order of execution of the
> apply workers right?
>

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

-- 
With Regards,
Amit Kapila.



Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Dilip Kumar
Дата:
On Mon, May 17, 2021 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > > PSA a patch adding a test for this scenario.
> >
> > I am not sure this test case is exactly targeting the problematic
> > behavior because that will depends upon the order of execution of the
> > apply workers right?
> >
>
> Yeah, so we can't guarantee that this test will always reproduce the
> problem but OTOH, I have tried two times and it reproduced both times.
> I guess we don't have a similar test where Truncate will replicate to
> two subscriptions, otherwise, we would have caught such a problem.
> Having said that, I am fine with leaving this test if others feel so
> on the grounds that it won't always lead to the problem reported.

Although it is not guaranteed to reproduce the scenario every time, it
is testing a new scenario, so +1 for the test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Peter Smith
Дата:
On Mon, May 17, 2021 at 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > > PSA a patch adding a test for this scenario.
> >
> > I am not sure this test case is exactly targeting the problematic
> > behavior because that will depends upon the order of execution of the
> > apply workers right?
> >
>
> Yeah, so we can't guarantee that this test will always reproduce the
> problem but OTOH, I have tried two times and it reproduced both times.
> I guess we don't have a similar test where Truncate will replicate to
> two subscriptions, otherwise, we would have caught such a problem.
> Having said that, I am fine with leaving this test if others feel so
> on the grounds that it won't always lead to the problem reported.
>

If there is any concern that the problem won't always happen then I
think we should just increase the numbers of subscriptions.

Having more simultaneous subscriptions (e.g. I have tried 4). will
make it much more likely for the test to encounter the deadlock, and
it probably would also be quite a useful worker stress test in it's
own right.

~~

Also, should this test be in the 010_truncate,pl, or does it belong in
the 100_bugs.pl?  (I don't know what are the rules for when a test
gets put into 100_bugs.pl)

----------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Peter Smith
Дата:
On Mon, May 17, 2021 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
[...]
> > The essence of the trouble seems to be that the apply_handle_truncate
> > function never anticipated it may end up truncating the same table
> > from 2 separate workers (subscriptions) like this test case is doing.
> > Probably this is quite an old problem because the
> > apply_handle_truncate code has not changed much for a long time.
> >
>
> Yeah, have you checked it in the back branches?
>

Yes, the apply_handle_truncate function was introduced in April/2018 [1].

REL_11_0 was tagged in Oct/2018.

The "ERROR:  deadlock detected" log is reproducible in PG 11.0.

----------
[1] https://github.com/postgres/postgres/commit/039eb6e92f20499ac36cc74f8a5cef7430b706f6

Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Amit Kapila
Дата:
On Tue, May 18, 2021 at 6:19 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, May 17, 2021 at 3:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > > PSA a patch adding a test for this scenario.
> > >
> > > I am not sure this test case is exactly targeting the problematic
> > > behavior because that will depends upon the order of execution of the
> > > apply workers right?
> > >
> >
> > Yeah, so we can't guarantee that this test will always reproduce the
> > problem but OTOH, I have tried two times and it reproduced both times.
> > I guess we don't have a similar test where Truncate will replicate to
> > two subscriptions, otherwise, we would have caught such a problem.
> > Having said that, I am fine with leaving this test if others feel so
> > on the grounds that it won't always lead to the problem reported.
> >
>
> If there is any concern that the problem won't always happen then I
> think we should just increase the numbers of subscriptions.
>
> Having more simultaneous subscriptions (e.g. I have tried 4). will
> make it much more likely for the test to encounter the deadlock, and
> it probably would also be quite a useful worker stress test in it's
> own right.
>

I don't think we need to go that far.

> ~~
>
> Also, should this test be in the 010_truncate,pl,
>

+1 for keeping it in 010_truncate.pl but remove the synchronous part
of it from the testcase and comments as that is not required.

-- 
With Regards,
Amit Kapila.



Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Amit Kapila
Дата:
On Tue, May 18, 2021 at 6:52 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> >
> > Yeah, have you checked it in the back branches?
> >
>
> Yes, the apply_handle_truncate function was introduced in April/2018 [1].
>
> REL_11_0 was tagged in Oct/2018.
>
> The "ERROR:  deadlock detected" log is reproducible in PG 11.0.
>

Okay, I have prepared the patches for all branches (11...HEAD). Each
version needs minor changes in the test, the code doesn't need much
change. Some notable changes in the tests:
1. I have removed the conf change for max_logical_replication_workers
on the publisher node. We only need this for the subscriber node.
2. After creating the new subscriptions wait for initial
synchronization as we do for other tests.
3. synchronous_standby_names need to be reset for the previous test.
This is only required for HEAD.
4. In PG-11, we need to specify the application_name in the connection
string, otherwise, it took the testcase file name as application_name.
This is the same as other tests are doing in PG11.

Can you please once verify the attached patches?

-- 
With Regards,
Amit Kapila.

Вложения

RE: "ERROR: deadlock detected" when replicating TRUNCATE

От
"tanghy.fnst@fujitsu.com"
Дата:
On Thursday, May 20, 2021 3:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, I have prepared the patches for all branches (11...HEAD). Each
> version needs minor changes in the test, the code doesn't need much
> change. Some notable changes in the tests:
> 1. I have removed the conf change for max_logical_replication_workers
> on the publisher node. We only need this for the subscriber node.
> 2. After creating the new subscriptions wait for initial
> synchronization as we do for other tests.
> 3. synchronous_standby_names need to be reset for the previous test.
> This is only required for HEAD.
> 4. In PG-11, we need to specify the application_name in the connection
> string, otherwise, it took the testcase file name as application_name.
> This is the same as other tests are doing in PG11.
> 
> Can you please once verify the attached patches?

I have tested your patches for all branches(11...HEAD). All of them passed. B.T.W, I also confirmed that the bug exists
inthese branches without your fix.
 

The changes in tests LGTM. 
But I saw whitespace warnings when applied the patches for PG11 and PG12, please take a look at this.

Regards
Tang

Re: "ERROR: deadlock detected" when replicating TRUNCATE

От
Amit Kapila
Дата:
On Thu, May 20, 2021 at 12:46 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:
>
> On Thursday, May 20, 2021 3:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Okay, I have prepared the patches for all branches (11...HEAD). Each
> > version needs minor changes in the test, the code doesn't need much
> > change. Some notable changes in the tests:
> > 1. I have removed the conf change for max_logical_replication_workers
> > on the publisher node. We only need this for the subscriber node.
> > 2. After creating the new subscriptions wait for initial
> > synchronization as we do for other tests.
> > 3. synchronous_standby_names need to be reset for the previous test.
> > This is only required for HEAD.
> > 4. In PG-11, we need to specify the application_name in the connection
> > string, otherwise, it took the testcase file name as application_name.
> > This is the same as other tests are doing in PG11.
> >
> > Can you please once verify the attached patches?
>
> I have tested your patches for all branches(11...HEAD). All of them passed. B.T.W, I also confirmed that the bug
existsin these branches without your fix.
 
>
> The changes in tests LGTM.
> But I saw whitespace warnings when applied the patches for PG11 and PG12, please take a look at this.
>

Thanks, I have pushed after fixing the whitespace warning.

-- 
With Regards,
Amit Kapila.