Re: [HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION
От | Masahiko Sawada |
---|---|
Тема | Re: [HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION |
Дата | |
Msg-id | CAD21AoC=ETDqenL4A0y6XrqjckWCq145EFDUi2UbBVL_RarBjA@mail.gmail.com обсуждение исходный текст |
Ответ на | [HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) |
Список | pgsql-hackers |
On Tue, Oct 24, 2017 at 7:13 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Parallel execution of ALTER SUBSCRIPTION REFRESH PUBLICATION at several > nodes may cause deadlock: > > knizhnik 1480 0.0 0.1 1417532 16496 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16589 sync 16720 > waiting > knizhnik 1481 0.0 0.1 1417668 17668 ? Ss 20:01 0:00 postgres: > knizhnik postgres 127.0.0.1(36378) idle > knizhnik 1484 0.0 0.1 1406952 16676 ? Ss 20:01 0:00 postgres: > knizhnik postgres 127.0.0.1(56194) ALTER SUBSCRIPTION waiting > knizhnik 1486 0.0 0.1 1410040 21268 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(36386) idle > knizhnik 1487 0.0 0.1 1417532 16736 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16589 sync 16774 > knizhnik 1489 0.0 0.1 1410040 21336 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(36388) idle > knizhnik 1510 0.0 0.1 1406952 16820 ? Ss 20:01 0:00 postgres: > knizhnik postgres 127.0.0.1(56228) ALTER SUBSCRIPTION waiting > knizhnik 1530 0.0 0.1 1406952 16736 ? Ss 20:01 0:00 postgres: > knizhnik postgres 127.0.0.1(56260) ALTER SUBSCRIPTION waiting > knizhnik 1534 0.0 0.0 1417504 14360 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16590 > knizhnik 1537 0.0 0.1 1409164 16920 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(45054) idle > knizhnik 1545 0.0 0.0 1417344 14576 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16592 > knizhnik 1546 0.0 0.1 1409168 16588 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(56274) idle > knizhnik 1547 0.0 0.0 1417348 14332 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16592 sync 16606 > knizhnik 1549 0.0 0.0 1409004 14512 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(56276) idle in transaction waiting > knizhnik 1553 0.0 0.0 1417348 14540 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16592 > knizhnik 1554 0.0 0.1 1409168 16596 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(56280) idle > knizhnik 1555 0.0 0.0 1417348 14332 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16592 sync 16660 > knizhnik 1556 0.0 0.0 1409004 14520 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(56282) idle in transaction waiting > knizhnik 1558 0.0 0.0 1417348 14460 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16592 sync 16696 > knizhnik 1560 0.0 0.0 1409004 14516 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(56286) idle in transaction waiting > knizhnik 1562 0.0 0.0 1417348 14084 ? Ss 20:01 0:00 postgres: > bgworker: logical replication worker for subscription 16592 sync 16732 > knizhnik 1567 0.0 0.0 1409004 14516 ? Ss 20:01 0:00 postgres: > wal sender process knizhnik 127.0.0.1(56288) idle in transaction waiting > > knizhnik@knizhnik:~$ gdb -p 1556 > (gdb) bt > #0 0x00007f7b991569b3 in __epoll_wait_nocancel () at > ../sysdeps/unix/syscall-template.S:84 > #1 0x000000000087185a in WaitEventSetWaitBlock (set=0x246d1e8, > cur_timeout=-1, occurred_events=0x7ffff3a43580, nevents=1) at latch.c:1048 > #2 0x0000000000871716 in WaitEventSetWait (set=0x246d1e8, timeout=-1, > occurred_events=0x7ffff3a43580, nevents=1, wait_event_info=50331652) at > latch.c:1000 > #3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f7b975daba4, > wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331652) at latch.c:385 > #4 0x0000000000870d3e in WaitLatch (latch=0x7f7b975daba4, wakeEvents=1, > timeout=0, wait_event_info=50331652) at latch.c:339 > #5 0x000000000088995c in ProcSleep (locallock=0x24764d0, > lockMethodTable=0xc17420 <default_lockmethod>) at proc.c:1255 > #6 0x00000000008828cb in WaitOnLock (locallock=0x24764d0, owner=0x2517c78) > at lock.c:1702 > #7 0x000000000088157e in LockAcquireExtended (locktag=0x7ffff3a439a0, > lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 > '\001') at lock.c:998 > #8 0x0000000000880ba8 in LockAcquire (locktag=0x7ffff3a439a0, lockmode=5, > sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688 > #9 0x000000000087f8dc in XactLockTableWait (xid=624, rel=0x0, ctid=0x0, > oper=XLTW_None) at lmgr.c:587 > #10 0x000000000082f524 in SnapBuildWaitSnapshot (running=0x246d1b0, > cutoff=632) at snapbuild.c:1400 > #11 0x000000000082f2f6 in SnapBuildFindSnapshot (builder=0x254b550, > lsn=24410008, running=0x246d1b0) at snapbuild.c:1311 > #12 0x000000000082ee67 in SnapBuildProcessRunningXacts (builder=0x254b550, > lsn=24410008, running=0x246d1b0) at snapbuild.c:1106 > #13 0x000000000081c458 in DecodeStandbyOp (ctx=0x25334b0, > buf=0x7ffff3a43b10) at decode.c:314 > #14 0x000000000081bf39 in LogicalDecodingProcessRecord (ctx=0x25334b0, > record=0x2533770) at decode.c:117 > #15 0x0000000000820219 in DecodingContextFindStartpoint (ctx=0x25334b0) at > logical.c:458 > #16 0x000000000083674a in CreateReplicationSlot (cmd=0x246c418) at > walsender.c:934 > #17 0x00000000008374ef in exec_replication_command (cmd_string=0x24dd140 > "CREATE_REPLICATION_SLOT \"node_1_16592_sync_16660\" TEMPORARY LOGICAL > pgoutput USE_SNAPSHOT") at walsender.c:1511 > #18 0x00000000008a47ff in PostgresMain (argc=1, argv=0x247c468, > dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at > postgres.c:4086 > #19 0x0000000000801753 in BackendRun (port=0x2477db0) at postmaster.c:4357 > #20 0x0000000000800e5f in BackendStartup (port=0x2477db0) at > postmaster.c:4029 > #21 0x00000000007fd398 in ServerLoop () at postmaster.c:1753 > #22 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at > postmaster.c:1361 > #23 0x0000000000734f08 in main (argc=3, argv=0x244d6e0) at main.c:228 > > knizhnik@knizhnik:~$ gdb -p 1558 > #0 0x00007f2c6edd19b3 in __epoll_wait_nocancel () at > ../sysdeps/unix/syscall-template.S:84 > #1 0x000000000087185a in WaitEventSetWaitBlock (set=0x21d8ec8, > cur_timeout=-1, occurred_events=0x7ffe17ed0c20, nevents=1) at latch.c:1048 > #2 0x0000000000871716 in WaitEventSetWait (set=0x21d8ec8, timeout=-1, > occurred_events=0x7ffe17ed0c20, nevents=1, wait_event_info=100663299) at > latch.c:1000 > #3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f2c6d2825a4, > wakeEvents=19, sock=4, timeout=-1, wait_event_info=100663299) at latch.c:385 > #4 0x00007f2c1a5bc304 in libpqrcv_PQexec (streamConn=0x21e0ca0, > query=0x21d9a28 "CREATE_REPLICATION_SLOT \"node_2_16592_sync_16696\" > TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") > at libpqwalreceiver.c:585 > #5 0x00007f2c1a5bc921 in libpqrcv_create_slot (conn=0x217f770, > slotname=0x2182e38 "node_2_16592_sync_16696", temporary=1 '\001', > snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7ffe17ed0d80) > at libpqwalreceiver.c:796 > #6 0x0000000000832088 in LogicalRepSyncTableStart > (origin_startpos=0x7ffe17ed0d80) at tablesync.c:917 > #7 0x0000000000834d18 in ApplyWorkerMain (main_arg=2) at worker.c:1582 > #8 0x00000000007eeab1 in StartBackgroundWorker () at bgworker.c:835 > #9 0x000000000080267a in do_start_bgworker (rw=0x217b610) at > postmaster.c:5680 > #10 0x00000000008029c2 in maybe_start_bgworkers () at postmaster.c:5884 > #11 0x00000000008019cc in sigusr1_handler (postgres_signal_arg=10) at > postmaster.c:5073 > #12 <signal handler called> > #13 0x00007f2c6edc7573 in __select_nocancel () at > ../sysdeps/unix/syscall-template.S:84 > #14 0x00000000007fd23c in ServerLoop () at postmaster.c:1717 > #15 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x21516e0) at > postmaster.c:1361 > #16 0x0000000000734f08 in main (argc=3, argv=0x21516e0) at main.c:228 > > knizhnik@knizhnik:~$ gdb -p 1510 > #0 0x00007f7b991569b3 in __epoll_wait_nocancel () at > ../sysdeps/unix/syscall-template.S:84 > #1 0x000000000087185a in WaitEventSetWaitBlock (set=0x2479f40, > cur_timeout=-1, occurred_events=0x7ffff3a42fb0, nevents=1) at latch.c:1048 > #2 0x0000000000871716 in WaitEventSetWait (set=0x2479f40, timeout=-1, > occurred_events=0x7ffff3a42fb0, nevents=1, wait_event_info=50331655) at > latch.c:1000 > #3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f7b975dd844, > wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331655) at latch.c:385 > #4 0x0000000000870d3e in WaitLatch (latch=0x7f7b975dd844, wakeEvents=1, > timeout=0, wait_event_info=50331655) at latch.c:339 > #5 0x000000000088995c in ProcSleep (locallock=0x2476470, > lockMethodTable=0xc17420 <default_lockmethod>) at proc.c:1255 > #6 0x00000000008828cb in WaitOnLock (locallock=0x2476470, owner=0x2517a48) > at lock.c:1702 > #7 0x000000000088157e in LockAcquireExtended (locktag=0x7ffff3a433c0, > lockmode=8, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 > '\001') at lock.c:998 > #8 0x0000000000880ba8 in LockAcquire (locktag=0x7ffff3a433c0, lockmode=8, > sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688 > #9 0x0000000000880087 in LockSharedObject (classid=6100, objid=16589, > objsubid=0, lockmode=8) at lmgr.c:882 > #10 0x000000000067c361 in AlterSubscription (stmt=0x24dd558) at > subscriptioncmds.c:645 > #11 0x00000000008aa7ab in ProcessUtilitySlow (pstate=0x2479dd0, > pstmt=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO > local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION", > context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, > completionTag=0x7ffff3a43c90 "") at utility.c:1629 > #12 0x00000000008a90d0 in standard_ProcessUtility (pstmt=0x24dd998, > queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER > SUBSCRIPTION sub_3_1 REFRESH PUBLICATION", > context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, > completionTag=0x7ffff3a43c90 "") at utility.c:928 > #13 0x00007f7b98528884 in pathman_process_utility_hook (first_arg=0x24dd998, > queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER > SUBSCRIPTION sub_3_1 REFRESH PUBLICATION", > context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, > completionTag=0x7ffff3a43c90 "") at src/hooks.c:913 > #14 0x00000000008a80a2 in ProcessUtility (pstmt=0x24dd998, > queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER > SUBSCRIPTION sub_3_1 REFRESH PUBLICATION", > context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, > completionTag=0x7ffff3a43c90 "") at utility.c:353 > #15 0x00000000008a7075 in PortalRunUtility (portal=0x246be60, > pstmt=0x24dd998, isTopLevel=0 '\000', setHoldSnapshot=0 '\000', > dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at pquery.c:1178 > #16 0x00000000008a728d in PortalRunMulti (portal=0x246be60, isTopLevel=0 > '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, altdest=0x24ddcb8, > completionTag=0x7ffff3a43c90 "") at pquery.c:1324 > #17 0x00000000008a6757 in PortalRun (portal=0x246be60, > count=9223372036854775807, isTopLevel=0 '\000', run_once=1 '\001', > dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") > at pquery.c:799 > #18 0x00000000008a0288 in exec_simple_query (query_string=0x24dc920 "SET > SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH > PUBLICATION") at postgres.c:1099 > #19 0x00000000008a4823 in PostgresMain (argc=1, argv=0x247c2d0, > dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at > postgres.c:4090 > #20 0x0000000000801753 in BackendRun (port=0x2477770) at postmaster.c:4357 > #21 0x0000000000800e5f in BackendStartup (port=0x2477770) at > postmaster.c:4029 > #22 0x00000000007fd398 in ServerLoop () at postmaster.c:1753 > #23 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at > postmaster.c:1361 > #24 0x0000000000734f08 in main (argc=3, argv=0x244d6e0) at main.c:228 > > > > The reason of this deadlock seems to be clear: ALTER SUBSCRIPTION starts > transaction at one node and tries to create slot at other node, which > waiting for completion of all active transaction while building scnapshpot. > Is there any way to avoid this deadlock? I think that there is no way to avoid this deadlock except for refreshing subscriptions sequentially. One solution I came up with is to change ALTER TABLE REFRESH so that we can specify replication slot names used for table sync, or can accord the ERRCODE_DUPLICATE_OBJECT error when creating a replication slot. These are obviously poor solutions, though. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Vitaly BurovoyДата:
Сообщение: [HACKERS] Fix dumping pre-10 DBs by pg_dump10 if table "name" exists