Обсуждение: IPC/MultixactCreation on the Standby server
Hi, hackers The problem is as follows. A replication cluster includes a primary server and one hot-standby replica. The workload on the primary server is represented by multiple requests generating multixact IDs, while the hot-standby replica performs reading requests. After some time, all requests on the hot-standby are stuck and never get finished. The `pg_stat_activity` view on the replica reports that processes are stuck waiting for IPC/MultixactCreation, pg_cancel_backend and pg_terminate_backend cannot cancel the request, SIGQUIT is the only way to stop it. We tried: - changing the `autovacuum_multixact_freeze_max_age` parameters, - increasing `multixact_member_buffers` and `multixact_offset_buffers`, - disabling `hot_standby_feedback`, - switching the replica to synchronous and asynchronous mode, - and much more. But nothing helped. We ran the replica in recovery mode from WAL archive, i.e. as warm-standby, the result is the same. We tried to build from the sources based on REL_17_5 branch with the default configure settings ./configure make make install But got no luck. Here is an example with a synthetic workload reproducing the problem. Test system =========== - Architecture: x86_64 - OS: Ubuntu 24.04.2 LTS (Noble Numbat) - Tested postgres version(s): - latest 17 (17.5) - latest 18 (18-beta1) The problem is not reproducible on PostgreSQL 16.9 Steps to reproduce ================== postgres=# create table tbl ( id int primary key, val int ); postgres=# insert into tbl select i, 0 from generate_series(1,5) i; The first and second scripts execute queries on the master server ----------------------------------------------------------------- pgbench --no-vacuum --report-per-command -M prepared -c 200 -j 200 -T 300 -P 1 --file=/dev/stdin <<'EOF' \set id random(1, 5) begin; select * from tbl where id = :id for key share; commit; EOF pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 -T 300 -P 1 --file=/dev/stdin <<'EOF' \set id random(1, 5) begin; update tbl set val = val+1 where id = :id; \sleep 10 ms commit; EOF The following script is executed on the replica ----------------------------------------------- pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 -T 300 -P 1 --file=/dev/stdin <<'EOF' begin; select sum(val) from tbl; \sleep 10 ms select sum(val) from tbl; \sleep 10 ms commit; EOF pgbench (17.5 (Ubuntu 17.5-1.pgdg24.04+1)) progress: 1.0 s, 2606.8 tps, lat 33.588 ms stddev 13.316, 0 failed progress: 2.0 s, 3315.0 tps, lat 30.174 ms stddev 5.933, 0 failed progress: 3.0 s, 3357.0 tps, lat 29.699 ms stddev 5.541, 0 failed progress: 4.0 s, 3350.0 tps, lat 29.911 ms stddev 5.311, 0 failed progress: 5.0 s, 3206.0 tps, lat 30.999 ms stddev 6.343, 0 failed progress: 6.0 s, 3264.0 tps, lat 30.828 ms stddev 6.389, 0 failed progress: 7.0 s, 3224.0 tps, lat 31.099 ms stddev 6.197, 0 failed progress: 8.0 s, 3168.0 tps, lat 31.486 ms stddev 6.940, 0 failed progress: 9.0 s, 3118.0 tps, lat 32.004 ms stddev 6.546, 0 failed progress: 10.0 s, 3017.0 tps, lat 33.183 ms stddev 7.971, 0 failed progress: 11.0 s, 3157.0 tps, lat 31.697 ms stddev 6.624, 0 failed progress: 12.0 s, 3180.0 tps, lat 31.415 ms stddev 6.310, 0 failed progress: 13.0 s, 3150.9 tps, lat 31.591 ms stddev 6.280, 0 failed progress: 14.0 s, 3329.0 tps, lat 30.189 ms stddev 5.792, 0 failed progress: 15.0 s, 3233.6 tps, lat 30.852 ms stddev 5.723, 0 failed progress: 16.0 s, 3185.4 tps, lat 31.378 ms stddev 6.383, 0 failed progress: 17.0 s, 3035.0 tps, lat 32.920 ms stddev 7.390, 0 failed progress: 18.0 s, 3173.0 tps, lat 31.547 ms stddev 6.390, 0 failed progress: 19.0 s, 3077.0 tps, lat 32.427 ms stddev 6.634, 0 failed progress: 20.0 s, 3266.1 tps, lat 30.740 ms stddev 5.842, 0 failed progress: 21.0 s, 2990.9 tps, lat 33.353 ms stddev 7.019, 0 failed progress: 22.0 s, 3048.1 tps, lat 32.933 ms stddev 6.951, 0 failed progress: 23.0 s, 3148.0 tps, lat 31.769 ms stddev 6.077, 0 failed progress: 24.0 s, 1523.2 tps, lat 30.029 ms stddev 5.093, 0 failed progress: 25.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 26.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 27.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 28.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 29.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 30.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 31.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 32.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 33.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 34.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 35.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed After some time, all requests on the replica hang waiting for IPC/MultixactCreation. Output from `pg_stat_activity` ------------------------------ backend_type | state | wait_event_type | wait_event | query ----------------------------+--------+-----------------+-------------------+------------------------------------------ client backend | active | IPC | MultixactCreation | select sum(val) from tbl; client backend | active | IPC | MultixactCreation | select sum(val) from tbl; client backend | active | IPC | MultixactCreation | select sum(val) from tbl; client backend | active | IPC | MultixactCreation | select sum(val) from tbl; ... client backend | active | IPC | MultixactCreation | select sum(val) from tbl; client backend | active | IPC | MultixactCreation | select sum(val) from tbl; client backend | active | IPC | MultixactCreation | select sum(val) from tbl; startup | | LWLock | BufferContent | checkpointer | | Activity | CheckpointerMain | background writer | | Activity | BgwriterHibernate | walreceiver | | Activity | WalReceiverMain | gdb session for `client backend` process ---------------------------------------- (gdb) bt #0 0x00007f0e9872a007 in epoll_wait (epfd=5, events=0x57c4747fc458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000057c440685033 in WaitEventSetWaitBlock (nevents=<optimized out>, occurred_events=0x7ffdaedc8360, cur_timeout=-1, set=0x57c4747fc3f0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1577 #2 WaitEventSetWait (set=0x57c4747fc3f0, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffdaedc8360, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1525 #3 0x000057c44068541c in WaitLatch (latch=<optimized out>, wakeEvents=<optimized out>, timeout=<optimized out>, wait_event_info=134217765) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:538 #4 0x000057c44068d8c0 in ConditionVariableTimedSleep (cv=0x7f0cefc50ab0, timeout=-1, wait_event_info=134217765) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:163 #5 0x000057c440365a0c in ConditionVariableSleep (wait_event_info=134217765, cv=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:98 #6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483 #7 0x000057c4408adc6b in MultiXactIdGetUpdateXid.isra.0 (xmax=xmax@entry=45559845, t_infomask=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7478 #8 0x000057c44031ecfa in HeapTupleGetUpdateXid (tuple=<error reading variable: Cannot access memory at address 0x0>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7519 #9 HeapTupleSatisfiesMVCC (htup=<optimized out>, buffer=404, snapshot=0x57c474892ff0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1090 #10 HeapTupleSatisfiesVisibility (htup=<optimized out>, snapshot=0x57c474892ff0, buffer=404) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1772 #11 0x000057c44030c1cb in page_collect_tuples (check_serializable=<optimized out>, all_visible=<optimized out>, lines=<optimized out>, block=<optimized out>, buffer=<optimized out>, page=<optimized out>, snapshot=<optimized out>, scan=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:480 #12 heap_prepare_pagescan (sscan=0x57c47495b970) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:579 #13 0x000057c44030cb59 in heapgettup_pagemode (scan=scan@entry=0x57c47495b970, dir=<optimized out>, nkeys=<optimized out>, key=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:999 #14 0x000057c44030d1bd in heap_getnextslot (sscan=0x57c47495b970, direction=<optimized out>, slot=0x57c47494b278) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:1319 #15 0x000057c4404f090a in table_scan_getnextslot (slot=0x57c47494b278, direction=ForwardScanDirection, sscan=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/tableam.h:1072 #16 SeqNext (node=0x57c47494b0e8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeSeqscan.c:80 #17 0x000057c4404d5cfc in ExecProcNode (node=0x57c47494b0e8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274 #18 fetch_input_tuple (aggstate=aggstate@entry=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:561 #19 0x000057c4404d848a in agg_retrieve_direct (aggstate=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2459 #20 ExecAgg (pstate=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2179 #21 0x000057c4404c2003 in ExecProcNode (node=0x57c47494aaf0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274 #22 ExecutePlan (dest=0x57c47483d548, direction=<optimized out>, numberTuples=0, sendTuples=true, operation=CMD_SELECT, queryDesc=0x57c474895010) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:1649 #23 standard_ExecutorRun (queryDesc=0x57c474895010, direction=<optimized out>, count=0, execute_once=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:361 ... gdb session for `startup` process --------------------------------- (gdb) bt #0 0x00007f0e98698ce3 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0ceb34e6b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139 #3 0x00007f0e986a4f1f in do_futex_wait (sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111 #4 0x00007f0e986a4fb8 in __new_sem_wait_slow64 (sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183 #5 0x00007f0e986a503d in __new_sem_wait (sem=sem@entry=0x7f0ceb34e6b8) at ./nptl/sem_wait.c:42 #6 0x000057c440696166 in PGSemaphoreLock (sema=0x7f0ceb34e6b8) at port/pg_sema.c:327 #7 LWLockAcquire (lock=0x7f0cefc58064, mode=LW_EXCLUSIVE) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/lwlock.c:1289 #8 0x000057c44038f96a in LockBuffer (mode=2, buffer=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/buffer/bufmgr.c:5147 #9 XLogReadBufferForRedoExtended (record=<optimized out>, block_id=<optimized out>, mode=RBM_NORMAL, get_cleanup_lock=false, buf=0x7ffdaedc8b4c) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:429 #10 0x000057c440319969 in XLogReadBufferForRedo (buf=0x7ffdaedc8b4c, block_id=0 '\000', record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:317 #11 heap_xlog_lock_updated (record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10230 #12 heap2_redo (record=0x57c4748994d8) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10362 #13 0x000057c44038e1d2 in ApplyWalRecord (replayTLI=<synthetic pointer>, record=0x7f0e983908e0, xlogreader=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/xlog_internal.h:380 #14 PerformWalRecovery () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogrecovery.c:1822 #15 0x000057c44037bbf6 in StartupXLOG () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlog.c:5821 #16 0x000057c4406155ed in StartupProcessMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/startup.c:258 #17 0x000057c44060b376 in postmaster_child_launch (child_type=B_STARTUP, startup_data=0x0, startup_data_len=0, client_sock=0x0) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/launch_backend.c:277 #18 0x000057c440614509 in postmaster_child_launch (client_sock=0x0, startup_data_len=0, startup_data=0x0, child_type=B_STARTUP) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3934 #19 StartChildProcess (type=type@entry=B_STARTUP) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3930 #20 0x000057c44061480d in PostmasterStateMachine () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3392 #21 0x000057c4408a3455 in process_pm_child_exit () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:2683 #22 ServerLoop.isra.0 () at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1667 #23 0x000057c440616965 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1374 #24 0x000057c4402bcd2d in main (argc=17, argv=0x57c4747fb140) at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/main/main.c:199 Could you please help me to fix the problem of stuck 'client backend' processes? I kindly ask you for any ideas and recommendations! Best regards, Dmitry
On 25.06.2025 12:34, Andrey Borodin wrote: > >> On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote: >> >> #6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimizedout>) >> at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483 > Hi Dmitry! > > This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixactis still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WALredo. > > If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact. > > Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further. > > > Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleepso in worst case we will fallback to behavior of PG 16. > > > Best regards, Andrey Borodin. Hi Andrey! Thanks so much for your response. A small comment on /* Corner case 2: ... */ At this point in the code, I tried to set trace points by outputting messages through `elog()`, and I can say that the process does not always stuck in this part of the code, it appears from time to time and in an unpredictable way. Maybe this will help you a little. To be honest, PostgreSQL performance is much better with this feature, it would be a shame if we had to rollback to the behavior in version 16. I will definitely try to reproduce the problem with your patch. Best regards, Dmitry.
I will definitely try to reproduce the problem with your patch.
Hi Andrey! I checked with the patch, unfortunately the problem is also reproducible. Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop. At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'. Best regards, Dmitry.
> On 26 Jun 2025, at 14:33, Dmitry <dsy.075@yandex.ru> wrote: > > On 25.06.2025 16:44, Dmitry wrote: >> I will definitely try to reproduce the problem with your patch. > Hi Andrey! > > I checked with the patch, unfortunately the problem is also reproducible. > Client processes wake up after a second and try to get information about the members of the multixact again, in an endlessloop. > At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'. My hypothesis is that MultiXactState->nextMXact is not filled often enough from redo pathes. So if you are unlucky enough,corner case 2 reading can deadlock with startup. I need to verify it further, but if so - I's an ancient bug that just happens to be few orders of magnitude more reproducibleon 17 due to performance improvements. Still a hypothetical though. Best regards, Andrey Borodin.
On 26.06.2025 19:24, Andrey Borodin wrote: > If my hypothesis is correct nextMXact will precede tmpMXact. It seems that the hypothesis has not been confirmed. Attempt #1 2025-06-26 23:47:24.821 MSK [220458] WARNING: Timed out: nextMXact 24138381 tmpMXact 24138379 2025-06-26 23:47:24.822 MSK [220540] WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379 2025-06-26 23:47:24.823 MSK [220548] WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379 ... pgbench (17.5) progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed progress: 4.0 s, 482.2 tps, lat 820.293 ms stddev 1370.729, 0 failed progress: 5.0 s, 886.0 tps, lat 112.463 ms stddev 8.506, 0 failed progress: 6.0 s, 348.9 tps, lat 111.324 ms stddev 5.871, 0 failed WARNING: Timed out: nextMXact 24138381 tmpMXact 24138379 WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379 WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379 ... progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379 Attempt #2 2025-06-27 09:18:01.312 MSK [236187] WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 2025-06-27 09:18:01.312 MSK [236225] WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 2025-06-27 09:18:01.312 MSK [236178] WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 ... pgbench (17.5) progress: 1.0 s, 830.9 tps, lat 108.556 ms stddev 10.078, 0 failed progress: 2.0 s, 839.0 tps, lat 118.358 ms stddev 19.708, 0 failed progress: 3.0 s, 623.4 tps, lat 134.186 ms stddev 15.565, 0 failed WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 WARNING: Timed out: nextMXact 24497747 tmpMXact 24497744 WARNING: Timed out: nextMXact 24497747 tmpMXact 24497744 ... progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744 Best regards, Dmitry.
> On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuplevisibility check might need WAL to go on, we have to wait until some next MX be filled in. > Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples(). Thinking more about the problem I see 3 ways to deal with this deadlock: 1. We check for recovery conflict even in presence of InterruptHoldoffCount. That's what patch v4 does. 2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock. 3. Why do we even HOLD_INTERRUPTS() when aquire shared lock?? Personally, I see point 2 as very invasive in a code that I'm not too familiar with. Option 1 is clumsy. But option 3 isa giant system-wide change. Yet, I see 3 as a correct solution. Can't we just abstain from HOLD_INTERRUPTS() if taken LWLock is not exclusive? Best regards, Andrey Borodin.
On 2025-Jul-17, Andrey Borodin wrote: > Thinking more about the problem I see 3 ways to deal with this deadlock: > 1. We check for recovery conflict even in presence of > InterruptHoldoffCount. That's what patch v4 does. > 2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() > without holding buffer lock. > 3. Why do we even HOLD_INTERRUPTS() when aquire shared lock?? Hmm, as you say, doing (3) is a very invasive system-wide change, but can we do it more localized? I mean, what if we do RESUME_INTERRUPTS() just before going to sleep on the CV, and restore with HOLD_INTERRUPTS() once the sleep is done? That would only affect this one place rather than the whole system, and should also (AFAICS) solve the issue. > Yet, I see 3 as a correct solution. Can't we just abstain from > HOLD_INTERRUPTS() if taken LWLock is not exclusive? Hmm, the code in LWLockAcquire says /* * Lock out cancel/die interrupts until we exit the code section protected * by the LWLock. This ensures that interrupts will not interfere with * manipulations of data structures in shared memory. */ HOLD_INTERRUPTS(); which means if we want to change this, we would have to inspect every single use of LWLocks in shared mode in order to be certain that such a change isn't problematic. This is a discussion I'm not prepared for. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Si quieres ser creativo, aprende el arte de perder el tiempo"
Hello, Andrey and I discussed this on IM, and after some back and forth, he came up with a brilliant idea: modify the WAL record for multixact creation, so that the offset of the next multixact is transmitted and can be replayed. (We know it when we create each multixact, because the number of members is known). So the replica can store the offset of the next multixact right away, even though it doesn't know the members for that multixact. On replay of the next multixact we can cross-check that the offset matches what we had written previously. This allows reading the first multixact, without having to wait for the replay of creation of the second multixact. One concern is: if we write the offset for the second mxact, but haven't written its members, what happens if another process looks up the members for that multixact? We'll have to make it wait (retry) somehow. Given what was described upthread, it's possible for the multixact beyond that one to be written already, so we won't have the zero offset that would make us wait. Anyway, he's going to try and implement this. Andrey, please let me know if I misunderstood the idea. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
> On 18 Jul 2025, at 16:53, Álvaro Herrera <alvherre@kurilemu.de> wrote: > > Hello, > > Andrey and I discussed this on IM, and after some back and forth, he > came up with a brilliant idea: modify the WAL record for multixact > creation, so that the offset of the next multixact is transmitted and > can be replayed. (We know it when we create each multixact, because the > number of members is known). So the replica can store the offset of the > next multixact right away, even though it doesn't know the members for > that multixact. On replay of the next multixact we can cross-check that > the offset matches what we had written previously. This allows reading > the first multixact, without having to wait for the replay of creation > of the second multixact. > > One concern is: if we write the offset for the second mxact, but haven't > written its members, what happens if another process looks up the > members for that multixact? We'll have to make it wait (retry) somehow. > Given what was described upthread, it's possible for the multixact > beyond that one to be written already, so we won't have the zero offset > that would make us wait. We redo Multixact creation always before it is visible anywhere on heap. The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged yet. However, I think we do not need to read multi before it is redone. > > Anyway, he's going to try and implement this. > > Andrey, please let me know if I misunderstood the idea. Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproduciblebefore 17. Best regards, Andrey Borodin.
Вложения
> On 18 Jul 2025, at 18:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproduciblebefore 17. Here'v v7 with improved comments and cross-check for correctness. Also, MultiXact wraparound is handled. I'm planning to prepare tests and fixes for all supported branches, if there's no objections to this approach. Best regards, Andrey Borodin.
Вложения
> On 21 Jul 2025, at 19:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > I'm planning to prepare tests and fixes for all supported branches This is a status update message. I've reproduced problem on REL_13_STABLE and verified that proposed fix works there. Also I've discovered one more serious problem. If a backend crashes just before WAL-logging multi, any heap tuple that uses this multi will become unreadable. Any attemptto read it will hang forever. I've reproduced the problem and now I'm working on scripting this scenario. Basically, I modify code to hang forever afterassigning multi number 2. Then execute in first psql: create table x as select i,0 v from generate_series(1,10) i; create unique index on x(i); \set id 1 begin; select * from x where i = :id for no key update; savepoint s1; update x set v = v+1 where i = :id; -- multi 1 commit; \set id 2 begin; select * from x where i = :id for no key update; savepoint s1; update x set v = v+1 where i = :id; -- multi 2 -- will hang commit; Then in second psql: create table y as select i,0 v from generate_series(1,10) i; create unique index on y(i); \set id 1 begin; select * from y where i = :id for no key update; savepoint s1; update y set v = v+1 where i = :id; commit; After this I pkill -9 postgres. Recovered installation cannot execute select * from x; because multi 1 cannot be read withoutrecovery of multi 2 which was never logged. Luckily fix is the same: just restore offset of multi 2 when multi 1 is recovered. Best regards, Andrey Borodin.
On 2025-Jul-25, Andrey Borodin wrote: > Also I've discovered one more serious problem. > If a backend crashes just before WAL-logging multi, any heap tuple > that uses this multi will become unreadable. Any attempt to read it > will hang forever. > > I've reproduced the problem and now I'm working on scripting this > scenario. Basically, I modify code to hang forever after assigning > multi number 2. It took me a minute to understand this, and I think your description is slightly incorrect: you mean that the heap tuple that uses the PREVIOUS multixact cannot be read (at least, that's what I understand from your reproducer script). I agree it's a pretty ugly bug! I think it's essentially the same bug as the other problem, so the proposed fix should solve both. Thanks for working on this! Looking at this, /* * We want to avoid edge case 2 in redo, because we cannot wait for * startup process in GetMultiXactIdMembers() without risk of a * deadlock. */ MultiXactId next = multi + 1; int next_pageno; /* Handle wraparound as GetMultiXactIdMembers() does it. */ if (multi < FirstMultiXactId) multi = FirstMultiXactId; Don't you mean to test and change the value 'next' rather than 'multi' here? In this bit, * We do not need to handle race conditions, because this code * is only executed in redo and we hold * MultiXactOffsetSLRULock. I think it'd be good to have an Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE)); just for peace of mind. Also, commit c61678551699 removed ZeroMultiXactOffsetPage(), but since you have 'false' as the second argument, then SimpleLruZeroPage() is enough. (I wondered why isn't WAL-logging necessary ... until I remember that we're in a standby. I think a simple comment here like "no WAL-logging because we're a standby" should suffice.) -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
> On 26 Jul 2025, at 22:44, Álvaro Herrera <alvherre@kurilemu.de> wrote: > > On 2025-Jul-25, Andrey Borodin wrote: > >> Also I've discovered one more serious problem. >> If a backend crashes just before WAL-logging multi, any heap tuple >> that uses this multi will become unreadable. Any attempt to read it >> will hang forever. >> >> I've reproduced the problem and now I'm working on scripting this >> scenario. Basically, I modify code to hang forever after assigning >> multi number 2. > > It took me a minute to understand this, and I think your description is > slightly incorrect: you mean that the heap tuple that uses the PREVIOUS > multixact cannot be read (at least, that's what I understand from your > reproducer script). Yes, I explained a bit incorrectly, but you got the problem correctly. > > Looking at this, > > /* > * We want to avoid edge case 2 in redo, because we cannot wait for > * startup process in GetMultiXactIdMembers() without risk of a > * deadlock. > */ > MultiXactId next = multi + 1; > int next_pageno; > > /* Handle wraparound as GetMultiXactIdMembers() does it. */ > if (multi < FirstMultiXactId) > multi = FirstMultiXactId; > > Don't you mean to test and change the value 'next' rather than 'multi' > here? Yup, that was typo. > > In this bit, > > * We do not need to handle race conditions, because this code > * is only executed in redo and we hold > * MultiXactOffsetSLRULock. > > I think it'd be good to have an > Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE)); > just for peace of mind. Ugh, that uncovered 17+ problem: now we have a couple of locks simultaneously. I'll post a version with this a later. > Also, commit c61678551699 removed > ZeroMultiXactOffsetPage(), but since you have 'false' as the second > argument, then SimpleLruZeroPage() is enough. (I wondered why isn't > WAL-logging necessary ... until I remember that we're in a standby. I > think a simple comment here like "no WAL-logging because we're a > standby" should suffice.) Agree. I've made a test [0] and discovered another problem. Adding this checkpoint breaks the test[1] even after a fix[2]. I suspect that excluding "edge case 2" on standby is simply not enough... we have to do this "next offset" dance on Primarytoo. I'll think more about other options. Best regards, Andrey Borodin. [0] https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07 [1] https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9 [2] https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af Timeline of this commits can be seen here https://github.com/x4m/postgres_g/commits/mx19/
> On 27 Jul 2025, at 16:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > we have to do this "next offset" dance on Primary too. PFA draft of this. I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch. I think the problem shouldbe fixed by the patch. Thanks! Best regards, Andrey Borodin.
Вложения
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch.
Andrey, thank you very much for your work, and also thanks to Álvaro for joining the discussion on the problem. I ran tests on PG17 with patch v8, there are no more sessions hanging on the replica, great! Replica requests are canceled with recovery conflicts. ERROR: canceling statement due to conflict with recovery DETAIL: User was holding shared buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. STATEMENT: select sum(val) from tbl2;
But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages. ERROR: MultiXact 81926 has invalid next offset STATEMENT: select * from tbl2 where id = $1 for no key update; ERROR: MultiXact 81941 has invalid next offset CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2" automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.
I'll duplicate the message, the previous one turned out to have poor formatting, sorry. On 28.07.2025 15:49, Andrey Borodin wrote: > I also attach a version for PG17, maybe Dmitry could try to reproduce > the problem with this patch. Andrey, thank you very much for your work, and also thanks to Álvaro for joining the discussion on the problem. I ran tests on PG17 with patch v8, there are no more sessions hanging on the replica, great! Replica requests are canceled with recovery conflicts. ERROR: canceling statement due to conflict with recovery DETAIL: User was holding shared buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. STATEMENT: select sum(val) from tbl2; But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages. ERROR: MultiXact 81926 has invalid next offset STATEMENT: select * from tbl2 where id = $1 for no key update; ERROR: MultiXact 81941 has invalid next offset CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2" automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.
17.07.2025 21:34, Andrey Borodin пишет: >> On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0.Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in. >> Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples(). > > Thinking more about the problem I see 3 ways to deal with this deadlock: > 2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock. > > Personally, I see point 2 as very invasive in a code that I'm not too familiar with. If there were no SetHintBits inside of HeapTupleSatisfies* , then it could be just "copy line pointers and tuple headers under lock, release lock, check tuples visibility using copied arrays". But hint bits makes it much more difficult. Probably, tuple headers could be copied twice and compared afterwards. If there are change in hint bits, page should be relocked. And call to MarkBufferDirtyHint should be delayed. A very dirty variant is in attach. I've made it just for fun. It passes 'regress', 'isolation' and 'recovery'. But I didn't benchmark it. -- regards Yura Sokolov aka funny-falcon
Вложения
> On 29 Jul 2025, at 12:17, Dmitry <dsy.075@yandex.ru> wrote: > > But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages. Thanks! That's a bug in my patch. I do not understand it yet. I've reproduced it with your original workload. Most of errors I see are shallow (offset == 0 or nextOffset==0), but this one is interesting: TRAP: failed Assert("shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS"),File: "slru.c", Line: 729, PID: 91085 0 postgres 0x00000001032ea5ac ExceptionalCondition + 216 1 postgres 0x0000000102af2784 SlruInternalWritePage + 700 2 postgres 0x0000000102af14dc SimpleLruWritePage + 96 3 postgres 0x0000000102ae89d4 RecordNewMultiXact + 576 So it makes me think that it's some version of IO concurrency issue. As expected error only persists if "extend SLRU" branch is active in RecordNewMultiXact(). Thanks for testing! Best regards, Andrey Borodin.
> On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > I do not understand it yet. OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension byExtendMultiXactOffset(). So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound... Here's an updated two patches, one for Postgres 17 and one for mater(with a test). Best regards, Andrey Borodin.
Вложения
On 31.07.2025 09:29, Andrey Borodin wrote: > Here's an updated two patches, one for Postgres 17 and one for > mater(with a test). I ran tests on PG17 with patch v9. I tried to reproduce it for three cases, the first when we explicitly use for key share, the second through subtransactions and the third, through implicit use of for key share, through an foreign key. There are no more sessions hanging on the replica, that's great! Thank you all very much! Best regards, Dmitry.
On Thu, 31 Jul 2025 at 11:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > > On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > I do not understand it yet. > > OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension byExtendMultiXactOffset(). > So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound... > > Here's an updated two patches, one for Postgres 17 and one for mater(with a test). Hi! + /* + * We might have filled this offset previosuly. + * Cross-check for correctness. + */ + Assert((*offptr == 0) || (*offptr == offset)); Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr != 0 and *offptr != offset? + /* Read and adjust next page */ + next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next); + next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno]; + next_offptr[next_entryno] = offset + nmembers; should we check the value of next_offptr[next_entryno] to be equal to zero or offset + nmembers ? Assert or errcode(ERRCODE_DATA_CORRUPTED) also. -- Best regards, Kirill Reshke
Hi Kirill, thanks for looking into this! > On 20 Aug 2025, at 12:19, Kirill Reshke <reshkekirill@gmail.com> wrote: > > + /* > + * We might have filled this offset previosuly. > + * Cross-check for correctness. > + */ > + Assert((*offptr == 0) || (*offptr == offset)); > > Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr != > 0 and *offptr != offset? No, we should not exit. We encountered inconsistencies that we are fully prepared to fix. But you are right - we should betteremit WARNING with XX001. > + /* Read and adjust next page */ > + next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next); > + next_offptr = (MultiXactOffset *) > MultiXactOffsetCtl->shared->page_buffer[next_slotno]; > + next_offptr[next_entryno] = offset + nmembers; > > should we check the value of next_offptr[next_entryno] to be equal to > zero or offset + nmembers ? Assert or > errcode(ERRCODE_DATA_CORRUPTED) also. Yes, we'd better WARN user here. Thanks for your valuable suggestions. I'm not sending new version of the patch, because I'm waiting input on overall designfrom Alvaro or any committer willing to fix this. We need to figure out if this radical approach is acceptable to backpatch.I do not see other options, but someone might have more clever ideas. Best regards, Andrey Borodin.