Обсуждение: IPC/MultixactCreation on the Standby server

Поиск
Список
Период
Сортировка

IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
Hi, hackers

The problem is as follows.
A replication cluster includes a primary server and one hot-standby replica.
The workload on the primary server is represented by multiple requests 
generating multixact IDs, while the hot-standby replica performs reading 
requests.

After some time, all requests on the hot-standby are stuck and never get 
finished.

The `pg_stat_activity` view on the replica reports that processes are 
stuck waiting for IPC/MultixactCreation,
pg_cancel_backend and pg_terminate_backend cannot cancel the request, 
SIGQUIT is the only way to stop it.

We tried:
- changing the `autovacuum_multixact_freeze_max_age` parameters,
- increasing `multixact_member_buffers` and `multixact_offset_buffers`,
- disabling `hot_standby_feedback`,
- switching the replica to synchronous and asynchronous mode,
- and much more.
But nothing helped.

We ran the replica in recovery mode from WAL archive, i.e. as 
warm-standby, the result is the same.

We tried to build from the sources based on REL_17_5 branch with the 
default configure settings
     ./configure
     make
     make install
But got no luck.

Here is an example with a synthetic workload reproducing the problem.

Test system
===========

- Architecture: x86_64
- OS: Ubuntu 24.04.2 LTS (Noble Numbat)
- Tested postgres version(s):
     - latest 17 (17.5)
     - latest 18 (18-beta1)

The problem is not reproducible on PostgreSQL 16.9

Steps to reproduce
==================

     postgres=# create table tbl (
         id int primary key,
         val int
     );
     postgres=# insert into tbl select i, 0 from generate_series(1,5) i;


The first and second scripts execute queries on the master server
-----------------------------------------------------------------

     pgbench --no-vacuum --report-per-command -M prepared -c 200 -j 200 
-T 300 -P 1 --file=/dev/stdin <<'EOF'
     \set id random(1, 5)
     begin;
     select * from tbl where id = :id for key share;
     commit;
     EOF

     pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 
-T 300 -P 1 --file=/dev/stdin <<'EOF'
     \set id random(1, 5)
     begin;
     update tbl set val = val+1 where id = :id;
     \sleep 10 ms
     commit;
     EOF


The following script is executed on the replica
-----------------------------------------------

     pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100 
-T 300 -P 1 --file=/dev/stdin <<'EOF'
     begin;
     select sum(val) from tbl;
     \sleep 10 ms
     select sum(val) from tbl;
     \sleep 10 ms
     commit;
     EOF

     pgbench (17.5 (Ubuntu 17.5-1.pgdg24.04+1))
     progress: 1.0 s, 2606.8 tps, lat 33.588 ms stddev 13.316, 0 failed
     progress: 2.0 s, 3315.0 tps, lat 30.174 ms stddev 5.933, 0 failed
     progress: 3.0 s, 3357.0 tps, lat 29.699 ms stddev 5.541, 0 failed
     progress: 4.0 s, 3350.0 tps, lat 29.911 ms stddev 5.311, 0 failed
     progress: 5.0 s, 3206.0 tps, lat 30.999 ms stddev 6.343, 0 failed
     progress: 6.0 s, 3264.0 tps, lat 30.828 ms stddev 6.389, 0 failed
     progress: 7.0 s, 3224.0 tps, lat 31.099 ms stddev 6.197, 0 failed
     progress: 8.0 s, 3168.0 tps, lat 31.486 ms stddev 6.940, 0 failed
     progress: 9.0 s, 3118.0 tps, lat 32.004 ms stddev 6.546, 0 failed
     progress: 10.0 s, 3017.0 tps, lat 33.183 ms stddev 7.971, 0 failed
     progress: 11.0 s, 3157.0 tps, lat 31.697 ms stddev 6.624, 0 failed
     progress: 12.0 s, 3180.0 tps, lat 31.415 ms stddev 6.310, 0 failed
     progress: 13.0 s, 3150.9 tps, lat 31.591 ms stddev 6.280, 0 failed
     progress: 14.0 s, 3329.0 tps, lat 30.189 ms stddev 5.792, 0 failed
     progress: 15.0 s, 3233.6 tps, lat 30.852 ms stddev 5.723, 0 failed
     progress: 16.0 s, 3185.4 tps, lat 31.378 ms stddev 6.383, 0 failed
     progress: 17.0 s, 3035.0 tps, lat 32.920 ms stddev 7.390, 0 failed
     progress: 18.0 s, 3173.0 tps, lat 31.547 ms stddev 6.390, 0 failed
     progress: 19.0 s, 3077.0 tps, lat 32.427 ms stddev 6.634, 0 failed
     progress: 20.0 s, 3266.1 tps, lat 30.740 ms stddev 5.842, 0 failed
     progress: 21.0 s, 2990.9 tps, lat 33.353 ms stddev 7.019, 0 failed
     progress: 22.0 s, 3048.1 tps, lat 32.933 ms stddev 6.951, 0 failed
     progress: 23.0 s, 3148.0 tps, lat 31.769 ms stddev 6.077, 0 failed
     progress: 24.0 s, 1523.2 tps, lat 30.029 ms stddev 5.093, 0 failed
     progress: 25.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 26.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 27.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 28.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 29.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 30.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 31.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 32.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 33.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 34.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
     progress: 35.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed

After some time, all requests on the replica hang waiting for 
IPC/MultixactCreation.

Output from `pg_stat_activity`
------------------------------

             backend_type        | state  | wait_event_type | 
wait_event     |                  query
----------------------------+--------+-----------------+-------------------+------------------------------------------
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
     ...
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
      client backend             | active | IPC             | 
MultixactCreation | select sum(val) from tbl;
      startup                    |        | LWLock          | 
BufferContent     |
      checkpointer               |        | Activity        | 
CheckpointerMain  |
      background writer          |        | Activity        | 
BgwriterHibernate |
      walreceiver                |        | Activity        | 
WalReceiverMain   |


gdb session for `client backend` process
----------------------------------------

     (gdb) bt
     #0  0x00007f0e9872a007 in epoll_wait (epfd=5, 
events=0x57c4747fc458, maxevents=1, timeout=-1) at 
../sysdeps/unix/sysv/linux/epoll_wait.c:30
     #1  0x000057c440685033 in WaitEventSetWaitBlock (nevents=<optimized 
out>, occurred_events=0x7ffdaedc8360, cur_timeout=-1, set=0x57c4747fc3f0)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1577
     #2  WaitEventSetWait (set=0x57c4747fc3f0, timeout=timeout@entry=-1, 
occurred_events=occurred_events@entry=0x7ffdaedc8360, 
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1525
     #3  0x000057c44068541c in WaitLatch (latch=<optimized out>,
wakeEvents=<optimized out>, timeout=<optimized out>, 
wait_event_info=134217765)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:538
     #4  0x000057c44068d8c0 in ConditionVariableTimedSleep 
(cv=0x7f0cefc50ab0, timeout=-1, wait_event_info=134217765)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:163
     #5  0x000057c440365a0c in ConditionVariableSleep 
(wait_event_info=134217765, cv=<optimized out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:98
     #6  GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, 
from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
     #7  0x000057c4408adc6b in MultiXactIdGetUpdateXid.isra.0 
(xmax=xmax@entry=45559845, t_infomask=<optimized out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7478
     #8  0x000057c44031ecfa in HeapTupleGetUpdateXid (tuple=<error 
reading variable: Cannot access memory at address 0x0>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7519
     #9  HeapTupleSatisfiesMVCC (htup=<optimized out>, buffer=404, 
snapshot=0x57c474892ff0) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1090
     #10 HeapTupleSatisfiesVisibility (htup=<optimized out>, 
snapshot=0x57c474892ff0, buffer=404) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1772
     #11 0x000057c44030c1cb in page_collect_tuples 
(check_serializable=<optimized out>, all_visible=<optimized out>, 
lines=<optimized out>, block=<optimized out>, buffer=<optimized out>, 
page=<optimized out>,
         snapshot=<optimized out>, scan=<optimized out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:480
     #12 heap_prepare_pagescan (sscan=0x57c47495b970) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:579
     #13 0x000057c44030cb59 in heapgettup_pagemode 
(scan=scan@entry=0x57c47495b970, dir=<optimized out>, nkeys=<optimized 
out>, key=<optimized out>)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:999
     #14 0x000057c44030d1bd in heap_getnextslot (sscan=0x57c47495b970, 
direction=<optimized out>, slot=0x57c47494b278) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:1319
     #15 0x000057c4404f090a in table_scan_getnextslot 
(slot=0x57c47494b278, direction=ForwardScanDirection, sscan=<optimized out>)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/tableam.h:1072
     #16 SeqNext (node=0x57c47494b0e8) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeSeqscan.c:80
     #17 0x000057c4404d5cfc in ExecProcNode (node=0x57c47494b0e8) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
     #18 fetch_input_tuple (aggstate=aggstate@entry=0x57c47494aaf0) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:561
     #19 0x000057c4404d848a in agg_retrieve_direct 
(aggstate=0x57c47494aaf0) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2459
     #20 ExecAgg (pstate=0x57c47494aaf0) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2179
     #21 0x000057c4404c2003 in ExecProcNode (node=0x57c47494aaf0) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
     #22 ExecutePlan (dest=0x57c47483d548, direction=<optimized out>, 
numberTuples=0, sendTuples=true, operation=CMD_SELECT, 
queryDesc=0x57c474895010)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:1649
     #23 standard_ExecutorRun (queryDesc=0x57c474895010, 
direction=<optimized out>, count=0, execute_once=<optimized out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:361
     ...

gdb session for `startup` process
---------------------------------

     (gdb) bt
     #0  0x00007f0e98698ce3 in __futex_abstimed_wait_common64 
(private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, 
futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:57
     #1  __futex_abstimed_wait_common (cancel=true, private=<optimized 
out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7f0ceb34e6b8) at 
./nptl/futex-internal.c:87
     #2  __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x7f0ceb34e6b8, expected=expected@entry=0, 
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
         at ./nptl/futex-internal.c:139
     #3  0x00007f0e986a4f1f in do_futex_wait 
(sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at 
./nptl/sem_waitcommon.c:111
     #4  0x00007f0e986a4fb8 in __new_sem_wait_slow64 
(sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at 
./nptl/sem_waitcommon.c:183
     #5  0x00007f0e986a503d in __new_sem_wait 
(sem=sem@entry=0x7f0ceb34e6b8) at ./nptl/sem_wait.c:42
     #6  0x000057c440696166 in PGSemaphoreLock (sema=0x7f0ceb34e6b8) at 
port/pg_sema.c:327
     #7  LWLockAcquire (lock=0x7f0cefc58064, mode=LW_EXCLUSIVE) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/lwlock.c:1289
     #8  0x000057c44038f96a in LockBuffer (mode=2, buffer=<optimized 
out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/buffer/bufmgr.c:5147
     #9  XLogReadBufferForRedoExtended (record=<optimized out>, 
block_id=<optimized out>, mode=RBM_NORMAL, get_cleanup_lock=false, 
buf=0x7ffdaedc8b4c)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:429
     #10 0x000057c440319969 in XLogReadBufferForRedo 
(buf=0x7ffdaedc8b4c, block_id=0 '\000', record=0x57c4748994d8) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:317
     #11 heap_xlog_lock_updated (record=0x57c4748994d8) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10230
     #12 heap2_redo (record=0x57c4748994d8) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10362
     #13 0x000057c44038e1d2 in ApplyWalRecord (replayTLI=<synthetic 
pointer>, record=0x7f0e983908e0, xlogreader=<optimized out>)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/xlog_internal.h:380
     #14 PerformWalRecovery () at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogrecovery.c:1822
     #15 0x000057c44037bbf6 in StartupXLOG () at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlog.c:5821
     #16 0x000057c4406155ed in StartupProcessMain 
(startup_data=<optimized out>, startup_data_len=<optimized out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/startup.c:258
     #17 0x000057c44060b376 in postmaster_child_launch 
(child_type=B_STARTUP, startup_data=0x0, startup_data_len=0, 
client_sock=0x0)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/launch_backend.c:277
     #18 0x000057c440614509 in postmaster_child_launch (client_sock=0x0, 
startup_data_len=0, startup_data=0x0, child_type=B_STARTUP)
         at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3934
     #19 StartChildProcess (type=type@entry=B_STARTUP) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3930
     #20 0x000057c44061480d in PostmasterStateMachine () at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3392
     #21 0x000057c4408a3455 in process_pm_child_exit () at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:2683
     #22 ServerLoop.isra.0 () at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1667
     #23 0x000057c440616965 in PostmasterMain (argc=<optimized out>, 
argv=<optimized out>) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1374
     #24 0x000057c4402bcd2d in main (argc=17, argv=0x57c4747fb140) at 
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/main/main.c:199

Could you please help me to fix the problem of stuck 'client backend' 
processes?

I kindly ask you for any ideas and recommendations!

Best regards,
Dmitry





Re: IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
On 25.06.2025 12:34, Andrey Borodin wrote:
>
>> On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:
>>
>>      #6  GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>,
isLockOnly=<optimizedout>)
 
>>          at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
> Hi Dmitry!
>
> This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next
multixactis still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from
WALredo.
 
>
> If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.
>
> Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.
>
>
> Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV
sleepso in worst case we will fallback to behavior of PG 16.
 
>
>
> Best regards, Andrey Borodin.

Hi Andrey!

Thanks so much for your response.

A small comment on /* Corner case 2: ... */
At this point in the code, I tried to set trace points by outputting 
messages through `elog()`,
and I can say that the process does not always stuck in this part of the 
code, it appears from time to time and in an unpredictable way.

Maybe this will help you a little.

To be honest, PostgreSQL performance is much better with this feature, 
it would be a shame if we had to rollback to the behavior in version 16.

I will definitely try to reproduce the problem with your patch.

Best regards,
Dmitry.




Re: IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
On 25.06.2025 16:44, Dmitry wrote:
I will definitely try to reproduce the problem with your patch.
Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

Best regards,
Dmitry.

Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 26 Jun 2025, at 14:33, Dmitry <dsy.075@yandex.ru> wrote:
>
> On 25.06.2025 16:44, Dmitry wrote:
>> I will definitely try to reproduce the problem with your patch.
> Hi Andrey!
>
> I checked with the patch, unfortunately the problem is also reproducible.
> Client processes wake up after a second and try to get information about the members of the multixact again, in an
endlessloop. 
> At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

My hypothesis is that MultiXactState->nextMXact is not filled often enough from redo pathes. So if you are unlucky
enough,corner case 2 reading can deadlock with startup. 
I need to verify it further, but if so - I's an ancient bug that just happens to be few orders of magnitude more
reproducibleon 17 due to performance improvements. Still a hypothetical though. 


Best regards, Andrey Borodin.


Re: IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
On 26.06.2025 19:24, Andrey Borodin wrote:
> If my hypothesis is correct nextMXact will precede tmpMXact.

It seems that the hypothesis has not been confirmed.

Attempt #1
2025-06-26 23:47:24.821 MSK [220458] WARNING:  Timed out: nextMXact 
24138381 tmpMXact 24138379
2025-06-26 23:47:24.822 MSK [220540] WARNING:  Timed out: nextMXact 
24138382 tmpMXact 24138379
2025-06-26 23:47:24.823 MSK [220548] WARNING:  Timed out: nextMXact 
24138382 tmpMXact 24138379
...

pgbench (17.5)
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 4.0 s, 482.2 tps, lat 820.293 ms stddev 1370.729, 0 failed
progress: 5.0 s, 886.0 tps, lat 112.463 ms stddev 8.506, 0 failed
progress: 6.0 s, 348.9 tps, lat 111.324 ms stddev 5.871, 0 failed
WARNING:  Timed out: nextMXact 24138381 tmpMXact 24138379
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379
...
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379


Attempt #2
2025-06-27 09:18:01.312 MSK [236187] WARNING:  Timed out: nextMXact 
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236225] WARNING:  Timed out: nextMXact 
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236178] WARNING:  Timed out: nextMXact 
24497746 tmpMXact 24497744
...

pgbench (17.5)
progress: 1.0 s, 830.9 tps, lat 108.556 ms stddev 10.078, 0 failed
progress: 2.0 s, 839.0 tps, lat 118.358 ms stddev 19.708, 0 failed
progress: 3.0 s, 623.4 tps, lat 134.186 ms stddev 15.565, 0 failed
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497747 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497747 tmpMXact 24497744
...
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744


Best regards,
Dmitry.




Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0.
Tuplevisibility check might need WAL to go on, we have to wait until some next MX be filled in. 
> Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Personally, I see point 2 as very invasive in a code that I'm not too familiar with. Option 1 is clumsy. But option 3
isa giant system-wide change. 
Yet, I see 3 as a correct solution. Can't we just abstain from HOLD_INTERRUPTS() if taken LWLock is not exclusive?


Best regards, Andrey Borodin.


Re: IPC/MultixactCreation on the Standby server

От
Álvaro Herrera
Дата:
On 2025-Jul-17, Andrey Borodin wrote:

> Thinking more about the problem I see 3 ways to deal with this deadlock:
> 1. We check for recovery conflict even in presence of
> InterruptHoldoffCount. That's what patch v4 does.
> 2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility()
> without holding buffer lock.
> 3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Hmm, as you say, doing (3) is a very invasive system-wide change, but
can we do it more localized?  I mean, what if we do RESUME_INTERRUPTS()
just before going to sleep on the CV, and restore with HOLD_INTERRUPTS()
once the sleep is done?  That would only affect this one place rather
than the whole system, and should also (AFAICS) solve the issue.


> Yet, I see 3 as a correct solution. Can't we just abstain from
> HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Hmm, the code in LWLockAcquire says

    /*
     * Lock out cancel/die interrupts until we exit the code section protected
     * by the LWLock.  This ensures that interrupts will not interfere with
     * manipulations of data structures in shared memory.
     */
    HOLD_INTERRUPTS();

which means if we want to change this, we would have to inspect every
single use of LWLocks in shared mode in order to be certain that such a
change isn't problematic.  This is a discussion I'm not prepared for.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"



Re: IPC/MultixactCreation on the Standby server

От
Álvaro Herrera
Дата:
Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed.  (We know it when we create each multixact, because the
number of members is known).  So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact.  On replay of the next multixact we can cross-check that
the offset matches what we had written previously.  This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact?  We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 18 Jul 2025, at 16:53, Álvaro Herrera <alvherre@kurilemu.de> wrote:
>
> Hello,
>
> Andrey and I discussed this on IM, and after some back and forth, he
> came up with a brilliant idea: modify the WAL record for multixact
> creation, so that the offset of the next multixact is transmitted and
> can be replayed.  (We know it when we create each multixact, because the
> number of members is known).  So the replica can store the offset of the
> next multixact right away, even though it doesn't know the members for
> that multixact.  On replay of the next multixact we can cross-check that
> the offset matches what we had written previously.  This allows reading
> the first multixact, without having to wait for the replay of creation
> of the second multixact.
>
> One concern is: if we write the offset for the second mxact, but haven't
> written its members, what happens if another process looks up the
> members for that multixact?  We'll have to make it wait (retry) somehow.
> Given what was described upthread, it's possible for the multixact
> beyond that one to be written already, so we won't have the zero offset
> that would make us wait.

We redo Multixact creation always before it is visible anywhere on heap.
The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged
yet.
However, I think we do not need to read multi before it is redone.

>
> Anyway, he's going to try and implement this.
>
> Andrey, please let me know if I misunderstood the idea.


Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is
reproduciblebefore 17. 


Best regards, Andrey Borodin.

Вложения

Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 18 Jul 2025, at 18:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is
reproduciblebefore 17. 

Here'v v7 with improved comments and cross-check for correctness.
Also, MultiXact wraparound is handled.
I'm planning to prepare tests and fixes for all supported branches, if there's no objections to this approach.


Best regards, Andrey Borodin.


Вложения

Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 21 Jul 2025, at 19:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> I'm planning to prepare tests and fixes for all supported branches

This is a status update message. I've reproduced problem on REL_13_STABLE and verified that proposed fix works there.

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple that uses this multi will become unreadable. Any
attemptto read it will hang forever. 

I've reproduced the problem and now I'm working on scripting this scenario. Basically, I modify code to hang forever
afterassigning multi number 2. Then execute in first psql: 

create table x as select i,0 v from generate_series(1,10) i;
create unique index on x(i);

\set id 1
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 1
commit;

\set id 2
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 2 -- will hang
commit;

Then in second psql:

create table y as select i,0 v from generate_series(1,10) i;
create unique index on y(i);

\set id 1
begin;
select * from y where i = :id for no key update;
savepoint s1;
update y set v = v+1 where i = :id;
commit;

After this I pkill -9 postgres. Recovered installation cannot execute select * from x; because multi 1 cannot be read
withoutrecovery of multi 2 which was never logged. 


Luckily fix is the same: just restore offset of multi 2 when multi 1 is recovered.


Best regards, Andrey Borodin.


Re: IPC/MultixactCreation on the Standby server

От
Álvaro Herrera
Дата:
On 2025-Jul-25, Andrey Borodin wrote:

> Also I've discovered one more serious problem.
> If a backend crashes just before WAL-logging multi, any heap tuple
> that uses this multi will become unreadable. Any attempt to read it
> will hang forever.
> 
> I've reproduced the problem and now I'm working on scripting this
> scenario. Basically, I modify code to hang forever after assigning
> multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script).  I agree it's a pretty ugly bug!  I think it's
essentially the same bug as the other problem, so the proposed fix
should solve both.

Thanks for working on this!

Looking at this,

        /*
         * We want to avoid edge case 2 in redo, because we cannot wait for
         * startup process in GetMultiXactIdMembers() without risk of a
         * deadlock.
         */
        MultiXactId next = multi + 1;
        int         next_pageno;

        /* Handle wraparound as GetMultiXactIdMembers() does it. */
        if (multi < FirstMultiXactId)
            multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

In this bit,

                 * We do not need to handle race conditions, because this code
                 * is only executed in redo and we hold
                 * MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind.  Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough.  (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby.  I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)


-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 26 Jul 2025, at 22:44, Álvaro Herrera <alvherre@kurilemu.de> wrote:
>
> On 2025-Jul-25, Andrey Borodin wrote:
>
>> Also I've discovered one more serious problem.
>> If a backend crashes just before WAL-logging multi, any heap tuple
>> that uses this multi will become unreadable. Any attempt to read it
>> will hang forever.
>>
>> I've reproduced the problem and now I'm working on scripting this
>> scenario. Basically, I modify code to hang forever after assigning
>> multi number 2.
>
> It took me a minute to understand this, and I think your description is
> slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
> multixact cannot be read (at least, that's what I understand from your
> reproducer script).
Yes, I explained a bit incorrectly, but you got the problem correctly.

>
> Looking at this,
>
>        /*
>         * We want to avoid edge case 2 in redo, because we cannot wait for
>         * startup process in GetMultiXactIdMembers() without risk of a
>         * deadlock.
>         */
>        MultiXactId next = multi + 1;
>        int         next_pageno;
>
>        /* Handle wraparound as GetMultiXactIdMembers() does it. */
>        if (multi < FirstMultiXactId)
>            multi = FirstMultiXactId;
>
> Don't you mean to test and change the value 'next' rather than 'multi'
> here?
Yup, that was typo.

>
> In this bit,
>
>                 * We do not need to handle race conditions, because this code
>                 * is only executed in redo and we hold
>                 * MultiXactOffsetSLRULock.
>
> I think it'd be good to have an
> Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
> just for peace of mind.
Ugh, that uncovered 17+ problem: now we have a couple of locks simultaneously. I'll post a version with this a later.

> Also, commit c61678551699 removed
> ZeroMultiXactOffsetPage(), but since you have 'false' as the second
> argument, then SimpleLruZeroPage() is enough.  (I wondered why isn't
> WAL-logging necessary ... until I remember that we're in a standby.  I
> think a simple comment here like "no WAL-logging because we're a
> standby" should suffice.)

Agree.


I've made a test [0] and discovered another problem. Adding this checkpoint breaks the test[1] even after a fix[2].
I suspect that excluding "edge case 2" on standby is simply not enough... we have to do this "next offset" dance on
Primarytoo. I'll think more about other options. 


Best regards, Andrey Borodin.
[0] https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07
[1] https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9
[2] https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af

Timeline of this commits can be seen here https://github.com/x4m/postgres_g/commits/mx19/


Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 27 Jul 2025, at 16:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> we have to do this "next offset" dance on Primary too.

PFA draft of this.
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch. I think the problem
shouldbe fixed by the patch. 

Thanks!


Best regards, Andrey Borodin.

Вложения

Re: IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
On 28.07.2025 15:49, Andrey Borodin wrote:
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch.

Andrey, thank you very much for your work, and also thanks to Álvaro for joining the discussion on the problem. I ran tests on PG17 with patch v8, there are no more sessions hanging on the replica, great! Replica requests are canceled with recovery conflicts. ERROR: canceling statement due to conflict with recovery DETAIL: User was holding shared buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. STATEMENT: select sum(val) from tbl2;

But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages. ERROR: MultiXact 81926 has invalid next offset STATEMENT: select * from tbl2 where id = $1 for no key update; ERROR: MultiXact 81941 has invalid next offset CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2" automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.

Re: IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
I'll duplicate the message, the previous one turned out to have poor 
formatting, sorry.

On 28.07.2025 15:49, Andrey Borodin wrote:
> I also attach a version for PG17, maybe Dmitry could try to reproduce 
> the problem with this patch.
Andrey, thank you very much for your work, and also thanks to Álvaro for 
joining the discussion on the problem.
I ran tests on PG17 with patch v8, there are no more sessions hanging on 
the replica, great!

Replica requests are canceled with recovery conflicts.

ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
STATEMENT: select sum(val) from tbl2;

or

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be 
removed.
STATEMENT: select sum(val) from tbl2;

But on the master, some of the requests then fail with an error, 
apparently invalid multixact's remain in the pages.

ERROR: MultiXact 81926 has invalid next offset
STATEMENT: select * from tbl2 where id = $1 for no key update;

ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2" 
automatic vacuum of table "postgres.public.tbl2"

Best regards,
Dmitry.





17.07.2025 21:34, Andrey Borodin пишет:
>> On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount >
0.Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
 
>> Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().
> 
> Thinking more about the problem I see 3 ways to deal with this deadlock:
> 2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.
> 
> Personally, I see point 2 as very invasive in a code that I'm not too familiar with.

If there were no SetHintBits inside of HeapTupleSatisfies* , then it could
be just "copy line pointers and tuple headers under lock, release lock,
check tuples visibility using copied arrays".

But hint bits makes it much more difficult.

Probably, tuple headers could be copied twice and compared afterwards. If
there are change in hint bits, page should be relocked.

And call to MarkBufferDirtyHint should be delayed.

A very dirty variant is in attach. I've made it just for fun. It passes
'regress', 'isolation' and 'recovery'. But I didn't benchmark it.

-- 
regards
Yura Sokolov aka funny-falcon
Вложения

Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 29 Jul 2025, at 12:17, Dmitry <dsy.075@yandex.ru> wrote:
>
> But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages.

Thanks!

That's a bug in my patch. I do not understand it yet. I've reproduced it with your original workload.
Most of errors I see are shallow (offset == 0 or nextOffset==0), but this one is interesting:

TRAP: failed Assert("shared->page_number[slotno] == pageno && shared->page_status[slotno] ==
SLRU_PAGE_WRITE_IN_PROGRESS"),File: "slru.c", Line: 729, PID: 91085 
0   postgres                            0x00000001032ea5ac ExceptionalCondition + 216
1   postgres                            0x0000000102af2784 SlruInternalWritePage + 700
2   postgres                            0x0000000102af14dc SimpleLruWritePage + 96
3   postgres                            0x0000000102ae89d4 RecordNewMultiXact + 576

So it makes me think that it's some version of IO concurrency issue.
As expected error only persists if "extend SLRU" branch is active in RecordNewMultiXact().

Thanks for testing!


Best regards, Andrey Borodin.


Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:

> On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> I do not understand it yet.

OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension
byExtendMultiXactOffset(). 
So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound...

Here's an updated two patches, one for Postgres 17 and one for mater(with a test).


Best regards, Andrey Borodin.



Вложения

Re: IPC/MultixactCreation on the Standby server

От
Dmitry
Дата:
On 31.07.2025 09:29, Andrey Borodin wrote:
> Here's an updated two patches, one for Postgres 17 and one for 
> mater(with a test).
I ran tests on PG17 with patch v9.

I tried to reproduce it for three cases, the first when we explicitly 
use for key share, the second through subtransactions
and the third, through implicit use of for key share, through an foreign 
key.

There are no more sessions hanging on the replica, that's great!

Thank you all very much!

Best regards,
Dmitry.




Re: IPC/MultixactCreation on the Standby server

От
Kirill Reshke
Дата:
On Thu, 31 Jul 2025 at 11:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
> > On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > I do not understand it yet.
>
> OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension
byExtendMultiXactOffset().
 
> So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound...
>
> Here's an updated two patches, one for Postgres 17 and one for mater(with a test).

Hi!

+ /*
+ * We might have filled this offset previosuly.
+ * Cross-check for correctness.
+ */
+ Assert((*offptr == 0) || (*offptr == offset));

Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr !=
0 and *offptr != offset?


+ /* Read and adjust next page */
+ next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+ next_offptr = (MultiXactOffset *)
MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+ next_offptr[next_entryno] = offset + nmembers;

should we check the value of next_offptr[next_entryno] to be equal to
zero or  offset + nmembers ? Assert or
errcode(ERRCODE_DATA_CORRUPTED) also.



-- 
Best regards,
Kirill Reshke



Re: IPC/MultixactCreation on the Standby server

От
Andrey Borodin
Дата:
Hi Kirill, thanks for looking into this!

> On 20 Aug 2025, at 12:19, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> + /*
> + * We might have filled this offset previosuly.
> + * Cross-check for correctness.
> + */
> + Assert((*offptr == 0) || (*offptr == offset));
>
> Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr !=
> 0 and *offptr != offset?

No, we should not exit. We encountered inconsistencies that we are fully prepared to fix. But you are right - we should
betteremit WARNING with XX001. 

> + /* Read and adjust next page */
> + next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
> + next_offptr = (MultiXactOffset *)
> MultiXactOffsetCtl->shared->page_buffer[next_slotno];
> + next_offptr[next_entryno] = offset + nmembers;
>
> should we check the value of next_offptr[next_entryno] to be equal to
> zero or  offset + nmembers ? Assert or
> errcode(ERRCODE_DATA_CORRUPTED) also.

Yes, we'd better WARN user here.

Thanks for your valuable suggestions. I'm not sending new version of the patch, because I'm waiting input on overall
designfrom Alvaro or any committer willing to fix this. We need to figure out if this radical approach is acceptable to
backpatch.I do not see other options, but someone might have more clever ideas. 


Best regards, Andrey Borodin.