Обсуждение: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
Hi,
I have noticed one of my buildfarm machines - widowbird - did not report
any results since February 17. And it seems to be stuck somewhere in
amcheck:
$ ps ax | grep postgres
1180067 ? Ss 0:02
/mnt/data/buildfarm/buildroot/HEAD/inst/bin/postgres -D data-C
1180069 ? Ss 0:00 postgres: checkpointer
1180070 ? Ss 0:00 postgres: background writer
1180072 ? Ss 0:00 postgres: walwriter
1180073 ? Ss 0:01 postgres: autovacuum launcher
1180074 ? Ss 0:00 postgres: logical replication launcher
1180107 ? Ss 0:05 postgres: buildfarm
contrib_regression_amcheck [local] INSERT
1180111 ? Ss 0:00 postgres: autovacuum worker
1180134 ? Ss 0:00 postgres: autovacuum worker
1180135 ? Ss 0:00 postgres: autovacuum worker
1374029 pts/0 S+ 0:00 grep --color=auto postgres
So there's PID 1180107, executing an insert, but not progressing. The
backtrace looks like this (first couple lines, full backtrace attached):
#0 0x0000007fa64b8ddc in __GI_epoll_pwait (epfd=5, events=0x55ad6285a8,
maxevents=1, timeout=timeout@entry=-1, set=set@entry=0x0) at
../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1 0x0000007fa64b8fe8 in epoll_wait (epfd=<optimized out>,
events=<optimized out>, maxevents=<optimized out>, timeout=timeout@entry=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:32
#2 0x000000558f043588 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ff8ed4e18, cur_timeout=-1, set=0x55ad628540) at
latch.c:1571
#3 WaitEventSetWait (set=0x55ad628540, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ff8ed4e18,
nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=134217781) at latch.c:1519
#4 0x000000558f043778 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781)
at latch.c:538
#5 0x000000558f052274 in ConditionVariableTimedSleep (cv=0x7f9ac9deb0,
timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781) at condition_variable.c:163
#6 0x000000558f05286c in ConditionVariableTimedSleep
(wait_event_info=134217781, timeout=-1, cv=<optimized out>) at
condition_variable.c:135
#7 0x000000558ed2fc90 in AdvanceXLInsertBuffer
(upto=upto@entry=608174080, tli=tli@entry=1,
opportunistic=opportunistic@entry=false) at xlog.c:2224
So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another
interesting fact is it's testing 75dfde13639, which is just a couple
commits after 6a2275b895:
commit 6a2275b8953a4462d44daf001bdd60b3d48f0946
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon Feb 17 04:19:01 2025 +0200
Get rid of WALBufMappingLock
Allow multiple backends to initialize WAL buffers concurrently.
This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in
parallel without taking a single LWLock in exclusive mode.
...
which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the
last (successful) run on widorbird was on eaf502747b, which already
includes 6a2275b895, so maybe it's unrelated.
Is there something else I could collect from the stuck instance, before
I restart it?
regards
--
Tomas Vondra
Вложения
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
От
Peter Geoghegan
Дата:
On Wed, Feb 26, 2025 at 5:09 PM Tomas Vondra <tomas@vondra.me> wrote: > So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another > interesting fact is it's testing 75dfde13639, which is just a couple > commits after 6a2275b895: > which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the > last (successful) run on widorbird was on eaf502747b, which already > includes 6a2275b895, so maybe it's unrelated. Are you aware that Alexander reverted this only a day later, in commit 3fb58625? -- Peter Geoghegan
On 2/26/25 23:13, Peter Geoghegan wrote: > On Wed, Feb 26, 2025 at 5:09 PM Tomas Vondra <tomas@vondra.me> wrote: >> So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another >> interesting fact is it's testing 75dfde13639, which is just a couple >> commits after 6a2275b895: > >> which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the >> last (successful) run on widorbird was on eaf502747b, which already >> includes 6a2275b895, so maybe it's unrelated. > > Are you aware that Alexander reverted this only a day later, in commit 3fb58625? > Oh, I forgot about that. I guess if the theory is the commit might be causing this, and it's been already reverted, I should just restart the instance, so that it tests with the revert. regards -- Tomas Vondra
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
От
Peter Geoghegan
Дата:
On Wed, Feb 26, 2025 at 5:16 PM Tomas Vondra <tomas@vondra.me> wrote: > Oh, I forgot about that. I guess if the theory is the commit might be > causing this, and it's been already reverted, I should just restart the > instance, so that it tests with the revert. According to the revert commit's commit message there was also problems on batta at the time. Might have been the same issue that you saw on widowbird. Alexander? -- Peter Geoghegan
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
От
Michael Paquier
Дата:
On Wed, Feb 26, 2025 at 05:23:51PM -0500, Peter Geoghegan wrote: > According to the revert commit's commit message there was also > problems on batta at the time. Might have been the same issue that you > saw on widowbird. Alexander? This problem has already been discussed, and I saw the same backtrace as you do on my hosts: https://www.postgresql.org/message-id/Z7PSeT7AbrjmiCCV@paquier.xyz It is not the first time where we have a problematic patch causing some animals to freeze, so I would recommend using a wait_timeout => 43200 or similar in the configuration of your buildfarm animals so as they would unfreeze themselves without the need for a manual intervention. This setting has saved me quite a few maintenance cycles. -- Michael
Вложения
Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
От
Alexander Korotkov
Дата:
On Thu, Feb 27, 2025 at 12:24 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Feb 26, 2025 at 5:16 PM Tomas Vondra <tomas@vondra.me> wrote: > > Oh, I forgot about that. I guess if the theory is the commit might be > > causing this, and it's been already reverted, I should just restart the > > instance, so that it tests with the revert. > > According to the revert commit's commit message there was also > problems on batta at the time. Might have been the same issue that you > saw on widowbird. Alexander? Yes, the backtrace posted by Tomas from widowbird looks quite the same as posted by Michael from batta. My apologies for committing not well-tested patch. ------ Regards, Alexander Korotkov Supabase