Обсуждение: Instability of phycodorus in pg_upgrade tests with JIT
Hi all, I have spotted a couple of buildfarm failures for buildfarm member phycodorus on REL_14_STABLE and REL_13_STABLE: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36 These are sporadic, pointing at a backtrace with JIT in some cases on REL_13_STABLE. Short extract with a broken free: #18 0x00007fd9eed38242 in llvm::AttributeSet::addAttribute (this=0x7fff3ec6af40, C=..., Indices=..., A=...) at /home/bf/src/llvm-project-3.9/llvm/lib/IR/Attributes.cpp:882 #19 0x00007fd9eee51d42 in llvm::Function::addAttribute (this=0x55714a9a01e8, i=4294967295, Attr=...) at /home/bf/src/llvm-project-3.9/llvm/lib/IR/Function.cpp:377 #20 0x00007fd9eedbf113 in LLVMAddAttributeAtIndex (F=0x55714a9a01e8, Idx=4294967295, A=0x55714a3f82e0) at /home/bf/src/llvm-project-3.9/llvm/lib/IR/Core.cpp:1845 #21 0x00007fd9fbe2b393 in llvm_copy_attributes_at_index (v_from=v_from@entry=0x55714a34ab28, v_to=v_to@entry=0x55714a9a01e8, index=index@entry=4294967295) at /home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit.c:551 #22 0x00007fd9fbe2c2df in llvm_copy_attributes (v_from=0x55714a34ab28, v_to=v_to@entry=0x55714a9a01e8) at /home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit.c:566 #23 0x00007fd9fbe34b28 in llvm_compile_expr (state=0x55714a3a80b8) at /home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit_expr.c:158 #24 0x00005571479f5448 in jit_compile_expr (state=state@entry=0x55714a3a80b8) at /home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/jit.c:177 REL_14_STABLE points at a crash, without a backtrace. It looks like only this host is seeing such failures for the upgrade test, for only these two branches. Is that something we'd better act on even for v13 which is going to be EOL soon? Thanks, -- Michael
Вложения
Michael Paquier <michael@paquier.xyz> writes: > I have spotted a couple of buildfarm failures for buildfarm member > phycodorus on REL_14_STABLE and REL_13_STABLE: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36 phycodorus seems to be running a remarkably ancient LLVM version. I wonder if we should just write these off as "probably an LLVM bug". regards, tom lane
Hello Tom and Michael,
16.10.2025 02:39, Tom Lane wrote:
16.10.2025 02:39, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:I have spotted a couple of buildfarm failures for buildfarm member phycodorus on REL_14_STABLE and REL_13_STABLE: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36phycodorus seems to be running a remarkably ancient LLVM version. I wonder if we should just write these off as "probably an LLVM bug".
I collected all of such failures here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption
Masao-san was going to dig into that:
https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com
Best regards,
Alexander
On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote: > I collected all of such failures here: > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption > > Masao-san was going to dig into that: > https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com Good to know. Thanks for the information, Alexander. -- Michael
Вложения
On Fri, Oct 17, 2025 at 8:32 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote: > > I collected all of such failures here: > > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption > > > > Masao-san was going to dig into that: > > https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com I tried that briefly, but unfortunately I still have no idea what caused this failure or what triggered the double-free issue shown below… ----------------------------------- [New LWP 978394] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/tmp_install/home/bf/bf-build/petalura/REL_13_STABLE/inst/bin/postgres '' '' '' '' '''. Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 warning: 44 ./nptl/pthread_kill.c: No such file or directory #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x00007f6b19e9e9ff in __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89 #2 0x00007f6b19e49cc2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007f6b19e324ac in __GI_abort () at ./stdlib/abort.c:73 #4 0x00007f6b19e33291 in __libc_message_impl (fmt=fmt@entry=0x7f6b19fb532d "%s\\n") at ../sysdeps/posix/libc_fatal.c:134 #5 0x00007f6b19ea8465 in malloc_printerr (str=str@entry=0x7f6b19fb86f8 "double free or corruption (!prev)") at ./malloc/malloc.c:5829 #6 0x00007f6b19eaa56c in _int_free_merge_chunk (av=av@entry=0x7f6b19ff1ac0 <main_arena>, p=p@entry=0xfba29e0, size=272) at ./malloc/malloc.c:4721 #7 0x00007f6b19eaa6c6 in _int_free_chunk (av=av@entry=0x7f6b19ff1ac0 <main_arena>, p=p@entry=0xfba29e0, size=<optimized out>, have_lock=<optimized out>, have_lock@entry=0) at ./malloc/malloc.c:4667 #8 0x00007f6b19ead3c0 in _int_free (av=0x7f6b19ff1ac0 <main_arena>, p=0xfba29e0, have_lock=0) at ./malloc/malloc.c:4699 #9 __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3476 #10 0x00007f6b1a29053c in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #11 0x00007f6b1a290574 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #12 0x00007f6b1b2b7fc2 in _dl_call_fini (closure_map=closure_map@entry=0x7f6b1ae49660) at ./elf/dl-call_fini.c:43 #13 0x00007f6b1b2bae72 in _dl_fini () at ./elf/dl-fini.c:120 #14 0x00007f6b19e4c291 in __run_exit_handlers (status=0, listp=0x7f6b19ff1680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:118 #15 0x00007f6b19e4c35a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:148 #16 0x000000000078d80c in proc_exit (code=0) at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:156 #17 0x00000000007b44e1 in PostgresMain (argc=1, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/tcop/postgres.c:4604 #18 0x000000000073498b in BackendRun (port=0xf8562a0) at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4561 #19 0x0000000000734337 in BackendStartup (port=<optimized out>) at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4245 #20 0x0000000000733b33 in ServerLoop () at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1744 #21 0x0000000000731e47 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1417 #22 0x0000000000693d89 in main (argc=6, argv=0xf7d90c0) at /home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/main/main.c:212 $1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {978394, 1000, 0 <repeats 26 times>}, _kill = {si_pid = 978394, si_uid = 1000}, _timer = {si_tid = 978394, si_overrun = 1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 978394, si_uid = 1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 978394, si_uid = 1000, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e8000eedda}, _sigpoll = {si_band = 4294968274394, si_fd = 0}, _sigsys = {_call_addr = 0x3e8000eedda, _syscall = 0, _arch = 0}}} Regards, -- Fujii Masao
Hello Andres,
17.10.2025 08:21, Fujii Masao wrote:
17.10.2025 08:21, Fujii Masao wrote:
On Fri, Oct 17, 2025 at 8:32 AM Michael Paquier <michael@paquier.xyz> wrote:On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote:I collected all of such failures here: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption Masao-san was going to dig into that: https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.comI tried that briefly, but unfortunately I still have no idea what caused this failure or what triggered the double-free issue shown below…
I've been trying to reproduce the issue locally for several days, with
clang 3.9.0 and 4.0.1 compiled from sources with -DCMAKE_BUILD_TYPE=Debug
-DLLVM_ENABLE_ASSERTIONS=ON, running buildfarm client (TestUpgrade) on
four different x86_64 systems (Debian, Ubuntu, but not the latest versions), with
no single failure so far.
(I've re-created config from petalura/phycodurus: 'jit=1',
'jit_above_cost=0', 'jit_optimize_above_cost=1000'... also tried
jit_optimize_above_cost=0...)
I tried to invoke double free with a simple program and confirmed that the
double free is detected and the program aborted.
So if I re-created all the conditions (based on buildfarm logs) correctly,
then several hundred runs, which I performed, should be enough to
reproduce the issue, but probably there is something specific with those
animals (petalura, phycodurus, desmoxytes, dragonet)... Maybe a buggy libc
update was installed there in September?
Meanwhile we've got a failure at stage Check (not pg_upgradeCheck), with a
release LLVM build [1]:
2025-10-21 17:15:16.261 CEST [1489783][client backend][:0] LOG: disconnection: session time: 0:00:03.177 user=bf database=regression host=[local]
corrupted size vs. prev_size while consolidating
Thus, the initial suspicion that the issue is caused by dff7591a7 (because
the first failure [2] happened right after it) seems wrong now.
Maybe you have an insight on the possible cause of these memory errors?
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2025-10-21%2015%3A14%3A12
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-09-16%2011%3A09%3A07
Best regards,
Alexander
Hi, On 2025-10-15 19:39:03 -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > I have spotted a couple of buildfarm failures for buildfarm member > > phycodorus on REL_14_STABLE and REL_13_STABLE: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36 > > phycodorus seems to be running a remarkably ancient LLVM version. It intentionally tests the oldest supported version... If we don't care, I'm happy enough to just remove the animal. > I wonder if we should just write these off as "probably an LLVM bug". I'm not sure that's really convincing, given that REL_16_STABLE seems to not have an issue? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2025-10-15 19:39:03 -0400, Tom Lane wrote:
>> phycodorus seems to be running a remarkably ancient LLVM version.
> It intentionally tests the oldest supported version... If we don't care, I'm
> happy enough to just remove the animal.
Sure, we'd need to change our docs about the oldest supported LLVM
version if we go that way.
>> I wonder if we should just write these off as "probably an LLVM bug".
> I'm not sure that's really convincing, given that REL_16_STABLE seems to not
> have an issue?
The other side of that coin is that no other LLVM-using animal is
showing similar instability. Sure, it's plausible that we changed
something in v15 or so that stopped the problem, but is it worth the
effort to try to find out what? And if we did find it, would we
care to risk back-porting it?
(If you want to research this, I'm not standing in the way.
But I think there are better uses for your time.)
regards, tom lane
Hello Tom and Andres,
25.10.2025 00:31, Tom Lane wrote:
25.10.2025 00:31, Tom Lane wrote:
Sure, we'd need to change our docs about the oldest supported LLVM version if we go that way.I wonder if we should just write these off as "probably an LLVM bug".
As I wrote upthread, I could not reproduce the issue with the same old
LLVM versions.
I'm not sure that's really convincing, given that REL_16_STABLE seems to not have an issue?The other side of that coin is that no other LLVM-using animal is showing similar instability. Sure, it's plausible that we changed something in v15 or so that stopped the problem, but is it worth the effort to try to find out what? And if we did find it, would we care to risk back-porting it?
My collection [2] contains also reports from other animals: petalura,
desmoxytes, dragonet.
(If you want to research this, I'm not standing in the way. But I think there are better uses for your time.)
I wanted to research this, but failed, to my disappointment.
[1] https://www.postgresql.org/message-id/563ee5af-8ee2-484f-b50a-1c8fbdd16171%40gmail.com
[2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
> 25.10.2025 00:31, Tom Lane wrote:
>> The other side of that coin is that no other LLVM-using animal is
>> showing similar instability.
> My collection [2] contains also reports from other animals: petalura,
> desmoxytes, dragonet.
Hmm ... but none of those are running any LLVM newer than 4.0.1
(obsolete since 2017).
regards, tom lane