Обсуждение: Instability of phycodorus in pg_upgrade tests with JIT

Поиск
Список
Период
Сортировка

Instability of phycodorus in pg_upgrade tests with JIT

От
Michael Paquier
Дата:
Hi all,

I have spotted a couple of buildfarm failures for buildfarm member
phycodorus on REL_14_STABLE and REL_13_STABLE:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36

These are sporadic, pointing at a backtrace with JIT in some cases on
REL_13_STABLE.  Short extract with a broken free:
#18 0x00007fd9eed38242 in llvm::AttributeSet::addAttribute
(this=0x7fff3ec6af40, C=..., Indices=..., A=...) at
/home/bf/src/llvm-project-3.9/llvm/lib/IR/Attributes.cpp:882
#19 0x00007fd9eee51d42 in llvm::Function::addAttribute
(this=0x55714a9a01e8, i=4294967295, Attr=...) at
/home/bf/src/llvm-project-3.9/llvm/lib/IR/Function.cpp:377
#20 0x00007fd9eedbf113 in LLVMAddAttributeAtIndex (F=0x55714a9a01e8,
Idx=4294967295, A=0x55714a3f82e0) at
/home/bf/src/llvm-project-3.9/llvm/lib/IR/Core.cpp:1845
#21 0x00007fd9fbe2b393 in llvm_copy_attributes_at_index
(v_from=v_from@entry=0x55714a34ab28, v_to=v_to@entry=0x55714a9a01e8,
index=index@entry=4294967295) at
/home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit.c:551
#22 0x00007fd9fbe2c2df in llvm_copy_attributes (v_from=0x55714a34ab28,
v_to=v_to@entry=0x55714a9a01e8) at
/home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit.c:566
#23 0x00007fd9fbe34b28 in llvm_compile_expr (state=0x55714a3a80b8) at
/home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit_expr.c:158
#24 0x00005571479f5448 in jit_compile_expr
(state=state@entry=0x55714a3a80b8) at
/home/bf/bf-build/phycodurus/REL_13_STABLE/pgsql.build/../pgsql/src/backend/jit/jit.c:177

REL_14_STABLE points at a crash, without a backtrace.  It looks like
only this host is seeing such failures for the upgrade test, for only
these two branches.  Is that something we'd better act on even for v13
which is going to be EOL soon?

Thanks,
--
Michael

Вложения

Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> I have spotted a couple of buildfarm failures for buildfarm member
> phycodorus on REL_14_STABLE and REL_13_STABLE:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36

phycodorus seems to be running a remarkably ancient LLVM version.
I wonder if we should just write these off as "probably an LLVM bug".

            regards, tom lane



Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Alexander Lakhin
Дата:
Hello Tom and Michael,

16.10.2025 02:39, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
I have spotted a couple of buildfarm failures for buildfarm member
phycodorus on REL_14_STABLE and REL_13_STABLE:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36
phycodorus seems to be running a remarkably ancient LLVM version.
I wonder if we should just write these off as "probably an LLVM bug".

I collected all of such failures here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption

Masao-san was going to dig into that:
https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com

Best regards,
Alexander

Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Michael Paquier
Дата:
On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote:
> I collected all of such failures here:
>
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption
>
> Masao-san was going to dig into that:
> https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com

Good to know.  Thanks for the information, Alexander.
--
Michael

Вложения

Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Fujii Masao
Дата:
On Fri, Oct 17, 2025 at 8:32 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote:
> > I collected all of such failures here:
> >
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption
> >
> > Masao-san was going to dig into that:
> > https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com

I tried that briefly, but unfortunately I still have no idea what caused
this failure or what triggered the double-free issue shown below…

-----------------------------------
[New LWP 978394]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by

`/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/tmp_install/home/bf/bf-build/petalura/REL_13_STABLE/inst/bin/postgres
'' '' '' '' '''.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>,
signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44

warning: 44 ./nptl/pthread_kill.c: No such file or directory
#0  __pthread_kill_implementation (threadid=<optimized out>,
signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44
#1  0x00007f6b19e9e9ff in __pthread_kill_internal (threadid=<optimized
out>, signo=6) at ./nptl/pthread_kill.c:89
#2  0x00007f6b19e49cc2 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#3  0x00007f6b19e324ac in __GI_abort () at ./stdlib/abort.c:73
#4  0x00007f6b19e33291 in __libc_message_impl
(fmt=fmt@entry=0x7f6b19fb532d "%s\\n") at
../sysdeps/posix/libc_fatal.c:134
#5  0x00007f6b19ea8465 in malloc_printerr
(str=str@entry=0x7f6b19fb86f8 "double free or corruption (!prev)") at
./malloc/malloc.c:5829
#6  0x00007f6b19eaa56c in _int_free_merge_chunk
(av=av@entry=0x7f6b19ff1ac0 <main_arena>, p=p@entry=0xfba29e0,
size=272) at ./malloc/malloc.c:4721
#7  0x00007f6b19eaa6c6 in _int_free_chunk (av=av@entry=0x7f6b19ff1ac0
<main_arena>, p=p@entry=0xfba29e0, size=<optimized out>,
have_lock=<optimized out>, have_lock@entry=0) at
./malloc/malloc.c:4667
#8  0x00007f6b19ead3c0 in _int_free (av=0x7f6b19ff1ac0 <main_arena>,
p=0xfba29e0, have_lock=0) at ./malloc/malloc.c:4699
#9  __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3476
#10 0x00007f6b1a29053c in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#11 0x00007f6b1a290574 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#12 0x00007f6b1b2b7fc2 in _dl_call_fini
(closure_map=closure_map@entry=0x7f6b1ae49660) at
./elf/dl-call_fini.c:43
#13 0x00007f6b1b2bae72 in _dl_fini () at ./elf/dl-fini.c:120
#14 0x00007f6b19e4c291 in __run_exit_handlers (status=0,
listp=0x7f6b19ff1680 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true,
run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:118
#15 0x00007f6b19e4c35a in __GI_exit (status=<optimized out>) at
./stdlib/exit.c:148
#16 0x000000000078d80c in proc_exit (code=0) at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:156
#17 0x00000000007b44e1 in PostgresMain (argc=1, argv=<optimized out>,
dbname=<optimized out>, username=<optimized out>) at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/tcop/postgres.c:4604
#18 0x000000000073498b in BackendRun (port=0xf8562a0) at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4561
#19 0x0000000000734337 in BackendStartup (port=<optimized out>) at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4245
#20 0x0000000000733b33 in ServerLoop () at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1744
#21 0x0000000000731e47 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1417
#22 0x0000000000693d89 in main (argc=6, argv=0xf7d90c0) at
/home/bf/bf-build/petalura/REL_13_STABLE/pgsql.build/../pgsql/src/backend/main/main.c:212
$1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad =
{978394, 1000, 0 <repeats 26 times>}, _kill = {si_pid = 978394, si_uid
= 1000}, _timer = {si_tid = 978394, si_overrun = 1000, si_sigval =
{sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 978394, si_uid =
1000, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld =
{si_pid = 978394, si_uid = 1000, si_status = 0, si_utime = 0, si_stime
= 0}, _sigfault = {si_addr = 0x3e8000eedda}, _sigpoll = {si_band =
4294968274394, si_fd = 0}, _sigsys = {_call_addr = 0x3e8000eedda,
_syscall = 0, _arch = 0}}}

Regards,

--
Fujii Masao



Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Alexander Lakhin
Дата:
Hello Andres,

17.10.2025 08:21, Fujii Masao wrote:
On Fri, Oct 17, 2025 at 8:32 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote:
I collected all of such failures here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption

Masao-san was going to dig into that:
https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com
I tried that briefly, but unfortunately I still have no idea what caused
this failure or what triggered the double-free issue shown below…

I've been trying to reproduce the issue locally for several days, with
clang 3.9.0 and 4.0.1 compiled from sources with -DCMAKE_BUILD_TYPE=Debug
-DLLVM_ENABLE_ASSERTIONS=ON, running buildfarm client (TestUpgrade) on
four different x86_64 systems (Debian, Ubuntu, but not the latest versions), with
no single failure so far.

(I've re-created config from petalura/phycodurus:  'jit=1',
'jit_above_cost=0', 'jit_optimize_above_cost=1000'... also tried
jit_optimize_above_cost=0...)

I tried to invoke double free with a simple program and confirmed that the
double free is detected and the program aborted.

So if I re-created all the conditions (based on buildfarm logs) correctly,
then several hundred runs, which I performed, should be enough to
reproduce the issue, but probably there is something specific with those
animals (petalura, phycodurus, desmoxytes, dragonet)... Maybe a buggy libc
update was installed there in September?

Meanwhile we've got a failure at stage Check (not pg_upgradeCheck), with a
release LLVM build [1]:
2025-10-21 17:15:16.261 CEST [1489783][client backend][:0] LOG:  disconnection: session time: 0:00:03.177 user=bf database=regression host=[local]
corrupted size vs. prev_size while consolidating

Thus, the initial suspicion that the issue is caused by dff7591a7 (because
the first failure [2] happened right after it) seems wrong now.

Maybe you have an insight on the possible cause of these memory errors?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2025-10-21%2015%3A14%3A12
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-09-16%2011%3A09%3A07

Best regards,
Alexander

Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Andres Freund
Дата:
Hi,

On 2025-10-15 19:39:03 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > I have spotted a couple of buildfarm failures for buildfarm member
> > phycodorus on REL_14_STABLE and REL_13_STABLE:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-10-15%2009%3A12%3A36
> 
> phycodorus seems to be running a remarkably ancient LLVM version.

It intentionally tests the oldest supported version... If we don't care, I'm
happy enough to just remove the animal.


> I wonder if we should just write these off as "probably an LLVM bug".

I'm not sure that's really convincing, given that REL_16_STABLE seems to not
have an issue?

Greetings,

Andres Freund



Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2025-10-15 19:39:03 -0400, Tom Lane wrote:
>> phycodorus seems to be running a remarkably ancient LLVM version.

> It intentionally tests the oldest supported version... If we don't care, I'm
> happy enough to just remove the animal.

Sure, we'd need to change our docs about the oldest supported LLVM
version if we go that way.

>> I wonder if we should just write these off as "probably an LLVM bug".

> I'm not sure that's really convincing, given that REL_16_STABLE seems to not
> have an issue?

The other side of that coin is that no other LLVM-using animal is
showing similar instability.  Sure, it's plausible that we changed
something in v15 or so that stopped the problem, but is it worth the
effort to try to find out what?  And if we did find it, would we
care to risk back-porting it?

(If you want to research this, I'm not standing in the way.
But I think there are better uses for your time.)

            regards, tom lane



Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Alexander Lakhin
Дата:
Hello Tom and Andres,

25.10.2025 00:31, Tom Lane wrote:
Sure, we'd need to change our docs about the oldest supported LLVM
version if we go that way.

I wonder if we should just write these off as "probably an LLVM bug".

As I wrote upthread, I could not reproduce the issue with the same old
LLVM versions.

I'm not sure that's really convincing, given that REL_16_STABLE seems to not
have an issue?
The other side of that coin is that no other LLVM-using animal is
showing similar instability.  Sure, it's plausible that we changed
something in v15 or so that stopped the problem, but is it worth the
effort to try to find out what?  And if we did find it, would we
care to risk back-porting it?

My collection [2] contains also reports from other animals: petalura,
desmoxytes, dragonet.


(If you want to research this, I'm not standing in the way.
But I think there are better uses for your time.)

I wanted to research this, but failed, to my disappointment.


[1] https://www.postgresql.org/message-id/563ee5af-8ee2-484f-b50a-1c8fbdd16171%40gmail.com
[2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption

Best regards,
Alexander

Re: Instability of phycodorus in pg_upgrade tests with JIT

От
Tom Lane
Дата:
Alexander Lakhin <exclusion@gmail.com> writes:
> 25.10.2025 00:31, Tom Lane wrote:
>> The other side of that coin is that no other LLVM-using animal is
>> showing similar instability.

> My collection [2] contains also reports from other animals: petalura,
> desmoxytes, dragonet.

Hmm ... but none of those are running any LLVM newer than 4.0.1
(obsolete since 2017).

            regards, tom lane