[MASSMAIL]elog/ereport VS misleading backtrace_function function address

Поиск
Список
Период
Сортировка
От Jakub Wartak
Тема [MASSMAIL]elog/ereport VS misleading backtrace_function function address
Дата
Msg-id CAKZiRmwEojZ1awF2TD_URwHgOJd5Bx4NKaiTSiQM32NDNqODSQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: elog/ereport VS misleading backtrace_function function address  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi -hackers,

While chasing some other bug I've learned that backtrace_functions
might be misleading with top elog/ereport() address.

Reproducer:

# using Tom's reproducer on master:
wget https://www.postgresql.org/message-id/attachment/112394/ri-collation-bug-example.sql
echo '' >> ri-collation-bug-example.sql
echo '\errverbose' >> ri-collation-bug-example.sql
-- based on grepping the source code locations many others could go in here:
psql -p 5437 -c "alter system set backtrace_functions =
'RI_FKey_cascade_del,get_collation_isdeterministic';"
psql -p 5437 -c "select pg_reload_conf();"
dropdb -p 5437 test1 ; createdb -p 5437 test1 ; psql -p 5437 -d test1
-f ri-collation-bug-example.sql

gives (note "get_collation_isdeterministic"):
    psql:ri-collation-bug-example.sql:42: ERROR:  cache lookup failed
for collation 0
    psql:ri-collation-bug-example.sql:44: error: ERROR:  XX000: cache
lookup failed for collation 0
    LOCATION:  get_collation_isdeterministic, lsyscache.c:1088

and in log note the 0x14c6bb:
2024-03-27 14:39:13.065 CET [12899] postgres@test1 ERROR:  cache
lookup failed for collation 0
2024-03-27 14:39:13.065 CET [12899] postgres@test1 BACKTRACE:
        postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb)
[0x5565c5a9c6bb]
        postgres: 16/main: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973]
        postgres: 16/main: postgres test1 [local] DELETE(+0x2d401f)
[0x5565c5c2401f]
        postgres: 16/main: postgres test1 [local] DELETE(+0x2d5a04)
[0x5565c5c25a04]
        postgres: 16/main: postgres test1 [local]
DELETE(AfterTriggerEndQuery+0x78) [0x5565c5c2a918]
[..]
2024-03-27 14:39:13.065 CET [12899] postgres@test1 STATEMENT:  delete
from revisions where id='5c617ce7-688d-4bea-9d66-c0f0ebc635da';

based on \errverbose it is OK (error matches the code, thanks to
Alvaro for this hint):

     9   bool
     8   get_collation_isdeterministic(Oid colloid)
     7   {
     6   │   HeapTuple>  tp;
     5   │   Form_pg_collation colltup;
     4   │   bool>   >   result;
     3   │
     2   │   tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
     1   │   if (!HeapTupleIsValid(tp))
  1088   │   │   elog(ERROR, "cache lookup failed for collation %u", colloid);
[..]


but based on backtrace address 0x14c6bb (!) and it resolves
differently which seems to be highly misleading (note the
"get_language_name.cold"):

$ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb
0x000000000014c6bb
get_language_name.cold
./build/src/backend/utils/cache/./build/../src/backend/utils/cache/lsyscache.c:1181

When i disassemble the get_collation_isdeterministic() (and I know the
name from \errverbose):

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
   0x00000000005c7680 <+0>:     push   %rbp
[..]
   0x00000000005c7694 <+20>:    call   0x5d63b0 <SearchSysCache1>
   0x00000000005c7699 <+25>:    test   %rax,%rax
   0x00000000005c769c <+28>:    je     0x14c686
<get_collation_isdeterministic.cold>
   0x00000000005c76a2 <+34>:    mov    %rax,%rdi
[..]
   0x00000000005c76bf <+63>:    leave
   0x00000000005c76c0 <+64>:    ret
Address range 0x14c686 to 0x14c6bb:
   0x000000000014c686 <-4698106>:       xor    %esi,%esi
   0x000000000014c688 <-4698104>:       mov    $0x15,%edi
   0x000000000014c68d <-4698099>:       call   0x14ec86 <errstart_cold>
   0x000000000014c692 <-4698094>:       mov    %r12d,%esi
   0x000000000014c695 <-4698091>:       lea    0x5028dc(%rip),%rdi
   # 0x64ef78
   0x000000000014c69c <-4698084>:       xor    %eax,%eax
   0x000000000014c69e <-4698082>:       call   0x5df320 <errmsg_internal>
   0x000000000014c6a3 <-4698077>:       lea    0x6311a6(%rip),%rdx
   # 0x77d850 <__func__.34>
   0x000000000014c6aa <-4698070>:       mov    $0x440,%esi
   0x000000000014c6af <-4698065>:       lea    0x630c8a(%rip),%rdi
   # 0x77d340
   0x000000000014c6b6 <-4698058>:       call   0x5df100 <errfinish>
   << NOTE the exact 0x14c6bb is even missing here(!)

notice the interesting thing here: according to GDB
get_collation_isdeterministic() is @ 0x5c7680 + jump to 0x14c686
<get_collation_isdeterministic.cold> till 0x14c6bb (but without it)
without any stack setup for that .cold. But backtrace() just captured
the elog/ereport (cold) at the end of 0x14c6bb instead, so if I take
that exact address from backtrace_functions output as it is
("DELETE(+0x14c6bb)"), it also shows WRONG function (just like
addr2line):

(gdb) disassemble 0x14c6bb
Dump of assembler code for function get_language_name:
Address range 0x5c7780 to 0x5c77ee:
[..]
   0x00000000005c77ed <+109>:   ret
Address range 0x14c6bb to 0x14c6f0:
   0x000000000014c6bb <-4698309>:       xor    %esi,%esi
[..]
   0x000000000014c6eb <-4698261>:       call   0x5df100 <errfinish>
End of assembler dump.

so this is wrong (as are failing in "get_collation_isdeterministic"
not in "get_language_name").

I was very confused at the beginning with the main question being: why
are we compiling elog/ereport() so that it is incompatible with
backtrace? When looking for it I've found two threads [1] by David
that were actually helpful in understanding that this was done for
performance (unlikley(), cold attribute and similiar type of
discussions). Now my thought is that for >= ERROR in ereport_domain we
could add something more (?) before pg_unreachable() that would help
the backtrace to resolve the symbol correctly

If I try non-portable PoC with x86 nop instruction:

--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -145,8 +145,10 @@ struct Node;
                        errstart_cold(elevel, domain) : \
                        errstart(elevel, domain)) \
                        __VA_ARGS__, errfinish(__FILE__, __LINE__, __func__); \
-               if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
+               if (__builtin_constant_p(elevel) && (elevel) >= ERROR) { \
+                       __asm__ __volatile__("nop"); \
                        pg_unreachable(); \
+               } \

it partially works and the address can be now properly resolved!

Somewhat backtrace() still is unable to lookup the to do so in log itself:
        postgres: postgres test1 [local] DELETE(+0x15f84c) [0x55703bf5284c]
        postgres: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55703c34c850]

but at least:

postgres@hive:~$ addr2line -e /usr/pgsql17/bin/postgres -a -f 0x15f84c
0x000000000015f84c
get_collation_isdeterministic
/git/postgres/src/backend/utils/cache/lsyscache.c:1062 (discriminator 4)
postgres@hive:~$

in gdb it will point right to our new nop:
   0x000000000015f840 <-4490944>:       lea    0x608306(%rip),%rdi
   # 0x767b4d
   0x000000000015f847 <-4490937>:       call   0x5be1d0 <errfinish>
   0x000000000015f84c <-4490932>:       nop

Thoughts? Does it make sense to post a patch (for pg18?)? How to do it
in $arch-independent way? I've tried also to try to find a way with
e.g. -rdynamic to show that real function name, but it looks like
without some more serious unwind library it seems unrealistic (?) to
resolve that get_collation_isdeterministic.cold

-J.

[1] -
https://www.postgresql.org/message-id/flat/CAApHDvoWoxtbWiqZxrhO%2Bi9NoG56AWHDzuDDD%2B1OEf4PxzFhig%40mail.gmail.com#611566bd6fd06f27e51dbc3148a673d3
[2] - https://www.postgresql.org/message-id/flat/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68%3D3awbh8CJWTP9zXeoHMw%40mail.gmail.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Andrey M. Borodin"
Дата:
Сообщение: Re: Support logical replication of DDLs
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Support logical replication of DDLs