Обсуждение: Segmentation Fault

Поиск
Список
Период
Сортировка

Segmentation Fault

От
Robert Sanford
Дата:
PostgreSQL v15.4
Running on Ubuntu Jammy
ARM64 x8
64 GB RAM

I have been trying to tune the performance of a query, and wasn't having a lot of luck.I decided to run VACUUM ANALYZE to see if that would help.

The Postgres server crashed.

I ran through each of the tables individually and ran VACUUM ANALYZE one by one. That way completed successfully.

I ran my query to see if that had helped performance. Segmentation fault. The server restarted itself and came back to life. Reran the query and it crashed again. Every time I run the query I get the same segmentation fault. Different queries run just fine. 

I tried attaching GDB to the PID that the log file says is the main pid for startup. That seems to have worked in that it tells me that there were various libraries that were loaded. I'm not sure if it is sufficient or not. I'm pretty sure it isn't.

I tried getting GDB based on DebuggingProgramCrash - Ubuntu Wiki which says "If you are on Ubuntu Jammy or later, you do not need to worry about installing debug symbol packages anymore." I am running on Jammy so that should be good.

I then tried getting a Valgrind trace from PostgreSQL. I used "systemctl stop postgresql" to stop the service from running. I launched Valgrind using the command line below my sig. And everything worked just fine except the query took 10x the time that it was taking before I started the VACUUM ANALYZE. 

So I restarted the service, reran my query, and it failed with the same log entries showing a Signal 11: Segmentation fault.

Running it inside of Valgrind stopped the error from happening which is strange to me. 

What can I do to provide enough information to figure out why this is crashing?

rjsjr



Valgrind Command
valgrind --leak-check=no --gen-suppressions=all \
--suppressions=src/tools/valgrind.supp --time-stamp=yes \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
--log-file=/datadrive/%p.log --trace-children=yes \
/usr/lib/postgresql/15/bin/postgres --config-file=/etc/postgresql/15/main/postgresql.conf \
--log_line_prefix="%m %p " \
--log_statement=all --shared_buffers=64MB 2>&1 | tee /datadrive/postmaster.log


GDB Backtrace command
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
69      ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) bt
#0  0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000aaaab91b4a1c in ?? ()
#2  0x0000aaaab91b63cc in PostmasterMain ()
#3  0x0000aaaab8f053d8 in main ()


PostgreSQL Log File
2023-09-13 20:20:21.198 UTC [295031] LOG:  starting PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2023-09-13 20:20:21.198 UTC [295031] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-09-13 20:20:21.198 UTC [295031] LOG:  listening on IPv6 address "::", port 5432
2023-09-13 20:20:21.203 UTC [295031] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-13 20:20:21.218 UTC [295037] LOG:  database system was shut down at 2023-09-13 20:20:20 UTC
2023-09-13 20:20:21.228 UTC [295031] LOG:  database system is ready to accept connections
2023-09-13 20:21:21.604 UTC [295031] LOG:  server process (PID 295102) was terminated by signal 11: Segmentation fault
2023-09-13 20:21:21.604 UTC [295031] DETAIL:  Failed process was running: --explain analyze
        select
                a.equipmenttype
                , a.technology
                , count(*)
        from
                newfaultentrylineitems a
        where
                a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
                and a.assessmentyearmonth >= 202209
                and a.assessmentyearmonth < 202309
        group by a.equipmenttype, a.technology
        order by count(*) desc
2023-09-13 20:21:21.604 UTC [295031] LOG:  terminating any other active server processes
2023-09-13 20:21:21.610 UTC [295031] LOG:  all server processes terminated; reinitializing
2023-09-13 20:21:22.074 UTC [295109] LOG:  database system was interrupted; last known up at 2023-09-13 20:20:21 UTC
2023-09-13 20:21:22.137 UTC [295112] FATAL:  the database system is in recovery mode
2023-09-13 20:21:22.187 UTC [295109] LOG:  database system was not properly shut down; automatic recovery in progress
2023-09-13 20:21:22.195 UTC [295109] LOG:  redo starts at 17/6AFACF40
2023-09-13 20:21:22.195 UTC [295109] LOG:  invalid record length at 17/6AFACF78: wanted 24, got 0
2023-09-13 20:21:22.195 UTC [295109] LOG:  redo done at 17/6AFACF40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:21:22.208 UTC [295110] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:21:22.383 UTC [295113] FATAL:  the database system is not yet accepting connections
2023-09-13 20:21:22.383 UTC [295113] DETAIL:  Consistent recovery state has not been yet reached.
2023-09-13 20:21:22.428 UTC [295110] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.199 s, sync=0.004 s, total=0.224 s; sync files=2, longest=0.003>
2023-09-13 20:21:22.435 UTC [295031] LOG:  database system is ready to accept connections
2023-09-13 20:40:24.416 UTC [295031] LOG:  server process (PID 295140) was terminated by signal 11: Segmentation fault
2023-09-13 20:40:24.416 UTC [295031] DETAIL:  Failed process was running: --explain analyze
        select
                a.equipmenttype
                , a.technology
                , count(*)
        from
                newfaultentrylineitems a
        where
                a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
                and a.assessmentyearmonth >= 202209
                and a.assessmentyearmonth < 202309
        group by a.equipmenttype, a.technology
        order by count(*) desc
2023-09-13 20:40:24.416 UTC [295031] LOG:  terminating any other active server processes
2023-09-13 20:40:24.666 UTC [295523] FATAL:  the database system is in recovery mode
2023-09-13 20:40:24.922 UTC [295524] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.182 UTC [295525] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.379 UTC [295526] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.574 UTC [295527] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.775 UTC [295528] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.977 UTC [295529] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.179 UTC [295530] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.377 UTC [295531] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.578 UTC [295532] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.777 UTC [295533] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.973 UTC [295534] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.166 UTC [295535] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.372 UTC [295536] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.568 UTC [295537] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.766 UTC [295538] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.961 UTC [295539] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.163 UTC [295540] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.389 UTC [295541] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.589 UTC [295542] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.784 UTC [295543] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.978 UTC [295544] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.114 UTC [295031] LOG:  issuing SIGKILL to recalcitrant children
2023-09-13 20:40:29.178 UTC [295545] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.403 UTC [295546] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.603 UTC [295547] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.798 UTC [295548] FATAL:  the database system is in recovery mode
2023-09-13 20:40:30.017 UTC [295549] FATAL:  the database system is in recovery mode
2023-09-13 20:40:30.214 UTC [295550] FATAL:  the database system is in recovery mode
2023-09-13 20:40:58.414 UTC [295031] LOG:  all server processes terminated; reinitializing
2023-09-13 20:40:58.876 UTC [295553] LOG:  database system was interrupted; last known up at 2023-09-13 20:21:22 UTC
2023-09-13 20:40:58.986 UTC [295553] LOG:  database system was not properly shut down; automatic recovery in progress
2023-09-13 20:40:58.992 UTC [295553] LOG:  redo starts at 17/6AFACFF0
2023-09-13 20:40:58.992 UTC [295553] LOG:  invalid record length at 17/6AFAD028: wanted 24, got 0
2023-09-13 20:40:58.992 UTC [295553] LOG:  redo done at 17/6AFACFF0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:40:59.006 UTC [295554] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:40:59.228 UTC [295554] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.198 s, sync=0.004 s, total=0.228 s; sync files=2, longest=0.003>
2023-09-13 20:40:59.237 UTC [295031] LOG:  database system is ready to accept connections





Re: Segmentation Fault

От
Robert Sanford
Дата:
I realized that when I was running GDB I just took a snap of when it failed. Here I've added after continuing and requesting a backtrace.

rjsjr


Continuing.
[Detaching after fork from child process 297277]
[Detaching after fork from child process 297278]
[Detaching after fork from child process 297279]
[Detaching after fork from child process 297280]
[Detaching after fork from child process 297281]
[Detaching after fork from child process 297282]
[Detaching after fork from child process 297283]
[Detaching after fork from child process 297284]
[Detaching after fork from child process 297286]
[Detaching after fork from child process 297287]
[Detaching after fork from child process 297289]

Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69      in ../sysdeps/unix/sysv/linux/select.c
(gdb) bt
#0  0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000aaaab8124a1c in ?? ()
#2  0x0000aaaab81263cc in PostmasterMain ()
#3  0x0000aaaab7e753d8 in main ()
(gdb) c
Continuing.
[Detaching after fork from child process 297301]
[Detaching after fork from child process 297304]

Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69      in ../sysdeps/unix/sysv/linux/select.c
(gdb) bt
#0  0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000aaaab8124a1c in ?? ()
#2  0x0000aaaab81263cc in PostmasterMain ()
#3  0x0000aaaab7e753d8 in main ()
(gdb) c
Continuing.
[Detaching after fork from child process 297306]
[Detaching after fork from child process 297311]
[Detaching after fork from child process 297320]

Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69      in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297330]

Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69      in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297332]
[Detaching after fork from child process 297333]
[Detaching after fork from child process 297334]
[Detaching after fork from child process 297335]
[Detaching after fork from child process 297336]

Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69      in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297337]
[Detaching after fork from child process 297338]
[Detaching after fork from child process 297339]
[Detaching after fork from child process 297344]
[Detaching after fork from child process 297347]

Program received signal SIGUSR1, User defined signal 1.
0x0000ffffbbd6e844 in __GI___select (nfds=9, readfds=0xffffde28eab0, writefds=0x0, exceptfds=0x0, timeout=0xffffde28ea18) at ../sysdeps/unix/sysv/linux/select.c:69
69      in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.
[Detaching after fork from child process 297351]
[Detaching after fork from child process 297352]
[Detaching after fork from child process 297353]
[Detaching after fork from child process 297354]
[Detaching after fork from child process 297355]


On Wed, Sep 13, 2023 at 4:53 PM Robert Sanford <wobbet@gmail.com> wrote:
PostgreSQL v15.4
Running on Ubuntu Jammy
ARM64 x8
64 GB RAM

I have been trying to tune the performance of a query, and wasn't having a lot of luck.I decided to run VACUUM ANALYZE to see if that would help.

The Postgres server crashed.

I ran through each of the tables individually and ran VACUUM ANALYZE one by one. That way completed successfully.

I ran my query to see if that had helped performance. Segmentation fault. The server restarted itself and came back to life. Reran the query and it crashed again. Every time I run the query I get the same segmentation fault. Different queries run just fine. 

I tried attaching GDB to the PID that the log file says is the main pid for startup. That seems to have worked in that it tells me that there were various libraries that were loaded. I'm not sure if it is sufficient or not. I'm pretty sure it isn't.

I tried getting GDB based on DebuggingProgramCrash - Ubuntu Wiki which says "If you are on Ubuntu Jammy or later, you do not need to worry about installing debug symbol packages anymore." I am running on Jammy so that should be good.

I then tried getting a Valgrind trace from PostgreSQL. I used "systemctl stop postgresql" to stop the service from running. I launched Valgrind using the command line below my sig. And everything worked just fine except the query took 10x the time that it was taking before I started the VACUUM ANALYZE. 

So I restarted the service, reran my query, and it failed with the same log entries showing a Signal 11: Segmentation fault.

Running it inside of Valgrind stopped the error from happening which is strange to me. 

What can I do to provide enough information to figure out why this is crashing?

rjsjr



Valgrind Command
valgrind --leak-check=no --gen-suppressions=all \
--suppressions=src/tools/valgrind.supp --time-stamp=yes \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
--log-file=/datadrive/%p.log --trace-children=yes \
/usr/lib/postgresql/15/bin/postgres --config-file=/etc/postgresql/15/main/postgresql.conf \
--log_line_prefix="%m %p " \
--log_statement=all --shared_buffers=64MB 2>&1 | tee /datadrive/postmaster.log


GDB Backtrace command
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
69      ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) bt
#0  0x0000ffffb9c6e844 in __GI___select (nfds=9, readfds=0xffffd0849280, writefds=0x0, exceptfds=0x0, timeout=0xffffd08491e8) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000aaaab91b4a1c in ?? ()
#2  0x0000aaaab91b63cc in PostmasterMain ()
#3  0x0000aaaab8f053d8 in main ()


PostgreSQL Log File
2023-09-13 20:20:21.198 UTC [295031] LOG:  starting PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2023-09-13 20:20:21.198 UTC [295031] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-09-13 20:20:21.198 UTC [295031] LOG:  listening on IPv6 address "::", port 5432
2023-09-13 20:20:21.203 UTC [295031] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-13 20:20:21.218 UTC [295037] LOG:  database system was shut down at 2023-09-13 20:20:20 UTC
2023-09-13 20:20:21.228 UTC [295031] LOG:  database system is ready to accept connections
2023-09-13 20:21:21.604 UTC [295031] LOG:  server process (PID 295102) was terminated by signal 11: Segmentation fault
2023-09-13 20:21:21.604 UTC [295031] DETAIL:  Failed process was running: --explain analyze
        select
                a.equipmenttype
                , a.technology
                , count(*)
        from
                newfaultentrylineitems a
        where
                a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
                and a.assessmentyearmonth >= 202209
                and a.assessmentyearmonth < 202309
        group by a.equipmenttype, a.technology
        order by count(*) desc
2023-09-13 20:21:21.604 UTC [295031] LOG:  terminating any other active server processes
2023-09-13 20:21:21.610 UTC [295031] LOG:  all server processes terminated; reinitializing
2023-09-13 20:21:22.074 UTC [295109] LOG:  database system was interrupted; last known up at 2023-09-13 20:20:21 UTC
2023-09-13 20:21:22.137 UTC [295112] FATAL:  the database system is in recovery mode
2023-09-13 20:21:22.187 UTC [295109] LOG:  database system was not properly shut down; automatic recovery in progress
2023-09-13 20:21:22.195 UTC [295109] LOG:  redo starts at 17/6AFACF40
2023-09-13 20:21:22.195 UTC [295109] LOG:  invalid record length at 17/6AFACF78: wanted 24, got 0
2023-09-13 20:21:22.195 UTC [295109] LOG:  redo done at 17/6AFACF40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:21:22.208 UTC [295110] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:21:22.383 UTC [295113] FATAL:  the database system is not yet accepting connections
2023-09-13 20:21:22.383 UTC [295113] DETAIL:  Consistent recovery state has not been yet reached.
2023-09-13 20:21:22.428 UTC [295110] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.199 s, sync=0.004 s, total=0.224 s; sync files=2, longest=0.003>
2023-09-13 20:21:22.435 UTC [295031] LOG:  database system is ready to accept connections
2023-09-13 20:40:24.416 UTC [295031] LOG:  server process (PID 295140) was terminated by signal 11: Segmentation fault
2023-09-13 20:40:24.416 UTC [295031] DETAIL:  Failed process was running: --explain analyze
        select
                a.equipmenttype
                , a.technology
                , count(*)
        from
                newfaultentrylineitems a
        where
                a.databaseguid = '9a277d83-30eb-4736-99f2-d028ff84611e'
                and a.assessmentyearmonth >= 202209
                and a.assessmentyearmonth < 202309
        group by a.equipmenttype, a.technology
        order by count(*) desc
2023-09-13 20:40:24.416 UTC [295031] LOG:  terminating any other active server processes
2023-09-13 20:40:24.666 UTC [295523] FATAL:  the database system is in recovery mode
2023-09-13 20:40:24.922 UTC [295524] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.182 UTC [295525] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.379 UTC [295526] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.574 UTC [295527] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.775 UTC [295528] FATAL:  the database system is in recovery mode
2023-09-13 20:40:25.977 UTC [295529] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.179 UTC [295530] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.377 UTC [295531] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.578 UTC [295532] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.777 UTC [295533] FATAL:  the database system is in recovery mode
2023-09-13 20:40:26.973 UTC [295534] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.166 UTC [295535] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.372 UTC [295536] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.568 UTC [295537] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.766 UTC [295538] FATAL:  the database system is in recovery mode
2023-09-13 20:40:27.961 UTC [295539] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.163 UTC [295540] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.389 UTC [295541] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.589 UTC [295542] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.784 UTC [295543] FATAL:  the database system is in recovery mode
2023-09-13 20:40:28.978 UTC [295544] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.114 UTC [295031] LOG:  issuing SIGKILL to recalcitrant children
2023-09-13 20:40:29.178 UTC [295545] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.403 UTC [295546] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.603 UTC [295547] FATAL:  the database system is in recovery mode
2023-09-13 20:40:29.798 UTC [295548] FATAL:  the database system is in recovery mode
2023-09-13 20:40:30.017 UTC [295549] FATAL:  the database system is in recovery mode
2023-09-13 20:40:30.214 UTC [295550] FATAL:  the database system is in recovery mode
2023-09-13 20:40:58.414 UTC [295031] LOG:  all server processes terminated; reinitializing
2023-09-13 20:40:58.876 UTC [295553] LOG:  database system was interrupted; last known up at 2023-09-13 20:21:22 UTC
2023-09-13 20:40:58.986 UTC [295553] LOG:  database system was not properly shut down; automatic recovery in progress
2023-09-13 20:40:58.992 UTC [295553] LOG:  redo starts at 17/6AFACFF0
2023-09-13 20:40:58.992 UTC [295553] LOG:  invalid record length at 17/6AFAD028: wanted 24, got 0
2023-09-13 20:40:58.992 UTC [295553] LOG:  redo done at 17/6AFACFF0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-09-13 20:40:59.006 UTC [295554] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-09-13 20:40:59.228 UTC [295554] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.198 s, sync=0.004 s, total=0.228 s; sync files=2, longest=0.003>
2023-09-13 20:40:59.237 UTC [295031] LOG:  database system is ready to accept connections





Re: Segmentation Fault

От
Tom Lane
Дата:
Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.

This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one.  There is some advice about
collecting useful backtraces at

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.

            regards, tom lane



Re: Segmentation Fault

От
Laurenz Albe
Дата:
On Wed, 2023-09-13 at 16:53 -0500, Robert Sanford wrote:
> The Postgres server crashed.

The PostgreSQL log file may contain interesting information.

Yours,
Laurenz Albe



Re: Segmentation Fault

От
Robert Sanford
Дата:
Woke up this morning excited to create a core dump and... it worked.

Not only does it work, it works fast. Query time dropped from 5.5s to 1.2s. That was what I was hoping to get out of the VACUUM ANALYZE.

Very confused. I don't get it. I hope it doesn't happen again, but if it does I'll get the core dump. 

rjsjr


On Wed, Sep 13, 2023 at 6:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.

This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one.  There is some advice about
collecting useful backtraces at

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.

                        regards, tom lane

Re: Segmentation Fault

От
Robert Sanford
Дата:
And now it's happening again...

I'm trying to get a core dump but I can't seem to actually generate one. 

When I try to set ulimit -c from the command line, I get a message that I'm not allowed to do that. So I went in and updated /etc/security/limits.conf for the postgres user as follows. 
postgres         hard    core            unlimited
postgres         soft    core            unlimited

I then restarted the postgres server. When I look at the limits in the pid folder I see...
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes

My core_pattern is set to "/tmp/core-%e-%s-%u-%g-%p-%t"

I run the query that forces the crash but there is no core file in any of 
/tmp/
/etc/postgresql/15
/etc/postgresql/15/main
/var/lib/postgresql/15
/var/lib/postgresql/15/main
/usr/lib/postgresql/15
/usr/lib/postgresql/15/lib
/usr/lib/postgresql/15/bin

GDB (more below) is even able to detect that there's a segmentation fault.

What am I doing wrong? Why can't I get a core file anywhere? 

rjsjr

From the GDB shell... which looks identical to what's in the log file... 
(gdb) set pagination off
(gdb) set logging file /datadrive/pgdebug.txt
(gdb) set logging on
Warning: 'set logging on', an alias for the command 'set logging enabled', is deprecated.
Use 'set logging enabled on'.

Copying output to /datadrive/pgdebug.txt.
Copying debug output to /datadrive/pgdebug.txt.
(gdb) b errfinish
Breakpoint 1 at 0xaaaabef843b8
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
(gdb) bt
#0  0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) bt
#0  0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
(gdb) bt
#0  0x0000fff7735c766c in ?? () from /lib/aarch64-linux-gnu/libLLVM-10.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000fff7add7a000 in ?? ()
(gdb) bt
#0  0x0000fff7add7a000 in ?? ()
#1  0x0000aaaacc627ca0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) cont
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb)



On Thu, Sep 14, 2023 at 8:27 AM Robert Sanford <wobbet@gmail.com> wrote:
Woke up this morning excited to create a core dump and... it worked.

Not only does it work, it works fast. Query time dropped from 5.5s to 1.2s. That was what I was hoping to get out of the VACUUM ANALYZE.

Very confused. I don't get it. I hope it doesn't happen again, but if it does I'll get the core dump. 

rjsjr


On Wed, Sep 13, 2023 at 6:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> I realized that when I was running GDB I just took a snap of when it
> failed. Here I've added after continuing and requesting a backtrace.

This is still quite unhelpful; it appears to show normal behavior in
a process that's not the failing one.  There is some advice about
collecting useful backtraces at

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

I'd counsel the method of enabling core dumps and then attaching
gdb to a core file, rather than dealing with live processes.

                        regards, tom lane

Re: Segmentation Fault

От
"Euler Taveira"
Дата:
On Thu, Sep 14, 2023, at 4:29 PM, Robert Sanford wrote:
And now it's happening again...

I'm trying to get a core dump but I can't seem to actually generate one. 

When I try to set ulimit -c from the command line, I get a message that I'm not allowed to do that. So I went in and updated /etc/security/limits.conf for the postgres user as follows. 
postgres         hard    core            unlimited
postgres         soft    core            unlimited

I then restarted the postgres server. When I look at the limits in the pid folder I see...
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes

Are you using systemd? If so, execute

sudo systemctl show -p LimitCORE postgresql@15-main.service

(Replace service name accordingly.)

If it is not showing infinity as value, you need to edit the service file using
the following command:

sudo systemctl edit postgresql@15-main.service

and add

[Service]
LimitCORE=infinity

Restart the service after it and check again with systemctl show command.

You should also check where the operating system is saving the core files.

cat /proc/sys/kernel/core_pattern

If it says core, it means it is storing it in PGDATA directory. Some OSes can
use a command/script to send the core files to another location.


--
Euler Taveira

Re: Segmentation Fault

От
Tom Lane
Дата:
Robert Sanford <wobbet@gmail.com> writes:
> I'm trying to get a core dump but I can't seem to actually generate one.
> ...
> I then restarted the postgres server. When I look at the limits in the pid
> folder I see...
> Limit                     Soft Limit           Hard Limit           Units
> Max cpu time              unlimited            unlimited            seconds
> Max file size             unlimited            unlimited            bytes
> Max data size             unlimited            unlimited            bytes
> Max stack size            8388608              unlimited            bytes
> Max core file size        0                    unlimited            bytes

This shows that you are now allowed to set "ulimit -c unlimited",
but you have not actually done so, at least not in the session that
you're using to inspect these settings.

If you are starting the server "by hand" (from a manual shell
command), it should work to do "ulimit -c unlimited" before issuing
the command to start the server.  If you are using infrastructure such
as systemd or initscripts, then you'll need to figure out how to get
it to set the ulimit in the environment of the server process.

BTW, have you tried disabling JIT to see if that stops the crashes?

            regards, tom lane



Re: Segmentation Fault

От
Robert Sanford
Дата:
Turning JIT off seems to have solved the problem so far. But I've gone a couple of hours w/o it happening and then it starts again. So I'm still kind of focused on trying to get a core dump. 

Here are some of my notes from trying to get one last night... 

I am typically using systemctl to launch postgres but have also tested running manually. I will try adding the ulimit call before running manually on my next set of tests if I get another failure. If I'm not getting another failure then I'll re-enable JIT and try to induce it again.

I have 32GB of shared mem configured for postgresql but only 10G of drive space on my OS drive. I'm wondering if that's causing it to not be able to write the core dump. I'm going to move my core dump location to my data drive which has well over 100GB free... 

When running systemctl list I notice that I have versions 12 and 14 of postgresql also running. I would hope that doesn't cause an issue but it also doesn't seem ideal. How do I remove them from systemd startup?

Running the systemctl show I got the following info...
LimitCPU=infinity
LimitCPUSoft=infinity
LimitFSIZE=infinity
LimitFSIZESoft=infinity
LimitDATA=infinity
LimitDATASoft=infinity
LimitSTACK=infinity
LimitSTACKSoft=8388608
LimitCORE=infinity
LimitCORESoft=0
LimitRSS=infinity
LimitRSSSoft=infinity
LimitNOFILE=524288
LimitNOFILESoft=1024

My pgdata directory is set in my postgresql.conf file as
data_directory = '/var/lib/postgresql/15/main'

PGDATA is not set in the environment. 

My core_pattern is set to "/tmp/core-%e-%s-%u-%g-%p-%t"




The contents of that directory are
azureuser@kpidb-dev:/etc$ sudo ls -al /var/lib/postgresql/15/main
total 120
drwx------ 20 postgres postgres  4096 Sep 14 19:13 .
drwxr-xr-x  3 postgres postgres  4096 Aug 18 20:44 ..
-rw-------  1 postgres postgres     3 Aug 18 20:44 PG_VERSION
drwx------  5 postgres postgres  4096 Aug 18 20:44 base
-rw-------  1 postgres postgres    44 Sep 14 19:13 current_logfiles
drwx------  2 postgres postgres  4096 Sep 14 19:39 global
drwx------  2 postgres postgres  4096 Sep 14 19:13 log
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_commit_ts
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_dynshmem
drwx------  4 postgres postgres  4096 Sep 14 19:39 pg_logical
drwx------  4 postgres postgres  4096 Aug 18 20:44 pg_multixact
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_notify
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_replslot
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_serial
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_snapshots
drwx------  2 postgres postgres  4096 Sep 14 19:13 pg_stat
drwx------  2 postgres postgres  4096 Aug 29 01:44 pg_stat_tmp
drwx------  2 postgres postgres  4096 Sep  5 14:36 pg_subtrans
drwx------  2 postgres postgres  4096 Aug 30 14:05 pg_tblspc
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_twophase
drwx------  3 postgres postgres 20480 Sep 13 20:16 pg_wal
drwx------  2 postgres postgres  4096 Aug 18 20:44 pg_xact
-rw-------  1 postgres postgres    88 Aug 18 20:44 postgresql.auto.conf
-rw-------  1 postgres postgres   238 Sep 14 19:13 postmaster.opts
-rw-------  1 postgres postgres   101 Sep 14 19:39 postmaster.pid






On Thu, Sep 14, 2023 at 7:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> I'm trying to get a core dump but I can't seem to actually generate one.
> ...
> I then restarted the postgres server. When I look at the limits in the pid
> folder I see...
> Limit                     Soft Limit           Hard Limit           Units
> Max cpu time              unlimited            unlimited            seconds
> Max file size             unlimited            unlimited            bytes
> Max data size             unlimited            unlimited            bytes
> Max stack size            8388608              unlimited            bytes
> Max core file size        0                    unlimited            bytes

This shows that you are now allowed to set "ulimit -c unlimited",
but you have not actually done so, at least not in the session that
you're using to inspect these settings.

If you are starting the server "by hand" (from a manual shell
command), it should work to do "ulimit -c unlimited" before issuing
the command to start the server.  If you are using infrastructure such
as systemd or initscripts, then you'll need to figure out how to get
it to set the ulimit in the environment of the server process.

BTW, have you tried disabling JIT to see if that stops the crashes?

                        regards, tom lane