Обсуждение: Segfault on postgresql 12.3
Hi all, I just had strange behavior on my postgresql instance, with postgresql auto restart Looking for logs, I've found a segfault in kern.log [12:24:09]root@db12:~# cat /var/log/kern.log 2020-08-21T12:00:01.436378+02:00 db12 kernel: postgres[177990]: segfault at 0 ip 00005636d2d844f1 sp 00007fff4fa69910 error 4 in postgres[5636d2cb7000+775000] I've also enabled core dump, file output is : [12:24:13]root@db12:~# file /data/postgresql/12/main/core /data/postgresql/12/main/core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'postgres: 12/main: supervision neteven2 localhost(34868) SELECT', real uid: 110, effective uid: 110, real gid: 114, effective gid: 114, execfn: '/usr/lib/postgresql/12/bin/postgres', platform: 'x86_64' In logs , I have these messages 2020-08-21 12:00:01.451 CEST [274137]: [299-1] user=,db=,app=,client= LOG: server process (PID 177990) was terminated by signal 11: Segmentation fault 2020-08-21 12:00:01.451 CEST [274137]: [300-1] user=,db=,app=,client= DETAIL: Failed process was running: SELECT usename,count(*) FROM pg_stat_activity WHERE pid != pg_backend_pid() GROUP BY usename ORDER BY 1 .. 2020-08-21 12:00:02.776 CEST [274137]: [302-1] user=,db=,app=,client= LOG: archiver process (PID 274215) exited with exit code 1 2020-08-21 12:00:02.774 CEST [274214]: [1-1] user=,db=,app=,client= WARNING: terminating connection because of crash of another server process 2020-08-21 12:00:02.774 CEST [274214]: [2-1] user=,db=,app=,client= DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another s erver process exited abnormally and possibly corrupted shared memory. 2020-08-21 12:00:02.774 CEST [274214]: [3-1] user=,db=,app=,client= HINT: In a moment you should be able to reconnect to the database and repeat your command. (many times until full restart) I'm on 12.3 version, on a dedicated host on prem. root@db12:~# dpkg -l | grep postgresql ii pgdg-keyring 2018.2 all keyring for apt.postgresql.org ii postgresql-12 12.3-1.pgdg90+1 amd64 object-relational SQL database, version 12 server ii postgresql-12-repmgr 5.1.0-1.stretch+1 amd64 replication manager for PostgreSQL 12 ii postgresql-client-12 12.3-1.pgdg90+1 amd64 front-end programs for PostgreSQL 12 ii postgresql-client-common 215.pgdg90+1 all manager for multiple PostgreSQL client versions ii postgresql-common 215.pgdg90+1 all PostgreSQL database-cluster manager ii postgresql-server-dev-12 12.3-1.pgdg90+1 amd64 development files for PostgreSQL 12 server-side programming Could you please help me to find what is the root cause ? Best regards thomas
On Fri, Aug 21, 2020 at 2:25 PM Thomas SIMON <tsimon@neteven.com> wrote: > > Hi all, > > I just had strange behavior on my postgresql instance, with postgresql > auto restart > > Looking for logs, I've found a segfault in kern.log > > [12:24:09]root@db12:~# cat /var/log/kern.log > 2020-08-21T12:00:01.436378+02:00 db12 kernel: postgres[177990]: segfault > at 0 ip 00005636d2d844f1 sp 00007fff4fa69910 error 4 in > postgres[5636d2cb7000+775000] > > I've also enabled core dump, file output is : > > [12:24:13]root@db12:~# file /data/postgresql/12/main/core > /data/postgresql/12/main/core: ELF 64-bit LSB core file x86-64, version > 1 (SYSV), SVR4-style, from 'postgres: 12/main: supervision neteven2 > localhost(34868) SELECT', real uid: 110, effective uid: 110, real gid: > 114, effective gid: 114, execfn: '/usr/lib/postgresql/12/bin/postgres', > platform: 'x86_64' > > > In logs , I have these messages > > 2020-08-21 12:00:01.451 CEST [274137]: [299-1] user=,db=,app=,client= > LOG: server process (PID 177990) was terminated by signal 11: > Segmentation fault > 2020-08-21 12:00:01.451 CEST [274137]: [300-1] user=,db=,app=,client= > DETAIL: Failed process was running: SELECT usename,count(*) FROM > pg_stat_activity WHERE pid != pg_backend_pid() GROUP BY usename ORDER BY 1 > > .. > 2020-08-21 12:00:02.776 CEST [274137]: [302-1] user=,db=,app=,client= > LOG: archiver process (PID 274215) exited with exit code 1 > 2020-08-21 12:00:02.774 CEST [274214]: [1-1] user=,db=,app=,client= > WARNING: terminating connection because of crash of another server process > 2020-08-21 12:00:02.774 CEST [274214]: [2-1] user=,db=,app=,client= > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another s > erver process exited abnormally and possibly corrupted shared memory. > 2020-08-21 12:00:02.774 CEST [274214]: [3-1] user=,db=,app=,client= > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > (many times until full restart) > > > I'm on 12.3 version, on a dedicated host on prem. Note that version 12.4 is now available, however I don't see any relevant fix. > root@db12:~# dpkg -l | grep postgresql > ii pgdg-keyring 2018.2 all > keyring for apt.postgresql.org > ii postgresql-12 12.3-1.pgdg90+1 amd64 > object-relational SQL database, version 12 server > ii postgresql-12-repmgr 5.1.0-1.stretch+1 > amd64 replication manager for PostgreSQL 12 > ii postgresql-client-12 12.3-1.pgdg90+1 > amd64 front-end programs for PostgreSQL 12 > ii postgresql-client-common 215.pgdg90+1 > all manager for multiple PostgreSQL client versions > ii postgresql-common 215.pgdg90+1 > all PostgreSQL database-cluster manager > ii postgresql-server-dev-12 12.3-1.pgdg90+1 > amd64 development files for PostgreSQL 12 server-side programming > > > Could you please help me to find what is the root cause ? for This is unfortunately not enough information to find the root issue. Do you have any custom extension? Is there any chance you can get a backtrace of the generated coredump? See https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Getting_a_trace_from_a_randomly_crashing_backend for more details on how to do that.
Hi Julien, thanks for answering me. The only extension I use is repmgr. I've tried to use gdb to see something (I don't know if i use it correctly) , below the backtarce : [16:03:13]root@db13:/tmp$ gdb -q -c /tmp/core /usr/lib/postgresql/12/bin/postgres Reading symbols from /usr/lib/postgresql/12/bin/postgres...(no debugging symbols found)...done. [New LWP 177990] warning: Unexpected size of section `.reg-xstate/177990' in core file. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: 12/main: supervision neteven2 localhost(34868) SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. warning: Unexpected size of section `.reg-xstate/177990' in core file. #0 0x00005636d2d844f1 in equalTupleDescs () (gdb) (gdb) bt #0 0x00005636d2d844f1 in equalTupleDescs () #1 0x00005636d31a65cf in ?? () #2 0x00005636d31b5fd3 in hash_search_with_hash_value () #3 0x00005636d31a83b1 in assign_record_type_typmod () #4 0x00005636d31b4855 in ?? () #5 0x00005636d31b4b43 in get_expr_result_type () #6 0x00005636d31b4b7b in get_expr_result_tupdesc () #7 0x00005636d2e8bcce in get_rte_attribute_is_dropped () #8 0x00005636d303fc7a in AcquireRewriteLocks () #9 0x00005636d304039e in ?? () #10 0x00005636d3043aa2 in QueryRewrite () #11 0x00005636d307de90 in ?? () #12 0x00005636d307df70 in pg_analyze_and_rewrite () #13 0x00005636d307e68f in ?? () #14 0x00005636d30804ad in PostgresMain () #15 0x00005636d2d73f00 in ?? () #16 0x00005636d3006f89 in PostmasterMain () #17 0x00005636d2d75128 in main () (gdb) cont The program is not being run. thomas Le 21/08/2020 à 14:34, Julien Rouhaud a écrit : > On Fri, Aug 21, 2020 at 2:25 PM Thomas SIMON <tsimon@neteven.com> wrote: >> Hi all, >> >> I just had strange behavior on my postgresql instance, with postgresql >> auto restart >> >> Looking for logs, I've found a segfault in kern.log >> >> [12:24:09]root@db12:~# cat /var/log/kern.log >> 2020-08-21T12:00:01.436378+02:00 db12 kernel: postgres[177990]: segfault >> at 0 ip 00005636d2d844f1 sp 00007fff4fa69910 error 4 in >> postgres[5636d2cb7000+775000] >> >> I've also enabled core dump, file output is : >> >> [12:24:13]root@db12:~# file /data/postgresql/12/main/core >> /data/postgresql/12/main/core: ELF 64-bit LSB core file x86-64, version >> 1 (SYSV), SVR4-style, from 'postgres: 12/main: supervision neteven2 >> localhost(34868) SELECT', real uid: 110, effective uid: 110, real gid: >> 114, effective gid: 114, execfn: '/usr/lib/postgresql/12/bin/postgres', >> platform: 'x86_64' >> >> >> In logs , I have these messages >> >> 2020-08-21 12:00:01.451 CEST [274137]: [299-1] user=,db=,app=,client= >> LOG: server process (PID 177990) was terminated by signal 11: >> Segmentation fault >> 2020-08-21 12:00:01.451 CEST [274137]: [300-1] user=,db=,app=,client= >> DETAIL: Failed process was running: SELECT usename,count(*) FROM >> pg_stat_activity WHERE pid != pg_backend_pid() GROUP BY usename ORDER BY 1 >> >> .. >> 2020-08-21 12:00:02.776 CEST [274137]: [302-1] user=,db=,app=,client= >> LOG: archiver process (PID 274215) exited with exit code 1 >> 2020-08-21 12:00:02.774 CEST [274214]: [1-1] user=,db=,app=,client= >> WARNING: terminating connection because of crash of another server process >> 2020-08-21 12:00:02.774 CEST [274214]: [2-1] user=,db=,app=,client= >> DETAIL: The postmaster has commanded this server process to roll back >> the current transaction and exit, because another s >> erver process exited abnormally and possibly corrupted shared memory. >> 2020-08-21 12:00:02.774 CEST [274214]: [3-1] user=,db=,app=,client= >> HINT: In a moment you should be able to reconnect to the database and >> repeat your command. >> (many times until full restart) >> >> >> I'm on 12.3 version, on a dedicated host on prem. > Note that version 12.4 is now available, however I don't see any relevant fix. > >> root@db12:~# dpkg -l | grep postgresql >> ii pgdg-keyring 2018.2 all >> keyring for apt.postgresql.org >> ii postgresql-12 12.3-1.pgdg90+1 amd64 >> object-relational SQL database, version 12 server >> ii postgresql-12-repmgr 5.1.0-1.stretch+1 >> amd64 replication manager for PostgreSQL 12 >> ii postgresql-client-12 12.3-1.pgdg90+1 >> amd64 front-end programs for PostgreSQL 12 >> ii postgresql-client-common 215.pgdg90+1 >> all manager for multiple PostgreSQL client versions >> ii postgresql-common 215.pgdg90+1 >> all PostgreSQL database-cluster manager >> ii postgresql-server-dev-12 12.3-1.pgdg90+1 >> amd64 development files for PostgreSQL 12 server-side programming >> >> >> Could you please help me to find what is the root cause ? for > This is unfortunately not enough information to find the root issue. > > Do you have any custom extension? Is there any chance you can get a > backtrace of the generated coredump? See > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Getting_a_trace_from_a_randomly_crashing_backend > for more details on how to do that.
On Fri, Aug 21, 2020 at 4:13 PM Thomas SIMON <tsimon@neteven.com> wrote: > > Hi Julien, > > thanks for answering me. > > The only extension I use is repmgr. Ok, this shouldn't be a problem. > I've tried to use gdb to see something (I don't know if i use it > correctly) , below the backtarce : > > [16:03:13]root@db13:/tmp$ gdb -q -c /tmp/core > /usr/lib/postgresql/12/bin/postgres > Reading symbols from /usr/lib/postgresql/12/bin/postgres...(no debugging > symbols found)...done. > [New LWP 177990] > > warning: Unexpected size of section `.reg-xstate/177990' in core file. > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Core was generated by `postgres: 12/main: supervision neteven2 > localhost(34868) SELECT '. > Program terminated with signal SIGSEGV, Segmentation fault. > > warning: Unexpected size of section `.reg-xstate/177990' in core file. > #0 0x00005636d2d844f1 in equalTupleDescs () > (gdb) > (gdb) bt > #0 0x00005636d2d844f1 in equalTupleDescs () > #1 0x00005636d31a65cf in ?? () > #2 0x00005636d31b5fd3 in hash_search_with_hash_value () > #3 0x00005636d31a83b1 in assign_record_type_typmod () > #4 0x00005636d31b4855 in ?? () > #5 0x00005636d31b4b43 in get_expr_result_type () > #6 0x00005636d31b4b7b in get_expr_result_tupdesc () > #7 0x00005636d2e8bcce in get_rte_attribute_is_dropped () > #8 0x00005636d303fc7a in AcquireRewriteLocks () > #9 0x00005636d304039e in ?? () > #10 0x00005636d3043aa2 in QueryRewrite () > #11 0x00005636d307de90 in ?? () > #12 0x00005636d307df70 in pg_analyze_and_rewrite () > #13 0x00005636d307e68f in ?? () > #14 0x00005636d30804ad in PostgresMain () > #15 0x00005636d2d73f00 in ?? () > #16 0x00005636d3006f89 in PostmasterMain () > #17 0x00005636d2d75128 in main () > (gdb) cont > The program is not being run. Thanks! I don't see any obvious problem in that code, and that's something that didn't change for a long time so I'm starting to think this could be some hardware problem. Do you have any alarming messages in your system logs and/or dmesg?
Le 22/08/2020 à 10:11, Julien Rouhaud a écrit : > On Fri, Aug 21, 2020 at 4:13 PM Thomas SIMON <tsimon@neteven.com> wrote: >> Hi Julien, >> >> thanks for answering me. >> >> The only extension I use is repmgr. > Ok, this shouldn't be a problem. > >> I've tried to use gdb to see something (I don't know if i use it >> correctly) , below the backtarce : >> >> [16:03:13]root@db13:/tmp$ gdb -q -c /tmp/core >> /usr/lib/postgresql/12/bin/postgres >> Reading symbols from /usr/lib/postgresql/12/bin/postgres...(no debugging >> symbols found)...done. >> [New LWP 177990] >> >> warning: Unexpected size of section `.reg-xstate/177990' in core file. >> [Thread debugging using libthread_db enabled] >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". >> Core was generated by `postgres: 12/main: supervision neteven2 >> localhost(34868) SELECT '. >> Program terminated with signal SIGSEGV, Segmentation fault. >> >> warning: Unexpected size of section `.reg-xstate/177990' in core file. >> #0 0x00005636d2d844f1 in equalTupleDescs () >> (gdb) >> (gdb) bt >> #0 0x00005636d2d844f1 in equalTupleDescs () >> #1 0x00005636d31a65cf in ?? () >> #2 0x00005636d31b5fd3 in hash_search_with_hash_value () >> #3 0x00005636d31a83b1 in assign_record_type_typmod () >> #4 0x00005636d31b4855 in ?? () >> #5 0x00005636d31b4b43 in get_expr_result_type () >> #6 0x00005636d31b4b7b in get_expr_result_tupdesc () >> #7 0x00005636d2e8bcce in get_rte_attribute_is_dropped () >> #8 0x00005636d303fc7a in AcquireRewriteLocks () >> #9 0x00005636d304039e in ?? () >> #10 0x00005636d3043aa2 in QueryRewrite () >> #11 0x00005636d307de90 in ?? () >> #12 0x00005636d307df70 in pg_analyze_and_rewrite () >> #13 0x00005636d307e68f in ?? () >> #14 0x00005636d30804ad in PostgresMain () >> #15 0x00005636d2d73f00 in ?? () >> #16 0x00005636d3006f89 in PostmasterMain () >> #17 0x00005636d2d75128 in main () >> (gdb) cont >> The program is not being run. > Thanks! > > I don't see any obvious problem in that code, and that's something > that didn't change for a long time so I'm starting to think this could > be some hardware problem. Do you have any alarming messages in your > system logs and/or dmesg? I checked again, but found nothing relevant. From what you say, there is not much we can do, so i'm gonna keep this in mind, and wait to see if it happens again ... > >