Обсуждение: PosgreSQL backend process crashed with signal 9
In case of simultaneous calling plpythonu stored procedure and locking test table I have reproducible segfault of PosgreSQL backend process. Steps to reproduce: 0) initialize database schema and data: $ psql -h localhost -p 5434 -U test < init_schema.sql $ cat init_schema.sql DROP TABLE IF EXISTS test; CREATE TABLE IF NOT EXISTS test ( id BIGSERIAL PRIMARY KEY, amount BIGINT, counter BIGINT DEFAULT 0 ); CREATE INDEX ON test (counter) where counter > 0; INSERT INTO test (amount) VALUES (1); INSERT INTO test (amount) VALUES (2); INSERT INTO test (amount) VALUES (3); INSERT INTO test (amount) VALUES (4); INSERT INTO test (amount) VALUES (5); CREATE OR REPLACE FUNCTION test_plpy_postgres_crash(record_id INTEGER) RETURNS boolean AS $$ select_query = """ SELECT * FROM test WHERE id = $1 """ if "select_plan" in GD: select_plan = GD["select_plan"] else: select_plan = plpy.prepare(select_query, ["integer"]) GD["select_plan"] = select_plan select_cursor = plpy.cursor(select_plan, [record_id]) return True $$ LANGUAGE plpythonu; 1) open 2 psql client sessions to PosgreSQL 9.5.2: $ psql -h localhost -p 5434 -U test $ psql -h localhost -p 5434 -U test 2) enter the following statement in the one of client sessions: test=# select pg_sleep(1) ; begin; lock test ; select sum(counter) from test ; select pg_sleep(60); pg_sleep ---------- (1 row) BEGIN LOCK TABLE sum ----- 0 (1 row) 3) after successful taking table lock enter the following statement in the second client session: test=# SELECT test_plpy_postgres_crash(trunc(random() * 5 + 1)::integer); 4) wait several seconds and press CTRL+C in the second client session 5) PosgreSQL backend process crass is occurred: test=# select pg_sleep(1) ; begin; lock test ; select sum(counter) from test ; select pg_sleep(60); pg_sleep ---------- (1 row) BEGIN LOCK TABLE sum ----- 0 (1 row) WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. SSL SYSCALL error: EOF detected The connection to the server was lost. Attempting reset: Failed. test=# SELECT test_plpy_postgres_crash(trunc(random() * 5 + 1)::integer); ^CCancel request sent SSL SYSCALL error: EOF detected The connection to the server was lost. Attempting reset: Failed. Dmesg log: $ dmesg| tail [1904728.889471] postgres[21138]: segfault at fffffffffffffff0 ip 00007f25462c7724 sp 00007fff96c155a0 error 5 in postgres[7f2545eb9000+5f1000] PosgreSQL version: test=# select version(); version ------------------------------------------------------------------------------------------------- PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit (1 row) System info: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.4 LTS Release: 14.04 Codename: trusty $ uname -a Linux aromanov 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux -- Best regards, Aleksey Romanov Web Developer, Wargaming.net
Aleksey Romanov <drednout.by@gmail.com> writes: > In case of simultaneous calling plpythonu stored procedure and locking test > table I have reproducible segfault of PosgreSQL backend process. Hm, I can't reproduce that in either HEAD or 9.5. I get ^CCancel request sent ERROR: spiexceptions.QueryCanceled: canceling statement due to user request CONTEXT: Traceback (most recent call last): PL/Python function "test_plpy_postgres_crash", line 8, in <module> select_plan = plpy.prepare(select_query, ["integer"]) PL/Python function "test_plpy_postgres_crash" regards, tom lane
Pavel Suderevsky <psuderevsky@gmail.com> writes: > I've faced similar behaviour in much more simple situation. It crashed on > simple nested loop. And all other child postmaster processes were > terminated likewise. >> 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by >> signal 9: Killed Well, signal 9 is not an internal-to-Postgres error; that's something outside the database killing the process. If you're on a Linux machine it's most likely the dreaded OOM killer doing it. The usual recommendation is to configure your system to reduce or eliminate memory overcommit: http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT The OP's log did not indicate a signal 9, though --- that was SIGSEGV (sig 11, typically). regards, tom lane
Pavel Suderevsky <psuderevsky@gmail.com> writes: > Yes, OOM killer did the job, but is it normal that so lightweight query is > consuming so much memory that OOM-killer to be invoked? [ shrug... ] The OOM killer is widely considered broken. Its heuristics don't interact terribly well with processes using large amounts of shared memory. regards, tom lane
Yes, OOM killer did the job, but is it normal that so lightweight query is consuming so much memory that OOM-killer to be invoked? > [dbtest3] mars_gedik=# select > pg_size_pretty(pg_relation_size('node_statuses')); > pg_size_pretty > ---------------- > 11 MB > (1 row) > [dbtest3] mars_gedik=# select > pg_size_pretty(pg_relation_size('node_status_values')); > pg_size_pretty > ---------------- > 11 MB > (1 row) 2016-04-06 17:46 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Suderevsky <psuderevsky@gmail.com> writes: > > I've faced similar behaviour in much more simple situation. It crashed on > > simple nested loop. And all other child postmaster processes were > > terminated likewise. > > >> 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated > by > >> signal 9: Killed > > Well, signal 9 is not an internal-to-Postgres error; that's something > outside the database killing the process. If you're on a Linux machine > it's most likely the dreaded OOM killer doing it. The usual > recommendation is to configure your system to reduce or eliminate > memory overcommit: > > > http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT > > The OP's log did not indicate a signal 9, though --- that was SIGSEGV > (sig 11, typically). > > regards, tom lane >
> > [ shrug... ] The OOM killer is widely considered broken. Its heuristics > don't interact terribly well with processes using large amounts of shared > memory. Actually the issue is not the OOM killer invokation but in amount of memory that postgresql consumes for a not very heavy operation. Observing Memory: > [root@dbtest3 ~]# for i in `seq 25`; do free -m && sleep 1 ; done > total used free shared buff/cache > available > Mem: 993 64 831 3 97 > 811 > Swap: 1231 181 1050 Starting SQL: > [dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN > node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147; > Total memory consumed by a process including shared memory (according pmap utility): > [root@dbtest3 ~]# for i in `seq 25`; do pmap 16227 | tail -1 && sleep 1; > done > total 189172K > total 220264K > total 438276K > total 657148K > total 861352K > total 1137396K > total 1320612K > total 1564020K > total 1809472K > total 2038492K Obersving memory once again: > total used free shared buff/cache > available > Mem: 993 881 62 2 49 > 18 > Swap: 1231 1113 118 And OOM killer works. If I set vm.overcommit_memory=2 than I get: > [dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN > node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147; > out of memory for query result But process is still being alive and OOM killer hasn't been invoked. > [root@dbtest3 ~]# for i in `seq 25`; do pmap 16479 | tail -1 && sleep 1; > done > total 189172K > total 205776K > total 419796K > total 644212K > total 860560K > total 1110732K > total 1320480K > total 484948K > total 400404K > total 313236K > total 189172K > total 189172K > total 189172K > total 189172K > total 189172K 2016-04-06 18:00 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Suderevsky <psuderevsky@gmail.com> writes: > > Yes, OOM killer did the job, but is it normal that so lightweight query > is > > consuming so much memory that OOM-killer to be invoked? > > [ shrug... ] The OOM killer is widely considered broken. Its heuristics > don't interact terribly well with processes using large amounts of shared > memory. > > regards, tom lane >
I've faced similar behaviour in much more simple situation. It crashed on simple nested loop. And all other child postmaster processes were terminated likewise. Repoduced on: 1. 9.5.1 + Ubuntu 14.04.4 LTS 2. 9.5.1 + CentOS 7.2.1511 2. 9.5.2 + CentOS 7.2.1511 DDL: > platform_db=# \d node_statuses > Table "platform_db.node_statuses" > Column | Type | Modifiers > > > -----------------------+-------------------------+--------------------------------- > node_id | integer | not null > status_id | integer | not null > msg | character varying(1024) | default NULL::character > varying > node_status_timestamp | bigint | not null > Indexes: > "NODE_STATUSES_NODE_ID_FK" btree (node_id) > "NODE_STATUSES_NODE_ID_TIMESTAMP_IDX" btree (node_id, > node_status_timestamp) > "NODE_STATUSES_TIMESTAMP_IDX" btree (node_status_timestamp) > > platform_db=# \d node_status_values > Table "platform_db.node_status_values" > Column | Type | Modifiers > ----------------------+---------+----------- > node_id | integer | not null > status_id | integer | not null > node_value_timestamp | bigint | not null > Indexes: > "NODE_STATUS_VALUES_NODE_ID_FK" btree (node_id) > "NODE_STATUS_VALUES_NODE_ID_TIMESTAMP_IDX" btree (node_id, > node_value_timestamp) SQL: > SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON > ns.node_id = nsv.node_id where ns.node_id = 1147; Postgres log: > 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by > signal 9: Killed > 2016-04-05 18:37:31 MSK DETAIL: Failed process was running: COMMIT > 2016-04-05 18:37:32 MSK LOG: terminating any other active server > processes > 2016-04-05 18:37:32 MSK WARNING: terminating connection because of crash > of another server process > 2016-04-05 18:37:32 MSK DETAIL: The postmaster has commanded this server > process to roll back the current transaction and exit, because another > server process exited abnormally and possib > ly corrupted shared memory. > 2016-04-05 18:37:32 MSK HINT: In a moment you should be able to > reconnect to the database and repeat your command. > 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] % > WARNING: terminating connection because of crash of another server process > 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] % > DETAIL: The postmaster has commanded this server process to roll back the > current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] % > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % WARNING: > terminating connection because of crash of another server process > 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % DETAIL: The > postmaster has commanded this server process to roll back the current > transaction and exit, because another server process > exited abnormally and possibly corrupted shared memory. > 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % HINT: In a moment > you should be able to reconnect to the database and repeat your command. > 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % WARNING: > terminating connection because of crash of another server process > 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % DETAIL: The > postmaster has commanded this server process to roll back the current > transaction and exit, because another server process > exited abnormally and possibly corrupted shared memory. > 2016-04-05 18:37:37 MSK LOG: statistics collector process (PID 1012) was > terminated by signal 9: Killed > 2016-04-05 18:37:37 MSK LOG: all server processes terminated; > reinitializing > 2016-04-05 18:37:39 MSK LOG: database system was interrupted; last known > up at 2016-04-05 18:31:44 MSK > 2016-04-05 18:37:39 MSK ::1 zabbix@zabbix [unknown] % FATAL: the > database system is in recovery mode > 2016-04-05 18:37:41 MSK LOG: database system was not properly shut down; > automatic recovery in progress > 2016-04-05 18:37:41 MSK LOG: redo starts at 298/AB146828 > 2016-04-05 18:37:41 MSK ::1 zabbix@zabbix [unknown] % FATAL: the > database system is in recovery mode > 2016-04-05 18:37:44 MSK [local] postgres@platform_db [unknown] % FATAL: > the database system is in recovery mode > 2016-04-05 18:37:46 MSK LOG: invalid record length at 298/C07ACA78 > 2016-04-05 18:37:46 MSK LOG: redo done at 298/C07ACA50 > 2016-04-05 18:37:46 MSK LOG: last completed transaction was at log time > 2016-04-05 18:37:28.455638+03 > 2016-04-05 18:37:48 MSK ::1 zabbix@zabbix [unknown] % FATAL: the > database system is in recovery mode > 2016-04-05 18:38:08 MSK LOG: MultiXact member wraparound protections are > now enabled > 2016-04-05 18:38:08 MSK LOG: autovacuum launcher started > 2016-04-05 18:38:08 MSK LOG: database system is ready to accept > connections In CentOS just: > 2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % LOG: could > not send data to client: Broken pipe > 2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % STATEMENT: > SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON > ns.node_id = nsv.node_id where ns.node_id = 1147; > 2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % FATAL: > connection to client lost System log [ubuntu]: > Apr 5 18:37:30 db_server kernel: [3905852.822122] postgres invoked > oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 > Apr 5 18:37:30 db_server kernel: [3905852.822131] postgres cpuset=/ > mems_allowed=0-1 > Apr 5 18:37:30 db_server kernel: [3905852.822136] CPU: 3 PID: 9191 Comm: > postgres Not tainted 3.13.0-77-generic #121-Ubuntu > Apr 5 18:37:30 db_server kernel: [3905852.822139] Hardware name: > Supermicro H8DMR-82/H8DMR-82, BIOS 080014 12/15/2006 > ... > Apr 5 18:37:31 db_server kernel: [3905854.860083] Out of memory: Kill > process 2847 (postgres) score 391 or sacrifice child > Apr 5 18:37:31 db_server kernel: [3905854.860201] Killed process 2847 > (postgres) total-vm:6613344kB, anon-rss:34224kB, file-rss:6387036kB Counts: > platform_db=# SELECT count(*) FROM node_statuses ns where node_id = 1147; > count > ------- > 6819 > (1 row) > platform_db=# SELECT count(*) FROM node_status_values nsv where node_id = > 1147; > count > ------- > 6814 > (1 row) > platform_db=# SELECT count(*) FROM node_statuses ns INNER JOIN > node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147; > > count > ---------- > 46423776 > (1 row) Query plan: > Nested Loop (cost=477.14..570040.76 rows=45261752 width=53) (actual > time=3.223..252614.947 rows=46382904 loops=1) > -> Bitmap Heap Scan on node_statuses ns (cost=241.60..2336.38 > rows=6862 width=37) (actual time=1.724..21.248 rows=6813 loops=1) > Recheck Cond: (node_id = 1147) > Heap Blocks: exact=1397 > -> Bitmap Index Scan on "NODE_STATUSES_NODE_ID_FK" > (cost=0.00..239.89 rows=6862 width=0) (actual time=1.472..1.472 rows=6813 > loops=1) > Index Cond: (node_id = 1147) > -> Materialize (cost=235.54..1948.97 rows=6596 width=16) (actual > time=0.002..12.092 rows=6808 loops=6813) > -> Bitmap Heap Scan on node_status_values nsv > (cost=235.54..1915.99 rows=6596 width=16) (actual time=1.471..17.282 > rows=6808 loops=1) > Recheck Cond: (node_id = 1147) > Heap Blocks: exact=1357 > -> Bitmap Index Scan on "NODE_STATUS_VALUES_NODE_ID_FK" > (cost=0.00..233.89 rows=6596 width=0) (actual time=1.238..1.238 rows=6808 > loops=1) > Index Cond: (node_id = 1147) > Planning time: 0.210 ms > Execution time: 334897.948 ms Memory on the host: > total used free shared buffers cached > Mem: 15G 15G 453M 2.7G 195M 12G > -/+ buffers/cache: 2.2G 13G > Swap: 0B 0B 0B
On Wed, 6 Apr 2016 21:22:23 +0300 Pavel Suderevsky <psuderevsky@gmail.com> wrote: > > > > [ shrug... ] The OOM killer is widely considered broken. Its heuristics > > don't interact terribly well with processes using large amounts of shared > > memory. > > > Actually the issue is not the OOM killer invokation but in amount of memory > that postgresql consumes for a not very heavy operation. I suggest you open a new bug for this. The original report was for a SEGV, which should never happen. This report is for excessive memory growth leading to OOM. Both cause a crash, but they are different issues and tying them together may delay solving either of them. -dg -- David Gould 510 282 0869 daveg@sonic.net If simplicity worked, the world would be overrun with insects.