Обсуждение: PosgreSQL backend process crashed with signal 9

Поиск
Список
Период
Сортировка

PosgreSQL backend process crashed with signal 9

От
Aleksey Romanov
Дата:
In case of simultaneous calling plpythonu stored procedure and locking test
table I have reproducible segfault of PosgreSQL backend process.

Steps to reproduce:
0) initialize database schema and data:
$ psql -h localhost -p 5434 -U test < init_schema.sql
$ cat init_schema.sql
DROP TABLE IF EXISTS test;
CREATE TABLE IF NOT EXISTS test
(
    id BIGSERIAL PRIMARY KEY,
    amount BIGINT,
    counter BIGINT DEFAULT 0
);
CREATE INDEX ON test (counter) where counter > 0;
INSERT INTO test (amount) VALUES (1);
INSERT INTO test (amount) VALUES (2);
INSERT INTO test (amount) VALUES (3);
INSERT INTO test (amount) VALUES (4);
INSERT INTO test (amount) VALUES (5);


CREATE OR REPLACE FUNCTION test_plpy_postgres_crash(record_id INTEGER)
    RETURNS boolean
AS $$
    select_query = """
        SELECT * FROM test WHERE id = $1
    """
    if "select_plan" in GD:
        select_plan = GD["select_plan"]
    else:
        select_plan = plpy.prepare(select_query, ["integer"])
        GD["select_plan"] = select_plan

    select_cursor = plpy.cursor(select_plan, [record_id])
    return True
$$ LANGUAGE plpythonu;


1) open 2 psql client sessions to PosgreSQL 9.5.2:
$ psql -h localhost -p 5434 -U test
$ psql -h localhost -p 5434 -U test

2) enter the following statement in the one of client sessions:
test=# select pg_sleep(1) ; begin;  lock test ; select sum(counter) from
test ; select pg_sleep(60);
 pg_sleep
----------

(1 row)

BEGIN
LOCK TABLE
 sum
-----
   0
(1 row)

3) after successful taking table lock enter the following statement in the
second client session:
test=# SELECT test_plpy_postgres_crash(trunc(random() * 5 + 1)::integer);

4) wait several seconds and press CTRL+C in the second client session
5) PosgreSQL backend process crass is occurred:
test=# select pg_sleep(1) ; begin;  lock test ; select sum(counter) from
test ; select pg_sleep(60);
 pg_sleep
----------

(1 row)

BEGIN
LOCK TABLE
 sum
-----
   0
(1 row)

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Failed.



test=# SELECT test_plpy_postgres_crash(trunc(random() * 5 + 1)::integer);
^CCancel request sent
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Failed.


Dmesg log:
$ dmesg| tail
[1904728.889471] postgres[21138]: segfault at fffffffffffffff0 ip
00007f25462c7724 sp 00007fff96c155a0 error 5 in
postgres[7f2545eb9000+5f1000]


PosgreSQL version:
test=# select version();
                                             version

-------------------------------------------------------------------------------------------------
 PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
4.8.2-19ubuntu1) 4.8.2, 64-bit
(1 row)

System info:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.4 LTS
Release:        14.04
Codename:       trusty
$ uname -a
Linux aromanov 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux


--
Best regards,
Aleksey Romanov
Web Developer, Wargaming.net

Re: PosgreSQL backend process crashed with signal 9

От
Tom Lane
Дата:
Aleksey Romanov <drednout.by@gmail.com> writes:
> In case of simultaneous calling plpythonu stored procedure and locking test
> table I have reproducible segfault of PosgreSQL backend process.

Hm, I can't reproduce that in either HEAD or 9.5.  I get

^CCancel request sent
ERROR:  spiexceptions.QueryCanceled: canceling statement due to user request
CONTEXT:  Traceback (most recent call last):
  PL/Python function "test_plpy_postgres_crash", line 8, in <module>
    select_plan = plpy.prepare(select_query, ["integer"])
PL/Python function "test_plpy_postgres_crash"

            regards, tom lane

Re: PosgreSQL backend process crashed with signal 9

От
Tom Lane
Дата:
Pavel Suderevsky <psuderevsky@gmail.com> writes:
> I've faced similar behaviour in much more simple situation. It crashed on
> simple nested loop. And all other child postmaster processes were
> terminated likewise.

>> 2016-04-05 18:37:31 MSK  LOG:  server process (PID 2848) was terminated by
>> signal 9: Killed

Well, signal 9 is not an internal-to-Postgres error; that's something
outside the database killing the process.  If you're on a Linux machine
it's most likely the dreaded OOM killer doing it.  The usual
recommendation is to configure your system to reduce or eliminate
memory overcommit:

http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

The OP's log did not indicate a signal 9, though --- that was SIGSEGV
(sig 11, typically).

            regards, tom lane

Re: PosgreSQL backend process crashed with signal 9

От
Tom Lane
Дата:
Pavel Suderevsky <psuderevsky@gmail.com> writes:
> Yes, OOM killer did the job, but is it normal that so lightweight query is
> consuming so much memory that OOM-killer to be invoked?

[ shrug... ]  The OOM killer is widely considered broken.  Its heuristics
don't interact terribly well with processes using large amounts of shared
memory.

            regards, tom lane

Re: PosgreSQL backend process crashed with signal 9

От
Pavel Suderevsky
Дата:
Yes, OOM killer did the job, but is it normal that so lightweight query is
consuming so much memory that OOM-killer to be invoked?

> [dbtest3] mars_gedik=# select
> pg_size_pretty(pg_relation_size('node_statuses'));
>  pg_size_pretty
> ----------------
>  11 MB
> (1 row)
> [dbtest3] mars_gedik=# select
> pg_size_pretty(pg_relation_size('node_status_values'));
>  pg_size_pretty
> ----------------
>  11 MB
> (1 row)



2016-04-06 17:46 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

> Pavel Suderevsky <psuderevsky@gmail.com> writes:
> > I've faced similar behaviour in much more simple situation. It crashed on
> > simple nested loop. And all other child postmaster processes were
> > terminated likewise.
>
> >> 2016-04-05 18:37:31 MSK  LOG:  server process (PID 2848) was terminated
> by
> >> signal 9: Killed
>
> Well, signal 9 is not an internal-to-Postgres error; that's something
> outside the database killing the process.  If you're on a Linux machine
> it's most likely the dreaded OOM killer doing it.  The usual
> recommendation is to configure your system to reduce or eliminate
> memory overcommit:
>
>
> http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
>
> The OP's log did not indicate a signal 9, though --- that was SIGSEGV
> (sig 11, typically).
>
>                         regards, tom lane
>

Re: PosgreSQL backend process crashed with signal 9

От
Pavel Suderevsky
Дата:
>
> [ shrug... ]  The OOM killer is widely considered broken.  Its heuristics
> don't interact terribly well with processes using large amounts of shared
> memory.


Actually the issue is not the OOM killer invokation but in amount of memory
that postgresql consumes for a not very heavy operation.

Observing Memory:

> [root@dbtest3 ~]# for i in `seq 25`; do free -m && sleep 1 ; done
>               total        used        free      shared  buff/cache
> available
> Mem:            993          64         831           3          97
>   811
> Swap:          1231         181        1050


Starting SQL:

> [dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
>


Total memory consumed by a process including shared memory (according pmap
utility):

> [root@dbtest3 ~]# for i in `seq 25`; do pmap 16227 | tail -1 && sleep 1;
> done
>  total           189172K
>  total           220264K
>  total           438276K
>  total           657148K
>  total           861352K
>  total          1137396K
>  total          1320612K
>  total          1564020K
>  total          1809472K
>  total          2038492K


 Obersving memory once again:

>                total        used        free      shared  buff/cache
> available
> Mem:            993         881          62           2          49
>    18
> Swap:          1231        1113         118


And OOM killer works.

If I set vm.overcommit_memory=2 than I get:

> [dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
> out of memory for query result

But process is still being alive and OOM killer hasn't been invoked.

> [root@dbtest3 ~]# for i in `seq 25`; do pmap 16479 | tail -1 && sleep 1;
> done
>  total           189172K
>  total           205776K
>  total           419796K
>  total           644212K
>  total           860560K
>  total          1110732K
>  total          1320480K
>  total           484948K
>  total           400404K
>  total           313236K
>  total           189172K
>  total           189172K
>  total           189172K
>  total           189172K
>  total           189172K


2016-04-06 18:00 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

> Pavel Suderevsky <psuderevsky@gmail.com> writes:
> > Yes, OOM killer did the job, but is it normal that so lightweight query
> is
> > consuming so much memory that OOM-killer to be invoked?
>
> [ shrug... ]  The OOM killer is widely considered broken.  Its heuristics
> don't interact terribly well with processes using large amounts of shared
> memory.
>
>                         regards, tom lane
>

Re: PosgreSQL backend process crashed with signal 9

От
Pavel Suderevsky
Дата:
I've faced similar behaviour in much more simple situation. It crashed on
simple nested loop. And all other child postmaster processes were
terminated likewise.

Repoduced on:
1. 9.5.1 + Ubuntu 14.04.4 LTS
2. 9.5.1 + CentOS 7.2.1511
2. 9.5.2 + CentOS 7.2.1511

DDL:

> platform_db=# \d node_statuses
>                          Table "platform_db.node_statuses"
>         Column         |          Type           |            Modifiers
>
>
> -----------------------+-------------------------+---------------------------------
>  node_id               | integer                 | not null
>  status_id             | integer                 | not null
>  msg                   | character varying(1024) | default NULL::character
> varying
>  node_status_timestamp | bigint                  | not null
> Indexes:
>     "NODE_STATUSES_NODE_ID_FK" btree (node_id)
>     "NODE_STATUSES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
> node_status_timestamp)
>     "NODE_STATUSES_TIMESTAMP_IDX" btree (node_status_timestamp)
>
> platform_db=# \d node_status_values
>    Table "platform_db.node_status_values"
>         Column        |  Type   | Modifiers
> ----------------------+---------+-----------
>  node_id              | integer | not null
>  status_id            | integer | not null
>  node_value_timestamp | bigint  | not null
> Indexes:
>     "NODE_STATUS_VALUES_NODE_ID_FK" btree (node_id)
>     "NODE_STATUS_VALUES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
> node_value_timestamp)

SQL:

> SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
> ns.node_id = nsv.node_id where ns.node_id = 1147;


Postgres log:

> 2016-04-05 18:37:31 MSK  LOG:  server process (PID 2848) was terminated by
> signal 9: Killed
> 2016-04-05 18:37:31 MSK  DETAIL:  Failed process was running: COMMIT
> 2016-04-05 18:37:32 MSK  LOG:  terminating any other active server
> processes
> 2016-04-05 18:37:32 MSK  WARNING:  terminating connection because of crash
> of another server process
> 2016-04-05 18:37:32 MSK  DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possib
> ly corrupted shared memory.
> 2016-04-05 18:37:32 MSK  HINT:  In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] %
> WARNING:  terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] %
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db@platform_db [unknown] %
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % WARNING:
>  terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % DETAIL:  The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process
>  exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % HINT:  In a moment
> you should be able to reconnect to the database and repeat your command.
> 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % WARNING:
>  terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK ::1 zabbix@zabbix [unknown] % DETAIL:  The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process
>  exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:37 MSK  LOG:  statistics collector process (PID 1012) was
> terminated by signal 9: Killed
> 2016-04-05 18:37:37 MSK  LOG:  all server processes terminated;
> reinitializing
> 2016-04-05 18:37:39 MSK  LOG:  database system was interrupted; last known
> up at 2016-04-05 18:31:44 MSK
> 2016-04-05 18:37:39 MSK ::1 zabbix@zabbix [unknown] % FATAL:  the
> database system is in recovery mode
> 2016-04-05 18:37:41 MSK  LOG:  database system was not properly shut down;
> automatic recovery in progress
> 2016-04-05 18:37:41 MSK  LOG:  redo starts at 298/AB146828
> 2016-04-05 18:37:41 MSK ::1 zabbix@zabbix [unknown] % FATAL:  the
> database system is in recovery mode
> 2016-04-05 18:37:44 MSK [local] postgres@platform_db [unknown] % FATAL:
>  the database system is in recovery mode
> 2016-04-05 18:37:46 MSK  LOG:  invalid record length at 298/C07ACA78
> 2016-04-05 18:37:46 MSK  LOG:  redo done at 298/C07ACA50
> 2016-04-05 18:37:46 MSK  LOG:  last completed transaction was at log time
> 2016-04-05 18:37:28.455638+03
> 2016-04-05 18:37:48 MSK ::1 zabbix@zabbix [unknown] % FATAL:  the
> database system is in recovery mode
> 2016-04-05 18:38:08 MSK  LOG:  MultiXact member wraparound protections are
> now enabled
> 2016-04-05 18:38:08 MSK  LOG:  autovacuum launcher started
> 2016-04-05 18:38:08 MSK  LOG:  database system is ready to accept
> connections

In CentOS just:

> 2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % LOG:  could
> not send data to client: Broken pipe
> 2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % STATEMENT:
>  SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
> ns.node_id = nsv.node_id where ns.node_id = 1147;
> 2016-04-06 13:58:58 GMT [local] postgres@platform_db psql % FATAL:
>  connection to client lost


System log [ubuntu]:

> Apr  5 18:37:30 db_server kernel: [3905852.822122] postgres invoked
> oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> Apr  5 18:37:30 db_server kernel: [3905852.822131] postgres cpuset=/
> mems_allowed=0-1
> Apr  5 18:37:30 db_server kernel: [3905852.822136] CPU: 3 PID: 9191 Comm:
> postgres Not tainted 3.13.0-77-generic #121-Ubuntu
> Apr  5 18:37:30 db_server kernel: [3905852.822139] Hardware name:
> Supermicro H8DMR-82/H8DMR-82, BIOS 080014  12/15/2006
> ...
> Apr  5 18:37:31 db_server kernel: [3905854.860083] Out of memory: Kill
> process 2847 (postgres) score 391 or sacrifice child
> Apr  5 18:37:31 db_server kernel: [3905854.860201] Killed process 2847
> (postgres) total-vm:6613344kB, anon-rss:34224kB, file-rss:6387036kB


Counts:

> platform_db=# SELECT count(*) FROM node_statuses ns where node_id = 1147;
>  count
> -------
>   6819
> (1 row)
> platform_db=# SELECT count(*) FROM node_status_values nsv where node_id =
> 1147;
>  count
> -------
>   6814
> (1 row)
> platform_db=# SELECT count(*) FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
>
>   count
> ----------
>  46423776
> (1 row)


Query plan:

> Nested Loop  (cost=477.14..570040.76 rows=45261752 width=53) (actual
> time=3.223..252614.947 rows=46382904 loops=1)
>    ->  Bitmap Heap Scan on node_statuses ns  (cost=241.60..2336.38
> rows=6862 width=37) (actual time=1.724..21.248 rows=6813 loops=1)
>          Recheck Cond: (node_id = 1147)
>          Heap Blocks: exact=1397
>          ->  Bitmap Index Scan on "NODE_STATUSES_NODE_ID_FK"
>  (cost=0.00..239.89 rows=6862 width=0) (actual time=1.472..1.472 rows=6813
> loops=1)
>                Index Cond: (node_id = 1147)
>    ->  Materialize  (cost=235.54..1948.97 rows=6596 width=16) (actual
> time=0.002..12.092 rows=6808 loops=6813)
>          ->  Bitmap Heap Scan on node_status_values nsv
>  (cost=235.54..1915.99 rows=6596 width=16) (actual time=1.471..17.282
> rows=6808 loops=1)
>                Recheck Cond: (node_id = 1147)
>                Heap Blocks: exact=1357
>                ->  Bitmap Index Scan on "NODE_STATUS_VALUES_NODE_ID_FK"
>  (cost=0.00..233.89 rows=6596 width=0) (actual time=1.238..1.238 rows=6808
> loops=1)
>                      Index Cond: (node_id = 1147)
>  Planning time: 0.210 ms
>  Execution time: 334897.948 ms


Memory on the host:

>              total       used       free     shared    buffers     cached
> Mem:           15G        15G       453M       2.7G       195M        12G
> -/+ buffers/cache:       2.2G        13G
> Swap:           0B         0B         0B

Re: PosgreSQL backend process crashed with signal 9

От
David Gould
Дата:
On Wed, 6 Apr 2016 21:22:23 +0300
Pavel Suderevsky <psuderevsky@gmail.com> wrote:

> >
> > [ shrug... ]  The OOM killer is widely considered broken.  Its heuristics
> > don't interact terribly well with processes using large amounts of shared
> > memory.
>
>
> Actually the issue is not the OOM killer invokation but in amount of memory
> that postgresql consumes for a not very heavy operation.

I suggest you open a new bug for this. The original report was for a SEGV,
which should never happen. This report is for excessive memory growth leading
to OOM. Both cause a crash, but they are different issues and tying them
together may delay solving either of them.

-dg

--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.