Обсуждение: BUG #13490: Segmentation fault on pg_stat_activity

Поиск
Список
Период
Сортировка

BUG #13490: Segmentation fault on pg_stat_activity

От
michael@bommaritollc.com
Дата:
The following bug has been logged on the website:

Bug reference:      13490
Logged by:          Michael Bommarito
Email address:      michael@bommaritollc.com
PostgreSQL version: 9.5alpha1
Operating system:   Ubuntu 14.04.2 LTS
Description:

Environment: Up-to-date instance running Trusty and using apt.postgresql.org
postgresql-9.5-alpha1.  Database was upgraded using pg_dump/restore.  All
other application behavior as expected.

Not sure if this is related to #13286 or #12918, but if so, still affecting
the binaries in apt.  Pretty critical to fly blind without access to this.

Crash example 1:
2015-07-07 16:42:45 UTC [26785-2] LOG:  server process (PID 11836) was
terminated by signal 11: Segmentation fault
2015-07-07 16:42:45 UTC [26785-3] DETAIL:  Failed process was running:
select * from pg_stat_activity ;

Crash example 2:
https://github.com/ankane/pghero/issues/30

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Paquier
Дата:
On Wed, Jul 8, 2015 at 1:52 AM,  <michael@bommaritollc.com> wrote:
> Environment: Up-to-date instance running Trusty and using apt.postgresql.org
> postgresql-9.5-alpha1.  Database was upgraded using pg_dump/restore.  All
> other application behavior as expected.
>
> Not sure if this is related to #13286 or #12918, but if so, still affecting
> the binaries in apt.  Pretty critical to fly blind without access to this.
>
> Crash example 1:
> 2015-07-07 16:42:45 UTC [26785-2] LOG:  server process (PID 11836) was
> terminated by signal 11: Segmentation fault
> 2015-07-07 16:42:45 UTC [26785-3] DETAIL:  Failed process was running:
> select * from pg_stat_activity ;
>
> Crash example 2:
> https://github.com/ankane/pghero/issues/30

If this link gets away we would lost the test case, it is better to
directly add it in the email:
SELECT pid, state, application_name AS source, age(now(), xact_start)
AS duration, waiting, query, xact_start AS started_at FROM
pg_stat_activity WHERE query <> '<insufficient privilege>' AND s
tate <> 'idle' AND pid <> pg_backend_pid() AND now() - query_start >
interval '60 seconds' ORDER BY query_start DESC;

This seems unrelated to #13286 which refers to pg_terminate_backend()
crashing, and in the case 1 presented above the query used for the
crash is more simple. Now this may be related to #12918, so could it
be possible to see at least a stack trace to see what is the code path
failing?
--
Michael

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Paquier
Дата:
On Wed, Jul 8, 2015 at 10:22 AM, Michael Bommarito
<michael@bommaritollc.com> wrote:
>   I'll try to generate an isolated core dump and stack trace this weekend
> when the system is not in use.

That would be nice. I have let pgbench -C run for one hour with select
* from pg_stat_activity running every second (\watch 1) in parallel
but I could not reproduce the issue on HEAD.
--
Michael

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
I'll try to generate an isolated core dump and stack trace this weekend
when the system is not in use.

  I also confirmed that the earlier patch
2897e069c1f4309a237a1fa4bf5d2fcd9c4b9868 for 12918 is applied in our
version, so either the patch didn't handle all cases or it's unrelated.

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 7, 2015 at 9:17 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> On Wed, Jul 8, 2015 at 1:52 AM,  <michael@bommaritollc.com> wrote:
> > Environment: Up-to-date instance running Trusty and using
> apt.postgresql.org
> > postgresql-9.5-alpha1.  Database was upgraded using pg_dump/restore.  All
> > other application behavior as expected.
> >
> > Not sure if this is related to #13286 or #12918, but if so, still
> affecting
> > the binaries in apt.  Pretty critical to fly blind without access to
> this.
> >
> > Crash example 1:
> > 2015-07-07 16:42:45 UTC [26785-2] LOG:  server process (PID 11836) was
> > terminated by signal 11: Segmentation fault
> > 2015-07-07 16:42:45 UTC [26785-3] DETAIL:  Failed process was running:
> > select * from pg_stat_activity ;
> >
> > Crash example 2:
> > https://github.com/ankane/pghero/issues/30
>
> If this link gets away we would lost the test case, it is better to
> directly add it in the email:
> SELECT pid, state, application_name AS source, age(now(), xact_start)
> AS duration, waiting, query, xact_start AS started_at FROM
> pg_stat_activity WHERE query <> '<insufficient privilege>' AND s
> tate <> 'idle' AND pid <> pg_backend_pid() AND now() - query_start >
> interval '60 seconds' ORDER BY query_start DESC;
>
> This seems unrelated to #13286 which refers to pg_terminate_backend()
> crashing, and in the case 1 presented above the query used for the
> crash is more simple. Now this may be related to #12918, so could it
> be possible to see at least a stack trace to see what is the code path
> failing?
> --
> Michael
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Paquier
Дата:
On Sat, Jul 11, 2015 at 10:07 PM, Michael Bommarito <
michael@bommaritollc.com> wrote:

>   Here is the offending query and gdb session/stacktrace output.  Please
> let me know if we can provide anything else from gdb or logs that can be
> anonymized.
>

Thanks.


> 2015-07-11 12:57:41 UTC [12803-7] LOG:  server process (PID 20696) was
> terminated by signal 11: Segmentation fault
> 2015-07-11 12:57:41 UTC [12803-8] DETAIL:  Failed process was running:
> SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()
>

12803 is the number of the process PID, right?

[...]
> Core was generated by `postgres: postgres databasename 127.0.0.1(42063)
> BIND                             '.
>

Interesting. So you are using the extended query protocol.


> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
> 2832
>  /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
> No such file or directory.
> (gdb) bt
> #0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
> #1  0x00007fd0d47cb9dd in pullup_replace_vars_callback
> (var=0x7fd0d5d9e958, context=0x7fff52170620) at
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
> [...]
> #15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940,
> qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
> at
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951
>

And the crash happens in the planner when building a cached plan. This may
be enough information for coming up with a reproducible test case. I'll get
a closer look on Monday.
--
Michael

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
Hello Michael,
  Here is the offending query and gdb session/stacktrace output.  Please
let me know if we can provide anything else from gdb or logs that can be
anonymized.

*Query:*
2015-07-11 12:57:41 UTC [12803-7] LOG:  server process (PID 20696) was
terminated by signal 11: Segmentation fault
2015-07-11 12:57:41 UTC [12803-8] DETAIL:  Failed process was running:
SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()


*GDB:*
root@hostname:/var/lib/postgresql/9.5/main# gdb -q -c
/var/lib/postgresql/9.5/main/core /usr/lib/postgresql/9.5/bin/postgres
Reading symbols from /usr/lib/postgresql/9.5/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/7b/273fca376bfad4ea4bb0c4d3099d62ecc9c124.debug...done.
done.
[New LWP 16028]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: postgres databasename 127.0.0.1(42063)
BIND                             '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
2832
 /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
No such file or directory.
(gdb) bt
#0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
#1  0x00007fd0d47cb9dd in pullup_replace_vars_callback (var=0x7fd0d5d9e958,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
#2  0x00007fd0d481c3ea in replace_rte_variables_mutator (node=<optimized
out>, context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1149
#3  0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769
#4  0x00007fd0d47812b3 in expression_tree_mutator (node=<optimized out>,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2675
#5  0x00007fd0d481cc64 in replace_rte_variables (node=<optimized out>,
target_varno=<optimized out>, sublevels_up=sublevels_up@entry=0,
callback=callback@entry=0x7fd0d47cb880 <pullup_replace_vars_callback>,
callback_arg=callback_arg@entry=0x7fff521706c0,
outer_hasSubLinks=0x7fd0d5d30d6e "")
    at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1115
#6  0x00007fd0d47cd1c7 in pullup_replace_vars (context=0x7fff521706c0,
expr=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1982
#7  pull_up_simple_subquery (deletion_ok=<optimized out>,
containing_appendrel=0x0, lowest_nulling_outer_join=0x0,
lowest_outer_join=0x0, rte=0x7fd0d5d30ea8, jtnode=<optimized out>,
root=0x7fd0d5d9ee48) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1030
#8  pull_up_subqueries_recurse (root=root@entry=0x7fd0d5d9ee48,
jtnode=<optimized out>, lowest_outer_join=lowest_outer_join@entry=0x0,
lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0,
containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized
out>)
    at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:696
#9  0x00007fd0d47cc989 in pull_up_subqueries_recurse
(root=root@entry=0x7fd0d5d9ee48,
jtnode=0x7fd0d5d9e6c0, lowest_outer_join=lowest_outer_join@entry=0x0,
lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0,
containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized
out>,
    deletion_ok@entry=0 '\000') at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:762
#10 0x00007fd0d47cd639 in pull_up_subqueries (root=root@entry=0x7fd0d5d9ee48)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:614
#11 0x00007fd0d47c5014 in subquery_planner (glob=glob@entry=0x7fd0d5d9edb8,
parse=parse@entry=0x7fd0d5d30d48, parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=0,
subroot=subroot@entry=0x7fff52170908)
    at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:374
#12 0x00007fd0d47c5975 in standard_planner (parse=0x7fd0d5d30d48,
cursorOptions=0, boundParams=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:229
#13 0x00007fd0d4848034 in pg_plan_query (querytree=<optimized out>,
cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:809
#14 0x00007fd0d4848124 in pg_plan_queries
(querytrees=querytrees@entry=0x7fd0d5d30cf8,
cursorOptions=0, boundParams=boundParams@entry=0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:868
#15 0x00007fd0d4929760 in BuildCachedPlan
(plansource=plansource@entry=0x7fd0d5d7d940,
qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951
#16 0x00007fd0d4929a98 in GetCachedPlan
(plansource=plansource@entry=0x7fd0d5d7d940,
boundParams=boundParams@entry=0x0, useResOwner=useResOwner@entry=0 '\000')
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:1165
#17 0x00007fd0d48497ab in exec_bind_message (input_message=0x7fff52170be0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:1774
#18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x7fd0d5c8d950,
dbname=0x7fd0d5c8d840 "databasename", username=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:4071
#19 0x00007fd0d45f239c in BackendRun (port=0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:4159
#20 BackendStartup (port=0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:3835
#21 ServerLoop () at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1609
#22 0x00007fd0d47f18e1 in PostmasterMain (argc=5, argv=<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1254
#23 0x00007fd0d45f30cd in main (argc=5, argv=0x7fd0d5c8c970) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/main/main.c:221


Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 7, 2015 at 11:49 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> On Wed, Jul 8, 2015 at 10:22 AM, Michael Bommarito
> <michael@bommaritollc.com> wrote:
> >   I'll try to generate an isolated core dump and stack trace this weekend
> > when the system is not in use.
>
> That would be nice. I have let pgbench -C run for one hour with select
> * from pg_stat_activity running every second (\watch 1) in parallel
> but I could not reproduce the issue on HEAD.
> --
> Michael
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
Yes, 12803 was the PID of the parent postgres process.

While the extend query protocol was being used by this client and query,
we've also been able to trigger it via unprepared statements in `psql` or
pgAdmin.

Great, thank you!  I'll keep the core around for now so we can provide
additional details; if we need to test another execution path for segfault,
it will need to wait until the weekend again.

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sat, Jul 11, 2015 at 9:18 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

>
>
> On Sat, Jul 11, 2015 at 10:07 PM, Michael Bommarito <
> michael@bommaritollc.com> wrote:
>
>>   Here is the offending query and gdb session/stacktrace output.  Please
>> let me know if we can provide anything else from gdb or logs that can be
>> anonymized.
>>
>
> Thanks.
>
>
>> 2015-07-11 12:57:41 UTC [12803-7] LOG:  server process (PID 20696) was
>> terminated by signal 11: Segmentation fault
>> 2015-07-11 12:57:41 UTC [12803-8] DETAIL:  Failed process was running:
>> SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()
>>
>
> 12803 is the number of the process PID, right?
>
> [...]
>> Core was generated by `postgres: postgres databasename 127.0.0.1(42063)
>> BIND                             '.
>>
>
> Interesting. So you are using the extended query protocol.
>
>
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
>> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
>> 2832
>>  /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:
>> No such file or directory.
>> (gdb) bt
>> #0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
>> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
>> #1  0x00007fd0d47cb9dd in pullup_replace_vars_callback
>> (var=0x7fd0d5d9e958, context=0x7fff52170620) at
>> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074
>> [...]
>> #15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940,
>> qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0)
>> at
>> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951
>>
>
> And the crash happens in the planner when building a cached plan. This may
> be enough information for coming up with a reproducible test case. I'll get
> a closer look on Monday.
> --
> Michael
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Tom Lane
Дата:
Michael Bommarito <michael@bommaritollc.com> writes:
> Hello Michael,
>   Here is the offending query and gdb session/stacktrace output.  Please
> let me know if we can provide anything else from gdb or logs that can be
> anonymized.

> *Query:*
> 2015-07-11 12:57:41 UTC [12803-7] LOG:  server process (PID 20696) was
> terminated by signal 11: Segmentation fault
> 2015-07-11 12:57:41 UTC [12803-8] DETAIL:  Failed process was running:
> SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()

I tried tracing through the logic using the line numbers shown in this
stack trace, and soon decided that they didn't seem to match the query
you show above --- but on closer look, that's to be expected, because
this core is evidently from PID 16028:

> [New LWP 16028]

If you can identify the exact query that 16028 was running from your logs,
that would be helpful.  Or, if you still have this same core file laying
about, "p debug_query_string" in gdb would probably be a more trustworthy
guide than trying to match up log entries.

>> That would be nice. I have let pgbench -C run for one hour with select
>> * from pg_stat_activity running every second (\watch 1) in parallel
>> but I could not reproduce the issue on HEAD.

Yeah, I tried similar experiments without result.  Presumably there is
some other triggering condition here, but it's hard to guess what.
I tried things like doing VACUUM FULL on pg_database and pg_authid
to force replan cycles on the view, but no crash.

            regards, tom lane

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Tom Lane
Дата:
Michael Bommarito <michael@bommaritollc.com> writes:
> Here's the session with debug_query_string:
> (gdb) printf "%s\n", debug_query_string
> SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
> total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP
> BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
> client_addr ASC

Thanks.  This still doesn't match the stack trace: in particular, this
stack frame

#3  0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
context=0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769

indicates that we found a PlaceHolderInfo node in the expression tree that
pullup_replace_vars() was applied to, but so far as I can see no such node
should exist in the query tree generated by this query.  The most likely
theory seems to be that something clobbered the query tree while it was
sitting in the plancache, causing this recursive function to follow a
bogus pointer.  But that doesn't leave us with a lot to go on.

What can you tell us about the environment this is happening in?
How is the client-side code executing the failing queries?  (We know
it's using extended query protocol, but is it preparing a statement
and then executing it repeatedly, or just using a one-shot unnamed
prepared statement?)  What nondefault settings are in use on the
server side?  Do you have any extensions loaded, such as
pg_stat_statements or auto_explain?

            regards, tom lane

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
This particular instance is from pghero, which is a monitoring tool.   It
can be reproduced simply by querying stat_activity in psql as well.  Pghero
is using prepared statements via ruby from a quick skim on their github
repo.

We have pg_stat_statements enabled, and can reproduce without pghero setup
as well.  No other extensions loaded.
On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Michael Bommarito <michael@bommaritollc.com> writes:
> > Here's the session with debug_query_string:
> > (gdb) printf "%s\n", debug_query_string
> > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
> > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid()
> GROUP
> > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
> > client_addr ASC
>
> Thanks.  This still doesn't match the stack trace: in particular, this
> stack frame
>
> #3  0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
> mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
> context=0x7fff52170620) at
>
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769
>
> indicates that we found a PlaceHolderInfo node in the expression tree that
> pullup_replace_vars() was applied to, but so far as I can see no such node
> should exist in the query tree generated by this query.  The most likely
> theory seems to be that something clobbered the query tree while it was
> sitting in the plancache, causing this recursive function to follow a
> bogus pointer.  But that doesn't leave us with a lot to go on.
>
> What can you tell us about the environment this is happening in?
> How is the client-side code executing the failing queries?  (We know
> it's using extended query protocol, but is it preparing a statement
> and then executing it repeatedly, or just using a one-shot unnamed
> prepared statement?)  What nondefault settings are in use on the
> server side?  Do you have any extensions loaded, such as
> pg_stat_statements or auto_explain?
>
>                         regards, tom lane
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
Here's the session with debug_query_string:

ubuntu@hostname:~=E2=9F=AB sudo gdb -q -c /var/lib/postgresql/9.5/main/core
/usr/lib/postgresql/9.5/bin/postgres
Reading symbols from /usr/lib/postgresql/9.5/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/7b/273fca376bfad4ea4bb0c4d3099d62ecc9c124.debug...=
done.
done.
[New LWP 16028]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: postgres database 127.0.0.1(42063) BIND
                          '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_tle_by_resno (tlist=3D0x7fd0d5da27c0, resno=3Dresno@entry=3D6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_rel=
ation.c:2832
2832
 /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_re=
lation.c:
No such file or directory.
(gdb) printf "%s\n", debug_query_string
SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC
(gdb)


Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sat, Jul 11, 2015 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Michael Bommarito <michael@bommaritollc.com> writes:
> > Hello Michael,
> >   Here is the offending query and gdb session/stacktrace output.  Pleas=
e
> > let me know if we can provide anything else from gdb or logs that can b=
e
> > anonymized.
>
> > *Query:*
> > 2015-07-11 12:57:41 UTC [12803-7] LOG:  server process (PID 20696) was
> > terminated by signal 11: Segmentation fault
> > 2015-07-11 12:57:41 UTC [12803-8] DETAIL:  Failed process was running:
> > SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid()
>
> I tried tracing through the logic using the line numbers shown in this
> stack trace, and soon decided that they didn't seem to match the query
> you show above --- but on closer look, that's to be expected, because
> this core is evidently from PID 16028:
>
> > [New LWP 16028]
>
> If you can identify the exact query that 16028 was running from your logs=
,
> that would be helpful.  Or, if you still have this same core file laying
> about, "p debug_query_string" in gdb would probably be a more trustworthy
> guide than trying to match up log entries.
>
> >> That would be nice. I have let pgbench -C run for one hour with select
> >> * from pg_stat_activity running every second (\watch 1) in parallel
> >> but I could not reproduce the issue on HEAD.
>
> Yeah, I tried similar experiments without result.  Presumably there is
> some other triggering condition here, but it's hard to guess what.
> I tried things like doing VACUUM FULL on pg_database and pg_authid
> to force replan cycles on the view, but no crash.
>
>                         regards, tom lane
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Paquier
Дата:
On Mon, Jul 13, 2015 at 4:16 AM, Michael Bommarito
<michael@bommaritollc.com> wrote:
> This particular instance is from pghero, which is a monitoring tool.   It
> can be reproduced simply by querying stat_activity in psql as well.  Pghero
> is using prepared statements via ruby from a quick skim on their github
> repo.
>
> We have pg_stat_statements enabled, and can reproduce without pghero setup
> as well.  No other extensions loaded.
>
> On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>>
>> Michael Bommarito <michael@bommaritollc.com> writes:
>> > Here's the session with debug_query_string:
>> > (gdb) printf "%s\n", debug_query_string
>> > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
>> > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid()
>> > GROUP
>> > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
>> > client_addr ASC
>>
>> Thanks.  This still doesn't match the stack trace: in particular, this
>> stack frame
>>
>> #3  0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908,
>> mutator=0x7fd0d481c390 <replace_rte_variables_mutator>,
>> context=0x7fff52170620) at
>>
>> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769
>>
>> indicates that we found a PlaceHolderInfo node in the expression tree that
>> pullup_replace_vars() was applied to, but so far as I can see no such node
>> should exist in the query tree generated by this query.  The most likely
>> theory seems to be that something clobbered the query tree while it was
>> sitting in the plancache, causing this recursive function to follow a
>> bogus pointer.  But that doesn't leave us with a lot to go on.
>>
>> What can you tell us about the environment this is happening in?
>> How is the client-side code executing the failing queries?  (We know
>> it's using extended query protocol, but is it preparing a statement
>> and then executing it repeatedly, or just using a one-shot unnamed
>> prepared statement?)  What nondefault settings are in use on the
>> server side?  Do you have any extensions loaded, such as
>> pg_stat_statements or auto_explain?

FWIW, I have been fooling around with the query reported in the back
trace upthread by playing a bit with the extended query protocol to
send BIND messages with PQdescribePrepared and PQsendDescribePrepared,
as well as with psql and while I am able to reproduce stack traces
close to what you had I am not seeing any crashes. I have as well
played a bit with pghero with pgbench running in parallel and there
were no problems, with and without pg_stat_statements loaded.

In the backtrace you send previously
(http://www.postgresql.org/message-id/CAN=rtBipwKdHCtmXH3r4GNfUhF9e4ZfJbqcj7s_Ec9e2Mbf_LA@mail.gmail.com),
what is the value of MyProcPid?  Is it 12803 or 20696? If it is the
former, do you have a backtrace for process 20696? What we may be
looking at now is actually a side effect of the real problem, and as
long as we do not have a real test case, I am afraid that finding the
root problem is rather difficult.
--
Michael

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
Here are locals per frame, all the way back up to BackendStartup, in case
it helps as well.  I will be trying to reproduce this upcoming weekend with
a "baseline" version of postgresql.conf and another system with clean
import.

(gdb) bt
#0  get_tle_by_resno (tlist=3D0x7fd0d5da27c0, resno=3Dresno@entry=3D6) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_rel=
ation.c:2832
#1  0x00007fd0d47cb9dd in pullup_replace_vars_callback (var=3D0x7fd0d5d9e95=
8,
context=3D0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p=
repjointree.c:2074
#2  0x00007fd0d481c3ea in replace_rte_variables_mutator (node=3D<optimized
out>, context=3D0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteM=
anip.c:1149
#3  0x00007fd0d478152c in expression_tree_mutator (node=3D0x7fd0d5d9e908,
mutator=3D0x7fd0d481c390 <replace_rte_variables_mutator>,
context=3D0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.=
c:2769
#4  0x00007fd0d47812b3 in expression_tree_mutator (node=3D<optimized out>,
mutator=3D0x7fd0d481c390 <replace_rte_variables_mutator>,
context=3D0x7fff52170620) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.=
c:2675
#5  0x00007fd0d481cc64 in replace_rte_variables (node=3D<optimized out>,
target_varno=3D<optimized out>, sublevels_up=3Dsublevels_up@entry=3D0,
callback=3Dcallback@entry=3D0x7fd0d47cb880 <pullup_replace_vars_callback>,
callback_arg=3Dcallback_arg@entry=3D0x7fff521706c0,
outer_hasSubLinks=3D0x7fd0d5d30d6e "")
    at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteM=
anip.c:1115
#6  0x00007fd0d47cd1c7 in pullup_replace_vars (context=3D0x7fff521706c0,
expr=3D<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p=
repjointree.c:1982
#7  pull_up_simple_subquery (deletion_ok=3D<optimized out>,
containing_appendrel=3D0x0, lowest_nulling_outer_join=3D0x0,
lowest_outer_join=3D0x0, rte=3D0x7fd0d5d30ea8, jtnode=3D<optimized out>,
root=3D0x7fd0d5d9ee48) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p=
repjointree.c:1030
#8  pull_up_subqueries_recurse (root=3Droot@entry=3D0x7fd0d5d9ee48,
jtnode=3D<optimized out>, lowest_outer_join=3Dlowest_outer_join@entry=3D0x0=
,
lowest_nulling_outer_join=3Dlowest_nulling_outer_join@entry=3D0x0,
containing_appendrel=3Dcontaining_appendrel@entry=3D0x0, deletion_ok=3D<opt=
imized
out>)
    at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p=
repjointree.c:696
#9  0x00007fd0d47cc989 in pull_up_subqueries_recurse
(root=3Droot@entry=3D0x7fd0d5d9ee48,
jtnode=3D0x7fd0d5d9e6c0, lowest_outer_join=3Dlowest_outer_join@entry=3D0x0,
lowest_nulling_outer_join=3Dlowest_nulling_outer_join@entry=3D0x0,
containing_appendrel=3Dcontaining_appendrel@entry=3D0x0, deletion_ok=3D<opt=
imized
out>,
    deletion_ok@entry=3D0 '\000') at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p=
repjointree.c:762
#10 0x00007fd0d47cd639 in pull_up_subqueries (root=3Droot@entry=3D0x7fd0d5d=
9ee48)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p=
repjointree.c:614
#11 0x00007fd0d47c5014 in subquery_planner (glob=3Dglob@entry=3D0x7fd0d5d9e=
db8,
parse=3Dparse@entry=3D0x7fd0d5d30d48, parent_root=3Dparent_root@entry=3D0x0=
,
hasRecursion=3DhasRecursion@entry=3D0 '\000', tuple_fraction=3D0,
subroot=3Dsubroot@entry=3D0x7fff52170908)
    at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/p=
lanner.c:374
#12 0x00007fd0d47c5975 in standard_planner (parse=3D0x7fd0d5d30d48,
cursorOptions=3D0, boundParams=3D0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/p=
lanner.c:229
#13 0x00007fd0d4848034 in pg_plan_query (querytree=3D<optimized out>,
cursorOptions=3DcursorOptions@entry=3D0, boundParams=3DboundParams@entry=3D=
0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:=
809
#14 0x00007fd0d4848124 in pg_plan_queries
(querytrees=3Dquerytrees@entry=3D0x7fd0d5d30cf8,
cursorOptions=3D0, boundParams=3DboundParams@entry=3D0x0) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:=
868
#15 0x00007fd0d4929760 in BuildCachedPlan
(plansource=3Dplansource@entry=3D0x7fd0d5d7d940,
qlist=3D0x7fd0d5d30cf8, qlist@entry=3D0x0, boundParams=3DboundParams@entry=
=3D0x0)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plan=
cache.c:951
#16 0x00007fd0d4929a98 in GetCachedPlan
(plansource=3Dplansource@entry=3D0x7fd0d5d7d940,
boundParams=3DboundParams@entry=3D0x0, useResOwner=3DuseResOwner@entry=3D0 =
'\000')
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plan=
cache.c:1165
#17 0x00007fd0d48497ab in exec_bind_message (input_message=3D0x7fff52170be0=
)
at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:=
1774
#18 PostgresMain (argc=3D<optimized out>, argv=3Dargv@entry=3D0x7fd0d5c8d95=
0,
dbname=3D0x7fd0d5c8d840 "databasename", username=3D<optimized out>) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:=
4071
#19 0x00007fd0d45f239c in BackendRun (port=3D0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm=
aster.c:4159
#20 BackendStartup (port=3D0x7fd0d5cd2c00) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm=
aster.c:3835
#21 ServerLoop () at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm=
aster.c:1609
#22 0x00007fd0d47f18e1 in PostmasterMain (argc=3D5, argv=3D<optimized out>)=
 at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm=
aster.c:1254
#23 0x00007fd0d45f30cd in main (argc=3D5, argv=3D0x7fd0d5c8c970) at
/tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/main/main.c:221
(gdb) info locals
tle =3D 0x0
l =3D 0x7fd0d5da2940
(gdb) select-frame 1
(gdb) info locals
tle =3D <optimized out>
rcon =3D 0x7fff521706c0
varattno =3D 6
newnode =3D <optimized out>
__func__ =3D "pullup_replace_vars_callback"
(gdb) select-frame 2
(gdb) info locals




newnode =3D <optimized out>
var =3D <optimized out>
__func__ =3D "replace_rte_variables_mutator"
(gdb) select-frame 3
(gdb) info locals




phinfo =3D <optimized out>
newnode =3D <optimized out>
__func__ =3D "expression_tree_mutator"
(gdb) select-frame 4
(gdb) info locals




resultlist =3D 0x0
temp =3D 0x7fd0d5d9e8e8
__func__ =3D "expression_tree_mutator"
(gdb) select-frame 5
(gdb) info locals




result =3D <optimized out>
context =3D {callback =3D 0x7fd0d47cb880 <pullup_replace_vars_callback>,
callback_arg =3D 0x7fff521706c0, target_varno =3D 1, sublevels_up =3D 0,
inserted_sublink =3D 0 '\000'}
__func__ =3D "replace_rte_variables"
(gdb) select-frame 6
(gdb) info locals




No locals.
(gdb) select-frame 7
(gdb) info locals




parse =3D 0x7fd0d5d30d48
subquery =3D <optimized out>
rvcontext =3D {root =3D 0x7fd0d5d9ee48, targetlist =3D 0x7fd0d5da27c0, targ=
et_rte
=3D 0x7fd0d5d30ea8, relids =3D 0x0, outer_hasSubLinks =3D 0x7fd0d5d30d6e ""=
,
varno =3D 1, need_phvs =3D 0 '\000', wrap_non_vars =3D 0 '\000', rv_cache =
=3D
0x7fd0d5da2960}
varno =3D 1
subroot =3D 0x7fd0d5da2560
lc =3D <optimized out>
(gdb) select-frame 8
(gdb) info locals




varno =3D <optimized out>
rte =3D 0x7fd0d5d30ea8
__func__ =3D "pull_up_subqueries_recurse"
(gdb) select-frame 9
(gdb) info locals




sub_deletion_ok =3D <optimized out>
f =3D 0x7fd0d5d9e6c0
have_undeleted_child =3D 0 '\000'
l =3D 0x7fd0d5d9e720
__func__ =3D "pull_up_subqueries_recurse"
(gdb) select-frame 10
(gdb) info locals




No locals.
(gdb) select-frame 11
(gdb) info locals




root =3D 0x7fd0d5d9ee48
plan =3D <optimized out>
newWithCheckOptions =3D <optimized out>
newHaving =3D <optimized out>
hasOuterJoins =3D <optimized out>
l =3D <optimized out>
(gdb) select-frame 12
(gdb) info locals




result =3D <optimized out>
glob =3D 0x7fd0d5d9edb8
tuple_fraction =3D <optimized out>
root =3D 0x7fd0d5d9ed18
top_plan =3D <optimized out>
lp =3D <optimized out>
lr =3D <optimized out>
(gdb) select-frame 13
(gdb) info locals




plan =3D <optimized out>
(gdb) select-frame 14
(gdb) info locals




query =3D <optimized out>
stmt =3D <optimized out>
stmt_list =3D 0x0
query_list =3D 0x7fd0d5d30d28
(gdb) select-frame 15
(gdb) info locals




plan =3D <optimized out>
plist =3D <optimized out>
snapshot_set =3D 0 '\000'
spi_pushed =3D <optimized out>
plan_context =3D <optimized out>
oldcxt =3D 0x7fd0d5c8ccb0
(gdb) select-frame 16
(gdb) info locals




plan =3D <optimized out>
qlist =3D 0x0
customplan =3D 0 '\000'
__func__ =3D "GetCachedPlan"
(gdb) select-frame 17
(gdb) info locals




pformats =3D 0x0
psrc =3D 0x7fd0d5d7d940
portal =3D 0x7fd0d5ca6da0
query_string =3D 0x7fd0d5cd17a0 "SELECT application_name AS source,
client_addr AS ip, COUNT(*) AS total_connections FROM pg_stat_activity
WHERE pid <> pg_backend_pid() GROUP BY application_name, ip ORDER BY
COUNT(*) DESC, applicatio"...
portal_name =3D 0x7fd0d5d308d0 ""
stmt_name =3D 0x7fd0d5d308d1 ""
numPFormats =3D 0
saved_stmt_name =3D 0x0
rformats =3D 0x7fd0d5d30ce0
params =3D 0x0
save_log_statement_stats =3D 0 '\000'
msec_str =3D "SELECT
1\000\061\000\324\320\177\000\000\220\r\027R\377\177\000\000=E1=95=94\324\3=
20\177\000"
numParams =3D 0
numRFormats =3D 1
cplan =3D <optimized out>
snapshot_set =3D <optimized out>
(gdb) select-frame 18
(gdb) info locals




firstchar =3D -707588896
input_message =3D {data =3D 0x7fd0d5d308d0 "", len =3D 10, maxlen =3D 1024,=
 cursor
=3D 10}
local_sigjmp_buf =3D {{__jmpbuf =3D {140734570630176, -6783158886466612332,
140534916634656, 1, 0, 140534916918272, -6783158886368046188,
-6808758974801407084}, __mask_was_saved =3D 1, __saved_mask =3D {__val =3D =
{0,
140534916918272, 140534862700952, 140534916634800, 206158430256,
140734570630512, 140734570630304,
        140534916918272, 52, 140734570630432, 140534895011200, 1024,
140734570630464, 140734570630628, 0, 140734570630400}}}}
send_ready_for_query =3D 0 '\000'
__func__ =3D "PostgresMain"
(gdb) select-frame 19
(gdb) info locals




ac =3D 1
secs =3D 489934357
usecs =3D 656467
i =3D 1
av =3D 0x7fd0d5c8d950
maxac =3D <optimized out>



Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Mon, Jul 13, 2015 at 2:54 AM, Michael Paquier <michael.paquier@gmail.com=
>
wrote:

> On Mon, Jul 13, 2015 at 4:16 AM, Michael Bommarito
> <michael@bommaritollc.com> wrote:
> > This particular instance is from pghero, which is a monitoring tool.   =
It
> > can be reproduced simply by querying stat_activity in psql as well.
> Pghero
> > is using prepared statements via ruby from a quick skim on their github
> > repo.
> >
> > We have pg_stat_statements enabled, and can reproduce without pghero
> setup
> > as well.  No other extensions loaded.
> >
> > On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >>
> >> Michael Bommarito <michael@bommaritollc.com> writes:
> >> > Here's the session with debug_query_string:
> >> > (gdb) printf "%s\n", debug_query_string
> >> > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS
> >> > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid(=
)
> >> > GROUP
> >> > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC=
,
> >> > client_addr ASC
> >>
> >> Thanks.  This still doesn't match the stack trace: in particular, this
> >> stack frame
> >>
> >> #3  0x00007fd0d478152c in expression_tree_mutator (node=3D0x7fd0d5d9e9=
08,
> >> mutator=3D0x7fd0d481c390 <replace_rte_variables_mutator>,
> >> context=3D0x7fff52170620) at
> >>
> >>
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFunc=
s.c:2769
> >>
> >> indicates that we found a PlaceHolderInfo node in the expression tree
> that
> >> pullup_replace_vars() was applied to, but so far as I can see no such
> node
> >> should exist in the query tree generated by this query.  The most like=
ly
> >> theory seems to be that something clobbered the query tree while it wa=
s
> >> sitting in the plancache, causing this recursive function to follow a
> >> bogus pointer.  But that doesn't leave us with a lot to go on.
> >>
> >> What can you tell us about the environment this is happening in?
> >> How is the client-side code executing the failing queries?  (We know
> >> it's using extended query protocol, but is it preparing a statement
> >> and then executing it repeatedly, or just using a one-shot unnamed
> >> prepared statement?)  What nondefault settings are in use on the
> >> server side?  Do you have any extensions loaded, such as
> >> pg_stat_statements or auto_explain?
>
> FWIW, I have been fooling around with the query reported in the back
> trace upthread by playing a bit with the extended query protocol to
> send BIND messages with PQdescribePrepared and PQsendDescribePrepared,
> as well as with psql and while I am able to reproduce stack traces
> close to what you had I am not seeing any crashes. I have as well
> played a bit with pghero with pgbench running in parallel and there
> were no problems, with and without pg_stat_statements loaded.
>
> In the backtrace you send previously
> (
> http://www.postgresql.org/message-id/CAN=3DrtBipwKdHCtmXH3r4GNfUhF9e4ZfJb=
qcj7s_Ec9e2Mbf_LA@mail.gmail.com
> ),
> what is the value of MyProcPid?  Is it 12803 or 20696? If it is the
> former, do you have a backtrace for process 20696? What we may be
> looking at now is actually a side effect of the real problem, and as
> long as we do not have a real test case, I am afraid that finding the
> root problem is rather difficult.
> --
> Michael
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Paquier
Дата:
On Mon, Jul 13, 2015 at 10:43 PM, Michael Bommarito
<michael@bommaritollc.com> wrote:
> (gdb) bt
> #0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
> [...]
> (gdb) info locals
> tle = 0x0
> l = 0x7fd0d5da2940

So basically this is crashing because this tle is NULL, and
get_tle_by_resno does not like that much when this item is extracted
from the list. I am wondering what is breaking that.
--
Michael

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Tom Lane
Дата:
Michael Bommarito <michael@bommaritollc.com> writes:
> If you can provide a patch that performs input validation in
> get_tle_by_resno and logs the condition, I can compile and test with it.

Wouldn't prove anything one way or another.  Somehow, a corrupt query tree
is being fed to the planner; what we need to understand is what conditions
cause that to happen.  I doubt that getting more details at the point
where the code trips over the corruption will teach us that.

            regards, tom lane

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
If you can provide a patch that performs input validation in
get_tle_by_resno and logs the condition, I can compile and test with it.

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 14, 2015 at 3:33 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> On Mon, Jul 13, 2015 at 10:43 PM, Michael Bommarito
> <michael@bommaritollc.com> wrote:
> > (gdb) bt
> > #0  get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at
> >
> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832
> > [...]
> > (gdb) info locals
> > tle = 0x0
> > l = 0x7fd0d5da2940
>
> So basically this is crashing because this tle is NULL, and
> get_tle_by_resno does not like that much when this item is extracted
> from the list. I am wondering what is breaking that.
> --
> Michael
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Tom Lane
Дата:
Michael Bommarito <michael@bommaritollc.com> writes:
> Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb
> -fno-omit-frame-pointer -mno-red-zone.  Reset to default postgresql.conf.
> Was able to generate segfaults repeatedly loading pghero dashboard.  First,
> some errors from the pg logs.
> 2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR:  attribute
> number 2 exceeds number of columns 0
> 2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR:  invalid
> varattno 66
> 2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR:  tupdesc
> reference 0x2bdd8a8 is not owned by resource owner Portal
> 2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR:  tupdesc
> reference 0xf77248 is not owned by resource owner Portal

I'm starting to wonder about flaky hardware and/or compiler bugs.  9.5
seems completely unreliable for you, but we're not seeing any similar
reports from anyone else.  Is 9.4 reliable on the same machine + same
build environment?

> Program received signal SIGQUIT, Quit.

Catching a SIGQUIT isn't very helpful; that just means you're one of
the server processes that got force-terminated after some other one
crashed.

            regards, tom lane

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb
-fno-omit-frame-pointer -mno-red-zone.  Reset to default postgresql.conf.
Was able to generate segfaults repeatedly loading pghero dashboard.  First,
some errors from the pg logs.
========================================================================
========================================================
2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR:  attribute
number 2 exceeds number of columns 0

2015-07-18 15:07:58 UTC [27112-2] postgres@database STATEMENT:  SELECT
application_name AS source, client_addr AS ip, COUNT(*) AS
total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP
BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC,
client_addr ASC

2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR:  invalid
varattno 66

2015-07-18 15:08:23 UTC [27112-4] postgres@database STATEMENT:  SELECT
relname AS table, indexrelname AS index,
pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, idx_scan as
index_scans FROM pg_stat_user_indexes ui INNER JOIN pg_index i ON
ui.indexrelid = i.indexrelid WHERE NOT indisunique AND idx_scan < 50 ORDER
BY pg_relation_size(i.indexrelid) DESC, relname ASC

2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR:  tupdesc
reference 0x2bdd8a8 is not owned by resource owner Portal
========================================================================
========================================================


Next, attached to the backend PID under sudo gdb and `cont`:
========================================================================
========================================================
2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR:  tupdesc
reference 0xf77248 is not owned by resource owner Portal
2015-07-18 15:48:54 UTC [8812-4] LOG:  server process (PID 10538) was
terminated by signal 11: Segmentation fault
2015-07-18 15:48:54 UTC [8812-5] LOG:  terminating any other active server
processes
2015-07-18 15:48:54 UTC [10523-1] postgres@database WARNING:  terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10523-2] postgres@database DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10523-3] postgres@database HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10239-1] postgres@database WARNING:  terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10239-2] postgres@database DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10239-3] postgres@database HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10522-1] postgres@database WARNING:  terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10522-2] postgres@database DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10522-3] postgres@database HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10409-1] postgres@database WARNING:  terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10409-2] postgres@database DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10409-3] postgres@database HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2015-07-18 15:48:54 UTC [10408-1] postgres@database WARNING:  terminating
connection because of crash of another server process
2015-07-18 15:48:54 UTC [10408-2] postgres@database DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2015-07-18 15:48:54 UTC [10408-3] postgres@database HINT:  In a moment you
should be able to reconnect to the database and repeat your command.

Program received signal SIGQUIT, Quit.
0x00007f84fe78c110 in __poll_nocancel () at
../sysdeps/unix/syscall-template.S:81
81      in ../sysdeps/unix/syscall-template.S
(gdb) bt full
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7ffc65188bd8:
#0  0x00007f84fe78c110 in __poll_nocancel () at
../sysdeps/unix/syscall-template.S:81
No locals.
Cannot access memory at address 0x7ffc65188bd8
========================================================================
========================================================

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Tue, Jul 14, 2015 at 8:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Michael Bommarito <michael@bommaritollc.com> writes:
> > If you can provide a patch that performs input validation in
> > get_tle_by_resno and logs the condition, I can compile and test with it.
>
> Wouldn't prove anything one way or another.  Somehow, a corrupt query tree
> is being fed to the planner; what we need to understand is what conditions
> cause that to happen.  I doubt that getting more details at the point
> where the code trips over the corruption will teach us that.
>
>                         regards, tom lane
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Michael Bommarito
Дата:
After exporting and re-importing (pg9.5 -> 9.5), this has gone away.  We
changed quite a few variables in the import (non-parallel, letting full
index builds proceed, etc.), so not 100% sure what the issue was; as you
suggested, doesn't that pg9.5 itself was the problem (though not sure on
pg_dump/restore).

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sat, Jul 18, 2015 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Michael Bommarito <michael@bommaritollc.com> writes:
> > Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb
> > -fno-omit-frame-pointer -mno-red-zone.  Reset to default postgresql.conf.
> > Was able to generate segfaults repeatedly loading pghero dashboard.
> First,
> > some errors from the pg logs.
> > 2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR:  attribute
> > number 2 exceeds number of columns 0
> > 2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR:  invalid
> > varattno 66
> > 2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR:  tupdesc
> > reference 0x2bdd8a8 is not owned by resource owner Portal
> > 2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR:  tupdesc
> > reference 0xf77248 is not owned by resource owner Portal
>
> I'm starting to wonder about flaky hardware and/or compiler bugs.  9.5
> seems completely unreliable for you, but we're not seeing any similar
> reports from anyone else.  Is 9.4 reliable on the same machine + same
> build environment?
>
> > Program received signal SIGQUIT, Quit.
>
> Catching a SIGQUIT isn't very helpful; that just means you're one of
> the server processes that got force-terminated after some other one
> crashed.
>
>                         regards, tom lane
>

Re: BUG #13490: Segmentation fault on pg_stat_activity

От
Alvaro Herrera
Дата:
Michael Bommarito wrote:
> After exporting and re-importing (pg9.5 -> 9.5), this has gone away.  We
> changed quite a few variables in the import (non-parallel, letting full
> index builds proceed, etc.), so not 100% sure what the issue was; as you
> suggested, doesn't that pg9.5 itself was the problem (though not sure on
> pg_dump/restore).

Maybe you had an inconsistent build.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services