Re: [sqlsmith] Failed assertions on parallel worker shutdown
От | Amit Kapila |
---|---|
Тема | Re: [sqlsmith] Failed assertions on parallel worker shutdown |
Дата | |
Msg-id | CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [sqlsmith] Failed assertions on parallel worker shutdown (Andreas Seltenreich <seltenreich@gmx.de>) |
Ответы |
Re: [sqlsmith] Failed assertions on parallel worker shutdown
Re: [sqlsmith] Failed assertions on parallel worker shutdown |
Список | pgsql-hackers |
On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
ExecutePlan()
{
..
if (TupIsNull(slot))
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}
{
..
/* Write the actual data bytes into the buffer. */
Assert(mqh->mqh_partial_bytes <= nbytes);
Each of the sent plans was collected when a worker dumped core due to
the failed assertion. More core dumps than plans were actually
observed, since with this failed assertion, multiple workers usually
trip on and dump core simultaneously.
The following query corresponds to plan2:
--8<---------------cut here---------------start------------->8---
select
pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
(select unique1 from public.bprime limit 1 offset 9) as c5,
subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
from
pg_catalog.pg_stats as ref_0
where 49 is not NULL limit 55) as subq_0
where true
limit 58;
--8<---------------cut here---------------end--------------->8---
I am able to reproduce the assertion (it occurs once in two to three times, but always at same place) you have reported upthread with the above query. It seems to me, issue here is that while workers are writing tuples in the tuple queue, the master backend has detached from the queues. The reason why master backend has detached from tuple queues is because of Limit clause, basically after processing required tuples as specified by Limit clause, it calls shutdown of nodes in below part of code:
{
..
if (TupIsNull(slot))
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}
Now, it is quite possible that the worker has written part of it's data, after which the queue got detached. The callers of shm_mq (tqueueReceiveSlot/shm_mq_send) doesn't handle SHM_MQ_DETACHED due to which it keeps on sending more data (next tuple) which leads to the assertion in below code:
shm_mq_sendv(){
..
/* Write the actual data bytes into the buffer. */
Assert(mqh->mqh_partial_bytes <= nbytes);
..
}
I think the workers should stop processing tuples after the tuple queues got detached. This will not only handle above situation gracefully, but will allow to speed up the queries where Limit clause is present on top of Gather node. Patch for the same is attached with mail (this was part of original parallel seq scan patch, but was not applied and the reason as far as I remember was we thought such an optimization might not be required for initial version).
Another approach to fix this issue could be to reset mqh_partial_bytes and mqh_length_word_complete in shm_mq_sendv in case of SHM_MQ_DETACHED. These are currently reset only incase of success.
I have added this issue in list of PostgreSQL 9.6 open items @https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items
The steps to reproduce it on regression database is:
1. We need to create enough rows in pg_statistic, so that parallel plan can be selected.
2. Used below procedure to ensure sufficient rows gets created in pg_statistic
CREATE OR REPLACE FUNCTION populate_pg_stat() RETURNS int AS
$BODY$
DECLARE
count int;
BEGIN
FOR count IN 1 .. 50000
LOOP
Execute 'Create table t' || count || '(c1 int)';
Execute 'insert into t' || count || ' values (generate_series(1,10))';
Execute 'Analyze t' || count;
End Loop;
Return 1;
END
$BODY$
LANGUAGE plpgsql;
3.
set parallel_tuple_cost = 0
set parallel_setup_cost = 0
execute query -
Explain Analyze select
pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
(select unique1 from public.bprime limit 1 offset 9) as c5,
subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
from
pg_catalog.pg_stats as ref_0
where 49 is not NULL limit 55) as subq_0
where true
limit 58;
pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
(select unique1 from public.bprime limit 1 offset 9) as c5,
subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
from
pg_catalog.pg_stats as ref_0
where 49 is not NULL limit 55) as subq_0
where true
limit 58;
Вложения
В списке pgsql-hackers по дате отправления: