Re: [sqlsmith] Failed assertions on parallel worker shutdown

Поиск

Список

Период

Сортировка

От	Amit Kapila
Тема	Re: [sqlsmith] Failed assertions on parallel worker shutdown
Дата	26 мая 2016 г. 09:57:46
Msg-id	CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [sqlsmith] Failed assertions on parallel worker shutdown (Andreas Seltenreich <seltenreich@gmx.de>)
Ответы	Re: [sqlsmith] Failed assertions on parallel worker shutdown Re: [sqlsmith] Failed assertions on parallel worker shutdown
Список	pgsql-hackers

Дерево обсуждения

On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Each of the sent plans was collected when a worker dumped core due to
the failed assertion. More core dumps than plans were actually
observed, since with this failed assertion, multiple workers usually
trip on and dump core simultaneously.

The following query corresponds to plan2:

--8<---------------cut here---------------start------------->8---
select
pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
(select unique1 from public.bprime limit 1 offset 9) as c5,
subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
from
pg_catalog.pg_stats as ref_0
where 49 is not NULL limit 55) as subq_0
where true
limit 58;
--8<---------------cut here---------------end--------------->8---

I am able to reproduce the assertion (it occurs once in two to three times, but always at same place) you have reported upthread with the above query. It seems to me, issue here is that while workers are writing tuples in the tuple queue, the master backend has detached from the queues. The reason why master backend has detached from tuple queues is because of Limit clause, basically after processing required tuples as specified by Limit clause, it calls shutdown of nodes in below part of code:

ExecutePlan()
{
..
if (TupIsNull(slot))
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}

Now, it is quite possible that the worker has written part of it's data, after which the queue got detached. The callers of shm_mq (tqueueReceiveSlot/shm_mq_send) doesn't handle SHM_MQ_DETACHED due to which it keeps on sending more data (next tuple) which leads to the assertion in below code:

shm_mq_sendv()
{
..
/* Write the actual data bytes into the buffer. */
Assert(mqh->mqh_partial_bytes <= nbytes);

}

I think the workers should stop processing tuples after the tuple queues got detached. This will not only handle above situation gracefully, but will allow to speed up the queries where Limit clause is present on top of Gather node. Patch for the same is attached with mail (this was part of original parallel seq scan patch, but was not applied and the reason as far as I remember was we thought such an optimization might not be required for initial version).

Another approach to fix this issue could be to reset mqh_partial_bytes and mqh_length_word_complete in shm_mq_sendv in case of SHM_MQ_DETACHED. These are currently reset only incase of success.

I have added this issue in list of PostgreSQL 9.6 open items @https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items

The steps to reproduce it on regression database is:

1. We need to create enough rows in pg_statistic, so that parallel plan can be selected.

2. Used below procedure to ensure sufficient rows gets created in pg_statistic

CREATE OR REPLACE FUNCTION populate_pg_stat() RETURNS int AS

$BODY$

DECLARE

count int;

BEGIN

FOR count IN 1 .. 50000

LOOP

Execute 'Create table t' || count || '(c1 int)';

Execute 'insert into t' || count || ' values (generate_series(1,10))';

Execute 'Analyze t' || count;

End Loop;

Return 1;

END

$BODY$

LANGUAGE plpgsql;

set parallel_tuple_cost = 0

set parallel_setup_cost = 0

execute query -

Explain Analyze select
pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
(select unique1 from public.bprime limit 1 offset 9) as c5,
subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
from
pg_catalog.pg_stats as ref_0
where 49 is not NULL limit 55) as subq_0
where true
limit 58;

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

stop_processing_tuples_detached_queue_v1.patch

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [sqlsmith] Failed assertions on parallel worker shutdown

Вложения