On Fri, Jan 5, 2018 at 5:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The early returns indicate that that problem is fixed;
Thanks for your help and patience with that. I've made a list over
here so we don't lose track of the various things that should be
improved in this area, and will start a new thread when I have patches
to propose: https://wiki.postgresql.org/wiki/Parallel_Hash
> but now that the
> noise level is down, it's possible to see that brolga is showing an actual
> crash in the PHJ test, perhaps one time in four. So we're not out of
> the woods yet. It seems to consistently look like this:
>
> 2017-12-21 17:34:52.092 EST [2252:4] LOG: background worker "parallel worker" (PID 3584) was terminated by signal
11
> 2017-12-21 17:34:52.092 EST [2252:5] DETAIL: Failed process was running: select count(*) from foo
> left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
> on foo.id < ss.id + 1 and foo.id > ss.id - 1;
> 2017-12-21 17:34:52.092 EST [2252:6] LOG: terminating any other active server processes
That is a test of a parallel-aware hash join with a rescan (ie workers
get restarted repeatedly by the gather node reusing the DSM; maybe I
misunderstood some detail of the protocol for that). I'll go and
review that code and try to reproduce the failure. On the off-chance,
Andrew, is there any chance you have a core dump you could pull a
backtrace out of, on brolga?
--
Thomas Munro
http://www.enterprisedb.com