Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
От | Tomas Vondra |
---|---|
Тема | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |
Дата | |
Msg-id | 4dcd8d2b-efd6-4ede-1c43-f2dbd760ea3e@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
|
Список | pgsql-hackers |
On 1/29/23 18:26, Thomas Munro wrote: > On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> So I did that - same configure options as the buildfarm client, and a >> 'make check' (with only tests up to the 'join' suite, because that's >> where it got stuck before). And it took only ~15 runs (~1h) to hit this >> again on dikkop. > > That's good news. > >> I managed to collect the fstat/procstat stuff Thomas asked for, and the >> backtraces - attached. I still have the core files, in case we look at >> something. As before, running gcore on the second worker (29081) gets >> this unstuck - it sends some signal that apparently wakes it up. > > Thanks! As expected, no bytes in the pipe for any those processes. > Unfortunately I gave the wrong procstat command, it should be -i, not > -j. Does "procstat -i /path/to/core | grep USR1" show P (pending) for > that stuck process? Silly question really, I don't really expect > poll() to be misbehaving in such a basic way. > It shows "--C" for all three processes, which should mean "will be caught". > I was talking to Andres on IM about this yesterday and he pointed out > a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to > tell the signal handler to write to the self-pipe) and then reads > latch->is_set with neither compiler nor memory barrier, which doesn't > seem right because we might see a value of latch->is_set from before > "waiting" was true, and yet the signal handler might also have run > while "waiting" was false so the self-pipe doesn't save us, despite > the length of the comment about that. Can you reproduce it with this > change? > Will do, but I'll wait for another lockup to see how frequent it actually is. I'm now at ~90 runs total, and it didn't happen again yet. So hitting it after 15 runs might have been a bit of a luck. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: