Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
От | Andres Freund |
---|---|
Тема | Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker |
Дата | |
Msg-id | 20230725033946.mstudugbav7yghui@awork3.anarazel.de обсуждение исходный текст |
Ответ на | Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker (Masahiko Sawada <sawada.mshk@gmail.com>) |
Ответы |
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
|
Список | pgsql-bugs |
Hi, Thanks for finding / debugging this issue! On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote: > I've reproduced the issue in my environment with the provided script. > The crashed process is not a parallel vacuum worker actually but a > parallel worker for rebuilding the index. The scenario seems that when > detecting a deadlock, the process removes itself from the wait queue > by RemoveFromWaitQueue() (called by CheckDeadLock()), and then > RemoveFromWaitQueue() is called again by LockErrorCleanup() while > aborting the transaction. With commit 5764f611e, in > RemoveFromWaitQueue() we remove the process from the wait queue using > dclist_delete_from(): > > /* Remove proc from lock's wait queue */ > dclist_delete_from(&waitLock->waitProcs, &proc->links); > : > /* Clean up the proc's own state, and pass it the ok/fail signal */ > proc->waitLock = NULL; > proc->waitProcLock = NULL; > proc->waitStatus = PROC_WAIT_STATUS_ERROR; > > However, since dclist_delete_from() doesn't clear proc->links, in > LockErrorCleanup(), dlist_node_is_detached() still returns false: > > if (!dlist_node_is_detached(&MyProc->links)) > { > /* We could not have been granted the lock yet */ > RemoveFromWaitQueue(MyProc, lockAwaited->hashcode); > } Indeed :( > leading to calling RemoveFromWaitQueue() again. I think we should use > dclist_delete_from_thoroughly() instead. With the attached patch, the > issue doesn't happen in my environment. Yep. Do you want to push that fix, or should I? > Another thing I noticed is that the Assert(waitLock) in > RemoveFromWaitQueue() is useless actually, since we access *waitLock > before that: > > I think we should fix it as well. This fix is also included in the > attached patch. Don't really have an opinion on that. It's been this way for longer, afaict. Greetings, Andres Freund
В списке pgsql-bugs по дате отправления: