Re: Broken order-of-operations in parallel query latch manipulation
От | Tom Lane |
---|---|
Тема | Re: Broken order-of-operations in parallel query latch manipulation |
Дата | |
Msg-id | 6446.1470064523@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: Broken order-of-operations in parallel query latch manipulation (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Broken order-of-operations in parallel query latch manipulation
|
Список | pgsql-hackers |
Amit Kapila <amit.kapila16@gmail.com> writes: > On Mon, Aug 1, 2016 at 1:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I believe this is wrong and the CHECK_FOR_INTERRUPTS needs to be before >> or after the two latch operations. As-is, if the reason somebody set >> our latch was to get us to notice that a CHECK_FOR_INTERRUPTS condition >> happened, there's a race condition where we'd fail to realize that. > I could see that in nodeGather.c, it might fail to notice the SetLatch > done by worker process or spuriously woken up due to SetLatch for some > unrelated reason. However, I don't see what problem it can cause > apart from one extra loop cycle where it will try to process the tuple > when actually there is no tuple in the queue. Consider the following sequence of events: 1. gather_readnext reaches the WaitLatch, and is allowed to pass through it for some unrelated reason, perhaps some long-since-handled SIGUSR1 from a worker process. 2. gather_readnext does CHECK_FOR_INTERRUPTS(), and sees nothing pending. 3. A SIGINT is received. StatementCancelHandler sets QueryCancelPending and does SetLatch(MyLatch). 4. gather_readnext does ResetLatch(MyLatch). 5. gather_readnext runs through its loop again, finds nothing to do, and reaches the WaitLatch. 6. The process is now sleeping on its latch, and might sit there a long time before noticing the pending query cancel. Obviously the window for this race condition is pretty tight --- there's not many instructions between steps 2 and 4. But it can happen. If memory serves, we've had actual field reports for race condition bugs where the window that was being hit wasn't much more than a single instruction. Also, it's entirely possible that the bug could be masked, if there was another CHECK_FOR_INTERRUPTS lurking anywhere in the code called within the loop. That doesn't excuse this coding practice, though. BTW, now that I look at it, CHECK_FOR_INTERRUPTS subsumes HandleParallelMessages(), which means the direct call to the latter at the top of gather_readnext's loop is pretty bogus. I now think the right fix in gather_readnext is to move the CHECK_FOR_INTERRUPTS macro to the top of the loop, replacing that call. The places in shm_mq.c that have this issue should probably look like ProcWaitForSignal, though. regards, tom lane
В списке pgsql-hackers по дате отправления: