Обсуждение: Re: Potential deadlock in pgaio_io_wait()
On Mon, Aug 4, 2025 at 5:54 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I doubt it's very easy to reproduce with simple queries, but I assume > if you had a SQL function that acquires a central LWLock and you ran > concurrent queries SELECT COUNT(*) FROM t WHERE locking_function(x) Hmm, that's a bad example as it has the wrong lock scope. Probably would need a dedicated test to demonstrate with low level functions. What I was trying to convey is that it's not a problem that can be hit in practice without great effort as far as I know, but it does break an intended pgaio architectural principle as I understand it. Also I accidentally sent that to -bugs by fat fingering an autocompletion. Moving to -hackers.
I discussed this off-list with Andres who provided the following review: * +1 for the analysis * +1 for the solution * his benchmark machine shows no regression under heavy IO submission workload * needs better comments I had expected that patch to be rejected as too slow. I was thinking that it should be enough to insert a memory barrier and then do an unlocked check for an empty waitlist in ConditionVariableBroadcast() to avoid a spinlock acquire/release. That caused a few weird hangs I didn't understand, which is why I didn't post that version last time, but he pointed out that the other side also needs a memory barrier in ConditionVariablePrepareToSleep() and the existing spinlock acquire/release is not enough. Still processing that, but given that ConditionVariableBroadcast() performance is already sufficient, there is no need for fancy tricks for this problem and my "naive" patch is actually fine. So here's a version that just adds some comments and corrects a minor thinko. I also thought of a small optimisation, presented in the -B patch. It's a bit of a shame to wait for backend->submit_cv and then also ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for nothing. So I figured it should be allowed to fall all the way through based on its lack of ->wait_one. Worth bothering with? On a superficial note: AIO_IO_COMPLETION "Waiting for another process to complete IO." +AIO_IO_SUBMIT "Waiting for another process to submit IO." AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring." AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring." We're inconsistent in our choice of noun or verb. I went with _SUBMIT to match the following line, rather than _SUBMISSION to match the preceding line. Shrug.
Вложения
Hi, On 2025-08-15 17:39:30 +1200, Thomas Munro wrote: > I discussed this off-list with Andres who provided the following review: > > * +1 for the analysis > * +1 for the solution > * his benchmark machine shows no regression under heavy IO submission workload > * needs better comments > > I had expected that patch to be rejected as too slow. I was thinking > that it should be enough to insert a memory barrier and then do an > unlocked check for an empty waitlist in ConditionVariableBroadcast() > to avoid a spinlock acquire/release. That caused a few weird hangs I > didn't understand, which is why I didn't post that version last time, > but he pointed out that the other side also needs a memory barrier in > ConditionVariablePrepareToSleep() and the existing spinlock > acquire/release is not enough. FTR, the reason that we would need a barrier there is that we define SpinLockRelease() as a non-atomic volatile store on x86. That means that there is no guarantee that another backend sees an up2date view of the memory if reading a memory location *without* acquiring the spinlock. The reason this is correct for spinlocks is that x86 has strongly ordered stores, and the *acquisition* of a spinlock will only succeed once the release of the spinlock has become visible. But if you read something *without* the spinlock, you're not guaranteed to see all the changes done without the spinlock. I do wonder if we accidentally rely on SpinLockRelease() being a barrier anywhere... > I also thought of a small optimisation, presented in the -B patch. > It's a bit of a shame to wait for backend->submit_cv and then also > ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for > nothing. So I figured it should be allowed to fall all the way > through based on its lack of ->wait_one. Worth bothering with? I don't think I'd go there without concrete evidence that it helps - it makes it harder to understand when to wait for what. Not terribly so, but enough that I'd not go there without some measurable benefit. > On a superficial note: > > AIO_IO_COMPLETION "Waiting for another process to complete IO." > +AIO_IO_SUBMIT "Waiting for another process to submit IO." > AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring." > AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring." > We're inconsistent in our choice of noun or verb. I went with _SUBMIT > to match the following line, rather than _SUBMISSION to match the > preceding line. Shrug. FWIW, I think so far it's a verb when the process is doing the work (e.g. the process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for somebody *else* to do the completion, since we're not doing work ourselves. > From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Sun, 3 Aug 2025 23:07:56 +1200 > Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs. > > Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and > PGAIO_HS_STAGED fell through to waiting for completion. The owner only > promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be > prepared to call ->wait_one() itself once the IO is submitted in order > to guarantee progress and avoid deadlocks on IO methods that provide > ->wait_one(). > > Introduce a new per-backend condition variable submit_cv, woken by by > pgaio_submit_stage(), and use it to wait for the state to advance. The > new broadcast doesn't seem to cause any measurable slowdown, so ideas > for optimizing the common no-waiters case were abandoned for now. > > It may not be possible to reach any real deadlock with existing AIO > users, but that situation could change. There's also no reason the > waiter shouldn't begin to wait via the IO method as soon as possible > even without a deadlock. > > Picked up by testing a proposed IO method that has ->wait_one(), like > io_method=io_uring, and code review. LGTM. Greetings, Andres Freund
On 19/08/2025 03:07, Andres Freund wrote: > On 2025-08-15 17:39:30 +1200, Thomas Munro wrote: >> From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001 >> From: Thomas Munro <thomas.munro@gmail.com> >> Date: Sun, 3 Aug 2025 23:07:56 +1200 >> Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs. >> >> Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and >> PGAIO_HS_STAGED fell through to waiting for completion. The owner only >> promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be >> prepared to call ->wait_one() itself once the IO is submitted in order >> to guarantee progress and avoid deadlocks on IO methods that provide >> ->wait_one(). >> >> Introduce a new per-backend condition variable submit_cv, woken by by >> pgaio_submit_stage(), and use it to wait for the state to advance. The >> new broadcast doesn't seem to cause any measurable slowdown, so ideas >> for optimizing the common no-waiters case were abandoned for now. >> >> It may not be possible to reach any real deadlock with existing AIO >> users, but that situation could change. There's also no reason the >> waiter shouldn't begin to wait via the IO method as soon as possible >> even without a deadlock. >> >> Picked up by testing a proposed IO method that has ->wait_one(), like >> io_method=io_uring, and code review. > > LGTM. I just independently noticed this same issue, wrote a little test to reproduce it, and was about to report it, when I noticed that you found this already. Attached is the repro script. Both of the proposed patches seem fine to me. I'm inclined to go with the first patch (v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patch), without the extra optimization, unless we can actually measure a performance difference. - Heikki