Обсуждение: Re: Potential deadlock in pgaio_io_wait()

Поиск
Список
Период
Сортировка

Re: Potential deadlock in pgaio_io_wait()

От
Thomas Munro
Дата:
On Mon, Aug 4, 2025 at 5:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I doubt it's very easy to reproduce with simple queries, but I assume
> if you had a SQL function that acquires a central LWLock and you ran
> concurrent queries SELECT COUNT(*) FROM t WHERE locking_function(x)

Hmm, that's a bad example as it has the wrong lock scope.  Probably
would need a dedicated test to demonstrate with low level functions.
What I was trying to convey is that it's not a problem that can be hit
in practice without great effort as far as I know, but it does break
an intended pgaio architectural principle as I understand it.

Also I accidentally sent that to -bugs by fat fingering an
autocompletion.  Moving to -hackers.



Re: Potential deadlock in pgaio_io_wait()

От
Thomas Munro
Дата:
I discussed this off-list with Andres who provided the following review:

 * +1 for the analysis
 * +1 for the solution
 * his benchmark machine shows no regression under heavy IO submission workload
 * needs better comments

I had expected that patch to be rejected as too slow.  I was thinking
that it should be enough to insert a memory barrier and then do an
unlocked check for an empty waitlist in ConditionVariableBroadcast()
to avoid a spinlock acquire/release.  That caused a few weird hangs I
didn't understand, which is why I didn't post that version last time,
but he pointed out that the other side also needs a memory barrier in
ConditionVariablePrepareToSleep() and the existing spinlock
acquire/release is not enough.  Still processing that, but given that
ConditionVariableBroadcast() performance is already sufficient, there
is no need for fancy tricks for this problem and my "naive" patch is
actually fine.

So here's a version that just adds some comments and corrects a minor thinko.

I also thought of a small optimisation, presented in the -B patch.
It's a bit of a shame to wait for backend->submit_cv and then also
ioh->cv in io_method=worker.  It's just a bunch of IPC ping-pong for
nothing.  So I figured it should be allowed to fall all the way
through based on its lack of ->wait_one.  Worth bothering with?

On a superficial note:

 AIO_IO_COMPLETION      "Waiting for another process to complete IO."
+AIO_IO_SUBMIT  "Waiting for another process to submit IO."
 AIO_IO_URING_SUBMIT    "Waiting for IO submission via io_uring."
 AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."

We're inconsistent in our choice of noun or verb.  I went with _SUBMIT
to match the following line, rather than _SUBMISSION to match the
preceding line.  Shrug.

Вложения

Re: Potential deadlock in pgaio_io_wait()

От
Andres Freund
Дата:
Hi,

On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
> I discussed this off-list with Andres who provided the following review:
> 
>  * +1 for the analysis
>  * +1 for the solution
>  * his benchmark machine shows no regression under heavy IO submission workload
>  * needs better comments
> 
> I had expected that patch to be rejected as too slow.  I was thinking
> that it should be enough to insert a memory barrier and then do an
> unlocked check for an empty waitlist in ConditionVariableBroadcast()
> to avoid a spinlock acquire/release.  That caused a few weird hangs I
> didn't understand, which is why I didn't post that version last time,
> but he pointed out that the other side also needs a memory barrier in
> ConditionVariablePrepareToSleep() and the existing spinlock
> acquire/release is not enough.

FTR, the reason that we would need a barrier there is that we define
SpinLockRelease() as a non-atomic volatile store on x86. That means that there
is no guarantee that another backend sees an up2date view of the memory if
reading a memory location *without* acquiring the spinlock. The reason this is
correct for spinlocks is that x86 has strongly ordered stores, and the
*acquisition* of a spinlock will only succeed once the release of the spinlock
has become visible.  But if you read something *without* the spinlock, you're
not guaranteed to see all the changes done without the spinlock.

I do wonder if we accidentally rely on SpinLockRelease() being a barrier
anywhere...


> I also thought of a small optimisation, presented in the -B patch.
> It's a bit of a shame to wait for backend->submit_cv and then also
> ioh->cv in io_method=worker.  It's just a bunch of IPC ping-pong for
> nothing.  So I figured it should be allowed to fall all the way
> through based on its lack of ->wait_one.  Worth bothering with?

I don't think I'd go there without concrete evidence that it helps - it makes
it harder to understand when to wait for what. Not terribly so, but enough
that I'd not go there without some measurable benefit.


> On a superficial note:
> 
>  AIO_IO_COMPLETION      "Waiting for another process to complete IO."
> +AIO_IO_SUBMIT  "Waiting for another process to submit IO."
>  AIO_IO_URING_SUBMIT    "Waiting for IO submission via io_uring."
>  AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."


> We're inconsistent in our choice of noun or verb.  I went with _SUBMIT
> to match the following line, rather than _SUBMISSION to match the
> preceding line.  Shrug.

FWIW, I think so far it's a verb when the process is doing the work (e.g. the
process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for
somebody *else* to do the completion, since we're not doing work ourselves.



> From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Sun, 3 Aug 2025 23:07:56 +1200
> Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.
> 
> Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
> PGAIO_HS_STAGED fell through to waiting for completion.  The owner only
> promises to advance it to PGAIO_HS_SUBMITTED.  The waiter needs to be
> prepared to call ->wait_one() itself once the IO is submitted in order
> to guarantee progress and avoid deadlocks on IO methods that provide
> ->wait_one().
> 
> Introduce a new per-backend condition variable submit_cv, woken by by
> pgaio_submit_stage(), and use it to wait for the state to advance.  The
> new broadcast doesn't seem to cause any measurable slowdown, so ideas
> for optimizing the common no-waiters case were abandoned for now.
> 
> It may not be possible to reach any real deadlock with existing AIO
> users, but that situation could change.  There's also no reason the
> waiter shouldn't begin to wait via the IO method as soon as possible
> even without a deadlock.
> 
> Picked up by testing a proposed IO method that has ->wait_one(), like
> io_method=io_uring, and code review.

LGTM.

Greetings,

Andres Freund



Re: Potential deadlock in pgaio_io_wait()

От
Heikki Linnakangas
Дата:
On 19/08/2025 03:07, Andres Freund wrote:
> On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
>>  From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
>> From: Thomas Munro <thomas.munro@gmail.com>
>> Date: Sun, 3 Aug 2025 23:07:56 +1200
>> Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.
>>
>> Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
>> PGAIO_HS_STAGED fell through to waiting for completion.  The owner only
>> promises to advance it to PGAIO_HS_SUBMITTED.  The waiter needs to be
>> prepared to call ->wait_one() itself once the IO is submitted in order
>> to guarantee progress and avoid deadlocks on IO methods that provide
>> ->wait_one().
>>
>> Introduce a new per-backend condition variable submit_cv, woken by by
>> pgaio_submit_stage(), and use it to wait for the state to advance.  The
>> new broadcast doesn't seem to cause any measurable slowdown, so ideas
>> for optimizing the common no-waiters case were abandoned for now.
>>
>> It may not be possible to reach any real deadlock with existing AIO
>> users, but that situation could change.  There's also no reason the
>> waiter shouldn't begin to wait via the IO method as soon as possible
>> even without a deadlock.
>>
>> Picked up by testing a proposed IO method that has ->wait_one(), like
>> io_method=io_uring, and code review.
> 
> LGTM.

I just independently noticed this same issue, wrote a little test to 
reproduce it, and was about to report it, when I noticed that you found 
this already. Attached is the repro script.

Both of the proposed patches seem fine to me. I'm inclined to go with 
the first patch (v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patch), 
without the extra optimization, unless we can actually measure a 
performance difference.

- Heikki

Вложения