Re: Non-reproducible AIO failure

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: Non-reproducible AIO failure
Дата
Msg-id d19c1400-bb7e-4644-b4c5-1acfbf79a223@garret.ru
обсуждение исходный текст
Ответ на Re: Non-reproducible AIO failure  (Alexander Lakhin <exclusion@gmail.com>)
Ответы Re: Non-reproducible AIO failure
Список pgsql-hackers
On 12/06/2025 4:13 pm, Andres Freund wrote:
> Hi,
>
> On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
>> Reproduced it once again with with write-protected io handle.
>> But once again - no access violation, just assert failure.
>>
>> Previously "op" field was overwritten somewhere between `pgaio_io_reclaim`
>> and `AsyncReadBuffers`:
>>
>> !!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generation:
>> 19346
>> !!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, ioh->op: 1,
>> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 19346
>>
>> Now it is overwritten after print in AsyncReadBuffers:
>>
>> !!!pgaio_io_reclaim [88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation:
>> 42848
>> !!!pgaio_io_acquire_nb[88932]| ioh: 0x105a5c000, ioh->op: 0,
>> ioh->generation: 42848
>> !!!AsyncReadBuffers [88932] (1)| blocknum: 10, ioh: 0x105a5c000, ioh->op: 0,
>> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 42848
>> !!!pgaio_io_before_start| ioh: 0x105a5c000, ioh->op: 1, ioh->state: 1,
>> ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 42848
>>
>> In this run I prohibit writes to io handle in `pgaio_io_acquire_nb` and
>> reenable them in `AsyncReadBuffer`.
> I'm reasonably certain I found the issue, I think it's a missing memory
> barrier on the read side. The CPU is reordering the read (or just using a
> cached value) of ->distilled_result to be before the load of ->state.
>
> But it'll take a bit to verify that that's the issue...

It is great!
But I wonder how it correlates with your previous statement:

> There shouldn't be any concurrent accesses here, so I don't really see how the
> above would explain the problem (the IO can only ever be modified by one
> backend, initially the "owning backend", then, when submitted, by the IO
> worker, and then again by the backend).


This is what I am observing myself: "op"  field is modified and fetched 
by the same process.
Certainly process can be rescheduled to some other CPU. But if such 
reschedule can cause loose of stored value, then nothing will work, will it?
So assume that there is some variable "x" which is updated by process 
"x=1" executed at CPU1, then process is rescheduled to CPU2 which does 
"x=2", then process is once again rescheduled to CPU1 and we found out 
that  "x==1". And to prevent it we need to explicitly enforce some 
memory barrier. Unless I missing something , nothing can work with such 
memory  model.





В списке pgsql-hackers по дате отправления: