Re: Non-reproducible AIO failure

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Non-reproducible AIO failure
Дата
Msg-id 2dkz7azclpeiqcmouamdixyn5xhlzy4rvikxrbovyzvi6rnv5c@pz7o7osv2ahf
обсуждение исходный текст
Ответ на Re: Non-reproducible AIO failure  (Konstantin Knizhnik <knizhnik@garret.ru>)
Ответы Re: Non-reproducible AIO failure
Re: Non-reproducible AIO failure
Список pgsql-hackers
Hi,

On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:13 pm, Andres Freund wrote:
> > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
> > I'm reasonably certain I found the issue, I think it's a missing memory
> > barrier on the read side. The CPU is reordering the read (or just using a
> > cached value) of ->distilled_result to be before the load of ->state.
> > 
> > But it'll take a bit to verify that that's the issue...
> 
> It is great!
> But I wonder how it correlates with your previous statement:
> 
> > There shouldn't be any concurrent accesses here, so I don't really see how the
> > above would explain the problem (the IO can only ever be modified by one
> > backend, initially the "owning backend", then, when submitted, by the IO
> > worker, and then again by the backend).

The problem appears to be in that switch between "when submitted, by the IO
worker" and "then again by the backend".  It's not concurrent access in the
sense of two processes writing to the same value, it's that when switching
from the worker updating ->distilled_result to the issuer looking at that, the
issuer didn't ensure that no outdated version of ->distilled_result could be
used.

Basically, the problem is that the worker would

1) set ->distilled_result
2) perform a write memory barrier
3) set ->state to COMPLETED_SHARED

and then the issuer of the IO would:

4) check ->state is COMPLETED_SHARED
5) use ->distilled_result

The problem is that there currently is no barrier between 4 & 5, which means
an outdated ->distilled_result could be used.


This also explains why the issue looked so weird - eventually, after fprintfs,
after a core dump, etc, the updated ->distilled_result result would "arrive"
in the issuing process, and suddenly look correct.



> This is what I am observing myself: "op"  field is modified and fetched by
> the same process.

Right - but I don't think the ->op field being wrong was actually part of the
issue though.


> Certainly process can be rescheduled to some other CPU. But if such
> reschedule can cause loose of stored value, then nothing will work, will it?

Yes, that'd be completely broken - but isn't the issue here.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления: