Re: Non-reproducible AIO failure
От | Andres Freund |
---|---|
Тема | Re: Non-reproducible AIO failure |
Дата | |
Msg-id | 2dkz7azclpeiqcmouamdixyn5xhlzy4rvikxrbovyzvi6rnv5c@pz7o7osv2ahf обсуждение исходный текст |
Ответ на | Re: Non-reproducible AIO failure (Konstantin Knizhnik <knizhnik@garret.ru>) |
Ответы |
Re: Non-reproducible AIO failure
Re: Non-reproducible AIO failure |
Список | pgsql-hackers |
Hi, On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:13 pm, Andres Freund wrote: > > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: > > I'm reasonably certain I found the issue, I think it's a missing memory > > barrier on the read side. The CPU is reordering the read (or just using a > > cached value) of ->distilled_result to be before the load of ->state. > > > > But it'll take a bit to verify that that's the issue... > > It is great! > But I wonder how it correlates with your previous statement: > > > There shouldn't be any concurrent accesses here, so I don't really see how the > > above would explain the problem (the IO can only ever be modified by one > > backend, initially the "owning backend", then, when submitted, by the IO > > worker, and then again by the backend). The problem appears to be in that switch between "when submitted, by the IO worker" and "then again by the backend". It's not concurrent access in the sense of two processes writing to the same value, it's that when switching from the worker updating ->distilled_result to the issuer looking at that, the issuer didn't ensure that no outdated version of ->distilled_result could be used. Basically, the problem is that the worker would 1) set ->distilled_result 2) perform a write memory barrier 3) set ->state to COMPLETED_SHARED and then the issuer of the IO would: 4) check ->state is COMPLETED_SHARED 5) use ->distilled_result The problem is that there currently is no barrier between 4 & 5, which means an outdated ->distilled_result could be used. This also explains why the issue looked so weird - eventually, after fprintfs, after a core dump, etc, the updated ->distilled_result result would "arrive" in the issuing process, and suddenly look correct. > This is what I am observing myself: "op" field is modified and fetched by > the same process. Right - but I don't think the ->op field being wrong was actually part of the issue though. > Certainly process can be rescheduled to some other CPU. But if such > reschedule can cause loose of stored value, then nothing will work, will it? Yes, that'd be completely broken - but isn't the issue here. Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: