Re: Non-reproducible AIO failure
От | Andres Freund |
---|---|
Тема | Re: Non-reproducible AIO failure |
Дата | |
Msg-id | r57i4bh6fs3vpprav7kynh2bbyma3mw5t3lxnzua43cszft5y7@itwb5kchyhfo обсуждение исходный текст |
Ответ на | Re: Non-reproducible AIO failure (Konstantin Knizhnik <knizhnik@garret.ru>) |
Список | pgsql-hackers |
On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writing to the same value, it's that when switching > > from the worker updating ->distilled_result to the issuer looking at that, the > > issuer didn't ensure that no outdated version of ->distilled_result could be > > used. > > > > Basically, the problem is that the worker would > > > > 1) set ->distilled_result > > 2) perform a write memory barrier > > 3) set ->state to COMPLETED_SHARED > > > > and then the issuer of the IO would: > > > > 4) check ->state is COMPLETED_SHARED > > 5) use ->distilled_result > > > > The problem is that there currently is no barrier between 4 & 5, which means > > an outdated ->distilled_result could be used. > > > > > > This also explains why the issue looked so weird - eventually, after fprintfs, > > after a core dump, etc, the updated ->distilled_result result would "arrive" > > in the issuing process, and suddenly look correct. > > > Sorry, I realized that O do not completely understand how it can explained > assertion failure in `pgaio_io_before_start`: > > Assert(ioh->op == PGAIO_OP_INVALID); I don't think it can - this must be an independent bug from the one that Tom and I were encountering. Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: