Re: Non-reproducible AIO failure
От | Andres Freund |
---|---|
Тема | Re: Non-reproducible AIO failure |
Дата | |
Msg-id | of6nnksyqlbqikhpiwspalskgtx5dax6te2dwn3ojmj5k7obh4@hrteef7hiwvp обсуждение исходный текст |
Ответ на | Re: Non-reproducible AIO failure (Konstantin Knizhnik <knizhnik@garret.ru>) |
Ответы |
Re: Non-reproducible AIO failure
|
Список | pgsql-hackers |
Hi, On 2025-06-16 20:22:00 -0400, Tom Lane wrote: > Konstantin Knizhnik <knizhnik@garret.ru> writes: > > On 16/06/2025 6:11 pm, Andres Freund wrote: > >> I unfortunately can't repro this issue so far. > > > But unfortunately it means that the problem is not fixed. > > FWIW, I get similar results to Andres' on a Mac Mini M4 Pro > using MacPorts' current compiler release (clang version 19.1.7). > The currently-proposed test case fails within a few minutes on > e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5. I'm surprised it takes that long, given it takes seconds to reproduce here with the config parameters I outlined. Did you try crank up the concurrency a bit? Yours has more cores than mine, and I found that that makes a huge difference. > However, I cannot repro that on a slightly older Mini M1 using Apple's > current release (clang-1700.0.13.5, which per wikipedia is really LLVM > 19.1.4). It seems to work fine even without e9a3615a5. So the whole > thing is still depressingly phase-of-the-moon-dependent. It's not entirely surprising that an M1 would have a harder time reproducing the issue, more cores, larger caches and a larger out-of-order execution window will make it more likely that the missing memory barriers have a visible effect. I'm reasonably sure that e9a3615a5 quashed that specific issue - I could repro it within seconds with e9a3615a5^ and with e9a3615a5 I ran it for several days without a single failure... > I don't doubt that Konstantin has found a different issue, but > it's hard to be sure about the fix unless we can get it to be > more reproducible. Neither of my machines has ever shown the > symptom he's getting. I've not been able to reproduce that symptom a single time either so far. The assertion continues to be inexplicable to me. It shows, within a single process, memory in shared memory going "backwards". But not always, just very occasionally. Because this is before the IO is defined, there's no concurrent access whatsoever. I stole^Wgot my partner's m1 macbook for a bit, trying to reproduce the issue there. It has "Apple clang version 16.0.0 (clang-1600.0.26.6)" on "Darwin Kernel Version 24.3.0" That's the same Apple-clang version that Alexander reported being able to reproduce the issue on [1], but unfortunately it's a newer kernel version. No dice in the first 55 test iterations. Konstantin, Alexander - are you using the same device to reproduce this or different ones? I wonder if this somehow depends on some MDM / corporate enforcement tooling running or such. What does: - profiles status -type enrollment - kextstat -l show? Greetings, Andres Freund [1] https://postgr.es/m/92b33ab2-0596-40fe-9db6-a6d821d08e8a%40gmail.com
В списке pgsql-hackers по дате отправления: