Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
От | Alexander Lakhin |
---|---|
Тема | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |
Дата | |
Msg-id | ee0e1ae4-ff12-7d56-72a8-a70e492d6287@gmail.com обсуждение исходный текст |
Ответ на | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
|
Список | pgsql-hackers |
Hi Thomas, 08.09.2023 22:39, Thomas Munro wrote: >> With debugging logging added I see (on 7389aad63~1) that one process >> really sends SIGURG to another, and the latter reaches poll(), but it >> just got no signal, it's signal handler not called and poll() just waits... > Thanks for working so hard on this Alexander. That is a surprising > discovery! So changes to the signal handler arrangements in the > *postmaster* before the child was forked affected this? Yes, I think we deal with something like that. I can try to deduce a minimum change that affects reproducing the issue, but may be it's not that important. Perhaps we now should think of escalating the problem to FreeBSD developers? I wonder, what kind of reproducer they find acceptable. A standalone C program only or maybe a script that compiles/installs postgres and runs our test will do too? >> So it looks like the ARM weak memory model is not the root cause of the >> issue. But as far as I can see, it's still specific to FreeBSD (but not >> specific to a compiler — I used gcc and clang with the same success). > Idea: FreeBSD 13 introduced a new mechanism called sigfastblock[1], > which lets system libraries control signal blocking with atomic memory > tricks in a word of user space memory. I have no particular theory > for why it would be going wrong here (I don't expect us to be using > any of the stuff that would use it, though I don't understand it in > detail so that doesn't say much), but it occurred to me that all > reports so far have been on 13.x or 14. I wonder... If you have a > good fast recipe for reproducing this, could you also try it on > FreeBSD 12.4? It was a happy guess! I checked the reproduction on FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 and got the same results as on FreeBSD 14: REL_12_STABLE - failed on iteration 3 REL_15_STABLE - failed on iteration 1 REL_16_STABLE - 10 iterations with no failure But on FreeBSD 12.4-RELEASE r372781: REL_12_STABLE - 20 iterations with no failure REL_15_STABLE - 20 iterations with no failure BTW, I also retested 7389aad63 on FreeBSD 14 and got no failure for 100 iterations. Best regards, Alexander
В списке pgsql-hackers по дате отправления: