Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
От | Thomas Munro |
---|---|
Тема | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? |
Дата | |
Msg-id | CA+hUKG+o_QzN1LrEPGJLa6JiZU4DpLRybkwZjdCErs2qsB6axg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? (Alexander Lakhin <exclusion@gmail.com>) |
Ответы |
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Re: Why is src/test/modules/committs/t/002_standby.pl flaky? Re: Why is src/test/modules/committs/t/002_standby.pl flaky? |
Список | pgsql-hackers |
On Tue, Jan 11, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > 10.01.2022 12:40, Thomas Munro wrote: > > This is super quick-and-dirty code (and doesn't handle some errors or > > socket changes correctly), but does it detect the closed socket? > Yes, it fixes the behaviour and makes the 002_standby test pass (100 of > 100 iterations). Thanks for testing. That result does seem to confirm the hypothesis that FD_CLOSE is reported only once for the socket on graceful shutdown (that is, it's edge-triggered and incidentally you won't get FD_READ), so you need to keep track of it carefully. Incidentally, another observation is that your WSAPoll() test appears to be returning POLLHUP where at least Linux, FreeBSD and Solaris would not: a socket that is only half shut down (the primary shut down its end gracefully, but walreceiver did not), so I suspect Windows' POLLHUP might have POLLRDHUP semantics. > I'm yet to find out whether the other > WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the > disconnected socket state, but this approach definitely works for > walreceiver. I see where you're going: there might be safe call sequences and unsafe call sequences, and maybe walreceiver is asking for trouble by double-polling. I'm not sure about that; I got the impression recently that it's possible to get FD_CLOSE while you still have buffered data to read, so then the next recv() will return > 0 and then we don't have any state left anywhere to remember that we saw FD_CLOSE, even if you're careful to poll and read in the ideal sequence. I could be wrong, and it would be nice if there is an easy fix along those lines... The documentation around FD_CLOSE is unclear. I do plan to make a higher quality patch like the one I showed (material from earlier unfinished work[1] that needs a bit more infrastructure), but to me that's new feature/efficiency work, not something we'd want to back-patch. Hmm, one thing I'm still unclear on: did this problem really start with 6051857fc/ed52c3707? My initial email in this thread lists similar failures going back further, doesn't it? (And what's tern doing mixed up in this mess?) [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJPaygh-6WHEd0FnH89GrkTpVyN_ew9ckv3%2BnwjmLcSeg%40mail.gmail.com#aa33ec3e7ad85499f35dd1434a139c3f
В списке pgsql-hackers по дате отправления: