Re: conchuela timeouts since 2021-10-09 system upgrade
От | Tom Lane |
---|---|
Тема | Re: conchuela timeouts since 2021-10-09 system upgrade |
Дата | |
Msg-id | 83446.1635258579@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: conchuela timeouts since 2021-10-09 system upgrade (Noah Misch <noah@leadboat.com>) |
Ответы |
Re: conchuela timeouts since 2021-10-09 system upgrade
|
Список | pgsql-bugs |
Noah Misch <noah@leadboat.com> writes: > On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote: >> Or more >> practically, use advisory locks in that script to enforce that only one >> runs at once. > The author did try that. Hmm ... that ought to have done the trick, I'd think. However: > Both sound doable, but I don't expect either to fix prairiedog's trouble. Yeah :-(. I think this test is somehow stumbling over a pre-existing bug. >> So what we have is that libpq thinks it's sent the next DROP INDEX, >> but the backend hasn't seen it. > Thanks for isolating that. The plot thickens. When I went back to look at that machine this morning, I found this in the postmaster log: 2021-10-26 02:52:09.324 EDT [1013] 002_cic.pl LOG: statement: DROP INDEX CONCURRENTLY idx; 2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl LOG: could not send data to client: Broken pipe 2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl FATAL: connection to client lost The timestamps correspond (more or less anyway) to when I killed off the stuck test run and went to bed. So the DROP command *was* sent, and it was eventually received by the backend, but it seems to have taken killing the pgbench process to do it. I think this probably exonerates the pgbench/libpq side of things, and instead we have to wonder about a backend or kernel bug. A kernel bug could possibly explain the unexplainable connection to what's happening on some other file descriptor. I'd be prepared to believe that prairiedog's ancient macOS version has some weird bug preventing kevent() from noticing available data ... but (a) surely conchuela wouldn't share such a bug, and (b) we've been using kevent() for a couple years now, so how come we didn't see this before? Still baffled. I'm currently experimenting to see if the bug reproduces when latch.c is made to use poll() instead of kevent(). But the failure rate was low enough that it'll be hours before I can say confidently that it doesn't (unless, of course, it does). regards, tom lane
В списке pgsql-bugs по дате отправления: