Обсуждение: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Hi, Somehow these tests have recently become unstable and have failed a few times: https://github.com/postgres/postgres/commits/REL_15_STABLE The failures are like: [22:32:26.722] # Failed test 'pgbench simple update stdout /(?^:builtin: simple update)/' [22:32:26.722] # at t/001_pgbench_with_server.pl line 119. [22:32:26.722] # 'pgbench (15.4) [22:32:26.722] # ' [22:32:26.722] # doesn't match '(?^:builtin: simple update)'
On Mon, Sep 04, 2023 at 03:18:40PM +1200, Thomas Munro wrote: > Somehow these tests have recently become unstable and have failed a few times: > > https://github.com/postgres/postgres/commits/REL_15_STABLE > > The failures are like: > > [22:32:26.722] # Failed test 'pgbench simple update stdout > /(?^:builtin: simple update)/' > [22:32:26.722] # at t/001_pgbench_with_server.pl line 119. > [22:32:26.722] # 'pgbench (15.4) > [22:32:26.722] # ' > [22:32:26.722] # doesn't match '(?^:builtin: simple update)' Fun. That's a test of "pgbench -C". The test harness isn't reporting pgbench's stderr, so I hacked things to get that and the actual file descriptor values being assigned. The failure case gets "pgbench: error: too many client connections for select()" in stderr, from this pgbench.c function: static void add_socket_to_set(socket_set *sa, int fd, int idx) { if (fd < 0 || fd >= FD_SETSIZE) { /* * Doing a hard exit here is a bit grotty, but it doesn't seem worth * complicating the API to make it less grotty. */ pg_fatal("too many client connections for select()"); } FD_SET(fd, &sa->fds); if (fd > sa->maxfd) sa->maxfd = fd; } The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently. The first associated failure was commit dea12a1 (2023-08-03); as a doc commit, it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead of windows containers" (2023-02), long before the failures began. I'll guess some 2023-08 Windows update or reconfiguration altered file descriptor assignment, hence the onset of failures. In my tests of v16, the highest file descriptor was 948. I could make v16 fail by changing --client=5 to --client=90 in the test. With the attached patch and --client=90, v16 peaked at file descriptor 2040. Thanks, nm P.S. Later, we should change test code so the pgbench stderr can't grow an extra line without that line appearing in test logs.
Вложения
On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote: > The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the > attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently. > The first associated failure was commit dea12a1 (2023-08-03); as a doc commit, > it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead > of windows containers" (2023-02), long before the failures began. I'll guess > some 2023-08 Windows update or reconfiguration altered file descriptor > assignment, hence the onset of failures. In my tests of v16, the highest file > descriptor was 948. I could make v16 fail by changing --client=5 to > --client=90 in the test. With the attached patch and --client=90, v16 peaked > at file descriptor 2040. Ohhh. Thanks for figuring that out. I'd never noticed that quirk. I didn't/can't test it but the patch looks reasonable after reading the referenced docs. Maybe instead of just "out of range" I'd say "out of range for select()" since otherwise it might seem a little baffling -- what range? Random water cooler speculation about future ideas: I wonder whether/when we can eventually ditch select() and use WSAPoll() for this on Windows, which is supposed to be like poll(). There's a comment explaining that we prefer select() because it has a higher resolution sleep than poll() (us vs ms), so we wouldn't want to use poll() on Unixen, but we know that Windows doesn't even use high resolution timers for any user space APIs anyway so that's just not a concern on that platform. The usual reason nobody ever uses WSAPoll() is because the Curl guys told[1] everyone that it's terrible in a way that would quite specifically break our usage. But I wonder, because the documentation now says "As of Windows 10 version 2004, when a TCP socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is indicated", it *sounds* like it might have been fixed ~3 years ago? One idea would be to hide it inside WaitEventSet, and let WaitEventSet be used in front end code (we couldn't use the WaitForMultipleObjects() version because it's hard-limited to 64 events, but in front end code we don't need latches and other stuff, so we could have a sockets-only version with WSAPoll()). The idea of using WaitEventSet for pgbench on Unix was already mentioned a couple of times by others for general scalability reasons -- that way we could finish up using a better kernel interface on all supported platforms. We'd probably want to add high resolution time-out support (I already posted a patch for that somewhere), or we'd be back to 1ms timeouts. [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/
On Mon, Oct 09, 2023 at 04:22:52PM +1300, Thomas Munro wrote: > On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote: > > The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the > > attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently. > > The first associated failure was commit dea12a1 (2023-08-03); as a doc commit, > > it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead > > of windows containers" (2023-02), long before the failures began. I'll guess > > some 2023-08 Windows update or reconfiguration altered file descriptor > > assignment, hence the onset of failures. In my tests of v16, the highest file > > descriptor was 948. I could make v16 fail by changing --client=5 to > > --client=90 in the test. With the attached patch and --client=90, v16 peaked > > at file descriptor 2040. > > Ohhh. Thanks for figuring that out. I'd never noticed that quirk. I > didn't/can't test it but the patch looks reasonable after reading the > referenced docs. For what it's worth, I did all that testing through CI, using hacks like https://cirrus-ci.com/task/5352974499708928 to get the relevant information. I didn't bother trying non-CI, since buildfarm animals aren't failing. > Maybe instead of just "out of range" I'd say "out of > range for select()" since otherwise it might seem a little baffling -- > what range? I was trying to follow this from the style guide: Avoid mentioning called function names, either; instead say what the code was trying to do: BAD: open() failed: %m BETTER: could not open file %s: %m I didn't think of any phrasing that clearly explained things without the reader consulting the code. I considered these: "socket file descriptor out of range: %d" [what range?] "socket file descriptor out of range for select(): %d" [style guide violation] "socket file descriptor out of range for checking whether ready for reading: %d" [?] "socket file descriptor out of range for synchronous I/O multiplexing: %d" [term from POSIX] > Random water cooler speculation about future ideas: I wonder > whether/when we can eventually ditch select() and use WSAPoll() for > this on Windows, which is supposed to be like poll(). There's a > comment explaining that we prefer select() because it has a higher > resolution sleep than poll() (us vs ms), so we wouldn't want to use > poll() on Unixen, but we know that Windows doesn't even use high > resolution timers for any user space APIs anyway so that's just not a > concern on that platform. The usual reason nobody ever uses WSAPoll() > is because the Curl guys told[1] everyone that it's terrible in a way > that would quite specifically break our usage. But I wonder, because > the documentation now says "As of Windows 10 version 2004, when a TCP > socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is > indicated", it *sounds* like it might have been fixed ~3 years ago? > One idea would be to hide it inside WaitEventSet, and let WaitEventSet > be used in front end code (we couldn't use the > WaitForMultipleObjects() version because it's hard-limited to 64 > events, but in front end code we don't need latches and other stuff, > so we could have a sockets-only version with WSAPoll()). The idea of > using WaitEventSet for pgbench on Unix was already mentioned a couple > of times by others for general scalability reasons -- that way we > could finish up using a better kernel interface on all supported > platforms. We'd probably want to add high resolution time-out support > (I already posted a patch for that somewhere), or we'd be back to 1ms > timeouts. > > [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/ Interesting. I have no current knowledge to add there.
On Sun, Oct 8, 2023 at 9:10 PM Noah Misch <noah@leadboat.com> wrote:
I didn't think of any phrasing that clearly explained things without the
reader consulting the code. I considered these:
"socket file descriptor out of range: %d" [what range?]
Quick drive-by...but it seems that < 0 is a distinctly different problem than exceeding FD_SETSIZE. I'm unsure what would cause the former but the error for the later seems like:
error: "Requested socket file descriptor %d exceeds connection limit of %d", fd, FD_SETSIZE-1
hint: Reduce the requested number of concurrent connections
In short, the concept of range doesn't seem applicable here. There is an error state at the max, and some invalid system state condition where the position within a set is somehow negative. These should be separated - with the < 0 check happening first.
And apparently this condition isn't applicable when dealing with jobs in connect_slot? Also, I see that for connections we immediately issue FD_SET so this check on the boundary of the file descriptor makes sense. But the remaining code in connect_slot doesn't involve FD_SET so the test for the file descriptor falling within its maximum seems to be coming out of nowhere. Likely this is all good, and the lack of symmetry is just due to the natural progressive development of the code, but it stands out to the uninitiated looking at this patch.
David J.
On Sun, Oct 08, 2023 at 10:00:03PM -0700, David G. Johnston wrote: > On Sun, Oct 8, 2023 at 9:10 PM Noah Misch <noah@leadboat.com> wrote: > > I didn't think of any phrasing that clearly explained things without the > > reader consulting the code. I considered these: I'm leaning toward: "socket file descriptor out of range for select(): %d" [style guide violation] A true style guide purist might bury it in a detail message: ERROR: socket file descriptor out of range: %d DETAIL: select() accepts file descriptors from 0 to 1023, inclusive, in this build. HINT: Try fewer concurrent database clients. > > "socket file descriptor out of range: %d" [what range?] > > > Quick drive-by...but it seems that < 0 is a distinctly different problem > than exceeding FD_SETSIZE. I'm unsure what would cause the former but the > error for the later seems like: > > error: "Requested socket file descriptor %d exceeds connection limit of > %d", fd, FD_SETSIZE-1 > hint: Reduce the requested number of concurrent connections > > In short, the concept of range doesn't seem applicable here. There is an > error state at the max, and some invalid system state condition where the > position within a set is somehow negative. These should be separated - > with the < 0 check happening first. I view it as: the range of select()-able file descriptors is [0,FD_SETSIZE). Negative is out of range. > And apparently this condition isn't applicable when dealing with jobs in > connect_slot? For both pgbench.c and parallel_slot.c, there are sufficient negative-FD checks elsewhere in the file. Ideally, either both files would have redundant checks, or neither file would. I didn't feel the need to mess with that part of the status quo. > Also, I see that for connections we immediately issue FD_SET > so this check on the boundary of the file descriptor makes sense. But the > remaining code in connect_slot doesn't involve FD_SET so the test for the > file descriptor falling within its maximum seems to be coming out of > nowhere. Likely this is all good, and the lack of symmetry is just due to > the natural progressive development of the code, but it stands out to the > uninitiated looking at this patch. True. The approach in parallel_slot.c is to check the FD number each time it opens a socket, after which its loop with FD_SET() doesn't recheck. That's a bit more efficient than the pgbench.c way, because each file may call FD_SET() many times per socket. Again, I didn't mess with that part of the status quo.