Обсуждение: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

Поиск
Список
Период
Сортировка

REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

От
Thomas Munro
Дата:
Hi,

Somehow these tests have recently become unstable and have failed a few times:

https://github.com/postgres/postgres/commits/REL_15_STABLE

The failures are like:

[22:32:26.722] # Failed test 'pgbench simple update stdout
/(?^:builtin: simple update)/'
[22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
[22:32:26.722] # 'pgbench (15.4)
[22:32:26.722] # '
[22:32:26.722] # doesn't match '(?^:builtin: simple update)'



Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

От
Noah Misch
Дата:
On Mon, Sep 04, 2023 at 03:18:40PM +1200, Thomas Munro wrote:
> Somehow these tests have recently become unstable and have failed a few times:
> 
> https://github.com/postgres/postgres/commits/REL_15_STABLE
> 
> The failures are like:
> 
> [22:32:26.722] # Failed test 'pgbench simple update stdout
> /(?^:builtin: simple update)/'
> [22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
> [22:32:26.722] # 'pgbench (15.4)
> [22:32:26.722] # '
> [22:32:26.722] # doesn't match '(?^:builtin: simple update)'

Fun.  That's a test of "pgbench -C".  The test harness isn't reporting
pgbench's stderr, so I hacked things to get that and the actual file
descriptor values being assigned.  The failure case gets "pgbench: error: too
many client connections for select()" in stderr, from this pgbench.c function:

static void
add_socket_to_set(socket_set *sa, int fd, int idx)
{
    if (fd < 0 || fd >= FD_SETSIZE)
    {
        /*
         * Doing a hard exit here is a bit grotty, but it doesn't seem worth
         * complicating the API to make it less grotty.
         */
        pg_fatal("too many client connections for select()");
    }
    FD_SET(fd, &sa->fds);
    if (fd > sa->maxfd)
        sa->maxfd = fd;
}

The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began.  I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures.  In my tests of v16, the highest file
descriptor was 948.  I could make v16 fail by changing --client=5 to
--client=90 in the test.  With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Thanks,
nm

P.S. Later, we should change test code so the pgbench stderr can't grow an
extra line without that line appearing in test logs.

Вложения

Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

От
Thomas Munro
Дата:
On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote:
> The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
> attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
> The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
> it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
> of windows containers" (2023-02), long before the failures began.  I'll guess
> some 2023-08 Windows update or reconfiguration altered file descriptor
> assignment, hence the onset of failures.  In my tests of v16, the highest file
> descriptor was 948.  I could make v16 fail by changing --client=5 to
> --client=90 in the test.  With the attached patch and --client=90, v16 peaked
> at file descriptor 2040.

Ohhh.  Thanks for figuring that out.  I'd never noticed that quirk.  I
didn't/can't test it but the patch looks reasonable after reading the
referenced docs.  Maybe instead of just "out of range" I'd say "out of
range for select()" since otherwise it might seem a little baffling --
what range?

Random water cooler speculation about future ideas:  I wonder
whether/when we can eventually ditch select() and use WSAPoll() for
this on Windows, which is supposed to be like poll().  There's a
comment explaining that we prefer select() because it has a higher
resolution sleep than poll() (us vs ms), so we wouldn't want to use
poll() on Unixen, but we know that Windows doesn't even use high
resolution timers for any user space APIs anyway so that's just not a
concern on that platform.  The usual reason nobody ever uses WSAPoll()
is because the Curl guys told[1] everyone that it's terrible in a way
that would quite specifically break our usage.  But I wonder, because
the documentation now says "As of Windows 10 version 2004, when a TCP
socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
indicated", it *sounds* like it might have been fixed ~3 years ago?
One idea would be to hide it inside WaitEventSet, and let WaitEventSet
be used in front end code (we couldn't use the
WaitForMultipleObjects() version because it's hard-limited to 64
events, but in front end code we don't need latches and other stuff,
so we could have a sockets-only version with WSAPoll()).  The idea of
using WaitEventSet for pgbench on Unix was already mentioned a couple
of times by others for general scalability reasons -- that way we
could finish up using a better kernel interface on all supported
platforms.  We'd probably want to add high resolution time-out support
(I already posted a patch for that somewhere), or we'd be back to 1ms
timeouts.

[1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/



Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

От
Noah Misch
Дата:
On Mon, Oct 09, 2023 at 04:22:52PM +1300, Thomas Munro wrote:
> On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote:
> > The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
> > attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
> > The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
> > it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
> > of windows containers" (2023-02), long before the failures began.  I'll guess
> > some 2023-08 Windows update or reconfiguration altered file descriptor
> > assignment, hence the onset of failures.  In my tests of v16, the highest file
> > descriptor was 948.  I could make v16 fail by changing --client=5 to
> > --client=90 in the test.  With the attached patch and --client=90, v16 peaked
> > at file descriptor 2040.
> 
> Ohhh.  Thanks for figuring that out.  I'd never noticed that quirk.  I
> didn't/can't test it but the patch looks reasonable after reading the
> referenced docs.

For what it's worth, I did all that testing through CI, using hacks like
https://cirrus-ci.com/task/5352974499708928 to get the relevant information.
I didn't bother trying non-CI, since buildfarm animals aren't failing.

> Maybe instead of just "out of range" I'd say "out of
> range for select()" since otherwise it might seem a little baffling --
> what range?

I was trying to follow this from the style guide:

  Avoid mentioning called function names, either; instead say what the code was trying to do:
  BAD:    open() failed: %m
  BETTER: could not open file %s: %m

I didn't think of any phrasing that clearly explained things without the
reader consulting the code.  I considered these:

  "socket file descriptor out of range: %d" [what range?]
  "socket file descriptor out of range for select(): %d" [style guide violation]
  "socket file descriptor out of range for checking whether ready for reading: %d" [?]
  "socket file descriptor out of range for synchronous I/O multiplexing: %d" [term from POSIX]

> Random water cooler speculation about future ideas:  I wonder
> whether/when we can eventually ditch select() and use WSAPoll() for
> this on Windows, which is supposed to be like poll().  There's a
> comment explaining that we prefer select() because it has a higher
> resolution sleep than poll() (us vs ms), so we wouldn't want to use
> poll() on Unixen, but we know that Windows doesn't even use high
> resolution timers for any user space APIs anyway so that's just not a
> concern on that platform.  The usual reason nobody ever uses WSAPoll()
> is because the Curl guys told[1] everyone that it's terrible in a way
> that would quite specifically break our usage.  But I wonder, because
> the documentation now says "As of Windows 10 version 2004, when a TCP
> socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
> indicated", it *sounds* like it might have been fixed ~3 years ago?
> One idea would be to hide it inside WaitEventSet, and let WaitEventSet
> be used in front end code (we couldn't use the
> WaitForMultipleObjects() version because it's hard-limited to 64
> events, but in front end code we don't need latches and other stuff,
> so we could have a sockets-only version with WSAPoll()).  The idea of
> using WaitEventSet for pgbench on Unix was already mentioned a couple
> of times by others for general scalability reasons -- that way we
> could finish up using a better kernel interface on all supported
> platforms.  We'd probably want to add high resolution time-out support
> (I already posted a patch for that somewhere), or we'd be back to 1ms
> timeouts.
> 
> [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/

Interesting.  I have no current knowledge to add there.



Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

От
"David G. Johnston"
Дата:
On Sun, Oct 8, 2023 at 9:10 PM Noah Misch <noah@leadboat.com> wrote:

I didn't think of any phrasing that clearly explained things without the
reader consulting the code.  I considered these:

  "socket file descriptor out of range: %d" [what range?]


Quick drive-by...but it seems that < 0 is a distinctly different problem than exceeding FD_SETSIZE.  I'm unsure what would cause the former but the error for the later seems like:

error: "Requested socket file descriptor %d exceeds connection limit of %d", fd, FD_SETSIZE-1
hint: Reduce the requested number of concurrent connections

In short, the concept of range doesn't seem applicable here.  There is an error state at the max, and some invalid system state condition where the position within a set is somehow negative.  These should be separated - with the < 0 check happening first.

And apparently this condition isn't applicable when dealing with jobs in connect_slot?  Also, I see that for connections we immediately issue FD_SET so this check on the boundary of the file descriptor makes sense.  But the remaining code in connect_slot doesn't involve FD_SET so the test for the file descriptor falling within its maximum seems to be coming out of nowhere.  Likely this is all good, and the lack of symmetry is just due to the natural progressive development of the code, but it stands out to the uninitiated looking at this patch.

David J.

Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

От
Noah Misch
Дата:
On Sun, Oct 08, 2023 at 10:00:03PM -0700, David G. Johnston wrote:
> On Sun, Oct 8, 2023 at 9:10 PM Noah Misch <noah@leadboat.com> wrote:
> > I didn't think of any phrasing that clearly explained things without the
> > reader consulting the code.  I considered these:

I'm leaning toward:

  "socket file descriptor out of range for select(): %d" [style guide violation]

A true style guide purist might bury it in a detail message:

  ERROR: socket file descriptor out of range: %d
  DETAIL: select() accepts file descriptors from 0 to 1023, inclusive, in this build.
  HINT: Try fewer concurrent database clients.

> >   "socket file descriptor out of range: %d" [what range?]
> >
> Quick drive-by...but it seems that < 0 is a distinctly different problem
> than exceeding FD_SETSIZE.  I'm unsure what would cause the former but the
> error for the later seems like:
> 
> error: "Requested socket file descriptor %d exceeds connection limit of
> %d", fd, FD_SETSIZE-1
> hint: Reduce the requested number of concurrent connections
> 
> In short, the concept of range doesn't seem applicable here.  There is an
> error state at the max, and some invalid system state condition where the
> position within a set is somehow negative.  These should be separated -
> with the < 0 check happening first.

I view it as: the range of select()-able file descriptors is [0,FD_SETSIZE).
Negative is out of range.

> And apparently this condition isn't applicable when dealing with jobs in
> connect_slot?

For both pgbench.c and parallel_slot.c, there are sufficient negative-FD
checks elsewhere in the file.  Ideally, either both files would have redundant
checks, or neither file would.  I didn't feel the need to mess with that part
of the status quo.

> Also, I see that for connections we immediately issue FD_SET
> so this check on the boundary of the file descriptor makes sense.  But the
> remaining code in connect_slot doesn't involve FD_SET so the test for the
> file descriptor falling within its maximum seems to be coming out of
> nowhere.  Likely this is all good, and the lack of symmetry is just due to
> the natural progressive development of the code, but it stands out to the
> uninitiated looking at this patch.

True.  The approach in parallel_slot.c is to check the FD number each time it
opens a socket, after which its loop with FD_SET() doesn't recheck.  That's a
bit more efficient than the pgbench.c way, because each file may call FD_SET()
many times per socket.  Again, I didn't mess with that part of the status quo.