Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Дата
Msg-id CA+hUKGKyc_K0bF-j1ds0yqxhiO9O5=qaJY5N3Km2BLMci1omfg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only  (Noah Misch <noah@leadboat.com>)
Ответы Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote:
> The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
> attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
> The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
> it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
> of windows containers" (2023-02), long before the failures began.  I'll guess
> some 2023-08 Windows update or reconfiguration altered file descriptor
> assignment, hence the onset of failures.  In my tests of v16, the highest file
> descriptor was 948.  I could make v16 fail by changing --client=5 to
> --client=90 in the test.  With the attached patch and --client=90, v16 peaked
> at file descriptor 2040.

Ohhh.  Thanks for figuring that out.  I'd never noticed that quirk.  I
didn't/can't test it but the patch looks reasonable after reading the
referenced docs.  Maybe instead of just "out of range" I'd say "out of
range for select()" since otherwise it might seem a little baffling --
what range?

Random water cooler speculation about future ideas:  I wonder
whether/when we can eventually ditch select() and use WSAPoll() for
this on Windows, which is supposed to be like poll().  There's a
comment explaining that we prefer select() because it has a higher
resolution sleep than poll() (us vs ms), so we wouldn't want to use
poll() on Unixen, but we know that Windows doesn't even use high
resolution timers for any user space APIs anyway so that's just not a
concern on that platform.  The usual reason nobody ever uses WSAPoll()
is because the Curl guys told[1] everyone that it's terrible in a way
that would quite specifically break our usage.  But I wonder, because
the documentation now says "As of Windows 10 version 2004, when a TCP
socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
indicated", it *sounds* like it might have been fixed ~3 years ago?
One idea would be to hide it inside WaitEventSet, and let WaitEventSet
be used in front end code (we couldn't use the
WaitForMultipleObjects() version because it's hard-limited to 64
events, but in front end code we don't need latches and other stuff,
so we could have a sockets-only version with WSAPoll()).  The idea of
using WaitEventSet for pgbench on Unix was already mentioned a couple
of times by others for general scalability reasons -- that way we
could finish up using a better kernel interface on all supported
platforms.  We'd probably want to add high resolution time-out support
(I already posted a patch for that somewhere), or we'd be back to 1ms
timeouts.

[1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "David G. Johnston"
Дата:
Сообщение: Re: Fix output of zero privileges in psql
Следующее
От: Noah Misch
Дата:
Сообщение: Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only