Possible explanation for Win32 stats regression test failures
От | Tom Lane |
---|---|
Тема | Possible explanation for Win32 stats regression test failures |
Дата | |
Msg-id | 599.1153067067@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Possible explanation for Win32 stats regression test failures
|
Список | pgsql-hackers |
The latest buildfarm report from trout, http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=trout&dt=2006-07-16%2014:36:19 shows a failure mode that we've seen recently on snake, but not for a long time on any non-Windows machines: the stats test fails with symptoms suggesting that the stats counters aren't getting incremented. Dave Page spotted the reason for this during the recent code sprint. The stats collector is dying with FATAL: could not read statistics message: A blocking operation was interrupted by a call to WSACancelBlockingCall. If you look through the above-mentioned report's postmaster log, you'll see several occurrences of this, indicating that the stats collector is being restarted by the postmaster and then dying again. After a bit of digging in our code, I realized that the above text is probably the system's translation of WSAEINTR, which we equate EINTR to, and thus that what's happening is just "recv() returned EINTR, even though the socket had already tested read-ready". I'm not sure whether that's considered normal behavior on Unixen but it is clearly possible with our Win32 implementation of recv() --- any pending signal will make it happen. So it seems an appropriate fix for the stats collector is len = recv(pgStatSock, (char *) &msg, sizeof(PgStat_Msg), 0); if (len < 0) + { + if (errno == EINTR) + continue; ereport(ERROR, (errcode_for_socket_access(), errmsg("could not read statistics message: %m"))); + } and we had better look around to make sure all other calls of send() and recv() treat EINTR as expected too. But ... AFAICS the only signal that could plausibly be arriving at the stats collector is SIGALRM from its own use of setitimer() to schedule stats file writes. So it seems that this failure occurs when the alarm fires between the select() and recv() calls; which is possible but it seems a mighty narrow window. So I'm not 100% convinced that this is the correct explanation of the problem --- we've seen snake fail this way repeatedly, and here we have trout doing it three times within one regression run. Can anyone think of a reason why the timing might fall just so with a higher probability than one would expect? Perhaps pgwin32_select() has got a problem that makes it not dispatch signals as it seems to be trying to do? regards, tom lane
В списке pgsql-hackers по дате отправления: