Non-systematic handling of EINTR/EAGAIN/EWOULDBLOCK
От | Alexander Lakhin |
---|---|
Тема | Non-systematic handling of EINTR/EAGAIN/EWOULDBLOCK |
Дата | |
Msg-id | f9bebfe6-cee4-ed87-d4e6-29b5ca4be08d@gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
Hello hackers, Looking at a recent failure on the buildfarm: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2024-04-30%2020%3A48%3A34 # poll_query_until timed out executing this query: # SELECT archived_count FROM pg_stat_archiver # expecting this output: # 1 # last actual query output: # 0 # with stderr: # Looks like your test exited with 29 just after 4. [23:01:41] t/020_archive_status.pl .............. Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 12/16 subtests with the following error in the log: 2024-04-30 22:57:27.931 CEST [83115:1] LOG: archive command failed with exit code 1 2024-04-30 22:57:27.931 CEST [83115:2] DETAIL: The failed archive command was: cp "pg_wal/000000010000000000000001_does_not_exist" "000000010000000000000001_does_not_exist" ... 2024-04-30 22:57:28.070 CEST [47962:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=020_archive_status.pl 2024-04-30 22:57:28.072 CEST [47962:3] 020_archive_status.pl LOG: statement: SELECT archived_count FROM pg_stat_archiver 2024-04-30 22:57:28.073 CEST [83115:3] LOG: could not send to statistics collector: Resource temporarily unavailable and the corresponding code (on REL_13_STABLE): static void pgstat_send(void *msg, int len) { int rc; if (pgStatSock == PGINVALID_SOCKET) return; ((PgStat_MsgHdr *) msg)->m_size = len; /* We'll retry after EINTR, but ignore all other failures */ do { rc = send(pgStatSock, msg, len, 0); } while (rc < 0 && errno == EINTR); #ifdef USE_ASSERT_CHECKING /* In debug builds, log send failures ... */ if (rc < 0) elog(LOG, "could not send to statistics collector: %m"); #endif } I wonder, whether this retry should be performed after EAGAIN (Resource temporarily unavailable), EWOULDBLOCK as well. With a simple send() wrapper (PFA) activated with LD_PRELOAD, I could reproduce this failure easily when running `make -s check -C src/test/recovery/ PROVE_TESTS="t/020*"` on REL_13_STABLE: t/020_archive_status.pl .. 1/16 # poll_query_until timed out executing this query: # SELECT archived_count FROM pg_stat_archiver # expecting this output: # 1 # last actual query output: # 0 # with stderr: # Looks like your test exited with 29 just after 4. t/020_archive_status.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 12/16 subtests I also reproduced another failure (that lacks useful diagnostics, unfortunately): https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-11-10%2015%3A30%3A16 ... t/020_archive_status.pl .. 8/16 # poll_query_until timed out executing this query: # SELECT last_archived_wal FROM pg_stat_archiver # expecting this output: # 000000010000000000000002 # last actual query output: # 000000010000000000000001 # with stderr: # Looks like your test exited with 29 just after 13. t/020_archive_status.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 3/16 subtests ... The "n == 64" condition in the cranky send() is needed to aim exactly these failures. Without this restriction the test (and also `make check`) just hangs because of: if (errno == EINTR) continue; /* Ok if we were interrupted */ /* * Ok if no data writable without blocking, and the socket is in * non-blocking mode. */ if (errno == EAGAIN || errno == EWOULDBLOCK) { return 0; } in internal_flush_buffer(). On the other hand, even with: int send(int s, const void *buf, size_t n, int flags) { if (rand() % 10000 == 0) { errno = EINTR; return -1; } return real_send(s, buf, n, flags); } `make check` fails with many miscellaneous errors... Best regards, Alexander
Вложения
В списке pgsql-hackers по дате отправления: