wal_sender_timeout should ignore server-side latency
От | Noah Misch |
---|---|
Тема | wal_sender_timeout should ignore server-side latency |
Дата | |
Msg-id | 20180826034600.GA1105084@rfd.leadboat.com обсуждение исходный текст |
Список | pgsql-hackers |
WalSndLoop() does this, simplifying considerably: for (;;) { /* does: last_reply_timestamp = GetCurrentTimestamp() */ ProcessRepliesIfAny(); send_data(); /* e.g. XLogSendPhysical(), which calls XLogRead() */ WalSndCheckTimeOut(GetCurrentTimestamp()); } A consequence is that any time spent in the send_data() callback counts against the timeout. In particular, if a single send_data() takes longer than wal_sender_timeout, the client is powerless to prevent a timeout. This disagrees with the wal_sender_timeout documentation ("Terminate replication connections that are inactive longer than the specified number of milliseconds. This is useful for the sending server to detect a standby crash or network outage"). I find it undesirable. The fix, attached, is to interpret the timeout relative to a timestamp taken before ProcessRepliesIfAny() polls the socket. If that timestamp is wal_sender_timeout later than the last reply, we can terminate with confidence. This adds one gettimeofday() per ProcessRepliesIfAny() finding no replies, which feels cheap enough. We've seen a number of wal_sender_timeout buildfarm failures on systems with I/O performance trouble: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-08-16%2020:55:57 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-06-30%2020:38:10 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-12%2018:12:36 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2018-01-13%2005:01:17 https://postgr.es/m/flat/20170604211229.GA1528911@rfd.leadboat.com Fixing $SUBJECT won't necessarily cure that, because an I/O stall on the client side can still cause a failure. We'd need something like threads or async I/O to avoid that. I mention a less-important corner case in the WalSndCheckTimeOut() header comment. You can simulate slow XLogSendPhysical() to explore these problems on any system: --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -65,2 +65,3 @@ #include "libpq/pqformat.h" +#include "libpq/pqsignal.h" #include "miscadmin.h" @@ -2731,2 +2732,5 @@ XLogSendPhysical(void) enlargeStringInfo(&output_message, nbytes); + PG_SETMASK(&BlockSig); + pg_usleep(65 * 1000 * 1000); + PG_SETMASK(&UnBlockSig); XLogRead(&output_message.data[output_message.len], startptr, nbytes);
Вложения
В списке pgsql-hackers по дате отправления: