Actual RC of "restore_command" is relevant for DB startup
От | Gunnar \"Nick\" Bluth |
---|---|
Тема | Actual RC of "restore_command" is relevant for DB startup |
Дата | |
Msg-id | 57177C46.6040604@elster.de обсуждение исходный текст |
Список | pgsql-docs |
Hello, I've just stumbled across a certain oddity with "restore_command" while setting up a fresh environment with segmented (i.e., firewalled) networks. I configured the restore_command as found in the PGBARMan docs (using ssh) and was a bit stunned that after a restart, I saw this in the logs: 2016-04-20 13:22:45 CEST [3788]: [2-1] db=,user= FATAL: could not restore file "00000002.history" from archive: child process exited with exit code 255 2016-04-20 13:22:45 CEST [3786]: [3-1] db=,user= LOG: startup process (PID 3788) exited with exit code 1 2016-04-20 13:22:45 CEST [3786]: [4-1] db=,user= LOG: aborting startup due to startup process failure Which was obviously caused by ssh: connect to host <archive server> port 22: Connection timed out rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.0] Now, the firewall does not let ssh through (yet), so the root cause is quite obvious. However, the docs[1] only state that: "(...) if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start up." In [2], Kevin Grittner stated that it might be that the commands RC should by <= 255, otherwise it will be assessed as "failed badly; give up". And indeed, after amending the restore_command with a "|| exit 1", the server starts up just fine, using replication to fetch the missing WALs. Which is ok for me right now as a workaround, however: had I found this not while setting everything up from scratch, but in case of a disaster (or simply a downtime or very high load of the archive server while restarting a slave), this (basically undocumented!) behavior would have caused me quite a headache...! I reckon only few users will expect a connection timeout to fall into the category of "command not found"... Maybe the part "error by the shell (such as command not found)" could be changed to "error by the shell (RC > 254, e.g. command not found or ssh connection failure)" (actually, whatever the real behaviour is, I didn't check the sources...)? 1 http://www.postgresql.org/docs/current/static/archive-recovery-settings.html 2 http://stackoverflow.com/questions/10524458/postgresql-9-1-streaming-replication-restore-command-special-meaning-of-exit-co Best regards, -- Gunnar "Nick" Bluth DBA ELSTER Tel: +49 911/991-4665 Mobil: +49 172/8853339
Вложения
В списке pgsql-docs по дате отправления: