Обсуждение: restore_command return code behaviour
Dear hackers,
We encountered an issue with restore_command when using scp on a v16 version. When SCP cannot connect to a host, it returns a return code of 255 (I won't discuss the decision to use such a return code). The return code of the restore_command is tested at [1] by calling wait_result_is_any_signal() [2]. If the code is greater than 125 or 128, a FATAL error level is generated in [1] in ereport, which leads to the standby shutting down [3]. This is perfectly fine for us.
However, the documentation [4] states the following:
> An exception is that if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start up.
Could we perhaps improve the documentation by stating that return codes over 125 or (at least) 128 will lead to the server not starting?
This may help people better understand the behaviour of the restore_command and quickly solve these kinds of issues without having to examine the source code.
If you agree, we can suggest a patch for the documentation.
Any thoughts would be much appreciated!
On Thu, Jul 24, 2025 at 2:18 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote: > Could we perhaps improve the documentation by stating that return codes over 125 or (at least) 128 will lead to the servernot starting? > > This may help people better understand the behaviour of the restore_command and quickly solve these kinds of issues withouthaving to examine the source code. > > If you agree, we can suggest a patch for the documentation. If you've tripped over it, I think we should consider a docs patch. To confirm -- you're happy with the behavior as-is? --Jacob
> If you agree, we can suggest a patch for the documentation.
If you've tripped over it, I think we should consider a docs patch.
To confirm -- you're happy with the behavior as-is?
restore_command returning over 128 can be wrapped in a shell script to catch the given rc *if server shutdown is not the desired behaviour. I agree that the docs should be as clear as possible about this behaviour (which may also be an API, that should not be changed).
On fri. 25 jul. 2025, 00:01, Jacob Champion <jacob.champion@enterprisedb.com> wrote :Here are the places where I think the details should be added :- GUC documentation [1]- Backup and Restore [2]The other mention of restore_command does not involve (or require) return code details.If there are no objections, I'll start writing a patch proposal on Monday.[1] https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND : file doc/src/sgml/config.sgml[2] https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY file: doc/src/sgml/backup.sgml
Вложения
Here's a first version of this tiny doc patch.Hope this is clear enough.
Вложения
Hi Jean-Christophe, On Mon, 28 Jul 2025 10:23:07 +0200 Jean-Christophe Arnu <jcarnu@gmail.com> wrote: > Le lun. 28 juil. 2025 à 10:01, Jean-Christophe Arnu <jcarnu@gmail.com> a > écrit : > > > Here's a first version of this tiny doc patch. > > > > Hope this is clear enough. > > > > Sorry, the previous version had a bad wording for one file. Please do not > review it. The following version is the correct one. a signal (other than <systemitem>SIGTERM</systemitem>, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start - up. + up. However, the server will also not start if the command returns a code + of 128 and above. It seems redundant with the explanation in this paragraph when you know that a code greater than 125 is returned on shell error or signal. As I'm sure you already know, this behavior is documented on the archive_command side using these words: « When the archive command is terminated by a signal (other than SIGTERM that is used as part of a server shutdown) or an error by the shell with an exit status greater than 125 (such as command not found), or if the archive function emits an ERROR or FATAL, the archiver process aborts and gets restarted by the postmaster. » So I assume we could keep the same documentation style for the restore_command side: « An exception is that if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell **with an exit status greater than 125** (such as command not found), then recovery will abort and the server will not start up. » What do you think?
command not found), then recovery will abort and the server will not start
- up.
+ up. However, the server will also not start if the command returns a code
+ of 128 and above.
It seems redundant with the explanation in this paragraph when you know that a
code greater than 125 is returned on shell error or signal.
As I'm sure you already know, this behavior is documented on the
archive_command side using these words:
So I assume we could keep the same documentation style for the restore_command
side:
«
An exception is that if the command was terminated by a signal (other than
SIGTERM, which is used as part of a database server shutdown) or an error by
the shell **with an exit status greater than 125** (such as command not
found), then recovery will abort and the server will not start up.
»
What do you think?
--
Вложения
On Mon, Jul 28, 2025 at 8:19 AM Jean-Christophe Arnu <jcarnu@gmail.com> wrote: > You're also right. That's more consistent and easier to read. > Thank you for pointing this out. I agree that reusing archive_command's wording is probably the way to go. I think archive_cleanup_command and recovery_end_command have the same issue; opinions on changing those as well? Nitpick mode: I think the current wording for archive_command could be misleading. > an error by the shell with an exit status greater than 125 (such as command not found) The phrase "by the shell" is not really true here (that's kind of the point of the thread, I'd argue) and I'm wondering if we should move that to the parenthetical. But I don't like any of my draft ideas so far, and I don't want to get in the way of a small improvement by demanding perfection. Thanks, --Jacob
I agree that reusing archive_command's wording is probably the way to
go. I think archive_cleanup_command and recovery_end_command have the
same issue; opinions on changing those as well?
Nitpick mode: I think the current wording for archive_command could be
misleading.
> an error by the shell with an exit status greater than 125 (such as command not found)
The phrase "by the shell" is not really true here (that's kind of the
point of the thread, I'd argue) and I'm wondering if we should move
that to the parenthetical. But I don't like any of my draft ideas so
far, and I don't want to get in the way of a small improvement by
demanding perfection.
than SIGTERM which is used as part of a database server shutdown)
and the server will not start up.
The recovery will be aborted and the server will stop if any of the following events occur:
- the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown);
- the command returns an exit code greater than 125
- the shell returns an error (such as 'command not found')
The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the server (but it breaks the GUC style, where bullet points are only used for defining possible values).
--
On Mon, Jul 28, 2025 at 1:58 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote: > Or > > The recovery will be aborted and the server will stop if any of the following events occur: > - the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown); > - the command returns an exit code greater than 125 > - the shell returns an error (such as 'command not found') > > The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the server(but it breaks the GUC style, where bullet points are only used for defining possible values). I like the latter. Riffing on that, we could collapse the bullet points and reuse a bit of the current wording: Recovery will abort and the server will not start up if any of the following events occur: the command is terminated by a signal other than SIGTERM (which is used as part of a database server shutdown); the command returns an exit status greater than 125; or the shell returns an error, such as "command not found". WDYT? --Jacob
Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".
On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too. RestoreArchivedFile() has a special case for SIGTERM, though? > And in any case signals are turned into exit status values anyway so naming them specifically seems redundant. > > The “Command not found” error is defined by POSIX as exit status 127 making its specification redundant with > 125 I don't think either is redundant from the point of view of the targeted audience, who may not understand the overlap in the POSIX specification. "My command returned", "my command died", and "my command never ran" are interesting cases to have to consider (and I think it's unfortunate that we can't reliably tell them apart). --Jacob
On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.
RestoreArchivedFile() has a special case for SIGTERM, though?
> And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.
>
> The “Command not found” error is defined by POSIX as exit status 127 making its specification redundant with > 125
I don't think either is redundant from the point of view of the
targeted audience, who may not understand the overlap in the POSIX
specification. "My command returned", "my command died", and "my
command never ran" are interesting cases to have to consider (and I
think it's unfortunate that we can't reliably tell them apart).
On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.
RestoreArchivedFile() has a special case for SIGTERM, though?I don’t see anything calling out sigterm by name. It seems to be handled the same as any other signal exit and, because of the “true” being passed into wait_result_is_any_signal, identically to exit statuses 126-127 as well.
> And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.Ok, yeah, the C API differentiates this reality since an unhandled signal precludes an exit status from being created. I’m fine with “…unhandled signals causing command termination, or exit statues > 125, result in abort and stop”.
Thank you!
Should we stick to the original wording, changing it slightly?
Or should we use the previous proposal (in that thread) with some adaptations?
following events occur: the command is terminated by an unhandled
signal or the command returns an exit status greater than 125
(including "command not found").
--
On Mon, Jul 28, 2025 at 3:38 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote: >> RestoreArchivedFile() has a special case for SIGTERM, though? > > I don’t see anything calling out sigterm by name. It's got a comment explaining the separate behavior, right before the call to wait_result_is_signal(rc, SIGTERM): https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/access/transam/xlogarchive.c?id=412036c22d6a605340dbe397da1fb12fccd3897f#n254 > I don’t really see a point in describing the special meanings ascribed to 126 and 127 here since the error message willhandle that should the need arise. I don't think Jean-Christophe's stated problem is with the server's error message (Jean-Christophe, please jump in and correct me) -- it's in knowing how to design the command so that the system behaves correctly. I'm not very excited about removing information at the same time that we're solving a lack-of-information problem. (At least not without consensus that the information is irrelevant, and personally I think the cases described so far in this thread are relevant to people writing these commands.) Thanks, --Jacob
On Mon, Jul 28, 2025 at 3:38 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
>> RestoreArchivedFile() has a special case for SIGTERM, though?
>
> I don’t see anything calling out sigterm by name.
It's got a comment explaining the separate behavior, right before the
call to wait_result_is_signal(rc, SIGTERM):
https://git.postgresql.org/cgit/postgresql.git/tree/src/ backend/access/transam/ xlogarchive.c?id= 412036c22d6a605340dbe397da1fb1 2fccd3897f#n254
> I don’t really see a point in describing the special meanings ascribed to 126 and 127 here since the error message will handle that should the need arise.
I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)
I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)
On Mon, Jul 28, 2025 at 1:58 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:
> Or
>
> The recovery will be aborted and the server will stop if any of the following events occur:
> - the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown);
> - the command returns an exit code greater than 125
> - the shell returns an error (such as 'command not found')
>
> The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the server (but it breaks the GUC style, where bullet points are only used for defining possible values).
I like the latter. Riffing on that, we could collapse the bullet
points and reuse a bit of the current wording:
Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".
How about:Recovery will abort and the server will not start up upon any of the following:the shell is unable to execute the command (due to it not being found or executable),the command returns an exit status greater than 125, or a non-SIGTERM signalterminates the shell process. SIGTERM allows a clean shutdown to happen, if there is one.Mostly just changing the order a bit but goes from most immediate when making a change (bad command written into the GUC) to least immediate or surprising (SIGTERM). The other two flow from there.