Обсуждение: restore_command return code behaviour

Поиск
Список
Период
Сортировка

restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:

Dear hackers,

We encountered an issue with restore_command when using scp on a v16 version. When SCP cannot connect to a host, it returns a return code of 255 (I won't discuss the decision to use such a return code). The return code of the restore_command is tested at [1] by calling wait_result_is_any_signal() [2]. If the code is greater than 125 or 128, a FATAL error level is generated in [1] in ereport, which leads to the standby shutting down [3]. This is perfectly fine for us.

However, the documentation [4] states the following:

> It is important for the command to return a zero exit status only if it succeeds. The command will be asked for file names that are not present in the archive; it must return nonzero when so asked. (...)
> An exception is that if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start up.

Could we perhaps improve the documentation by stating that return codes over 125 or (at least) 128 will lead to the server not starting?

This may help people better understand the behaviour of the restore_command and quickly solve these kinds of issues without having to examine the source code.

If you agree, we can suggest a patch for the documentation.

Any thoughts would be much appreciated!


Thank you


--
Jean-Christophe Arnu

Re: restore_command return code behaviour

От
Jacob Champion
Дата:
On Thu, Jul 24, 2025 at 2:18 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:
> Could we perhaps improve the documentation by stating that return codes over 125 or (at least) 128 will lead to the
servernot starting? 
>
> This may help people better understand the behaviour of the restore_command and quickly solve these kinds of issues
withouthaving to examine the source code. 
>
> If you agree, we can suggest a patch for the documentation.

If you've tripped over it, I think we should consider a docs patch.

To confirm -- you're happy with the behavior as-is?

--Jacob



Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:
On fri. 25 jul. 2025,  00:01, Jacob Champion <jacob.champion@enterprisedb.com> wrote :

> If you agree, we can suggest a patch for the documentation.

If you've tripped over it, I think we should consider a docs patch.

We've spent time on it, guessing it was a return code-driven behaviour, analysing the code, and finally using debugger to confirm it. If we can avoid that round trip (and any erroneous analyses stating "it does not work") for other users, that would be ideal.
 

To confirm -- you're happy with the behavior as-is?

I'm completely fine with it.
restore_command returning over 128 can be wrapped in a shell script to catch the given rc *if server shutdown is not the desired behaviour. I agree that the docs should be as clear as possible about this behaviour (which may also be an API, that should not be changed).

Here are the places where I think  the details should be added :
- GUC documentation [1]
- Backup and Restore [2]

The other mention of restore_command does not involve (or require) return code details.
If there are no objections, I'll start writing a patch proposal on Monday.


Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:

On friday, 25 jul. 2025 à 09:35, Jean-Christophe Arnu <jcarnu@gmail.com> wrote :
On fri. 25 jul. 2025,  00:01, Jacob Champion <jacob.champion@enterprisedb.com> wrote :

Here are the places where I think  the details should be added :
- GUC documentation [1]
- Backup and Restore [2]

The other mention of restore_command does not involve (or require) return code details.
If there are no objections, I'll start writing a patch proposal on Monday.



Here's a first version of this tiny doc patch.

Hope this is clear enough.
--
Jean-Christophe Arnu
Вложения

Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:


Le lun. 28 juil. 2025 à 10:01, Jean-Christophe Arnu <jcarnu@gmail.com> a écrit :
Here's a first version of this tiny doc patch.

Hope this is clear enough.

Sorry, the previous version had a bad wording for one file. Please do not review it. The following version is the correct one.

 
--
Jean-Christophe Arnu
Вложения

Re: restore_command return code behaviour

От
Jehan-Guillaume de Rorthais
Дата:
Hi Jean-Christophe,

On Mon, 28 Jul 2025 10:23:07 +0200
Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

> Le lun. 28 juil. 2025 à 10:01, Jean-Christophe Arnu <jcarnu@gmail.com> a
> écrit :
>
> > Here's a first version of this tiny doc patch.
> >
> > Hope this is clear enough.
> >
>
> Sorry, the previous version had a bad wording for one file. Please do not
> review it. The following version is the correct one.

       a signal (other than <systemitem>SIGTERM</systemitem>, which is used as
       part of a database server shutdown) or an error by the shell (such as
       command not found), then recovery will abort and the server will not start
  -    up.
  +    up. However, the server will also not start if the command returns a code
  +    of 128 and above.

It seems redundant with the explanation in this paragraph when you know that a
code greater than 125 is returned on shell error or signal.

As I'm sure you already know, this behavior is documented on the
archive_command side using these words:

«
  When the archive command is terminated by a signal (other than SIGTERM that
  is used as part of a server shutdown) or an error by the shell with an exit
  status greater than 125 (such as command not found), or if the archive
  function emits an ERROR or FATAL, the archiver process aborts and gets
  restarted by the postmaster.
»

So I assume we could keep the same documentation style for the restore_command
side:

«
  An exception is that if the command was terminated by a signal (other than
  SIGTERM, which is used as part of a database server shutdown) or an error by
  the shell **with an exit status greater than 125** (such as command not
  found), then recovery will abort and the server will not start up.
»

What do you think?



Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:
Hi Jehan-Guillaume,

Le lun. 28 juil. 2025 à 12:32, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> a écrit :
       command not found), then recovery will abort and the server will not start
  -    up.
  +    up. However, the server will also not start if the command returns a code
  +    of 128 and above.

It seems redundant with the explanation in this paragraph when you know that a
code greater than 125 is returned on shell error or signal.

You're right.
 
As I'm sure you already know, this behavior is documented on the
archive_command side using these words:
[...]
So I assume we could keep the same documentation style for the restore_command
side:

«
  An exception is that if the command was terminated by a signal (other than
  SIGTERM, which is used as part of a database server shutdown) or an error by
  the shell **with an exit status greater than 125** (such as command not
  found), then recovery will abort and the server will not start up.
»

What do you think?

You're also right. That's more consistent and easier to read.
Thank you for pointing this out.


--
Jean-Christophe Arnu
Вложения

Re: restore_command return code behaviour

От
Jacob Champion
Дата:
On Mon, Jul 28, 2025 at 8:19 AM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:
> You're also right. That's more consistent and easier to read.
> Thank you for pointing this out.

I agree that reusing archive_command's wording is probably the way to
go. I think archive_cleanup_command and recovery_end_command have the
same issue; opinions on changing those as well?

Nitpick mode: I think the current wording for archive_command could be
misleading.

> an error by the shell with an exit status greater than 125 (such as command not found)

The phrase "by the shell" is not really true here (that's kind of the
point of the thread, I'd argue) and I'm wondering if we should move
that to the parenthetical. But I don't like any of my draft ideas so
far, and I don't want to get in the way of a small improvement by
demanding perfection.

Thanks,
--Jacob



Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:

Le lun. 28 juil. 2025 à 20:15, Jacob Champion <jacob.champion@enterprisedb.com> a écrit :
I agree that reusing archive_command's wording is probably the way to
go. I think archive_cleanup_command and recovery_end_command have the
same issue; opinions on changing those as well?

I agree, we could include both of those  in the  fix. But let's refine the wording.
 
Nitpick mode: I think the current wording for archive_command could be
misleading.

> an error by the shell with an exit status greater than 125 (such as command not found)

The phrase "by the shell" is not really true here (that's kind of the
point of the thread, I'd argue) and I'm wondering if we should move
that to the parenthetical. But I don't like any of my draft ideas so
far, and I don't want to get in the way of a small improvement by
demanding perfection.

English is not my native language but let's try some wordings:

Proposals (scheme):

An exception is that if the command was terminated by a signal (other
than SIGTERM which is used as part of a database server shutdown) 
or a command returned with an exit code greater than 125, or  an error by the shell  
(such as command not found), then recovery will abort
and the server will not start up.

Or

The recovery will be aborted and the server will stop if any of the following events occur:
- the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown);
- the command returns an exit code greater than 125
- the shell returns an error (such as 'command not found')

The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the server (but it breaks the GUC style, where bullet points are only used for defining possible values).



--
Jean-Christophe Arnu

Re: restore_command return code behaviour

От
Jacob Champion
Дата:
On Mon, Jul 28, 2025 at 1:58 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:
> Or
>
> The recovery will be aborted and the server will stop if any of the following events occur:
> - the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown);
> - the command returns an exit code greater than 125
> - the shell returns an error (such as 'command not found')
>
> The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the
server(but it breaks the GUC style, where bullet points are only used for defining possible values). 

I like the latter. Riffing on that, we could collapse the bullet
points and reuse a bit of the current wording:

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".

WDYT?
--Jacob



Re: restore_command return code behaviour

От
"David G. Johnston"
Дата:
On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".

I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.  And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.

The “Command not found” error is defined by POSIX as exit status 127 making its specification redundant with > 125

“The server interprets the exit status per POSIX conventions: 0 for success, and any value between 1-125 to gracefully handle the case where a file name is supplied that does not exist, or exhibits a temporary failure during processing.  All other exit status values are considered fatal, causing recovery to be aborted and the server stopped.”

David J.


Re: restore_command return code behaviour

От
Jacob Champion
Дата:
On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.

RestoreArchivedFile() has a special case for SIGTERM, though?

> And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.
>
> The “Command not found” error is defined by POSIX as exit status 127 making its specification redundant with > 125

I don't think either is redundant from the point of view of the
targeted audience, who may not understand the overlap in the POSIX
specification. "My command returned", "my command died", and "my
command never ran" are interesting cases to have to consider (and I
think it's unfortunate that we can't reliably tell them apart).

--Jacob



Re: restore_command return code behaviour

От
"David G. Johnston"
Дата:
On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.

RestoreArchivedFile() has a special case for SIGTERM, though?

I don’t see anything calling out sigterm by name.  It seems to be handled the same as any other signal exit and, because of the “true” being passed into wait_result_is_any_signal, identically to exit statuses 126-127 as well.
 

> And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.

Ok, yeah, the C API differentiates this reality since an unhandled signal precludes an exit status from being created.  I’m fine with “…unhandled signals causing command termination, or exit statues > 125, result in abort and stop”.
 
>
> The “Command not found” error is defined by POSIX as exit status 127 making its specification redundant with > 125

I don't think either is redundant from the point of view of the
targeted audience, who may not understand the overlap in the POSIX
specification. "My command returned", "my command died", and "my
command never ran" are interesting cases to have to consider (and I
think it's unfortunate that we can't reliably tell them apart).

wait_result_to_str seems to be sufficient.  Sure, the command author can choose to violate conventions, but that’s life.

I don’t really see a point in describing the special meanings ascribed to 126 and 127 here since the error message will handle that should the need arise.

David J.

Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:

Le mar. 29 juil. 2025 à 00:38, David G. Johnston <david.g.johnston@gmail.com> a écrit :
On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
On Mon, Jul 28, 2025 at 2:42 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I don’t understand calling out sigterm as an exception, the same abort-and-shutdown action happens there too.

RestoreArchivedFile() has a special case for SIGTERM, though?

I don’t see anything calling out sigterm by name.  It seems to be handled the same as any other signal exit and, because of the “true” being passed into wait_result_is_any_signal, identically to exit statuses 126-127 as well.
 

> And in any case signals are turned into exit status values anyway so naming them specifically seems redundant.

Ok, yeah, the C API differentiates this reality since an unhandled signal precludes an exit status from being created.  I’m fine with “…unhandled signals causing command termination, or exit statues > 125, result in abort and stop”.
 

Thank you!

Should we stick to the original wording, changing it slightly?



An exception is that if the command was terminated by an
unhandled signal causing command termination or an exit status 
above 125 (including command not found), then recovery will abort 
and the server will not start up.


Or should we use the previous proposal (in that thread) with some adaptations?


Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by an unhandled
signal or the command returns an exit status greater than 125
(including "command not found").


--
Jean-Christophe Arnu

Re: restore_command return code behaviour

От
Jacob Champion
Дата:
On Mon, Jul 28, 2025 at 3:38 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
>> RestoreArchivedFile() has a special case for SIGTERM, though?
>
> I don’t see anything calling out sigterm by name.

It's got a comment explaining the separate behavior, right before the
call to wait_result_is_signal(rc, SIGTERM):


https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/access/transam/xlogarchive.c?id=412036c22d6a605340dbe397da1fb12fccd3897f#n254

> I don’t really see a point in describing the special meanings ascribed to 126 and 127 here since the error message
willhandle that should the need arise. 

I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)

Thanks,
--Jacob



Re: restore_command return code behaviour

От
"David G. Johnston"
Дата:
On Wednesday, July 30, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
On Mon, Jul 28, 2025 at 3:38 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Monday, July 28, 2025, Jacob Champion <jacob.champion@enterprisedb.com> wrote:
>> RestoreArchivedFile() has a special case for SIGTERM, though?
>
> I don’t see anything calling out sigterm by name.

It's got a comment explaining the separate behavior, right before the
call to wait_result_is_signal(rc, SIGTERM):

    https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/access/transam/xlogarchive.c?id=412036c22d6a605340dbe397da1fb12fccd3897f#n254


Oops, I was looking at fe_utils not xlogarchive.

 
> I don’t really see a point in describing the special meanings ascribed to 126 and 127 here since the error message will handle that should the need arise.

I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)

Works for me.  I do see the value in pointing out “command could not be executed” behavior without referring to exit status codes.  But it does make the text longer and maybe we could centralize it somewhere by leveraging the reference design.  But acknowledge that most of them will not have the knowledge memorized and being a bit more verbose would be a useful shortcut for them.

David J.


Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:


I don't think Jean-Christophe's stated problem is with the server's
error message (Jean-Christophe, please jump in and correct me) -- it's
in knowing how to design the command so that the system behaves
correctly. I'm not very excited about removing information at the same
time that we're solving a lack-of-information problem. (At least not
without consensus that the information is irrelevant, and personally I
think the cases described so far in this thread are relevant to people
writing these commands.)

That's right it has nothing to do with the error message. That proposal was only intended to help any people with the server behaviour in such cases. 

Not everybody reminds its POSIX return codes if not practiced. And it may be hard  (or take long) to understand why the server is stopped with a simple scp returning 255 in some cases.

Just adding rhe "over 126 return code leads to server stop" (or so) to the current doc version seems enough to me.

Jean-Christophe Arnu

Re: restore_command return code behaviour

От
"David G. Johnston"
Дата:
On Mon, Jul 28, 2025 at 2:22 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:
On Mon, Jul 28, 2025 at 1:58 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:
> Or
>
> The recovery will be aborted and the server will stop if any of the following events occur:
> - the command was terminated by a signal other than SIGTERM (which is used as part of a database server shutdown);
> - the command returns an exit code greater than 125
> - the shell returns an error (such as 'command not found')
>
> The former has a 'heavier' style; the latter has the benefit of clearly showing each condition for shutting down the server (but it breaks the GUC style, where bullet points are only used for defining possible values).

I like the latter. Riffing on that, we could collapse the bullet
points and reuse a bit of the current wording:

Recovery will abort and the server will not start up if any of the
following events occur: the command is terminated by a signal other
than SIGTERM (which is used as part of a database server shutdown);
the command returns an exit status greater than 125; or the shell
returns an error, such as "command not found".


How about:

Recovery will abort and the server will not start up upon any of the following:
the shell is unable to execute the command (due to it not being found or executable),
the command returns an exit status greater than 125, or a non-SIGTERM signal
terminates the shell process.  SIGTERM allows a clean shutdown to happen, if there is one.


Mostly just changing the order a bit but goes from most immediate when making a change (bad command written into the GUC) to least immediate or surprising (SIGTERM).  The other two flow from there.

David J.


Re: restore_command return code behaviour

От
Jean-Christophe Arnu
Дата:

Le lun. 4 août 2025, 04:45, David G. Johnston <david.g.johnston@gmail.com> a écrit :

How about:

Recovery will abort and the server will not start up upon any of the following:
the shell is unable to execute the command (due to it not being found or executable),
the command returns an exit status greater than 125, or a non-SIGTERM signal
terminates the shell process.  SIGTERM allows a clean shutdown to happen, if there is one.


Mostly just changing the order a bit but goes from most immediate when making a change (bad command written into the GUC) to least immediate or surprising (SIGTERM).  The other two flow from there.


I agree. I like this concise way to explain the different cases.