Обсуждение: Return value of pg_promote()
Hi All, At present, pg_promote() returns true to the caller on successful promotion of standby, however it returns false in multiple scenarios which includes: 1) The SIGUSR1 signal could not be sent to the postmaster process. 2) The postmaster died during standby promotion. 3) Standby couldn't be promoted within the specified wait time. For an application calling this function, if pg_promote returns false, it is hard to interpret the reason behind it. So I think we should *only* allow pg_promote to return false when the server could not be promoted in the given wait time and in other scenarios it should just throw an error (FATAL, ERROR ... depending on the type of failure that occurred). Please let me know your thoughts on this change. thanks.! -- With Regards, Ashutosh Sharma.
On Tue, 2023-06-06 at 16:35 +0530, Ashutosh Sharma wrote: > At present, pg_promote() returns true to the caller on successful > promotion of standby, however it returns false in multiple scenarios > which includes: > > 1) The SIGUSR1 signal could not be sent to the postmaster process. > 2) The postmaster died during standby promotion. > 3) Standby couldn't be promoted within the specified wait time. > > For an application calling this function, if pg_promote returns false, > it is hard to interpret the reason behind it. So I think we should > *only* allow pg_promote to return false when the server could not be > promoted in the given wait time and in other scenarios it should just > throw an error (FATAL, ERROR ... depending on the type of failure that > occurred). Please let me know your thoughts on this change. thanks.! As the original author, I'd say that that sounds reasonable, particularly in case #1. If the postmaster dies, we are going to die too, so it probably doesn't matter much. But I think an error is certainly also correct in that case. Yours, Laurenz Albe
On 2023/06/07 2:00, Laurenz Albe wrote: > On Tue, 2023-06-06 at 16:35 +0530, Ashutosh Sharma wrote: >> At present, pg_promote() returns true to the caller on successful >> promotion of standby, however it returns false in multiple scenarios >> which includes: >> >> 1) The SIGUSR1 signal could not be sent to the postmaster process. >> 2) The postmaster died during standby promotion. >> 3) Standby couldn't be promoted within the specified wait time. >> >> For an application calling this function, if pg_promote returns false, >> it is hard to interpret the reason behind it. So I think we should >> *only* allow pg_promote to return false when the server could not be >> promoted in the given wait time and in other scenarios it should just >> throw an error (FATAL, ERROR ... depending on the type of failure that >> occurred). Please let me know your thoughts on this change. thanks.! > > As the original author, I'd say that that sounds reasonable, particularly > in case #1. If the postmaster dies, we are going to die too, so it > probably doesn't matter much. But I think an error is certainly also > correct in that case. +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Wed, Jun 7, 2023 at 9:55 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2023/06/07 2:00, Laurenz Albe wrote: > > On Tue, 2023-06-06 at 16:35 +0530, Ashutosh Sharma wrote: > >> At present, pg_promote() returns true to the caller on successful > >> promotion of standby, however it returns false in multiple scenarios > >> which includes: > >> > >> 1) The SIGUSR1 signal could not be sent to the postmaster process. > >> 2) The postmaster died during standby promotion. > >> 3) Standby couldn't be promoted within the specified wait time. > >> > >> For an application calling this function, if pg_promote returns false, > >> it is hard to interpret the reason behind it. So I think we should > >> *only* allow pg_promote to return false when the server could not be > >> promoted in the given wait time and in other scenarios it should just > >> throw an error (FATAL, ERROR ... depending on the type of failure that > >> occurred). Please let me know your thoughts on this change. thanks.! > > > > As the original author, I'd say that that sounds reasonable, particularly > > in case #1. If the postmaster dies, we are going to die too, so it > > probably doesn't matter much. But I think an error is certainly also > > correct in that case. > > +1 > Thanks for sharing your thoughts, Laurenz and Fujii-san. I've prepared a patch that makes pg_promote error out if it couldn't send SIGUSR1 to the postmaster or if the postmaster died in the middle of standby promotion. PFA. Please note that now (with this patch) pg_promote only returns false if the standby could not be promoted within the given wait time. In case of any kind of failure, it just reports an error based on the type of failure that occurred. -- With Regards, Ashutosh Sharma.
Вложения
On Thu, Jun 08, 2023 at 04:53:50PM +0530, Ashutosh Sharma wrote: > Thanks for sharing your thoughts, Laurenz and Fujii-san. I've prepared > a patch that makes pg_promote error out if it couldn't send SIGUSR1 to > the postmaster or if the postmaster died in the middle of standby > promotion. PFA. Please note that now (with this patch) pg_promote only > returns false if the standby could not be promoted within the given > wait time. In case of any kind of failure, it just reports an error > based on the type of failure that occurred. if (kill(PostmasterPid, SIGUSR1) != 0) { - ereport(WARNING, - (errmsg("failed to send signal to postmaster: %m"))); (void) unlink(PROMOTE_SIGNAL_FILE); - PG_RETURN_BOOL(false); + ereport(ERROR, + (errmsg("failed to send signal to postmaster: %m"))); } Shouldn't you assign an error code to this one rather than the default one for internal errors, like ERRCODE_SYSTEM_ERROR? /* return immediately if waiting was not requested */ @@ -744,7 +743,9 @@ pg_promote(PG_FUNCTION_ARGS) * necessity for manual cleanup of all postmaster children. */ if (rc & WL_POSTMASTER_DEATH) - PG_RETURN_BOOL(false); + ereport(FATAL, + (errcode(ERRCODE_ADMIN_SHUTDOWN), + errmsg("terminating connection due to unexpected postmaster exit"))); I would add an errcontext here, to let somebody know that the connection died while waiting for the promotion to be processed, say "while waiting on promotion". -- Michael
Вложения
On Wed, Aug 16, 2023 at 05:02:09PM +0900, Michael Paquier wrote: > if (kill(PostmasterPid, SIGUSR1) != 0) > { > - ereport(WARNING, > - (errmsg("failed to send signal to postmaster: %m"))); > (void) unlink(PROMOTE_SIGNAL_FILE); > - PG_RETURN_BOOL(false); > + ereport(ERROR, > + (errmsg("failed to send signal to postmaster: %m"))); > } > > Shouldn't you assign an error code to this one rather than the > default one for internal errors, like ERRCODE_SYSTEM_ERROR? > > /* return immediately if waiting was not requested */ > @@ -744,7 +743,9 @@ pg_promote(PG_FUNCTION_ARGS) > * necessity for manual cleanup of all postmaster children. > */ > if (rc & WL_POSTMASTER_DEATH) > - PG_RETURN_BOOL(false); > + ereport(FATAL, > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > + errmsg("terminating connection due to unexpected postmaster exit"))); > > I would add an errcontext here, to let somebody know that the > connection died while waiting for the promotion to be processed, say > "while waiting on promotion". I have just noticed that we do not have a CF entry for this proposal, so I have added one with Laurenz as author: https://commitfest.postgresql.org/44/4504/ For now the patch is waiting on author. Could you address my last review? -- Michael
Вложения
Hi Michael, On Thu, Aug 17, 2023 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Aug 16, 2023 at 05:02:09PM +0900, Michael Paquier wrote: > > if (kill(PostmasterPid, SIGUSR1) != 0) > > { > > - ereport(WARNING, > > - (errmsg("failed to send signal to postmaster: %m"))); > > (void) unlink(PROMOTE_SIGNAL_FILE); > > - PG_RETURN_BOOL(false); > > + ereport(ERROR, > > + (errmsg("failed to send signal to postmaster: %m"))); > > } > > > > Shouldn't you assign an error code to this one rather than the > > default one for internal errors, like ERRCODE_SYSTEM_ERROR? > > > > /* return immediately if waiting was not requested */ > > @@ -744,7 +743,9 @@ pg_promote(PG_FUNCTION_ARGS) > > * necessity for manual cleanup of all postmaster children. > > */ > > if (rc & WL_POSTMASTER_DEATH) > > - PG_RETURN_BOOL(false); > > + ereport(FATAL, > > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > > + errmsg("terminating connection due to unexpected postmaster exit"))); > > > > I would add an errcontext here, to let somebody know that the > > connection died while waiting for the promotion to be processed, say > > "while waiting on promotion". > > I have just noticed that we do not have a CF entry for this proposal, > so I have added one with Laurenz as author: > https://commitfest.postgresql.org/44/4504/ > > For now the patch is waiting on author. Could you address my > last review? Thanks for reviewing the patch and adding a CF entry for it. PFA patch that addresses your review comments. And... Sorry for the delayed response. I totally missed it. -- With Regards, Ashutosh Sharma.
Вложения
On Mon, Aug 28, 2023 at 11:50:45AM +0530, Ashutosh Sharma wrote: > Thanks for reviewing the patch and adding a CF entry for it. PFA patch > that addresses your review comments. That looks OK seen from here. Perhaps others have more comments? > And... Sorry for the delayed response. I totally missed it. No problem. -- Michael
Вложения
On Thu, 2023-08-17 at 09:37 +0900, Michael Paquier wrote: > I have just noticed that we do not have a CF entry for this proposal, > so I have added one with Laurenz as author: > https://commitfest.postgresql.org/44/4504/ I have changed the author to Fujii Masao. Yours, Laurenz Albe
On Mon, Aug 28, 2023 at 02:09:37PM +0200, Laurenz Albe wrote: > On Thu, 2023-08-17 at 09:37 +0900, Michael Paquier wrote: > > I have just noticed that we do not have a CF entry for this proposal, > > so I have added one with Laurenz as author: > > https://commitfest.postgresql.org/44/4504/ > > I have changed the author to Fujii Masao. Still incorrect, as the author is Ashutosh Sharma. Fujii-san has provided some feedback, though. -- Michael