Обсуждение: Improve the connection failure error messages

Поиск
Список
Период
Сортировка

Improve the connection failure error messages

От
Nisha Moond
Дата:
Hi Hackers,

Various sections of the code utilize the walrcv_connect() function,
employed by various processes such as walreceiver, logical replication
apply worker, etc., to establish connections with other hosts.
Presently, in case of connection failures, the error message lacks
information about the specific process attempting to connect and
encountering the failure.

The provided patch enhances error messages for such connection
failures, offering more details on the processes that failed to
establish a connection.

--
Thanks,
Nisha

Вложения

Re: Improve the connection failure error messages

От
Peter Smith
Дата:
Thanks for the patch! Here are a couple of review comments for it.

======
src/backend/commands/subscriptioncmds.c

1.
@@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,
  if (!wrconn)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, err)));

In practice, these commands give errors like:

test_sub=# create subscription sub1 connection 'dbname=bogus' publication pub1;
ERROR:  could not connect to the publisher: connection to server on
socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "bogus" does not
exist

and logs like:

2024-01-12 12:45:05.177 AEDT [13108] ERROR:  could not connect to the
publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL:  database "bogus" does not exist
2024-01-12 12:45:05.177 AEDT [13108] STATEMENT:  create subscription
sub1 connection 'dbname=bogus' publication pub1;

Since the subscription name is already obvious from the statement that
caused the error I'm not sure it benefits much to add this to the
error message (but maybe it is useful if the error message was somehow
read in isolation from the statement).

Anyway, I felt at least it should include the word "subscription" for
better consistency with the other messages in this patch:

SUGGESTION
subscription \"%s\" could not connect to the publisher: %s

======

2.
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
  /* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, false, false, appname, &err);
  if (!wrconn)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("%s could not connect to the primary server: %s", appname, err)));

I think your new %s should be quoted according to the guidelines at [1].

======
src/test/regress/expected/subscription.out

3.
Apparently, there is no existing regression test case for the ALTER
"could not connect" message because if there was, it would have
failed. Maybe a test should be added?

======
[1] https://www.postgresql.org/docs/current/error-style-guide.html

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Nisha Moond
Дата:
Thanks for the review. Attached v2 patch with suggested changes.
Please find my response inline.

On Fri, Jan 12, 2024 at 8:20 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Thanks for the patch! Here are a couple of review comments for it.
>
> ======
> src/backend/commands/subscriptioncmds.c
>
> 1.
> @@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate,
> CreateSubscriptionStmt *stmt,
>   if (!wrconn)
>   ereport(ERROR,
>   (errcode(ERRCODE_CONNECTION_FAILURE),
> - errmsg("could not connect to the publisher: %s", err)));
> + errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, err)));
>
> In practice, these commands give errors like:
>
> test_sub=# create subscription sub1 connection 'dbname=bogus' publication pub1;
> ERROR:  could not connect to the publisher: connection to server on
> socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "bogus" does not
> exist
>
> and logs like:
>
> 2024-01-12 12:45:05.177 AEDT [13108] ERROR:  could not connect to the
> publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
> FATAL:  database "bogus" does not exist
> 2024-01-12 12:45:05.177 AEDT [13108] STATEMENT:  create subscription
> sub1 connection 'dbname=bogus' publication pub1;
>
> Since the subscription name is already obvious from the statement that
> caused the error I'm not sure it benefits much to add this to the
> error message (but maybe it is useful if the error message was somehow
> read in isolation from the statement).
>
> Anyway, I felt at least it should include the word "subscription" for
> better consistency with the other messages in this patch:
>
> SUGGESTION
> subscription \"%s\" could not connect to the publisher: %s

Done.

> ======
>
> 2.
> + appname = cluster_name[0] ? cluster_name : "walreceiver";
> +
>   /* Establish the connection to the primary for XLOG streaming */
> - wrconn = walrcv_connect(conninfo, false, false,
> - cluster_name[0] ? cluster_name : "walreceiver",
> - &err);
> + wrconn = walrcv_connect(conninfo, false, false, appname, &err);
>   if (!wrconn)
>   ereport(ERROR,
>   (errcode(ERRCODE_CONNECTION_FAILURE),
> - errmsg("could not connect to the primary server: %s", err)));
> + errmsg("%s could not connect to the primary server: %s", appname, err)));
>
> I think your new %s should be quoted according to the guidelines at [1].

Done.

> ======
> src/test/regress/expected/subscription.out
>
> 3.
> Apparently, there is no existing regression test case for the ALTER
> "could not connect" message because if there was, it would have
> failed. Maybe a test should be added?
>
The ALTER SUBSCRIPTION command does not error out on the user
interface if updated with a bad connection string and the connection
failure error can only be seen in the respective log file.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.

--
Thanks,
Nisha

Вложения

Re: Improve the connection failure error messages

От
Aleksander Alekseev
Дата:
Hi,

Thanks for the patch.

> Due to this behavior, it is not possible to add a test to show the
> error message as it is done for CREATE SUBSCRIPTION.
> Let me know if you think there is another way to add this test.

I believe it can be done with TAP tests, see for instance:

contrib/auto_explain/t/001_auto_explain.pl

However I wouldn't insist on including the test in scope of this
particular patch. Such a test doesn't currently exist, it can be added
as a separate patch, and whether this is actually a useful test (all
the tests consume time after all...) is somewhat debatable. Personally
I agree that it would be nice to have though.

This being said...

> The ALTER SUBSCRIPTION command does not error out on the user
> interface if updated with a bad connection string and the connection
> failure error can only be seen in the respective log file.

I wonder if we should fix this. Again, not necessarily in scope of
this work, but if this is not a difficult task, again, it would be
nice to have.

Other than that v2 looks good.

-- 
Best regards,
Aleksander Alekseev



Re: Improve the connection failure error messages

От
Peter Smith
Дата:
On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi,
>
> Thanks for the patch.
>
> > Due to this behavior, it is not possible to add a test to show the
> > error message as it is done for CREATE SUBSCRIPTION.
> > Let me know if you think there is another way to add this test.
>
> I believe it can be done with TAP tests, see for instance:
>
> contrib/auto_explain/t/001_auto_explain.pl
>
> However I wouldn't insist on including the test in scope of this
> particular patch. Such a test doesn't currently exist, it can be added
> as a separate patch, and whether this is actually a useful test (all
> the tests consume time after all...) is somewhat debatable. Personally
> I agree that it would be nice to have though.
>
> This being said...
>
> > The ALTER SUBSCRIPTION command does not error out on the user
> > interface if updated with a bad connection string and the connection
> > failure error can only be seen in the respective log file.
>
> I wonder if we should fix this. Again, not necessarily in scope of
> this work, but if this is not a difficult task, again, it would be
> nice to have.
>
> Other than that v2 looks good.
>

OK.  I see now that any ALTER of the subscription's connection, even
to some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over.

test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication
apply worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
b4
...

I don't really have any opinion if that behaviour is good or bad, but
anyway, it is deliberate, and IMO it is outside the scope of your
patch, so v2 patch LGTM.

~~

BTW, while experimenting with the bad connection ALTER I also tried
setting 'disable_on_error' like below:

ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';

...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.

That looks like it may be a bug. Thoughts?

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Nisha Moond
Дата:
Thanks for reviewing, please find my response inline.

On Wed, Jan 17, 2024 at 4:56 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi,
> >
> > Thanks for the patch.
> >
> > > Due to this behavior, it is not possible to add a test to show the
> > > error message as it is done for CREATE SUBSCRIPTION.
> > > Let me know if you think there is another way to add this test.
> >
> > I believe it can be done with TAP tests, see for instance:
> >
> > contrib/auto_explain/t/001_auto_explain.pl
> >
> > However I wouldn't insist on including the test in scope of this
> > particular patch. Such a test doesn't currently exist, it can be added
> > as a separate patch, and whether this is actually a useful test (all
> > the tests consume time after all...) is somewhat debatable. Personally
> > I agree that it would be nice to have though.
> >
> > This being said...
> >
> > > The ALTER SUBSCRIPTION command does not error out on the user
> > > interface if updated with a bad connection string and the connection
> > > failure error can only be seen in the respective log file.
> >
> > I wonder if we should fix this. Again, not necessarily in scope of
> > this work, but if this is not a difficult task, again, it would be
> > nice to have.
> >
> > Other than that v2 looks good.
> >
>
> OK.  I see now that any ALTER of the subscription's connection, even
> to some value that fails, will restart a new worker (like ALTER of any
> other subscription parameters). For a bad connection, it will continue
> to relaunch-worker/ERROR over and over.
>
> test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
> replication apply worker for subscription "sub4" has started
> 2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11274) exited with exit code 1
> dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication
> apply worker for subscription "sub4" has started
> 2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11391) exited with exit code 1
> b4
> ...
>
> I don't really have any opinion if that behaviour is good or bad, but
> anyway, it is deliberate, and IMO it is outside the scope of your
> patch, so v2 patch LGTM.

Upon code review, the ALTER SUBSCRIPTION updates the connection string
after checking for parse and a few obvious errors and does not attempt
to establish a connection. It is the apply worker running for the
respective subscription that will try to connect and fail in case of a
bad connection string.
To me, it seems an intentional design behavior and I agree that
deciding to change or maintain this behavior is out of this patch's
scope.

--
Thanks,
Nisha



Re: Improve the connection failure error messages

От
Nisha Moond
Дата:
>
> ~~
>
> BTW, while experimenting with the bad connection ALTER I also tried
> setting 'disable_on_error' like below:
>
> ALTER SUBSCRIPTION sub4 SET (disable_on_error);
> ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
>
> ...but here the subscription did not become DISABLED as I expected it
> would do on the next connection error iteration. It remains enabled
> and just continues to loop relaunch/ERROR indefinitely same as before.
>
> That looks like it may be a bug. Thoughts?
>
Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -

start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...

What is happening in this case is that the control reaches the function -
run_apply_worker() -> start_apply() -> LogicalRepApplyLoop ->
maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...

and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -

ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, &err);

if (LogRepWorkerWalRcvConn == NULL)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
   errmsg("could not connect to the publisher: %s", err)));
...
and due to the bad connection string in the subscription, it will error out.
[28680] ERROR:  could not connect to the publisher: invalid port number: "-1"
[3196] LOG:  background worker "logical replication apply worker" (PID
28680) exited with exit code 1

Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.

I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.
IMO, this bug-fix discussion deserves a separate thread. Thoughts?



Re: Improve the connection failure error messages

От
Peter Smith
Дата:
On Wed, Jan 17, 2024 at 7:15 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> >
> > ~~
> >
> > BTW, while experimenting with the bad connection ALTER I also tried
> > setting 'disable_on_error' like below:
> >
> > ALTER SUBSCRIPTION sub4 SET (disable_on_error);
> > ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
> >
> > ...but here the subscription did not become DISABLED as I expected it
> > would do on the next connection error iteration. It remains enabled
> > and just continues to loop relaunch/ERROR indefinitely same as before.
> >
> > That looks like it may be a bug. Thoughts?
> >
> Ideally, if the already running apply worker in
> "LogicalRepApplyLoop()" has any exception/error it will be handled and
> the subscription will be disabled if 'disable_on_error' is set -
>
> start_apply(XLogRecPtr origin_startpos)
> {
> PG_TRY();
> {
> LogicalRepApplyLoop(origin_startpos);
> }
> PG_CATCH();
> {
> if (MySubscription->disableonerr)
> DisableSubscriptionAndExit();
> ...
>
> What is happening in this case is that the control reaches the function -
> run_apply_worker() -> start_apply() -> LogicalRepApplyLoop ->
> maybe_reread_subscription()
> ...
> /*
> * Exit if any parameter that affects the remote connection was changed.
> * The launcher will start a new worker but note that the parallel apply
> * worker won't restart if the streaming option's value is changed from
> * 'parallel' to any other value or the server decides not to stream the
> * in-progress transaction.
> */
> if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
> ...
>
> and it sees a change in the parameter and calls apply_worker_exit().
> This will exit the current process, without throwing an exception to
> the caller and the postmaster will try to restart the apply worker.
> The new apply worker, before reaching the start_apply() [where we
> handle exception], will hit the code to establish the connection to
> the publisher -
>
> ApplyWorkerMain() -> run_apply_worker() -
> ...
> LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
> true /* replication */ ,
> true,
> must_use_password,
> MySubscription->name, &err);
>
> if (LogRepWorkerWalRcvConn == NULL)
>   ereport(ERROR,
>   (errcode(ERRCODE_CONNECTION_FAILURE),
>    errmsg("could not connect to the publisher: %s", err)));
> ...
> and due to the bad connection string in the subscription, it will error out.
> [28680] ERROR:  could not connect to the publisher: invalid port number: "-1"
> [3196] LOG:  background worker "logical replication apply worker" (PID
> 28680) exited with exit code 1
>
> Now, the postmaster keeps trying to restart the apply worker and it
> will keep failing until the connection string is corrected or the
> subscription is disabled manually.
>
> I think this is a bug that needs to be handled in run_apply_worker()
> when disable_on_error is set.
> IMO, this bug-fix discussion deserves a separate thread. Thoughts?

Hi Nisha,

Thanks for your analysis -- it is the same as my understanding.

As suggested, I have created a new thread for any further discussion
related to this 'disable_on_error' topic [1].

======
[1]
https://www.postgresql.org/message-id/flat/CAHut%2BPuEsekA3e7ThwzWr%2BUs4x%3DLzkF7DSrED1UsZTUqNrhCUQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Nisha Moond
Дата:
On Fri, Jan 12, 2024 at 7:06 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi,
>
> Thanks for the patch.
>
> > Due to this behavior, it is not possible to add a test to show the
> > error message as it is done for CREATE SUBSCRIPTION.
> > Let me know if you think there is another way to add this test.
>
> I believe it can be done with TAP tests, see for instance:
>
> contrib/auto_explain/t/001_auto_explain.pl
>
> However I wouldn't insist on including the test in scope of this
> particular patch. Such a test doesn't currently exist, it can be added
> as a separate patch, and whether this is actually a useful test (all
> the tests consume time after all...) is somewhat debatable. Personally
> I agree that it would be nice to have though.
Thank you for providing this information. Yes, I can write a TAP test
to check the log for the same error message.
I'll attempt it and perform a time analysis. I'm unsure where to
appropriately add this test. Any suggestions?

Following your suggestion, I won't include the test in the scope of
this patch. Instead, I'll start a new thread once I have sufficient
information required.

--
Thanks,
Nisha



Re: Improve the connection failure error messages

От
Peter Smith
Дата:
AFAIK some recent commits patches (e,g [1]  for the "slot sync"
development) have created some more cases of "could not connect..."
messages. So, you might need to enhance your patch to deal with any
new ones in the latest HEAD.

======
[1] https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Nisha Moond
Дата:

AFAIK some recent commits patches (e,g [1]  for the "slot sync"
development) have created some more cases of "could not connect..."
messages. So, you might need to enhance your patch to deal with any
new ones in the latest HEAD.

======
[1] https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a

Thank you for the update.
The v3 patch has the changes needed as per the latest HEAD.

--
Thanks,
Nisha
Вложения

Re: Improve the connection failure error messages

От
Peter Smith
Дата:
On Wed, Jan 31, 2024 at 9:58 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
>
>> AFAIK some recent commits patches (e,g [1]  for the "slot sync"
>> development) have created some more cases of "could not connect..."
>> messages. So, you might need to enhance your patch to deal with any
>> new ones in the latest HEAD.
>>
>> ======
>> [1] https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a
>>
> Thank you for the update.
> The v3 patch has the changes needed as per the latest HEAD.
>

Hi, just going by visual inspection of the v2/v3 patch diffs, the
latest v3 LGTM.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Peter Smith
Дата:
FYI -- some more code has been pushed since this patch was last
updated. AFAICT perhaps you'll want to update this patch again for the
following new connection messages on HEAD:

- slotfuncs.c [1]
- slotsync.c [2]

----------
[1]
https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989
[2]
https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Nisha Moond
Дата:
On Wed, Mar 13, 2024 at 11:16 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> FYI -- some more code has been pushed since this patch was last
> updated. AFAICT perhaps you'll want to update this patch again for the
> following new connection messages on HEAD:
>
> - slotfuncs.c [1]
> - slotsync.c [2]
>
> ----------
> [1]
https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989
> [2]
https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258
>
Thanks for the update.
Here is the v4 patch with changes required in slotfuncs.c and slotsync.c files.

--
Thanks,
Nisha

Вложения

Re: Improve the connection failure error messages

От
Peter Smith
Дата:
Hi, just by visual inspection of the v3/v4 patch diffs, the latest v4 LGTM.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improve the connection failure error messages

От
Daniel Gustafsson
Дата:
> On 22 Mar 2024, at 11:42, Nisha Moond <nisha.moond412@gmail.com> wrote:

> Here is the v4 patch with changes required in slotfuncs.c and slotsync.c files.

-    errmsg("could not connect to the primary server: %s", err));
+    errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));

Messages like this should perhaps have translator comments to indicate what the
leading "%s" will contain?

--
Daniel Gustafsson