Обсуждение: Inconsistent error handling in START_REPLICATION command

Поиск
Список
Период
Сортировка

Inconsistent error handling in START_REPLICATION command

От
"Shulgin, Oleksandr"
Дата:
Hackers,

It looks like there's an inconsistency in error handling during START_REPLICATION command of replication protocol:

$ psql postgres://localhost/psycopg2test?replication=database
psql (9.6devel)
Type "help" for help.

psycopg2test=# IDENTIFY_SYSTEM;
      systemid       | timeline |  xlogpos  |    dbname    
---------------------+----------+-----------+--------------
 6235978519197579707 |        1 | 0/2CE0F78 | psycopg2test
(1 row)

psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" "value");
ERROR:  syntax error

1) syntax errors are reported and client can retry with corrected command:

psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" 'value');
ERROR:  replication slot name "TEST1" contains invalid character
HINT:  Replication slot names may only contain lower case letters, numbers, and the underscore character.

2) further errors are reported and we can retry:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
ERROR:  replication slot "test1" does not exist

psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
 slot_name | consistent_point | snapshot_name | output_plugin 
-----------+------------------+---------------+---------------
 test1     | 0/2CF5340        | 0000088C-1    | test_decoding
(1 row)

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
unexpected PQresultStatus: 8

The last command results in the following output sent to the server log:

ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup callback

But the client has no way to learn about the error, nor can it restart with correct one (the server has entered COPY protocol mode):

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
PQexec not allowed during COPY BOTH

I recall Craig Rigner mentioning this issue in context of the pglogical_output plugin, but I thought that was something to do with the startup packet.  The behavior above doesn't strike me as very consistent, we should be able to detect and report errors during output plugin startup and let the client retry with the corrected command as we do for syntax or other errors.

I didn't look in the code yet, but if someone knows off top of the head the reason to this behavior, I'd be glad to learn it.

Cheers.
--
Oleksandr "Alex" Shulgin
Database Engineer

Mobile: +49 160 84-90-639
Email: oleksandr.shulgin@zalando.de

Re: Inconsistent error handling in START_REPLICATION command

От
"Shulgin, Oleksandr"
Дата:
On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
Hackers,

It looks like there's an inconsistency in error handling during START_REPLICATION command of replication protocol:

$ psql postgres://localhost/psycopg2test?replication=database
psql (9.6devel)
Type "help" for help.

psycopg2test=# IDENTIFY_SYSTEM;
      systemid       | timeline |  xlogpos  |    dbname    
---------------------+----------+-----------+--------------
 6235978519197579707 |        1 | 0/2CE0F78 | psycopg2test
(1 row)

psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" "value");
ERROR:  syntax error

1) syntax errors are reported and client can retry with corrected command:

psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" 'value');
ERROR:  replication slot name "TEST1" contains invalid character
HINT:  Replication slot names may only contain lower case letters, numbers, and the underscore character.

2) further errors are reported and we can retry:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
ERROR:  replication slot "test1" does not exist

psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
 slot_name | consistent_point | snapshot_name | output_plugin 
-----------+------------------+---------------+---------------
 test1     | 0/2CF5340        | 0000088C-1    | test_decoding
(1 row)

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
unexpected PQresultStatus: 8

The last command results in the following output sent to the server log:

ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup callback

But the client has no way to learn about the error, nor can it restart with correct one (the server has entered COPY protocol mode):

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
PQexec not allowed during COPY BOTH

I recall Craig Rigner mentioning this issue in context of the pglogical_output plugin, but I thought that was something to do with the startup packet.  The behavior above doesn't strike me as very consistent, we should be able to detect and report errors during output plugin startup and let the client retry with the corrected command as we do for syntax or other errors.

I didn't look in the code yet, but if someone knows off top of the head the reason to this behavior, I'd be glad to learn it.

Looks like the attached patch fixes it for me:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup callback

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
unexpected PQresultStatus: 8

There was a similar problem with PHYSICAL replication--if the requested start LSN is in the future, the error is not reported to the client and connection is unusable:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
unexpected PQresultStatus: 8

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
PQexec not allowed during COPY BOTH

After the patch:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
ERROR:  requested starting point 1FF0/0 is ahead of the WAL flush position of this server 0/2DE6C28

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
unexpected PQresultStatus: 8


Actually, would it make sense to move the block that sends CopyBothResponse message to the WalSndLoop() instead?  I think you can't get any closer than that...

--
Alex

Вложения

Re: Inconsistent error handling in START_REPLICATION command

От
"Shulgin, Oleksandr"
Дата:
On Tue, Jan 5, 2016 at 11:35 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:

I didn't look in the code yet, but if someone knows off top of the head the reason to this behavior, I'd be glad to learn it.

Looks like the attached patch fixes it for me:
 
Should I add it to the current commitfest?  It seems to me, the patch is trivial and unambiguous, so that might be an overkill...

Thanks.
--
Alex

Re: Inconsistent error handling in START_REPLICATION command

От
"Shulgin, Oleksandr"
Дата:
On Wed, Jan 6, 2016 at 11:38 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
On Tue, Jan 5, 2016 at 11:35 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:

Looks like the attached patch fixes it for me:
 
Should I add it to the current commitfest?  It seems to me, the patch is trivial and unambiguous, so that might be an overkill...


Re: Inconsistent error handling in START_REPLICATION command

От
Craig Ringer
Дата:


On 5 January 2016 at 18:35, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
 
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
unexpected PQresultStatus: 8

I think the point here is that START_REPLICATION issues useful errors and returns to the normal protocol up until the logical decoding startup callback is invoked.

It enters COPY BOTH mode before it invokes the startup callback. The client has no way to unilaterally terminate COPY BOTH mode and return to the normal walsender protocol. The server doesn't do it when there's an ERROR. So the only option the client has for recovery is to disconnect and reconnect.
 
I recall Craig Rigner mentioning this issue in context of the pglogical_output plugin, but I thought that was something to do with the startup packet.  The behavior above doesn't strike me as very consistent, we should be able to detect and report errors during output plugin startup and let the client retry with the corrected command as we do for syntax or other errors.

I agree. This would also let clients have some limited facility for negotiating options with the output plugin, albeit via the very clumsy channel of libpq protocol error message records and their text.


 
Looks like the attached patch fixes it for me:

For those who didn't read it, the patch moves the start of COPY BOTH mode to after the logical decoding startup callback is invoked.
 
This certainly fixes the immediate issue. It also forecloses something else I'd really like though, which is the ability for plugins to send output to clients from within their decoding startup hook.

For that reason I'd actually like to enter COPY BOTH mode before the startup callback, as is currently the case. So I'd like to wrap the decoding startup callback in a PG_TRY that catches an ERROR raised by the startup callback (if any) and exits COPY BOTH mode before re-throwing.

Sound reasonable?


I'm looking toward a future where the decoding startup callback can emit a structured, plugin-specific CopyData-wrapped message with information the downstream might need to negotiate a connection. It could then exit back to walsender mode by throwing an ERROR. Or, better, by setting some new decoding context field that's checked after each callback to see if the plugin wants to exit COPY BOTH mode and return to walsender protocol mode without spamming the logs with an ERROR. That way we can finally do two-way negotiation between downstream and upstream.

(Related to this, I also want to add a new decoding callback that lets the downstream send arbitrary CopyData back up the wire to the upstream where it's passed to the decoding callback. This would give the downstream a backchannel to change settings, request things from the output plugin, etc, w/o lots of disconnects and reconnects.)

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inconsistent error handling in START_REPLICATION command

От
Craig Ringer
Дата:
On 20 January 2016 at 15:28, Craig Ringer <craig@2ndquadrant.com> wrote:

For that reason I'd actually like to enter COPY BOTH mode before the startup callback, as is currently the case. So I'd like to wrap the decoding startup callback in a PG_TRY that catches an ERROR raised by the startup callback (if any) and exits COPY BOTH mode before re-throwing.

 
I hoped this'd be as simple as:

      PG_TRY();
      {
          logical_decoding_ctx = CreateDecodingContext(
                                                     cmd->startpoint, cmd->options,
                                                       logical_read_xlog_page,
                                              WalSndPrepareWrite, WalSndWriteData);
      }
      PG_CATCH();
      {
          pq_putmessage_noblock('c', NULL, 0);
          pq_flush();
          PG_RE_THROW();
      }
      PG_END_TRY();
  

but that's doesn't look sufficient, presumably as the client never has a chance to ack the exit from CopyBoth mode. Or maybe libpq just isn't prepared to cope with existing COPY then immediately getting an error.

What you have now works and fixes the immediate problem. Preserving the early start of COPY mode isn't actually useful until we're also prepared to accept writes from the output plugin's startup callback at that point, which we are not yet.

Maybe we should just apply your patch as-is, then move the start of COPY back if/when support for emitting output from within a logical decoding startup callback gets added.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inconsistent error handling in START_REPLICATION command

От
Robert Haas
Дата:
On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> It enters COPY BOTH mode before it invokes the startup callback. The client
> has no way to unilaterally terminate COPY BOTH mode and return to the normal
> walsender protocol. The server doesn't do it when there's an ERROR. So the
> only option the client has for recovery is to disconnect and reconnect.

This isn't my understanding of how the protocol works, and it's not
what the documentation says either.

Per http://www.postgresql.org/docs/current/static/protocol-flow.html:

===
In the event of a backend-detected error during copy-both mode, the
backend will issue an ErrorResponse message, discard frontend messages
until a Sync message is received, and then issue ReadyForQuery and
return to normal processing. The frontend should treat receipt of
ErrorResponse as terminating the copy in both directions; no CopyDone
should be sent in this case.
===

So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone.  But an error on the server side
should do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inconsistent error handling in START_REPLICATION command

От
"Shulgin, Oleksandr"
Дата:
On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> It enters COPY BOTH mode before it invokes the startup callback. The client
> has no way to unilaterally terminate COPY BOTH mode and return to the normal
> walsender protocol. The server doesn't do it when there's an ERROR. So the
> only option the client has for recovery is to disconnect and reconnect.

This isn't my understanding of how the protocol works, and it's not
what the documentation says either.

Per http://www.postgresql.org/docs/current/static/protocol-flow.html:

===
In the event of a backend-detected error during copy-both mode, the
backend will issue an ErrorResponse message, discard frontend messages
until a Sync message is received, and then issue ReadyForQuery and
return to normal processing. The frontend should treat receipt of
ErrorResponse as terminating the copy in both directions; no CopyDone
should be sent in this case.
===

So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone.  But an error on the server side
should do so.

Hm, you're right.  Even though the server sends COPY_BOTH message before the plugin startup, an attempt by the client to actually read the data messages results in ErrorResponse handled on the client, then the client can re-submit the corrected START_REPLICATION command and continue without the need to reconnect.  This cannot be actually tested in psql, but I can verify the behavior with my python scripts.

Then I don't have a preference for the early error reporting in this case.  If the current behavior potentially allows for more flexible error reporting, I'm for keeping it.

--
Alex

Re: Inconsistent error handling in START_REPLICATION command

От
David Steele
Дата:
On 1/21/16 9:53 AM, Shulgin, Oleksandr wrote:
On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone.  But an error on the server side
should do so.

Hm, you're right.  Even though the server sends COPY_BOTH message before the plugin startup, an attempt by the client to actually read the data messages results in ErrorResponse handled on the client, then the client can re-submit the corrected START_REPLICATION command and continue without the need to reconnect.  This cannot be actually tested in psql, but I can verify the behavior with my python scripts.

Then I don't have a preference for the early error reporting in this case.  If the current behavior potentially allows for more flexible error reporting, I'm for keeping it.

It looks like a decision needs to be made here whether to apply this patch, send it back to the author, or reject it so I'm marking it "Ready for Committer".

Robert, since you were participating in this conversation can you have a look?

-- 
-David
david@pgmasters.net
Вложения

Re: Inconsistent error handling in START_REPLICATION command

От
Robert Haas
Дата:
On Fri, Mar 11, 2016 at 11:36 AM, David Steele <david@pgmasters.net> wrote:
> It looks like a decision needs to be made here whether to apply this patch,
> send it back to the author, or reject it so I'm marking it "Ready for
> Committer".
>
> Robert, since you were participating in this conversation can you have a
> look?

Who, me?  *looks around*

OK, I had a look.  I don't think this is a bug.  The original report
is complaining about this:

=================
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
unexpected PQresultStatus: 8

The last command results in the following output sent to the server log:

ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup callback
=================

But that's just because psql knows utterly zero about copy-both mode.
So the problem here isn't that the server is doing anything it
shouldn't be doing; the server code is totally legitimate.  It's just
that the client is stupid.

Now, we could commit this patch anyway.  It won't hurt anything.  But
it won't help very much, either.  The fundamental issue is that
issuing a START_REPLICATION_SLOT command in psql can't be expected to
work.  If you do that, it's either going to fail with a server error,
or it's going to fail with "unexpected PQresultStatus: 8".  Committing
this patch would cause this particular example to fail in the first
way rather than the second one, and I'm not direly opposed to that if
somebody really feels strongly about it.  But it seems like the wrong
thing to focus on.  I think the real question is: why are you
insisting on issuing a command in psql that can NEVER succeed there?
If you want that to work, you're going to need to add copy-both
support to psql, which might be a fun project but has nothing to do
with this patch.

So I am inclined to think it's fine to reject this, unless somebody
wants to argue that there's some advantage to changing it than I am
missing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inconsistent error handling in START_REPLICATION command

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 11, 2016 at 11:36 AM, David Steele <david@pgmasters.net> wrote:
>> It looks like a decision needs to be made here whether to apply this patch,
>> send it back to the author, or reject it so I'm marking it "Ready for
>> Committer".
>> 
>> Robert, since you were participating in this conversation can you have a
>> look?

> Who, me?  *looks around*
> 
> OK, I had a look.  I don't think this is a bug.

I agree, and Craig's complaint that this change reduces flexibility
for plugins seems to tilt the balance against it.  I see that Alex
himself accepted that argument in his last message in the thread
(<CACACo5Sm-tY-9YijuLbXmcLmSkYtcTaVQh_1j4tADfjC+1-bCw@mail.gmail.com>).
So let's mark it rejected and move on.

> unexpected PQresultStatus: 8

It is a bit annoying that psql doesn't print "COPY_BOTH" here.
But fixing that is not this patch, either.
        regards, tom lane



Re: Inconsistent error handling in START_REPLICATION command

От
David Steele
Дата:
On 3/11/16 1:11 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Mar 11, 2016 at 11:36 AM, David Steele <david@pgmasters.net> wrote:
>>> It looks like a decision needs to be made here whether to apply this patch,
>>> send it back to the author, or reject it so I'm marking it "Ready for
>>> Committer".
>>>
>>> Robert, since you were participating in this conversation can you have a
>>> look?
>
>> Who, me?  *looks around*
>>
>> OK, I had a look.  I don't think this is a bug.
>
> I agree, and Craig's complaint that this change reduces flexibility
> for plugins seems to tilt the balance against it.  I see that Alex
> himself accepted that argument in his last message in the thread
> (<CACACo5Sm-tY-9YijuLbXmcLmSkYtcTaVQh_1j4tADfjC+1-bCw@mail.gmail.com>).
>
> So let's mark it rejected and move on.

+1 and done.

--
-David
david@pgmasters.net