Обсуждение: Inconsistent error handling in START_REPLICATION command
Hackers,
It looks like there's an inconsistency in error handling during START_REPLICATION command of replication protocol:
$ psql postgres://localhost/psycopg2test?replication=database
psql (9.6devel)
Type "help" for help.
psycopg2test=# IDENTIFY_SYSTEM;
systemid | timeline | xlogpos | dbname
---------------------+----------+-----------+--------------
6235978519197579707 | 1 | 0/2CE0F78 | psycopg2test
(1 row)
psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" "value");
ERROR: syntax error
1) syntax errors are reported and client can retry with corrected command:
psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" 'value');
ERROR: replication slot name "TEST1" contains invalid character
HINT: Replication slot names may only contain lower case letters, numbers, and the underscore character.
2) further errors are reported and we can retry:
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
ERROR: replication slot "test1" does not exist
psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
slot_name | consistent_point | snapshot_name | output_plugin
-----------+------------------+---------------+---------------
test1 | 0/2CF5340 | 0000088C-1 | test_decoding
(1 row)
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
unexpected PQresultStatus: 8
The last command results in the following output sent to the server log:
ERROR: option "invalid" = "value" is unknown
CONTEXT: slot "test1", output plugin "test_decoding", in the startup callback
But the client has no way to learn about the error, nor can it restart with correct one (the server has entered COPY protocol mode):
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
PQexec not allowed during COPY BOTH
I recall Craig Rigner mentioning this issue in context of the pglogical_output plugin, but I thought that was something to do with the startup packet. The behavior above doesn't strike me as very consistent, we should be able to detect and report errors during output plugin startup and let the client retry with the corrected command as we do for syntax or other errors.
I didn't look in the code yet, but if someone knows off top of the head the reason to this behavior, I'd be glad to learn it.
Cheers.
-- Oleksandr "Alex" Shulgin
Database Engineer
Mobile: +49 160 84-90-639
Email: oleksandr.shulgin@zalando.de
Database Engineer
Mobile: +49 160 84-90-639
Email: oleksandr.shulgin@zalando.de
On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
Hackers,It looks like there's an inconsistency in error handling during START_REPLICATION command of replication protocol:$ psql postgres://localhost/psycopg2test?replication=databasepsql (9.6devel)Type "help" for help.psycopg2test=# IDENTIFY_SYSTEM;systemid | timeline | xlogpos | dbname---------------------+----------+-----------+--------------6235978519197579707 | 1 | 0/2CE0F78 | psycopg2test(1 row)psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" "value");ERROR: syntax error1) syntax errors are reported and client can retry with corrected command:psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid" 'value');ERROR: replication slot name "TEST1" contains invalid characterHINT: Replication slot names may only contain lower case letters, numbers, and the underscore character.2) further errors are reported and we can retry:psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');ERROR: replication slot "test1" does not existpsycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";slot_name | consistent_point | snapshot_name | output_plugin-----------+------------------+---------------+---------------test1 | 0/2CF5340 | 0000088C-1 | test_decoding(1 row)psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');unexpected PQresultStatus: 8The last command results in the following output sent to the server log:ERROR: option "invalid" = "value" is unknownCONTEXT: slot "test1", output plugin "test_decoding", in the startup callbackBut the client has no way to learn about the error, nor can it restart with correct one (the server has entered COPY protocol mode):psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;PQexec not allowed during COPY BOTHI recall Craig Rigner mentioning this issue in context of the pglogical_output plugin, but I thought that was something to do with the startup packet. The behavior above doesn't strike me as very consistent, we should be able to detect and report errors during output plugin startup and let the client retry with the corrected command as we do for syntax or other errors.I didn't look in the code yet, but if someone knows off top of the head the reason to this behavior, I'd be glad to learn it.
Looks like the attached patch fixes it for me:
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
ERROR: option "invalid" = "value" is unknown
CONTEXT: slot "test1", output plugin "test_decoding", in the startup callback
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
unexpected PQresultStatus: 8
There was a similar problem with PHYSICAL replication--if the requested start LSN is in the future, the error is not reported to the client and connection is unusable:
psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
unexpected PQresultStatus: 8
psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
PQexec not allowed during COPY BOTH
After the patch:
psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
ERROR: requested starting point 1FF0/0 is ahead of the WAL flush position of this server 0/2DE6C28
psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
unexpected PQresultStatus: 8
Actually, would it make sense to move the block that sends CopyBothResponse message to the WalSndLoop() instead? I think you can't get any closer than that...
--
Alex
Вложения
On Tue, Jan 5, 2016 at 11:35 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:I didn't look in the code yet, but if someone knows off top of the head the reason to this behavior, I'd be glad to learn it.Looks like the attached patch fixes it for me:
Should I add it to the current commitfest? It seems to me, the patch is trivial and unambiguous, so that might be an overkill...
Thanks.
--
Alex
On Wed, Jan 6, 2016 at 11:38 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
On Tue, Jan 5, 2016 at 11:35 AM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:Looks like the attached patch fixes it for me:Should I add it to the current commitfest? It seems to me, the patch is trivial and unambiguous, so that might be an overkill...
Added to the Open one: https://commitfest.postgresql.org/9/474/
On 5 January 2016 at 18:35, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');unexpected PQresultStatus: 8
I think the point here is that START_REPLICATION issues useful errors and returns to the normal protocol up until the logical decoding startup callback is invoked.
It enters COPY BOTH mode before it invokes the startup callback. The client has no way to unilaterally terminate COPY BOTH mode and return to the normal walsender protocol. The server doesn't do it when there's an ERROR. So the only option the client has for recovery is to disconnect and reconnect.
I recall Craig Rigner mentioning this issue in context of the pglogical_output plugin, but I thought that was something to do with the startup packet. The behavior above doesn't strike me as very consistent, we should be able to detect and report errors during output plugin startup and let the client retry with the corrected command as we do for syntax or other errors.
I agree. This would also let clients have some limited facility for negotiating options with the output plugin, albeit via the very clumsy channel of libpq protocol error message records and their text.
Looks like the attached patch fixes it for me:
For those who didn't read it, the patch moves the start of COPY BOTH mode to after the logical decoding startup callback is invoked.
This certainly fixes the immediate issue. It also forecloses something else I'd really like though, which is the ability for plugins to send output to clients from within their decoding startup hook.
For that reason I'd actually like to enter COPY BOTH mode before the startup callback, as is currently the case. So I'd like to wrap the decoding startup callback in a PG_TRY that catches an ERROR raised by the startup callback (if any) and exits COPY BOTH mode before re-throwing.
Sound reasonable?
I'm looking toward a future where the decoding startup callback can emit a structured, plugin-specific CopyData-wrapped message with information the downstream might need to negotiate a connection. It could then exit back to walsender mode by throwing an ERROR. Or, better, by setting some new decoding context field that's checked after each callback to see if the plugin wants to exit COPY BOTH mode and return to walsender protocol mode without spamming the logs with an ERROR. That way we can finally do two-way negotiation between downstream and upstream.
(Related to this, I also want to add a new decoding callback that lets the downstream send arbitrary CopyData back up the wire to the upstream where it's passed to the decoding callback. This would give the downstream a backchannel to change settings, request things from the output plugin, etc, w/o lots of disconnects and reconnects.)
On 20 January 2016 at 15:28, Craig Ringer <craig@2ndquadrant.com> wrote:
I hoped this'd be as simple as:For that reason I'd actually like to enter COPY BOTH mode before the startup callback, as is currently the case. So I'd like to wrap the decoding startup callback in a PG_TRY that catches an ERROR raised by the startup callback (if any) and exits COPY BOTH mode before re-throwing.
PG_TRY();
{
logical_decoding_ctx = CreateDecodingContext(
cmd->startpoint, cmd->options,
logical_read_xlog_page,
WalSndPrepareWrite, WalSndWriteData);
}
PG_CATCH();
{
pq_putmessage_noblock('c', NULL, 0);
pq_flush();
PG_RE_THROW();
}
PG_END_TRY();
but that's doesn't look sufficient, presumably as the client never has a chance to ack the exit from CopyBoth mode. Or maybe libpq just isn't prepared to cope with existing COPY then immediately getting an error.
What you have now works and fixes the immediate problem. Preserving the early start of COPY mode isn't actually useful until we're also prepared to accept writes from the output plugin's startup callback at that point, which we are not yet.
Maybe we should just apply your patch as-is, then move the start of COPY back if/when support for emitting output from within a logical decoding startup callback gets added.
On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > It enters COPY BOTH mode before it invokes the startup callback. The client > has no way to unilaterally terminate COPY BOTH mode and return to the normal > walsender protocol. The server doesn't do it when there's an ERROR. So the > only option the client has for recovery is to disconnect and reconnect. This isn't my understanding of how the protocol works, and it's not what the documentation says either. Per http://www.postgresql.org/docs/current/static/protocol-flow.html: === In the event of a backend-detected error during copy-both mode, the backend will issue an ErrorResponse message, discard frontend messages until a Sync message is received, and then issue ReadyForQuery and return to normal processing. The frontend should treat receipt of ErrorResponse as terminating the copy in both directions; no CopyDone should be sent in this case. === So it's true that the client can't unilaterally terminate COPY BOTH mode; it can only send CopyDone. But an error on the server side should do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> It enters COPY BOTH mode before it invokes the startup callback. The client
> has no way to unilaterally terminate COPY BOTH mode and return to the normal
> walsender protocol. The server doesn't do it when there's an ERROR. So the
> only option the client has for recovery is to disconnect and reconnect.
This isn't my understanding of how the protocol works, and it's not
what the documentation says either.
Per http://www.postgresql.org/docs/current/static/protocol-flow.html:
===
In the event of a backend-detected error during copy-both mode, the
backend will issue an ErrorResponse message, discard frontend messages
until a Sync message is received, and then issue ReadyForQuery and
return to normal processing. The frontend should treat receipt of
ErrorResponse as terminating the copy in both directions; no CopyDone
should be sent in this case.
===
So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone. But an error on the server side
should do so.
Hm, you're right. Even though the server sends COPY_BOTH message before the plugin startup, an attempt by the client to actually read the data messages results in ErrorResponse handled on the client, then the client can re-submit the corrected START_REPLICATION command and continue without the need to reconnect. This cannot be actually tested in psql, but I can verify the behavior with my python scripts.
Then I don't have a preference for the early error reporting in this case. If the current behavior potentially allows for more flexible error reporting, I'm for keeping it.
--
Alex
Alex
On 1/21/16 9:53 AM, Shulgin, Oleksandr wrote:
It looks like a decision needs to be made here whether to apply this patch, send it back to the author, or reject it so I'm marking it "Ready for Committer".
Robert, since you were participating in this conversation can you have a look?
On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone. But an error on the server side
should do so.Hm, you're right. Even though the server sends COPY_BOTH message before the plugin startup, an attempt by the client to actually read the data messages results in ErrorResponse handled on the client, then the client can re-submit the corrected START_REPLICATION command and continue without the need to reconnect. This cannot be actually tested in psql, but I can verify the behavior with my python scripts.Then I don't have a preference for the early error reporting in this case. If the current behavior potentially allows for more flexible error reporting, I'm for keeping it.
It looks like a decision needs to be made here whether to apply this patch, send it back to the author, or reject it so I'm marking it "Ready for Committer".
Robert, since you were participating in this conversation can you have a look?
-- -David david@pgmasters.net
Вложения
On Fri, Mar 11, 2016 at 11:36 AM, David Steele <david@pgmasters.net> wrote: > It looks like a decision needs to be made here whether to apply this patch, > send it back to the author, or reject it so I'm marking it "Ready for > Committer". > > Robert, since you were participating in this conversation can you have a > look? Who, me? *looks around* OK, I had a look. I don't think this is a bug. The original report is complaining about this: ================= psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value'); unexpected PQresultStatus: 8 The last command results in the following output sent to the server log: ERROR: option "invalid" = "value" is unknown CONTEXT: slot "test1", output plugin "test_decoding", in the startup callback ================= But that's just because psql knows utterly zero about copy-both mode. So the problem here isn't that the server is doing anything it shouldn't be doing; the server code is totally legitimate. It's just that the client is stupid. Now, we could commit this patch anyway. It won't hurt anything. But it won't help very much, either. The fundamental issue is that issuing a START_REPLICATION_SLOT command in psql can't be expected to work. If you do that, it's either going to fail with a server error, or it's going to fail with "unexpected PQresultStatus: 8". Committing this patch would cause this particular example to fail in the first way rather than the second one, and I'm not direly opposed to that if somebody really feels strongly about it. But it seems like the wrong thing to focus on. I think the real question is: why are you insisting on issuing a command in psql that can NEVER succeed there? If you want that to work, you're going to need to add copy-both support to psql, which might be a fun project but has nothing to do with this patch. So I am inclined to think it's fine to reject this, unless somebody wants to argue that there's some advantage to changing it than I am missing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Mar 11, 2016 at 11:36 AM, David Steele <david@pgmasters.net> wrote: >> It looks like a decision needs to be made here whether to apply this patch, >> send it back to the author, or reject it so I'm marking it "Ready for >> Committer". >> >> Robert, since you were participating in this conversation can you have a >> look? > Who, me? *looks around* > > OK, I had a look. I don't think this is a bug. I agree, and Craig's complaint that this change reduces flexibility for plugins seems to tilt the balance against it. I see that Alex himself accepted that argument in his last message in the thread (<CACACo5Sm-tY-9YijuLbXmcLmSkYtcTaVQh_1j4tADfjC+1-bCw@mail.gmail.com>). So let's mark it rejected and move on. > unexpected PQresultStatus: 8 It is a bit annoying that psql doesn't print "COPY_BOTH" here. But fixing that is not this patch, either. regards, tom lane
On 3/11/16 1:11 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Mar 11, 2016 at 11:36 AM, David Steele <david@pgmasters.net> wrote: >>> It looks like a decision needs to be made here whether to apply this patch, >>> send it back to the author, or reject it so I'm marking it "Ready for >>> Committer". >>> >>> Robert, since you were participating in this conversation can you have a >>> look? > >> Who, me? *looks around* >> >> OK, I had a look. I don't think this is a bug. > > I agree, and Craig's complaint that this change reduces flexibility > for plugins seems to tilt the balance against it. I see that Alex > himself accepted that argument in his last message in the thread > (<CACACo5Sm-tY-9YijuLbXmcLmSkYtcTaVQh_1j4tADfjC+1-bCw@mail.gmail.com>). > > So let's mark it rejected and move on. +1 and done. -- -David david@pgmasters.net