Обсуждение: Re: Accept recovery conflict interrupt on blocked writing
Bonjour Anthonin, On Mon, Jan 13, 2025 at 11:31 PM Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote: > To avoid blocking recovery for an extended period of time, this patch > changes client write interrupts by handling recovery conflict > interrupts instead of ignoring them. Since the interrupt happens while > we're likely to have partially written results on the socket, there's > no easy way to keep protocol sync so the session needs to be > terminated. Right. Before commit 0da096d7 in v17, the recovery conflict code running in a signal handler would have set ProcDiePending, so this looks like an unintended regression due to that commit. In your patch you have CHECK_FOR_INTERRUPTS(), but I think that means it could also service other interrupts that we probably don't want to introduce without more study, no? For example if ProcessRecoveryConflictInterrupts() returns without throwing, and another pending interrupt flag is also set. Thinking of other proc signals/interrupts, of course it's not really acceptable that we don't process at least also global barriers here, ie for a time controllable by a client, but let's call that an independent problem. I think the minimal thing for this specific problem might be to use ProcessRecoveryConflictInterrupts() directly. I understand why you created DoingBlockingWrite, based on the variable named DoingQueryRead. I wonder if it would be better to model it on QueryCancelHoldoffCount, inventing QueryCancelPromoteToFatalCount, but I'm not sure. The read from the socket also seems to have a variant of the problem, unless I'm missing something, except in that case I'm not sure it's new. For sending you're proposing that our only choice is to kill the session, which makes sense, but the equivalent read side stay-in-sync strategy is to keep deferring until the client gets around to sending a complete message, if I read that right. Hmm, I wonder if we could write tests for both directions in 031_recovery_conflict.pl, using a Perl to do raw pq protocol in TCP, as seen in some other recent proposals... I might have a crack at that soon unless you want to.
Hi, On 2025-01-17 17:01:53 +0100, Anthonin Bonnefoy wrote: > I've cleaned up the tests: I've created a dedicated PgProto > (definitely open to suggestions for a better name...) module > containing all the helpers to send and receive messages on a raw > socket in 0001. Might be worth using it it in src/test/postmaster/t/002_start_stop.pl? That has e.g. code to send a startup message. > Also, I think the DoingBlockingWrite variable was unnecessary? Calling > ProcessRecoveryConflictInterrupts should be enough as I don't think > QueryCancelHoldoffCount can be >0 when writing on a socket it should always > be able to cancel the statement. I'd probably add an assertion to that effect, it's too easy for that to change in a few years otherwise. > Previously, all interrupts except process dying were ignored while a > process was blocked writing to a socket. If the connection to the client > was broken (no clean FIN nor RST), a process sending results to the > client could be stuck for 924s until the TCP retransmission timeout is > reached. During this time, it was possible for the process to conflict > with recovery: For example, the process returning results can have a > conflicting buffer pin. > > To avoid blocking recovery for an extended period of time, this patch > changes client write interrupts by handling recovery conflict interrupts > instead of ignoring them. Since the interrupt happens while we're likely > to have partially written results on the socket, there's no easy way to > keep protocol sync so the session needs to be terminated. I don't see anything implementing the promotion of ERRORs to FATAL? You're preventing the error message being sent to the client, but I don't think that causes the connection to be terminated. The pre-existing code doesn't have that problem, because it's only active when ProcDiePending is already set. In fact, the test your patch added goes through ProcessRecoveryConflictInterrupts() multiple times: 2025-01-17 12:52:47.842 EST [3376411] LOG: recovery still waiting after 20.709 ms: recovery conflict on buffer pin 2025-01-17 12:52:47.842 EST [3376411] CONTEXT: WAL redo at 0/3462288 for Heap2/PRUNE_VACUUM_SCAN: , isCatalogRel: F, nplans:0, nredirected: 0, ndead: 0, nun> 3376451: recovery conflict interrupt while blocked 3376451: recovery conflict processing done write(8192) = -1: 11/Resource temporarily unavailable 3376451: recovery conflict interrupt while blocked 3376451: recovery conflict processing done ... write(8192) = -1: 11/Resource temporarily unavailable 3376451: recovery conflict interrupt while blocked 3376451: recovery conflict processing done write(8192) = -1: 11/Resource temporarily unavailable 3376451: recovery conflict interrupt while blocked 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl DETAIL: User was holding shared buffer pin for too long. 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl STATEMENT: BEGIN; DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; FETCH FORWARD FROM test_recovery_conflict_cursor; SELECT generate_series(1, 100000); backend 3376451> 2025-01-17 12:52:48.072 EST [3376411] LOG: recovery finished waiting after 250.681 ms: recovery conflicton buffer pin I don't actually know why the conflict ends up being resolved after a bunch of retries. Note also the "backend>" (to which I added the PID to identify it) which gets emitted. Just setting whereToSendOutput = DestNone has side effects when not actually in a process exit status... > @@ -581,6 +582,22 @@ ProcessClientWriteInterrupt(bool blocked) > else > SetLatch(MyLatch); > } > + else if (blocked && RecoveryConflictPending) > + { > + /* > + * We're conflicting with recovery while being blocked writing. This > + * can happen when the process is returning results and no ACK is > + * received (broken connection, client overloaded...), eventually > + * saturating the socket buffer while the process holds a page pin > + * that eventually conflict with recovery. > + */ > + if (InterruptHoldoffCount == 0 && CritSectionCount == 0) I was about to say that we shouldn't get here with CritSectionCount != 0, but unfortunately I'm not sure that's true. Thinking of e.g. debug messages or such. And it's of course a copy of existing code... Playing around with this it's unfortunately worse - we very commonly get to ProcessClientWriteInterrupt() with InterruptHoldoffCount > 0. Including just after processing recovery conflict interrupts. The reason for that is that if we do throw an error the first thing we do is to hold interrupts (c.f. sigsetjmp() in PostgresMain()) and then we call EmitErrorReport(). Unfortunately I suspect that this means any protection we'd get from this version of the patch is rather incomplete - if the recovery conflict is triggered while not blocked, we'll ERROR out and report the message with interrupts held. That cycle of recovery conflict signalling wouldn't ever be resolved in such a case, I think. Of course we do re-signal recovery conflict interrupts rather aggressively - but I'm not sure that's going to help reliably. I've previously suggested that we ought to report some errors to the client by trying to send the data in a a non-blocking way, and just terminate if we can't send the data immediately. IIRC Tom was concerned that that would lead to too many lost messages. Perhaps we could instead implement this with a fairly timer that terminates connections when stuck trying to write out an error message? > + { > + if (whereToSendOutput == DestRemote) > + whereToSendOutput = DestNone; > + ProcessRecoveryConflictInterrupts(); > + } > + } > This largely duplicates the existing code in ProcessClientWriteInterrupt(). Perhaps it's worth flipping the order of branches so that we first check that we're neither blocked nor have InterruptHoldoffCount == 0 && CritSectionCount == 0? Greetings, Andres Freund
Hi, On 2025-01-17 13:03:35 -0500, Andres Freund wrote: > I don't see anything implementing the promotion of ERRORs to FATAL? You're > preventing the error message being sent to the client, but I don't think that > causes the connection to be terminated. The pre-existing code doesn't have > that problem, because it's only active when ProcDiePending is already set. > > In fact, the test your patch added goes through > ProcessRecoveryConflictInterrupts() multiple times: > > 2025-01-17 12:52:47.842 EST [3376411] LOG: recovery still waiting after 20.709 ms: recovery conflict on buffer pin > 2025-01-17 12:52:47.842 EST [3376411] CONTEXT: WAL redo at 0/3462288 for Heap2/PRUNE_VACUUM_SCAN: , isCatalogRel: F, nplans:0, nredirected: 0, ndead: 0, nun> > 3376451: recovery conflict interrupt while blocked > 3376451: recovery conflict processing done > write(8192) = -1: 11/Resource temporarily unavailable > 3376451: recovery conflict interrupt while blocked > 3376451: recovery conflict processing done > ... > write(8192) = -1: 11/Resource temporarily unavailable > 3376451: recovery conflict interrupt while blocked > 3376451: recovery conflict processing done > write(8192) = -1: 11/Resource temporarily unavailable > 3376451: recovery conflict interrupt while blocked > 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery > 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl DETAIL: User was holding shared buffer pin for too long. > 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl STATEMENT: > BEGIN; > DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; > FETCH FORWARD FROM test_recovery_conflict_cursor; > SELECT generate_series(1, 100000); > > backend 3376451> 2025-01-17 12:52:48.072 EST [3376411] LOG: recovery finished waiting after 250.681 ms: recovery conflicton buffer pin > > I don't actually know why the conflict ends up being resolved after a bunch of > retries. It's because the test sets deadlock_timeout lower than max_standby_streaming_delay. > Note also the "backend>" (to which I added the PID to identify it) which gets > emitted. Just setting whereToSendOutput = DestNone has side effects when not > actually in a process exit status... I the the only reason the patch works at all is because we end up in InteractiveBackend()'s EOF handling, because InteractiveBackend reads from stdin. That's closed here, but I don't think we have any guarantee that stdin isn't something that can be read from. Greetings, Andres Freund
Hi, On 2025-01-17 13:03:35 -0500, Andres Freund wrote: > > Previously, all interrupts except process dying were ignored while a > > process was blocked writing to a socket. If the connection to the client > > was broken (no clean FIN nor RST), a process sending results to the > > client could be stuck for 924s until the TCP retransmission timeout is > > reached. During this time, it was possible for the process to conflict > > with recovery: For example, the process returning results can have a > > conflicting buffer pin. > > > > To avoid blocking recovery for an extended period of time, this patch > > changes client write interrupts by handling recovery conflict interrupts > > instead of ignoring them. Since the interrupt happens while we're likely > > to have partially written results on the socket, there's no easy way to > > keep protocol sync so the session needs to be terminated. > > I don't see anything implementing the promotion of ERRORs to FATAL? You're > preventing the error message being sent to the client, but I don't think that > causes the connection to be terminated. The pre-existing code doesn't have > that problem, because it's only active when ProcDiePending is already set. The easiest way to do that promotion is probably to set ExitOnAnyError to true before entering ProcessRecoveryConflictInterrupts(). That does prevent ending up in InteractiveBackend(). But I suspect this won't suffice: > > @@ -581,6 +582,22 @@ ProcessClientWriteInterrupt(bool blocked) > > else > > SetLatch(MyLatch); > > } > > + else if (blocked && RecoveryConflictPending) > > + { > > + /* > > + * We're conflicting with recovery while being blocked writing. This > > + * can happen when the process is returning results and no ACK is > > + * received (broken connection, client overloaded...), eventually > > + * saturating the socket buffer while the process holds a page pin > > + * that eventually conflict with recovery. > > + */ > > + if (InterruptHoldoffCount == 0 && CritSectionCount == 0) > > I was about to say that we shouldn't get here with CritSectionCount != 0, but > unfortunately I'm not sure that's true. Thinking of e.g. debug messages or > such. And it's of course a copy of existing code... > > Playing around with this it's unfortunately worse - we very commonly get to > ProcessClientWriteInterrupt() with InterruptHoldoffCount > 0. Including just > after processing recovery conflict interrupts. The reason for that is that if > we do throw an error the first thing we do is to hold interrupts > (c.f. sigsetjmp() in PostgresMain()) and then we call EmitErrorReport(). > > Unfortunately I suspect that this means any protection we'd get from this > version of the patch is rather incomplete - if the recovery conflict is > triggered while not blocked, we'll ERROR out and report the message with > interrupts held. That cycle of recovery conflict signalling wouldn't ever be > resolved in such a case, I think. Of course we do re-signal recovery conflict > interrupts rather aggressively - but I'm not sure that's going to help > reliably. This unfortunately indeed is true. If I, instead of the generate_series(), add the following to the pgproto query: DO \$\$BEGIN RAISE 'endless scream: %', repeat('a', 1000000); END;\$\$; the conflict doesn't get handled. Example log, with a fair bit of added debugging added: 2025-01-17 13:31:46.226 EST [3407842] 031_recovery_conflict.pl ERROR: endless scream: aa... 2025-01-17 13:31:46.226 EST [3407842] 031_recovery_conflict.pl CONTEXT: PL/pgSQL function inline_code_block line 1 at RAISE 2025-01-17 13:31:46.226 EST [3407842] 031_recovery_conflict.pl STATEMENT: BEGIN; DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; FETCH FORWARD FROM test_recovery_conflict_cursor; DO $$BEGIN RAISE 'endless scream: %', repeat('a', 1000000); END;$$; 3407842: start fronted error 3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0 3407842: write(8192) = 8192: 0/Success 3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0 3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0 3407842: write(992023) = 219264: 0/Success 3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0 3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0 3407842: write(772759) = -1: 11/Resource temporarily unavailable 2025-01-17 13:31:46.231 EST [3407800] DEBUG: reaping dead processes 2025-01-17 13:31:46.232 EST [3407800] DEBUG: releasing pm child slot 3 2025-01-17 13:31:46.232 EST [3407800] DEBUG: client backend (PID 3407844) exited with exit code 0 2025-01-17 13:31:46.407 EST [3407804] DEBUG: sendtime 2025-01-17 13:31:46.407326-05 receipttime 2025-01-17 13:31:46.40737-05replication apply delay 0 ms transfer latency 1 ms 2025-01-17 13:31:46.407 EST [3407804] DEBUG: sending write 0/3460F78 flush 0/3460DA0 apply 0/3460DA0 2025-01-17 13:31:46.407 EST [3407804] DEBUG: sending write 0/3460F78 flush 0/3460F78 apply 0/3460DA0 3407842: ProcessClientWriteInterrupt blocked: 1, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 1 3407832: !HoldingBufferPinThatDelaysRecovery 3407842: write(772759) = -1: 11/Resource temporarily unavailable 2025-01-17 13:31:46.707 EST [3407803] LOG: recovery still waiting after 300.099 ms: recovery conflict on buffer pin 2025-01-17 13:31:46.707 EST [3407803] CONTEXT: WAL redo at 0/3460DA0 for Heap2/PRUNE_VACUUM_SCAN: , isCatalogRel: F, nplans:0, nredirected: 0, ndead: 0, nunused: 1, unused: [21]; blkref #0: rel 1663/16385/16386, blk 0 3407832: !HoldingBufferPinThatDelaysRecovery 3407842: ProcessClientWriteInterrupt blocked: 1, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 1 3407842: write(772759) = -1: 11/Resource temporarily unavailable 3407842: ProcessClientWriteInterrupt blocked: 1, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 1 3407842: write(772759) = -1: 11/Resource temporarily unavailable 2025-01-17 13:31:56.417 EST [3407804] DEBUG: sending write 0/3460F78 flush 0/3460F78 apply 0/3460DA0 2025-01-17 13:32:06.427 EST [3407804] DEBUG: sending write 0/3460F78 flush 0/3460F78 apply 0/3460DA0 2025-01-17 13:32:07.224 EST [3407804] DEBUG: sendtime 2025-01-17 13:32:07.224295-05 receipttime 2025-01-17 13:32:07.224413-05replication apply delay 20817 ms transfer latency 1 ms Greetings, Andres Freund