Обсуждение: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Hi hackers, I'd like to report a regression in PostgreSQL 18 regarding physical replication slot invalidation and propose a fix. It's my first time sending any type of contribution, so please let me know if I made anything incorrectly and I'll fix it ASAP. It's also my first time doing any type of code inside the postgres project, so if the logic or anything I used is incorrect let me know. CCing Amit, since he committed f41d8468 and 8709dcc. ## Problem Commit f41d8468 introduced an ERROR when trying to acquire an invalidated replication slot. While this is correct for logical replication slots (which cannot safely recover after invalidation), it breaks recovery for physical replication slots. Later, commit 8709dcc improved upon this code to prevent a race condition and moved the check to after the slot was already acquired. In PostgreSQL 17 and earlier, when a physical replication slot was invalidated due to max_slot_wal_keep_size, the slot could still be reacquired if the required WAL became available through restore_command or archive recovery in the standby. This is a common operational scenario: - Temporary network issues - Planned maintenance windows - Standbys temporarily falling behind After commit f41d8468, physical replication slots cannot be reacquired once invalidated, even when the required WAL is available via archive recovery. The standby remains stuck recovering from archive and cannot resume streaming replication, demanding manual intervention (slot recreation). ## Reproduction 1. Set up primary with physical replication slot and small max_slot_wal_keep_size 2. Configure standby with restore_command for archive recovery 3. Stop standby and generate enough WAL on primary to invalidate the slot 4. Restart standby - it can access WAL from archive but gets: "FATAL: can no longer access replication slot" In PG17, streaming would resume. In PG18, it stays permanently broken. ## Proposed Fix The attached patch differentiates between logical and physical slots in ReplicationSlotAcquire(): - Logical slots: Raise ERROR as before (cannot safely recover) - Physical slots: Log a warning but allow acquisition (can recover) This restores the PG17 behavior for physical slots while maintaining safety guarantees for logical slots. The patch includes a TAP test that: - Demonstrates the issue - Verifies the fix works - Ensures physical slots can recover after invalidation ## Testing Tested on both master and REL_18_STABLE: - All existing regression tests pass - New TAP test passes - Manual testing confirms standbys can recover ## Backpatching This should be backpatched to PostgreSQL 18 where the regression was introduced. Patches attached: - v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation.patch (for master) - v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation-pg18.patch (for REL_18_STABLE) Best regards, Joao Foltran
Вложения
RE: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, December 16, 2025 2:54 AM Joao Foltran <joao@foltrandba.com> wrote: > Hi hackers, > > I'd like to report a regression in PostgreSQL 18 regarding physical replication > slot invalidation and propose a fix. > > It's my first time sending any type of contribution, so please let me know if I > made anything incorrectly and I'll fix it ASAP. > > It's also my first time doing any type of code inside the postgres project, so if > the logic or anything I used is incorrect let me know. > > CCing Amit, since he committed f41d8468 and 8709dcc. > > ## Problem > > Commit f41d8468 introduced an ERROR when trying to acquire an invalidated > replication slot. While this is correct for logical replication slots (which cannot > safely recover after invalidation), it breaks recovery for physical replication > slots. > > Later, commit 8709dcc improved upon this code to prevent a race condition > and moved the check to after the slot was already acquired. > > In PostgreSQL 17 and earlier, when a physical replication slot was invalidated > due to max_slot_wal_keep_size, the slot could still be reacquired if the > required WAL became available through restore_command or archive > recovery in the standby. This is a common operational scenario: > > - Temporary network issues > - Planned maintenance windows > - Standbys temporarily falling behind I think the ability to acquire an invalidated slot represents an potentially risky behavior. AFAICS, we do not currently support recovering invalidated slots. This implies that once a slot becomes invalidated, it does not offer any protection anymore. Even if the restart_lsn or xmin is valid for such a slot, WAL and rows can be removed at any time. For further clarification, please refer to ReplicationSlotsComputeRequiredLSN(), where we deliberately exclude counting the restart_lsn for an invalidated slot. > > After commit f41d8468, physical replication slots cannot be reacquired once > invalidated, even when the required WAL is available via archive recovery. > The standby remains stuck recovering from archive and cannot resume > streaming replication, demanding manual intervention (slot recreation). > I think even if the WALs is temporary available via archive recovery, since the slot cannot protect any further WALs and rows from being removed, it could cause problems later. Best Regards, Hou zj
Hi Zhijie,
Thank you for clarifying this behavior to me! I've tested it and it
really doesn't hold back wals anymore once it has been invalidated due
to the check inside ReplicationSlotsComputeRequiredLSN().
You are correct that simply letting the slot be reacquired and
continue working would be dangerous leading to possibly losing WALs.
Can we then check if the standby was able to reconnect and start
streaming successfully and then change the slots information for it to
be considered inside ReplicationSlotsComputeRequiredLSN() again?
Example:
in XLogSendPhysical(), after we seen that the first record was sent:
// In XLogSendPhysical() after XLogReadRecord() succeeds
if (first_record_sent &&
MyReplicationSlot &&
SlotIsPhysical(MyReplicationSlot) &&
MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
{
// Clear invalidation - we successfully read WAL
}
This would clear the invalidation only after we know for sure that it
can continue streaming wals without problem.
After we clear the invalidation then the slot should be able to start
holding back wals again, right?
Regards,
Joao Foltran
On Tue, Dec 16, 2025 at 12:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, December 16, 2025 2:54 AM Joao Foltran <joao@foltrandba.com> wrote:
> > Hi hackers,
> >
> > I'd like to report a regression in PostgreSQL 18 regarding physical replication
> > slot invalidation and propose a fix.
> >
> > It's my first time sending any type of contribution, so please let me know if I
> > made anything incorrectly and I'll fix it ASAP.
> >
> > It's also my first time doing any type of code inside the postgres project, so if
> > the logic or anything I used is incorrect let me know.
> >
> > CCing Amit, since he committed f41d8468 and 8709dcc.
> >
> > ## Problem
> >
> > Commit f41d8468 introduced an ERROR when trying to acquire an invalidated
> > replication slot. While this is correct for logical replication slots (which cannot
> > safely recover after invalidation), it breaks recovery for physical replication
> > slots.
> >
> > Later, commit 8709dcc improved upon this code to prevent a race condition
> > and moved the check to after the slot was already acquired.
> >
> > In PostgreSQL 17 and earlier, when a physical replication slot was invalidated
> > due to max_slot_wal_keep_size, the slot could still be reacquired if the
> > required WAL became available through restore_command or archive
> > recovery in the standby. This is a common operational scenario:
> >
> > - Temporary network issues
> > - Planned maintenance windows
> > - Standbys temporarily falling behind
>
> I think the ability to acquire an invalidated slot represents an
> potentially risky behavior. AFAICS, we do not currently support
> recovering invalidated slots. This implies that once a slot becomes invalidated,
> it does not offer any protection anymore. Even if the restart_lsn or xmin is valid for
> such a slot, WAL and rows can be removed at any time. For further clarification,
> please refer to ReplicationSlotsComputeRequiredLSN(), where we deliberately
> exclude counting the restart_lsn for an invalidated slot.
>
> >
> > After commit f41d8468, physical replication slots cannot be reacquired once
> > invalidated, even when the required WAL is available via archive recovery.
> > The standby remains stuck recovering from archive and cannot resume
> > streaming replication, demanding manual intervention (slot recreation).
> >
>
> I think even if the WALs is temporary available via archive recovery, since the slot
> cannot protect any further WALs and rows from being removed, it could cause
> problems later.
>
> Best Regards,
> Hou zj
>
On Tue, Dec 16, 2025 at 9:54 AM Joao Foltran <joao@foltrandba.com> wrote:
>
> Thank you for clarifying this behavior to me! I've tested it and it
> really doesn't hold back wals anymore once it has been invalidated due
> to the check inside ReplicationSlotsComputeRequiredLSN().
>
> You are correct that simply letting the slot be reacquired and
> continue working would be dangerous leading to possibly losing WALs.
> Can we then check if the standby was able to reconnect and start
> streaming successfully and then change the slots information for it to
> be considered inside ReplicationSlotsComputeRequiredLSN() again?
>
> Example:
>
> in XLogSendPhysical(), after we seen that the first record was sent:
>
> // In XLogSendPhysical() after XLogReadRecord() succeeds
> if (first_record_sent &&
> MyReplicationSlot &&
> SlotIsPhysical(MyReplicationSlot) &&
> MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> {
> // Clear invalidation - we successfully read WAL
> }
>
> This would clear the invalidation only after we know for sure that it
> can continue streaming wals without problem.
>
The slots could be invalidated due to other reasons like
RS_INVAL_IDLE_TIMEOUT as well. It doesn't sound like a good to clear
the invalidation flag of the slot because tomorrow we could decide to
invalidate due to other reasons as well. I think it would be better to
do the required forensic with invalid slots and re-create the slot if
we want to retain the required WAL. Why don't you prefer to re-create
it once the slot is invalidated?
--
With Regards,
Amit Kapila.