Обсуждение: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Поиск

Список

Период

Сортировка

[BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

Joao Foltran

Дата:

15 декабря 2025 г., 21:54:23

Hi hackers,

I'd like to report a regression in PostgreSQL 18 regarding physical
replication slot invalidation and propose a fix.

It's my first time sending any type of contribution, so please let me
know if I made anything incorrectly and I'll fix it ASAP.

It's also my first time doing any type of code inside the postgres
project, so if the logic or anything I used is incorrect let me know.

CCing Amit, since he committed f41d8468 and 8709dcc.

## Problem

Commit f41d8468 introduced an ERROR when trying to acquire an invalidated
replication slot. While this is correct for logical replication slots
(which cannot safely recover after invalidation), it breaks recovery
for physical replication slots.

Later, commit 8709dcc improved upon this code to prevent a race
condition and moved the check to after the slot was already acquired.

In PostgreSQL 17 and earlier, when a physical replication slot was
invalidated due to max_slot_wal_keep_size, the slot could still be
reacquired if the required WAL became available through restore_command
or archive recovery in the standby. This is a common operational scenario:

- Temporary network issues
- Planned maintenance windows
- Standbys temporarily falling behind

After commit f41d8468, physical replication slots cannot be reacquired
once invalidated, even when the required WAL is available via archive
recovery. The standby remains stuck recovering from archive and cannot
resume streaming replication, demanding manual intervention (slot recreation).

## Reproduction

1. Set up primary with physical replication slot and small
max_slot_wal_keep_size
2. Configure standby with restore_command for archive recovery
3. Stop standby and generate enough WAL on primary to invalidate the slot
4. Restart standby - it can access WAL from archive but gets:
   "FATAL: can no longer access replication slot"

In PG17, streaming would resume. In PG18, it stays permanently broken.

## Proposed Fix

The attached patch differentiates between logical and physical slots in
ReplicationSlotAcquire():

- Logical slots: Raise ERROR as before (cannot safely recover)
- Physical slots: Log a warning but allow acquisition (can recover)

This restores the PG17 behavior for physical slots while maintaining
safety guarantees for logical slots.

The patch includes a TAP test that:
- Demonstrates the issue
- Verifies the fix works
- Ensures physical slots can recover after invalidation

## Testing

Tested on both master and REL_18_STABLE:
- All existing regression tests pass
- New TAP test passes
- Manual testing confirms standbys can recover

## Backpatching

This should be backpatched to PostgreSQL 18 where the regression was
introduced.

Patches attached:
- v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation.patch
(for master)
- v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation-pg18.patch
(for REL_18_STABLE)

Best regards,

Joao Foltran

Вложения

RE: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

"Zhijie Hou (Fujitsu)"

Дата:

16 декабря 2025 г., 06:15:29

On Tuesday, December 16, 2025 2:54 AM Joao Foltran <joao@foltrandba.com> wrote:
> Hi hackers,
> 
> I'd like to report a regression in PostgreSQL 18 regarding physical replication
> slot invalidation and propose a fix.
> 
> It's my first time sending any type of contribution, so please let me know if I
> made anything incorrectly and I'll fix it ASAP.
> 
> It's also my first time doing any type of code inside the postgres project, so if
> the logic or anything I used is incorrect let me know.
> 
> CCing Amit, since he committed f41d8468 and 8709dcc.
> 
> ## Problem
> 
> Commit f41d8468 introduced an ERROR when trying to acquire an invalidated
> replication slot. While this is correct for logical replication slots (which cannot
> safely recover after invalidation), it breaks recovery for physical replication
> slots.
> 
> Later, commit 8709dcc improved upon this code to prevent a race condition
> and moved the check to after the slot was already acquired.
> 
> In PostgreSQL 17 and earlier, when a physical replication slot was invalidated
> due to max_slot_wal_keep_size, the slot could still be reacquired if the
> required WAL became available through restore_command or archive
> recovery in the standby. This is a common operational scenario:
> 
> - Temporary network issues
> - Planned maintenance windows
> - Standbys temporarily falling behind

I think the ability to acquire an invalidated slot represents an
potentially risky behavior. AFAICS, we do not currently support
recovering invalidated slots. This implies that once a slot becomes invalidated,
it does not offer any protection anymore. Even if the restart_lsn or xmin is valid for
such a slot, WAL and rows can be removed at any time. For further clarification,
please refer to ReplicationSlotsComputeRequiredLSN(), where we deliberately
exclude counting the restart_lsn for an invalidated slot.

> 
> After commit f41d8468, physical replication slots cannot be reacquired once
> invalidated, even when the required WAL is available via archive recovery.
> The standby remains stuck recovering from archive and cannot resume
> streaming replication, demanding manual intervention (slot recreation).
> 

I think even if the WALs is temporary available via archive recovery, since the slot
cannot protect any further WALs and rows from being removed, it could cause
problems later.

Best Regards,
Hou zj

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

Joao Foltran

Дата:

16 декабря 2025 г., 07:24:40

Hi Zhijie,

Thank you for clarifying this behavior to me! I've tested it and it
really doesn't hold back wals anymore once it has been invalidated due
to the check inside ReplicationSlotsComputeRequiredLSN().

You are correct that simply letting the slot be reacquired and
continue working would be dangerous leading to possibly losing WALs.
Can we then check if the standby was able to reconnect and start
streaming successfully and then change the slots information for it to
be considered inside ReplicationSlotsComputeRequiredLSN() again?

Example:

in XLogSendPhysical(), after we seen that the first record was sent:

// In XLogSendPhysical() after XLogReadRecord() succeeds
if (first_record_sent &&
    MyReplicationSlot &&
    SlotIsPhysical(MyReplicationSlot) &&
    MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
{
    // Clear invalidation - we successfully read WAL
}

This would clear the invalidation only after we know for sure that it
can continue streaming wals without problem.

After we clear the invalidation then the slot should be able to start
holding back wals again, right?

Regards,
Joao Foltran

On Tue, Dec 16, 2025 at 12:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, December 16, 2025 2:54 AM Joao Foltran <joao@foltrandba.com> wrote:
> > Hi hackers,
> >
> > I'd like to report a regression in PostgreSQL 18 regarding physical replication
> > slot invalidation and propose a fix.
> >
> > It's my first time sending any type of contribution, so please let me know if I
> > made anything incorrectly and I'll fix it ASAP.
> >
> > It's also my first time doing any type of code inside the postgres project, so if
> > the logic or anything I used is incorrect let me know.
> >
> > CCing Amit, since he committed f41d8468 and 8709dcc.
> >
> > ## Problem
> >
> > Commit f41d8468 introduced an ERROR when trying to acquire an invalidated
> > replication slot. While this is correct for logical replication slots (which cannot
> > safely recover after invalidation), it breaks recovery for physical replication
> > slots.
> >
> > Later, commit 8709dcc improved upon this code to prevent a race condition
> > and moved the check to after the slot was already acquired.
> >
> > In PostgreSQL 17 and earlier, when a physical replication slot was invalidated
> > due to max_slot_wal_keep_size, the slot could still be reacquired if the
> > required WAL became available through restore_command or archive
> > recovery in the standby. This is a common operational scenario:
> >
> > - Temporary network issues
> > - Planned maintenance windows
> > - Standbys temporarily falling behind
>
> I think the ability to acquire an invalidated slot represents an
> potentially risky behavior. AFAICS, we do not currently support
> recovering invalidated slots. This implies that once a slot becomes invalidated,
> it does not offer any protection anymore. Even if the restart_lsn or xmin is valid for
> such a slot, WAL and rows can be removed at any time. For further clarification,
> please refer to ReplicationSlotsComputeRequiredLSN(), where we deliberately
> exclude counting the restart_lsn for an invalidated slot.
>
> >
> > After commit f41d8468, physical replication slots cannot be reacquired once
> > invalidated, even when the required WAL is available via archive recovery.
> > The standby remains stuck recovering from archive and cannot resume
> > streaming replication, demanding manual intervention (slot recreation).
> >
>
> I think even if the WALs is temporary available via archive recovery, since the slot
> cannot protect any further WALs and rows from being removed, it could cause
> problems later.
>
> Best Regards,
> Hou zj
>

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

Amit Kapila

Дата:

16 декабря 2025 г., 12:15:01

On Tue, Dec 16, 2025 at 9:54 AM Joao Foltran <joao@foltrandba.com> wrote:
>
> Thank you for clarifying this behavior to me! I've tested it and it
> really doesn't hold back wals anymore once it has been invalidated due
> to the check inside ReplicationSlotsComputeRequiredLSN().
>
> You are correct that simply letting the slot be reacquired and
> continue working would be dangerous leading to possibly losing WALs.
> Can we then check if the standby was able to reconnect and start
> streaming successfully and then change the slots information for it to
> be considered inside ReplicationSlotsComputeRequiredLSN() again?
>
> Example:
>
> in XLogSendPhysical(), after we seen that the first record was sent:
>
> // In XLogSendPhysical() after XLogReadRecord() succeeds
> if (first_record_sent &&
>     MyReplicationSlot &&
>     SlotIsPhysical(MyReplicationSlot) &&
>     MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> {
>     // Clear invalidation - we successfully read WAL
> }
>
> This would clear the invalidation only after we know for sure that it
> can continue streaming wals without problem.
>

The slots could be invalidated due to other reasons like
RS_INVAL_IDLE_TIMEOUT as well. It doesn't sound like a good to clear
the invalidation flag of the slot because tomorrow we could decide to
invalidate due to other reasons as well. I think it would be better to
do the required forensic with invalid slots and re-create the slot if
we want to retain the required WAL. Why don't you prefer to re-create
it once the slot is invalidated?

--
With Regards,
Amit Kapila.

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

Joao Foltran

Дата:

06 января, 00:55:49

> The slots could be invalidated due to other reasons like
> RS_INVAL_IDLE_TIMEOUT as well.

We could just filter which invalidation reasons could be "revalidated"
for only reasons that can be resolved this way.

As for recreating vs not recreating the slots: in situations where you
have tons of clusters that have disk space constraints this would help
tremendously. There's probably a lot of users that would prefer
self-healing in situations it can happen.

Self-healing doesn't mean not reporting it. They can later check the
reason in the logs why it happened and prevent it from happening in
the future.

If making this the default, it could be a flag in the slot? Something
like "self-healing: true", this way any possible self-healing
operations are enabled for the slot, this would enable for new
self-healing enhancements in the future to also be behind a flag and
prevent it from running when someone prefers error+investigate instead
of self-heal+investigate.

--
Regards,
João Foltran

On Tue, Dec 16, 2025 at 6:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 16, 2025 at 9:54 AM Joao Foltran <joao@foltrandba.com> wrote:
> >
> > Thank you for clarifying this behavior to me! I've tested it and it
> > really doesn't hold back wals anymore once it has been invalidated due
> > to the check inside ReplicationSlotsComputeRequiredLSN().
> >
> > You are correct that simply letting the slot be reacquired and
> > continue working would be dangerous leading to possibly losing WALs.
> > Can we then check if the standby was able to reconnect and start
> > streaming successfully and then change the slots information for it to
> > be considered inside ReplicationSlotsComputeRequiredLSN() again?
> >
> > Example:
> >
> > in XLogSendPhysical(), after we seen that the first record was sent:
> >
> > // In XLogSendPhysical() after XLogReadRecord() succeeds
> > if (first_record_sent &&
> >     MyReplicationSlot &&
> >     SlotIsPhysical(MyReplicationSlot) &&
> >     MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> > {
> >     // Clear invalidation - we successfully read WAL
> > }
> >
> > This would clear the invalidation only after we know for sure that it
> > can continue streaming wals without problem.
> >
>
> The slots could be invalidated due to other reasons like
> RS_INVAL_IDLE_TIMEOUT as well. It doesn't sound like a good to clear
> the invalidation flag of the slot because tomorrow we could decide to
> invalidate due to other reasons as well. I think it would be better to
> do the required forensic with invalid slots and re-create the slot if
> we want to retain the required WAL. Why don't you prefer to re-create
> it once the slot is invalidated?
>
> --
> With Regards,
> Amit Kapila.

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

Amit Kapila

Дата:

14 января, 14:20:53

On Tue, Jan 6, 2026 at 3:26 AM Joao Foltran <joao@foltrandba.com> wrote:
>
> > The slots could be invalidated due to other reasons like
> > RS_INVAL_IDLE_TIMEOUT as well.
>
> We could just filter which invalidation reasons could be "revalidated"
> for only reasons that can be resolved this way.
>

Can we make the slot valid even the required WAL is made available
afterwards? What about the removed rows due to the slot's xmin?

--
With Regards,
Amit Kapila.

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

От

Joao Foltran

Дата:

22 января, 22:41:13

Hi Amit!

Unless we have hot_standby_feedback = on, xmin would be null on the
physical replication slot.

But, even if using that parameter, as long as we know that the standby
already has caught up by using the archived wals then the xmin
wouldn't matter, since we don't need those rows to be visible anymore.

I've attached a simple patch and test here that revalidates the slot
after it is lost. It is still missing any filtering besides checking
if the slot is physical or logical, but we can add filters for
specific invalidations.

Let me know what you think.

Regards,
João Foltran

On Wed, Jan 14, 2026 at 8:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 6, 2026 at 3:26 AM Joao Foltran <joao@foltrandba.com> wrote:
> >
> > > The slots could be invalidated due to other reasons like
> > > RS_INVAL_IDLE_TIMEOUT as well.
> >
> > We could just filter which invalidation reasons could be "revalidated"
> > for only reasons that can be resolved this way.
> >
>
> Can we make the slot valid even the required WAL is made available
> afterwards? What about the removed rows due to the slot's xmin?
>
> --
> With Regards,
> Amit Kapila.

Вложения

v1-0001-Revalidate-physical-slot-after-standby-recovery-pg18.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

[BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Вложения

RE: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Вложения