Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Дата
Msg-id CAD21AoBs9d9ie5syRYT1Bxu6A3Q7SVRxvPEaZZyVnhkmO=GNWg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: enable logical decoding when wal_level = 'replica' without a server restart  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Wed, Oct 1, 2025 at 10:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 1, 2025 at 4:31 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Sawada-san,
> >
> > > If we implement these ideas, we can simplify the patch quite well as
> > > we no longer need the lazy behavior nor wait for the recovery to
> > > complete. I've attached a PoC patch that can be applied on top of the
> > > v15 patch.
> >
> > In 0002, I found an assertion failure. Steps:
> >
> > 0. There is a streaming replication system and only primary has a logical slot.
> > 1. Attached to a startup process and set a break at UpdateLogicalDecodingStatusEndOfRecovery.
> > 2. Sent a promote signal to the standby and ensured the startup stopped.
> > 3. Established new connection to the standby
> > 4. Attached to the backend process and set a break at create_logical_replication_slot.
> > 5. Tried to create a new slot on the standby and ensured the backend stopped
> > 6. Moved the startup process till WaitForProcSignalBarrier().
> > 7. Moved the backend process till WaitForProcSignalBarrier(). Both processes could go ahead.
> > 8. Moved the backend till ReplicationSlotReserveWal() and restart_lsn was set.
> > 9. Detached from the startup process. Recovery state became "DONE".
> > 10. Detached from the backend. It would crash at xlog_decode().
> >
> > Some data was obtained by the gdb, see [1].
> >
> > Direct cause is that restart_lsn of the slot points the value before STATUS_CHANGE(false).
> > Per my analysis, ReplicationSlotReserveWal() uses GetXLogReplayRecPtr(NULL) as the
> > initial decode point, which is the last record the standby receives from the primary.
> > However, the standby can generate additional record, STATUS_CHANGE (false) in
> > this case. After the recovery, the decoder would read the STATUS_CHANGE record,
> > but it breaks our assumption.
> >
> > Per my understanding, this cannot happen with 0001 because EnsureLogicalDecodingEnabled()
> > waits till RecoveryInProgress() becomes false.
> >
> > How should we fix the issue? One approach is to remove the Assert() and ereport(ERROR),
> > but even in the case the slot may not be able to establish the consistent snapshot.
> >
>
> The other point to consider is that during promotion after
> UpdateLogicalDecodingStatusEndOfRecovery(), we have multiple things
> that seems to be necessary to perform before backends are allowed to
> write. For example, refer to comments: "If any of the critical GUCs
> have changed, log them before we allow backends to write WAL.*/. I
> think the key thing is that before we set state DB_IN_PRODUCTION in
> ControlFile and mark SharedRecoverstate as RECOVERY_STATE_DONE,
> backends shouldn't be allowed to write WAL. If we want to take an
> exception for writing a WAL during slot_creation before the
> RECOVERY_STATE_DONE is set, we should analyze and explain in comments
> why it is okay to take this exception.

Agreed.

As the discussion is becoming more complex, let me summarize our
discussion about the delay_status_change flag and lazy behavior.

The delay_status_change flag was created to handle a specific timing
issue: there's a brief window where backend processes can
enable/disable logical decoding but cannot write the STATUS_CHANGE
record. This occurs because after the startup process updates the
logical decoding status (in
UpdateLogicalDecodingStatusEndOfRecovery()), backend processes cannot
write WAL records until the startup sets SharedRecoveryState to
RECOVERY_STATE_DONE. The idea is to delay any logical decoding status
changes during this window until WAL writing is permitted system-wide.
An alternative idea being discussed is to allow an exception for
STATUS_CHANGE records, letting them be written even during this
window. While this alternative is simpler and technically feasible, it
could be risky as it breaks the general rule that 'backends cannot
write WAL records until recovery completes.'

When the process exits or raises an ERROR,  the process needs to clean
up temporary and ephemeral slots, which might disable logical
decoding. This deactivation process may involve waiting - either for
concurrent activation/deactivation processes to finish or due to the
delay_status_flag (if implemented). However, waiting during user-level
cleanup (in before_shmem_exit callbacks) isn't ideal since the process
blocks all interrupts. To address this, we introduced lazy behavior,
which delegates the deactivation process to the checkpointer, allowing
it to disable logical decoding asynchronously. This way, the
deactivation during user-level cleanup only needs to disable logical
decoding in shared memory and send signals.

While we've discussed that if we don't use the idea of the
delay_status_flag we don't need the lazy behavior either, I find that
we still need lazy behavior to handle waits during concurrent status
changes. Moreover, since we need lazy behavior anyway, the benefits of
implementing the exception-based approach seem limited.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления: