Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
От | Masahiko Sawada |
---|---|
Тема | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Дата | |
Msg-id | CAD21AoDM0CZh6F5j4KTHyLMbjQ5xqHQtp6xY=H8MY9WDiPeumQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (shveta malik <shveta.malik@gmail.com>) |
Ответы |
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
|
Список | pgsql-hackers |
On Thu, Sep 25, 2025 at 10:43 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Sep 26, 2025 at 12:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Sep 25, 2025 at 4:57 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Tue, Sep 23, 2025 at 3:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > I've attached the updated patch. It incorporates all comments I got so > > > > far and implements to lazily disable logical decoding. It's used only > > > > when the process tries to disable logical decoding during process > > > > exit. > > > > > > > > > > I am resuming the review now. I agree with the discussion of lazily > > > disabling logical decoding on ERROR or process-exit for temp-slot. > > > > > > Few initial comments: > > > > Thank you for the comments! > > > > > > > > 1) > > > I see that on standby too, during proc-exit, we set 'pending_disable'. > > > But it never resets it, as DisableLogicalDecodingIfNecessary is no-op > > > on standby. And thus the checkpoint keeps on attempting to reset it > > > everytime. Do we even need to set it on standby? > > > > > > Logfile has repeated: 'start completing pending logical decoding > > > disable request' > > > > Ugh, I missed that part. I think that standbys should not delegate the > > deactivation to the checkpointer uless the deactivation is actually > > required. > > > > > 2) > > > + ereport(LOG, > > > + (errmsg("skip disabling logical decoding as during process exit"))); > > > > > > 'as' not needed. > > > > I've fixed the above two points and attached the new version patch. > > > > Thanks. > > 1) > Currently, in the existing implementation, if a promotion is in > progress (delay_status_change = true) and, during that time, a process > exits (causing a temporary slot to be released), then on the standby, > we may end up setting pending_disable. As a result, the checkpointer > will have to wait for the transition to complete before it can proceed > with disabling logical decoding (if needed). > > a) > This means the checkpoint may be delayed further, depending on how > long it takes for all processes to respond to ProcSignalBarrier(). > > b) > Additionally, consider the case where the promotion fails midway > (after UpdateLogicalDecodingStatusEndOfRecovery). If the checkpointer > still sees RecoveryInProgress and delay_status_change as true, could > it end up waiting indefinitely for the transition to complete? In my > testing, when promotion fails and the startup process exits, it > usually causes the rest of the processes, including the checkpointer, > to terminate as well. So, it seems that a dangling pending_disable > state may not actually occur on standby in practice. > > I believe scenario (b) can't really happen, but I still wanted to > check with you. I think so. We don't allow the system to continue starting up if the startup process exits with an error. > I am not sure if (a) is a real concern — what’s your take on it? Since the startup sets delay_status_change=true after creating a checkpoint by PerformRecoveryXLogAction(), I thought that it would not be a problem in practice even if the checkpointer ends up waiting for the recovery to complete. On the other hand, once the process delegated the deactivation to the checkpointer, it would also be okay not to disable logical decoding at its first attempt. One required change would be that if the checkpointer also skips the deactivation when delay_status_change=true, the startup would need to wake up the checkpointer after completing the recovery. Otherwise, the checkpointer might not disable logical decoding until the next checkpoint time. I wanted to avoid adding this part but I'm open to other opinions. > 2) > As per discussion in [1], there was a proposal to implement lazily > disabling decoding both in ERROR and proc-exit scenarios. But I see it > only implemented in proc-exit scenario. Are we planning to do it for > ERROR as well? After more thoughts, I realized that I missed the fact that we actually wrote an ABORT record during the process shutdown. ShutdownPostgres() that calls AbortOutOfAnyTransaction() is the last callback in before_shmem_exit callbacks. So it's probably okay to write STATUS_CHANGE record to disable logical decoding even during process shutdown. As for the race condition at the end of recovery between the startup process and processes updating the logical decoding status, we use delay_status_change flag so that any logical decoding status change initiated in the particular window (i.e., between the startup sets delay_status_change and the recovery completes) has to wait for the startup to complete all end-of-recovery actions. An alternative idea would be that we allow processes to write STATUS_CHANGE records in the particular window even during recovery, by using LocalSetXLogInsertAllowed(). If we implement these ideas, we can simplify the patch quite well as we no longer need the lazy behavior nor wait for the recovery to complete. I've attached a PoC patch that can be applied on top of the v15 patch. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: