Обсуждение: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
[PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Nitin Motiani
Дата:
Hi Hackers,
I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.
Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).
This WAL accumulation is usually caused by :
1. Inactive replication slot
2. PITR failing to keep up
In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpoint could run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relatively quickly and most of the delay is due to RemoveOldXlogFiles().
How : This patch solves this issue by running RemoveOldXlogFiles() separately and async. This is achieved by doing two things :
Repro Steps : To repro this, I inserted and deleted a few billion rows while keeping an inactive replication slot at the publisher. I changed the wal_segsize to 1MB to increase the number of files for a smaller amount of data. With 1.5TB worth of WAL files, I could consistently reproduce a 40 minutes delay. With 300GB it was around 10 minutes. With the proposed patch the connections started being accepted right after redo is done.
select pg_drop_replication_slot('s');
I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files.
Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).
This WAL accumulation is usually caused by :
1. Inactive replication slot
2. PITR failing to keep up
In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpoint could run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relatively quickly and most of the delay is due to RemoveOldXlogFiles().
How : This patch solves this issue by running RemoveOldXlogFiles() separately and async. This is achieved by doing two things :
1. Skip RemoveOldXlogFiles() for an END_OF_RECOVERY checkpoint. This will ensure that the recovery finishes sooner and postgres can start accepting connections.
2. After the recovery we run another checkpoint without CHECKPOINT_WAIT. This is done in StartupXLOG(). This will lead to some extra work but that should be minuscule as it is run right after the recovery. And the majority of work done by this checkpoint will be in RemoveOldXlogFiles() which can now run asynchronously.
I considered a couple of alternative solutions before attempting this.
1. One option could be to simply skip the removal of old xlog files during recovery and let a later checkpoint take care of that. But in case of large checkpoint_timeout, this could lead to bloat for longer.
2. Another approach might be to separate out RemoveOldXlogFiles() in a new request. This might also be doable by creating a special checkpoint flag like CHECKPOINT_ONLY_DELETE_OLD_FILES and using that in RequestCheckpoint(). This way we can have the second checkpoint only take care of file deletion. I ended up picking my approach over this because that can be done with a smaller change which might make it safer and less error-prone.
2. After the recovery we run another checkpoint without CHECKPOINT_WAIT. This is done in StartupXLOG(). This will lead to some extra work but that should be minuscule as it is run right after the recovery. And the majority of work done by this checkpoint will be in RemoveOldXlogFiles() which can now run asynchronously.
I considered a couple of alternative solutions before attempting this.
1. One option could be to simply skip the removal of old xlog files during recovery and let a later checkpoint take care of that. But in case of large checkpoint_timeout, this could lead to bloat for longer.
2. Another approach might be to separate out RemoveOldXlogFiles() in a new request. This might also be doable by creating a special checkpoint flag like CHECKPOINT_ONLY_DELETE_OLD_FILES and using that in RequestCheckpoint(). This way we can have the second checkpoint only take care of file deletion. I ended up picking my approach over this because that can be done with a smaller change which might make it safer and less error-prone.
I would like to know what folks think of these alternative approaches vs the current one.
Repro Steps : To repro this, I inserted and deleted a few billion rows while keeping an inactive replication slot at the publisher. I changed the wal_segsize to 1MB to increase the number of files for a smaller amount of data. With 1.5TB worth of WAL files, I could consistently reproduce a 40 minutes delay. With 300GB it was around 10 minutes. With the proposed patch the connections started being accepted right after redo is done.
These are the steps to reproduce this.
1. I created the instance with following settings :
wal-segsize=1MB (pg_ctl -D $PUB_DATA init -o --wal-segsize=1)
checkpoint_timeout=86400 to stop periodic checkpoint from running (echo "checkpoint_timeout = 86400" >> $PUB_DATA/postgresql.conf)
max_wal_size=102400 to avoid too many checkpoints during transactions (echo "max_wal_size = 102400" >> $PUB_DATA/postgresql.conf)
2. I created a database test and then ran the following commands :
create table t_pub(id int);
alter table t_pub replica identity full;
create publication p;
alter publication p add table t_pub;
3. I created a subscriber instance (to create an inactive replication slot). I didn't change any config settings for this instance. Here also I created a db test and ran the following commands :
create table t_pub(id int);
create subscription s connection 'application_name=test host=localhost user=nitinmotiani dbname=test port=5001' publication p;
alter subscription s refresh publication;
4. I stopped the subscriber instance and checked on the first instance (publisher) that there was an inactive replication slot by running the following command :
select slot_name, active from pg_replication_slots;
5. I inserted and deleted data by running the following pair of commands multiple times :
insert into t_pub select generate_series(1, 2000000000);
delete from t_pub;
delete from t_pub;
Depending on the number of times these are run, we can get different amounts of data in the WAL directory.
6. I dropped the replication slot using the following :
7. I killed one of the postgres processes to trigger crash recovery.
By checking the logs, I confirmed that the patch reduces the crash recovery time significantly. With the patch as the removal of old xlog files was going on in async, I also ran a few queries, created another table, inserted data etc and it was all working.
I'm attaching the patch here. Currently I have not added any tests as I would like to get feedback on this approach to solve the problem vs the alternatives. Please let me know what you think.
Thanks & Regards,
Nitin Motiani
Google
Вложения
Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Dilip Kumar
Дата:
On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > Hi Hackers, > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files. > > Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulatedWAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). > > This WAL accumulation is usually caused by : > > 1. Inactive replication slot > 2. PITR failing to keep up > > In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relativelyquickly and most of the delay is due to RemoveOldXlogFiles(). It makes sense to improve this. > How : This patch solves this issue by running RemoveOldXlogFiles() separately and async. This is achieved by doing twothings : > > 1. Skip RemoveOldXlogFiles() for an END_OF_RECOVERY checkpoint. This will ensure that the recovery finishes sooner andpostgres can start accepting connections. > 2. After the recovery we run another checkpoint without CHECKPOINT_WAIT. This is done in StartupXLOG(). This will leadto some extra work but that should be minuscule as it is run right after the recovery. And the majority of work doneby this checkpoint will be in RemoveOldXlogFiles() which can now run asynchronously. > > I considered a couple of alternative solutions before attempting this. > > 1. One option could be to simply skip the removal of old xlog files during recovery and let a later checkpoint take careof that. But in case of large checkpoint_timeout, this could lead to bloat for longer. > > 2. Another approach might be to separate out RemoveOldXlogFiles() in a new request. This might also be doable by creatinga special checkpoint flag like CHECKPOINT_ONLY_DELETE_OLD_FILES and using that in RequestCheckpoint(). This way wecan have the second checkpoint only take care of file deletion. I ended up picking my approach over this because that canbe done with a smaller change which might make it safer and less error-prone. One of the advantages of this approach over forcing an extra checkpoint is that you don't need to loop through the entire buffer pool just to find out mostly nothing is dirty, but yeah this may create some extra flags and extra checks in checkpointer code. -- Regards, Dilip Kumar Google
Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Amit Kapila
Дата:
On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files. > > Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of accumulatedWAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). > > This WAL accumulation is usually caused by : > > 1. Inactive replication slot > 2. PITR failing to keep up > > In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relativelyquickly and most of the delay is due to RemoveOldXlogFiles(). > Isn't it better to fix the reasons for WAL accumulation? Because even without recovery, this can fill up the disk. For example, one can use idle_replication_slot_timeout for inactive slots. Similarly, we can see what leads to slow PITR and try to avoid that. -- With Regards, Amit Kapila.
Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Dilip Kumar
Дата:
On Tue, Sep 9, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlogfiles. > > > > Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number ofaccumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). > > > > This WAL accumulation is usually caused by : > > > > 1. Inactive replication slot > > 2. PITR failing to keep up > > > > In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done relativelyquickly and most of the delay is due to RemoveOldXlogFiles(). > > > > Isn't it better to fix the reasons for WAL accumulation? Because even > without recovery, this can fill up the disk. For example, one can use > idle_replication_slot_timeout for inactive slots. Similarly, we can > see what leads to slow PITR and try to avoid that. I agree that in the ideal world it's better if someone can set 'idle_replication_slot_timeout' correctly so that we don't even create WAL accumulation. But that's not always the case with the user and there are situations where WAL gets accumulated. In this context, the goal is to address the problem after it has already happened, minimizing additional downtime for the user. I feel this is a reasonable goal although we can think more about whether it is worth issuing the extra checkpoint for improving this situation. -- Regards, Dilip Kumar Google
Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Fujii Masao
Дата:
On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > Hi Hackers, > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog files. As another idea, could crash recovery avoid waiting for the end-of-recovery checkpoint itself to finish, similar to archive recovery? In other words, crash recovery would write the end-of-recovery WAL record and request a checkpoint, but not block until it completes. Thought? One concern, though: in your case, the first checkpoint after crash recovery could take a very long time, since it needs to remove a large number of WAL files. This could delay subsequent checkpoints beyond checkpoint_timeout. If so, perhaps we'd need to limit how many WAL files a single checkpoint can remove. Regards, -- Fujii Masao
Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Dilip Kumar
Дата:
On Tue, Sep 9, 2025 at 1:23 PM Fujii Masao <masao.fujii@gmail.com> wrote: > > On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > Hi Hackers, > > > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlogfiles. > > As another idea, could crash recovery avoid waiting for the end-of-recovery > checkpoint itself to finish, similar to archive recovery? In other words, > crash recovery would write the end-of-recovery WAL record and request > a checkpoint, but not block until it completes. Thought? Thanks for your input Fujii. The end-of-recovery checkpoint needs to set checkpoint.redo to serve as a new recovery starting point. This prevents a full recovery from the previous checkpoint in the event of a crash. However, setting checkpoint.redo requires that no other backend is generating concurrent WAL. For this reason, the end-of-recovery checkpoint cannot run concurrently. -- Regards, Dilip Kumar Google
Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От
Nitin Motiani
Дата:
On Tue, Sep 9, 2025 at 1:23 PM Fujii Masao <masao.fujii@gmail.com> wrote: > > On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > Hi Hackers, > > > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlogfiles. > > As another idea, could crash recovery avoid waiting for the end-of-recovery > checkpoint itself to finish, similar to archive recovery? In other words, > crash recovery would write the end-of-recovery WAL record and request > a checkpoint, but not block until it completes. Thought? > Thanks for the feedback Fujii. I'll look into this. Although based on Dilip's reply it is probably not feasible. > One concern, though: in your case, the first checkpoint after crash recovery > could take a very long time, since it needs to remove a large number of > WAL files. This could delay subsequent checkpoints beyond checkpoint_timeout. > If so, perhaps we'd need to limit how many WAL files a single checkpoint > can remove. > The limiting of WAL files is something we only want to do for this checkpoint or in general for all checkpoints? A couple of thoughts on these options : 1. If we only do it for the post end-of-recovery checkpoint, we will have to add special handling for that case and perhaps that reduces the simplicity of this approach. Also if we just do it for the first checkpoint after recovery, a future checkpoint might again spend a lot of time removing these files and delay subsequent checkpoints. 2. We can do it for all checkpoints but that can cause the bloat to last for a far longer period. One alternative might be to provide a guc to set the num/size of wal files or a timeout for this step which would require some tuning from the users. Also what do you think of the simple method of skipping removal of files at recovery time and let the future checkpoints take care of it? One reason I went with this solution over the others was that in the current state, the system is down for all the time of removal of files. But with this, the only thing which might be delayed is the checkpoint and that seems like an improvement. But it would be great to get your thoughts on this and the other alternatives. Thanks & Regards, Nitin Motiani Google