Обсуждение: Questions about the continuity of WAL archiving
Hi,
There is a scenario: the current timeline of the PostgreSQL primary node is 1, and the latest WAL file is 100. The standby node has also received up to WAL file 100. However, the latest WAL file archived is only file 80. If the primary node crashes at this point and the standby is promoted to the new primary, archiving will resume from file 100 on timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?
Is there a good solution to prevent this situation?
Regards,
Pixian Shi
On 8/7/25 20:20, px shi wrote: > Hi, > There is a scenario: the current timeline of the PostgreSQL primary node > is 1, and the latest WAL file is 100. The standby node has also received > up to WAL file 100. However, the latest WAL file archived is only file > 80. If the primary node crashes at this point and the standby is > promoted to the new primary, archiving will resume from file 100 on > timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be > missing from the archive. What are you planning to do with the archived files? Also is not the case that once the primary crashes you are in a split brain case and can't really trust it's timeline anymore? > Is there a good solution to prevent this situation? > > Regards, > Pixian Shi -- Adrian Klaver adrian.klaver@aklaver.com
Thank you for your reply.
The archived files can be used for PITR (Point-In-Time Recovery), allowing recovery to any point between WAL 80 and 100 on timeline 1.
Additionally, if there's a backup taken during timeline 1 and a switchover to a new primary has occurred without taking a new full backup yet, these WAL logs can still be used to recover to any point on timeline 2.
Regards,
Pixian Shi
Pixian Shi
Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月8日周五 12:25写道:
On 8/7/25 20:20, px shi wrote:
> Hi,
> There is a scenario: the current timeline of the PostgreSQL primary node
> is 1, and the latest WAL file is 100. The standby node has also received
> up to WAL file 100. However, the latest WAL file archived is only file
> 80. If the primary node crashes at this point and the standby is
> promoted to the new primary, archiving will resume from file 100 on
> timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be
> missing from the archive.
What are you planning to do with the archived files?
Also is not the case that once the primary crashes you are in a split
brain case and can't really trust it's timeline anymore?
> Is there a good solution to prevent this situation?
>
> Regards,
> Pixian Shi
--
Adrian Klaver
adrian.klaver@aklaver.com
On 8/7/25 22:50, px shi wrote: > Thank you for your reply. > The archived files can be used for PITR (Point-In-Time Recovery), > allowing recovery to any point between WAL 80 and 100 on timeline 1. > Additionally, if there's a backup taken during timeline 1 and a > switchover to a new primary has occurred without taking a new full > backup yet, these WAL logs can still be used to recover to any point on > timeline 2. Alright I see. Two things: 1) What is the current archiving setup on the primary and why is lagging? 2) Have you looked at archiving off the standby node while it is in standby per: https://www.postgresql.org/docs/current/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY > > Regards, > Pixian Shi > > Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 12:25写道: > > On 8/7/25 20:20, px shi wrote: > > Hi, > > There is a scenario: the current timeline of the PostgreSQL > primary node > > is 1, and the latest WAL file is 100. The standby node has also > received > > up to WAL file 100. However, the latest WAL file archived is only > file > > 80. If the primary node crashes at this point and the standby is > > promoted to the new primary, archiving will resume from file 100 on > > timeline 2. As a result, WAL files from 81 to 100 on timeline 1 > will be > > missing from the archive. > > What are you planning to do with the archived files? > > Also is not the case that once the primary crashes you are in a split > brain case and can't really trust it's timeline anymore? > > > > Is there a good solution to prevent this situation? > > > > Regards, > > Pixian Shi > > > -- > Adrian Klaver > adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> > -- Adrian Klaver adrian.klaver@aklaver.com
There is a scenario: the current timeline of the PostgreSQL primary node is 1, and the latest WAL file is 100. The standby node has also received up to WAL file 100. However, the latest WAL file archived is only file 80. If the primary node crashes at this point and the standby is promoted to the new primary, archiving will resume from file 100 on timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?
I'm still not clear on what the problem here is, other than your archiving not keeping up. The best solution to that is:
Yes, you would lost some ability for easy PITR for 80-100, but could still be done by resurrecting your crashed primary, or carefully grabbing from the replica before they get recycled. You can set archive_mode=always on the replicas to help with this.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
On Fri, Aug 8, 2025 at 2:26 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
There is a scenario: the current timeline of the PostgreSQL primary node is 1, and the latest WAL file is 100. The standby node has also received up to WAL file 100. However, the latest WAL file archived is only file 80. If the primary node crashes at this point and the standby is promoted to the new primary, archiving will resume from file 100 on timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?I'm still not clear on what the problem here is, other than your archiving not keeping up. The best solution to that is:Yes, you would lost some ability for easy PITR for 80-100, but could still be done by resurrecting your crashed primary, or carefully grabbing from the replica before they get recycled. You can set archive_mode=always on the replicas to help with this.
Bog-standard PgBackRest retains all WAL files required for a full backup set and its associated differential/incremental backups, no? I've certainly done more than one --type=time --target="${RestoreUntil}" restore without giving a second thought to timelines or whether the WAL exists.
Maybe I've just ignored the problem, since it (seemingly) does everything for PITR backups.
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
1) What is the current archiving setup on the primary and why is lagging?
The archive command uses pgBackRest to archive to S3. Because it is uploaded to S3, the archiving speed is slow, which has caused lagging.
2) Have you looked at archiving off the standby node while it is in standby per:
Yes, archiving on the standby node is disabled. Is it recommended to share the WAL archive between the primary and standby nodes to avoid interruptions in archiving?
Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月8日周五 23:23写道:
On 8/7/25 22:50, px shi wrote:
> Thank you for your reply.
> The archived files can be used for PITR (Point-In-Time Recovery),
> allowing recovery to any point between WAL 80 and 100 on timeline 1.
> Additionally, if there's a backup taken during timeline 1 and a
> switchover to a new primary has occurred without taking a new full
> backup yet, these WAL logs can still be used to recover to any point on
> timeline 2.
Alright I see.
Two things:
1) What is the current archiving setup on the primary and why is lagging?
2) Have you looked at archiving off the standby node while it is in
standby per:
https://www.postgresql.org/docs/current/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY
>
> Regards,
> Pixian Shi
>
> Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 12:25写道:
>
> On 8/7/25 20:20, px shi wrote:
> > Hi,
> > There is a scenario: the current timeline of the PostgreSQL
> primary node
> > is 1, and the latest WAL file is 100. The standby node has also
> received
> > up to WAL file 100. However, the latest WAL file archived is only
> file
> > 80. If the primary node crashes at this point and the standby is
> > promoted to the new primary, archiving will resume from file 100 on
> > timeline 2. As a result, WAL files from 81 to 100 on timeline 1
> will be
> > missing from the archive.
>
> What are you planning to do with the archived files?
>
> Also is not the case that once the primary crashes you are in a split
> brain case and can't really trust it's timeline anymore?
>
>
> > Is there a good solution to prevent this situation?
> >
> > Regards,
> > Pixian Shi
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
--
Adrian Klaver
adrian.klaver@aklaver.com
I'm still not clear on what the problem here is, other than your archiving not keeping up.
In my scenario, archive_mode is not set to always on the replicas, it may cause interruptions in the archived WAL logs.
You can set archive_mode=always on the replicas to help with this.
Yes, it can work. And I would like to know if this is the recommended configuration for production use?
Greg Sabino Mullane <htamfids@gmail.com> 于2025年8月9日周六 02:25写道:
There is a scenario: the current timeline of the PostgreSQL primary node is 1, and the latest WAL file is 100. The standby node has also received up to WAL file 100. However, the latest WAL file archived is only file 80. If the primary node crashes at this point and the standby is promoted to the new primary, archiving will resume from file 100 on timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?I'm still not clear on what the problem here is, other than your archiving not keeping up. The best solution to that is:Yes, you would lost some ability for easy PITR for 80-100, but could still be done by resurrecting your crashed primary, or carefully grabbing from the replica before they get recycled. You can set archive_mode=always on the replicas to help with this.Cheers,Greg--Crunchy Data - https://www.crunchydata.comEnterprise Postgres Software Products & Tech Support
Bog-standard PgBackRest retains all WAL files required for a full backup set and its associated differential/incremental backups.
Yes, WAL files are continuous under normal circumstances. However, if the primary node crashes under high load, the archived WAL logs on S3 may be discontinuous.
Ron Johnson <ronljohnsonjr@gmail.com> 于2025年8月9日周六 02:45写道:
On Fri, Aug 8, 2025 at 2:26 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:There is a scenario: the current timeline of the PostgreSQL primary node is 1, and the latest WAL file is 100. The standby node has also received up to WAL file 100. However, the latest WAL file archived is only file 80. If the primary node crashes at this point and the standby is promoted to the new primary, archiving will resume from file 100 on timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?I'm still not clear on what the problem here is, other than your archiving not keeping up. The best solution to that is:Yes, you would lost some ability for easy PITR for 80-100, but could still be done by resurrecting your crashed primary, or carefully grabbing from the replica before they get recycled. You can set archive_mode=always on the replicas to help with this.Bog-standard PgBackRest retains all WAL files required for a full backup set and its associated differential/incremental backups, no? I've certainly done more than one --type=time --target="${RestoreUntil}" restore without giving a second thought to timelines or whether the WAL exists.Maybe I've just ignored the problem, since it (seemingly) does everything for PITR backups.--Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.<Redacted> lobster!
On 8/12/25 01:24, px shi wrote: > > 1) What is the current archiving setup on the primary and why is > lagging? > > The archive command uses pgBackRest to archive to S3. Because it is > uploaded to S3, the archiving speed is slow, which has caused lagging. > > 2) Have you looked at archiving off the standby node while it is in > standby per: > > Yes, archiving on the standby node is disabled. Is it recommended to > share the WAL archive between the primary and standby nodes to avoid > interruptions in archiving? Given that you are using a less then capable storage solution(S3) why do you think pushing the WAL from the standby to S3 would perform any better then what is happening with the primary WAL? The solution is to use a more capable storage platform. > > Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道: > -- Adrian Klaver adrian.klaver@aklaver.com
On Tue, 12 Aug 2025 at 17:14, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 8/12/25 01:24, px shi wrote:
>
> 1) What is the current archiving setup on the primary and why is
> lagging?
>
> The archive command uses pgBackRest to archive to S3. Because it is
> uploaded to S3, the archiving speed is slow, which has caused lagging.
>
> 2) Have you looked at archiving off the standby node while it is in
> standby per:
>
> Yes, archiving on the standby node is disabled. Is it recommended to
> share the WAL archive between the primary and standby nodes to avoid
> interruptions in archiving?
Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?
The solution is to use a more capable storage platform.
Regards
Bob
>
> Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道:
>
--
Adrian Klaver
adrian.klaver@aklaver.com
On Tue, Aug 12, 2025 at 4:37 AM px shi <spxlyy123@gmail.com> wrote:
Bog-standard PgBackRest retains all WAL files required for a full backup set and its associated differential/incremental backups.Yes, WAL files are continuous under normal circumstances. However, if the primary node crashes under high load, the archived WAL logs on S3 may be discontinuous.
1) PG does not purge WAL files that are needed for immediate crash recovery.
2) PgBackRest can archive (compressed and encrypted) WAL files to S3. https://pgbackrest.org/user-guide-rhel.html#s3-support
Ron Johnson <ronljohnsonjr@gmail.com> 于2025年8月9日周六 02:45写道:On Fri, Aug 8, 2025 at 2:26 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:There is a scenario: the current timeline of the PostgreSQL primary node is 1, and the latest WAL file is 100. The standby node has also received up to WAL file 100. However, the latest WAL file archived is only file 80. If the primary node crashes at this point and the standby is promoted to the new primary, archiving will resume from file 100 on timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?I'm still not clear on what the problem here is, other than your archiving not keeping up. The best solution to that is:Yes, you would lost some ability for easy PITR for 80-100, but could still be done by resurrecting your crashed primary, or carefully grabbing from the replica before they get recycled. You can set archive_mode=always on the replicas to help with this.Bog-standard PgBackRest retains all WAL files required for a full backup set and its associated differential/incremental backups, no? I've certainly done more than one --type=time --target="${RestoreUntil}" restore without giving a second thought to timelines or whether the WAL exists.Maybe I've just ignored the problem, since it (seemingly) does everything for PITR backups.
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
On 8/12/25 10:40, Bob Jolliffe wrote: > On Tue, 12 Aug 2025 at 17:14, Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> wrote: > The solution is to use a more capable storage platform. > > > That is an interesting point you make Adrian. S3 seems quite popular > for this type of archiving. What would you suggest as a more capable Yes but from here: https://pgbackrest.org/user-guide-rhel.html#s3-support File creation time in S3 is relatively slow so backup/restore performance is improved by enabling file bundling. Where file bundling is explained here: https://pgbackrest.org/user-guide-rhel.html#backup/bundle Though I don't think would help in this case. > (and cost effective) storage platform? I would say anything that does not use object storage and instead uses block storage, so you are not doing the conversion. I have no specific recommendations as this is not something I do, archive to the cloud. > > Regards > Bob > -- Adrian Klaver adrian.klaver@aklaver.com
Hi, Adrian
Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?
I mean that archive_mode is set to on in primary and set to always in standby.
This way, even if the primary crashes, the standby can still archive WAL files that the primary did not archive.
The solution is to use a more capable storage platform.
However, I believe that even if we use a more capable storage platform, it is still impossible to archive WAL files in real time. As long as real-time archiving cannot be achieved, there will always be some WAL files that are not archived if the primary node crashes.
Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月13日周三 00:14写道:
On 8/12/25 01:24, px shi wrote:
>
> 1) What is the current archiving setup on the primary and why is
> lagging?
>
> The archive command uses pgBackRest to archive to S3. Because it is
> uploaded to S3, the archiving speed is slow, which has caused lagging.
>
> 2) Have you looked at archiving off the standby node while it is in
> standby per:
>
> Yes, archiving on the standby node is disabled. Is it recommended to
> share the WAL archive between the primary and standby nodes to avoid
> interruptions in archiving?
Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?
The solution is to use a more capable storage platform.
>
> Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道:
>
--
Adrian Klaver
adrian.klaver@aklaver.com
How often does your primary node crash, and then not recover due to WALs corruption or WALs not existing?
If it's _ever_ happened, you should _fix that_ instead of rolling your own WAL archival.process.
On Tue, Aug 12, 2025 at 10:05 PM px shi <spxlyy123@gmail.com> wrote:
Hi, AdrianGiven that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?I mean that archive_mode is set to on in primary and set to always in standby.This way, even if the primary crashes, the standby can still archive WAL files that the primary did not archive.The solution is to use a more capable storage platform.However, I believe that even if we use a more capable storage platform, it is still impossible to archive WAL files in real time. As long as real-time archiving cannot be achieved, there will always be some WAL files that are not archived if the primary node crashes.Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月13日周三 00:14写道:On 8/12/25 01:24, px shi wrote:
>
> 1) What is the current archiving setup on the primary and why is
> lagging?
>
> The archive command uses pgBackRest to archive to S3. Because it is
> uploaded to S3, the archiving speed is slow, which has caused lagging.
>
> 2) Have you looked at archiving off the standby node while it is in
> standby per:
>
> Yes, archiving on the standby node is disabled. Is it recommended to
> share the WAL archive between the primary and standby nodes to avoid
> interruptions in archiving?
Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?
The solution is to use a more capable storage platform.
>
> Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道:
>
--
Adrian Klaver
adrian.klaver@aklaver.com
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
How often does your primary node crash, and then not recover due to WALs corruption or WALs not existing?
If it's _ever_ happened, you should _fix that_ instead of rolling your own WAL archival.process.
I once encountered a case where the recovery process failed to restore to the latest LSN due to missing WAL files in the archive. The root cause was multiple failovers between primary and standby. During one of the switchovers, the primary crashed before completing the archiving of all WAL files. When the standby was promoted to primary, it began archiving WAL files for the new timeline, resulting in a gap between the WAL files of the two timelines. Moreover, no base backup was taken during this period.
Ron Johnson <ronljohnsonjr@gmail.com> 于2025年8月13日周三 10:11写道:
How often does your primary node crash, and then not recover due to WALs corruption or WALs not existing?If it's _ever_ happened, you should _fix that_ instead of rolling your own WAL archival.process.On Tue, Aug 12, 2025 at 10:05 PM px shi <spxlyy123@gmail.com> wrote:Hi, AdrianGiven that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?I mean that archive_mode is set to on in primary and set to always in standby.This way, even if the primary crashes, the standby can still archive WAL files that the primary did not archive.The solution is to use a more capable storage platform.However, I believe that even if we use a more capable storage platform, it is still impossible to archive WAL files in real time. As long as real-time archiving cannot be achieved, there will always be some WAL files that are not archived if the primary node crashes.Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月13日周三 00:14写道:On 8/12/25 01:24, px shi wrote:
>
> 1) What is the current archiving setup on the primary and why is
> lagging?
>
> The archive command uses pgBackRest to archive to S3. Because it is
> uploaded to S3, the archiving speed is slow, which has caused lagging.
>
> 2) Have you looked at archiving off the standby node while it is in
> standby per:
>
> Yes, archiving on the standby node is disabled. Is it recommended to
> share the WAL archive between the primary and standby nodes to avoid
> interruptions in archiving?
Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?
The solution is to use a more capable storage platform.
>
> Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道:
>
--
Adrian Klaver
adrian.klaver@aklaver.com--Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.<Redacted> lobster!
On Tue, Aug 12, 2025 at 10:24 PM px shi <spxlyy123@gmail.com> wrote:
How often does your primary node crash, and then not recover due to WALs corruption or WALs not existing?
If it's _ever_ happened, you should _fix that_ instead of rolling your own WAL archival.process.I once encountered a case where the recovery process failed to restore to the latest LSN due to missing WAL files in the archive. The root cause was multiple failovers between primary and standby. During one of the switchovers, the primary crashed before completing the archiving of all WAL files. When the standby was promoted to primary, it began archiving WAL files for the new timeline, resulting in a gap between the WAL files of the two timelines. Moreover, no base backup was taken during this period.
The replica should be receiving the WAL via a replication slot using Streaming. Meaning the primary will keep the WAL until the replica is caught up, if the replica becomes disconnected due to max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas recovery_command can take offer and fetch from the WAL Archive to catch the replica up. This assumes hot_feedback is on so the WAL replay won't become delayed due to snapshot locks on the replica.
If all the above is true the replica should never lag behind unless the disk IO layer is way undersized compared to the Primary. S3 is being talked about so it makes me wonder about DISK IO configuration on the primary vs the replica. I see this causing lag under high load where the replica IO layer is the bottleneck.
If PgBackrest can't keep up with WAL archiving, as others have stated you need to configure Asynchronous Archiving. The number of workers depends on the load. I have a server running 8 parallel workers to archive 1TB of WAL daily.... And another server during maintenance tasks generates around 10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3 buckets.
The above statement makes me wonder if there is some kind of High Availability monitor running like pg_autofailover, that is promoting a replica then converting the former primary to a replica of the recently "promoted replica"
If the above matches to what is happening, it is very easy to mess up the configuration for WAL archiving and backups. Part of the process of promoting a replica is to make sure WAL archiving is working. The replica after being promoted immediately kicks of autovacuum to rebuild things like FSM which generates a lot of WAL files.
If you are losing WAL files the configuration is wrong somewhere..
Just not enough information on the series of events and the configuration to tell what the root cause is other than miss-configuration.
Thanks
Justin
If PgBackrest can't keep up with WAL archiving, as others have stated you need to configure Asynchronous Archiving. The number of workers depends on the load. I have a server running 8 parallel workers to archive 1TB of WAL daily.... And another server during maintenance tasks generates around 10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3 buckets.
The above statement makes me wonder if there is some kind of High Availability monitor running like pg_autofailover, that is promoting a replica then converting the former primary to a replica of the recently "promoted replica"
If the above matches to what is happening, it is very easy to mess up the configuration for WAL archiving and backups. Part of the process of promoting a replica is to make sure WAL archiving is working. The replica after being promoted immediately kicks of autovacuum to rebuild things like FSM which generates a lot of WAL files.
If you are losing WAL files the configuration is wrong somewhere..
Just not enough information on the series of events and the configuration to tell what the root cause is other than miss-configuration.
Thanks
Justin
Here’s a scenario: The latest WAL file on the primary node is 0000000100000000000000AF, and the standby node has also received up to 0000000100000000000000AF. However, the latest WAL file that has been successfully archived from the primary is only 0000000100000000000000A1 (WAL files from A2 to AE have not yet been archived). If the primary crashes at this point, triggering a failover, the new primary will start generating and archiving WAL on a new timeline (2), beginning with 0000000200000000000000AF. It will not backfill the missing WAL files from timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE). As a result, while the new primary does not have any local WAL gaps, the archive directory will contain a gap in that WAL range.
I’m not sure if I explained it clearly.
Justin <zzzzz.graf@gmail.com> 于2025年8月13日周三 10:51写道:
On Tue, Aug 12, 2025 at 10:24 PM px shi <spxlyy123@gmail.com> wrote:How often does your primary node crash, and then not recover due to WALs corruption or WALs not existing?
If it's _ever_ happened, you should _fix that_ instead of rolling your own WAL archival.process.I once encountered a case where the recovery process failed to restore to the latest LSN due to missing WAL files in the archive. The root cause was multiple failovers between primary and standby. During one of the switchovers, the primary crashed before completing the archiving of all WAL files. When the standby was promoted to primary, it began archiving WAL files for the new timeline, resulting in a gap between the WAL files of the two timelines. Moreover, no base backup was taken during this period.I am not sure what the problem is here either, other than something seriously wrong with configuration with PostgreSQL and PgBackrest.
The replica should be receiving the WAL via a replication slot using Streaming. Meaning the primary will keep the WAL until the replica is caught up, if the replica becomes disconnected due to max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas recovery_command can take offer and fetch from the WAL Archive to catch the replica up. This assumes hot_feedback is on so the WAL replay won't become delayed due to snapshot locks on the replica.If all the above is true the replica should never lag behind unless the disk IO layer is way undersized compared to the Primary. S3 is being talked about so it makes me wonder about DISK IO configuration on the primary vs the replica. I see this causing lag under high load where the replica IO layer is the bottleneck.
If PgBackrest can't keep up with WAL archiving, as others have stated you need to configure Asynchronous Archiving. The number of workers depends on the load. I have a server running 8 parallel workers to archive 1TB of WAL daily.... And another server during maintenance tasks generates around 10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3 buckets.
The above statement makes me wonder if there is some kind of High Availability monitor running like pg_autofailover, that is promoting a replica then converting the former primary to a replica of the recently "promoted replica"
If the above matches to what is happening, it is very easy to mess up the configuration for WAL archiving and backups. Part of the process of promoting a replica is to make sure WAL archiving is working. The replica after being promoted immediately kicks of autovacuum to rebuild things like FSM which generates a lot of WAL files.
If you are losing WAL files the configuration is wrong somewhere..
Just not enough information on the series of events and the configuration to tell what the root cause is other than miss-configuration.
Thanks
Justin
Justin <zzzzz.graf@gmail.com> 于2025年8月13日周三 10:51写道:On Tue, Aug 12, 2025 at 10:24 PM px shi <spxlyy123@gmail.com> wrote:How often does your primary node crash, and then not recover due to WALs corruption or WALs not existing?
If it's _ever_ happened, you should _fix that_ instead of rolling your own WAL archival.process.I once encountered a case where the recovery process failed to restore to the latest LSN due to missing WAL files in the archive. The root cause was multiple failovers between primary and standby. During one of the switchovers, the primary crashed before completing the archiving of all WAL files. When the standby was promoted to primary, it began archiving WAL files for the new timeline, resulting in a gap between the WAL files of the two timelines. Moreover, no base backup was taken during this period.I am not sure what the problem is here either, other than something seriously wrong with configuration with PostgreSQL and PgBackrest.
The replica should be receiving the WAL via a replication slot using Streaming. Meaning the primary will keep the WAL until the replica is caught up, if the replica becomes disconnected due to max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas recovery_command can take offer and fetch from the WAL Archive to catch the replica up. This assumes hot_feedback is on so the WAL replay won't become delayed due to snapshot locks on the replica.If all the above is true the replica should never lag behind unless the disk IO layer is way undersized compared to the Primary. S3 is being talked about so it makes me wonder about DISK IO configuration on the primary vs the replica. I see this causing lag under high load where the replica IO layer is the bottleneck.
If PgBackrest can't keep up with WAL archiving, as others have stated you need to configure Asynchronous Archiving. The number of workers depends on the load. I have a server running 8 parallel workers to archive 1TB of WAL daily.... And another server during maintenance tasks generates around 10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3 buckets.
The above statement makes me wonder if there is some kind of High Availability monitor running like pg_autofailover, that is promoting a replica then converting the former primary to a replica of the recently "promoted replica"
If the above matches to what is happening, it is very easy to mess up the configuration for WAL archiving and backups. Part of the process of promoting a replica is to make sure WAL archiving is working. The replica after being promoted immediately kicks of autovacuum to rebuild things like FSM which generates a lot of WAL files.
If you are losing WAL files the configuration is wrong somewhere..
Just not enough information on the series of events and the configuration to tell what the root cause is other than miss-configuration.
Thanks
Justin
On Wed, Aug 13, 2025 at 1:48 AM px shi <spxlyy123@gmail.com> wrote:
Here’s a scenario: The latest WAL file on the primary node is 0000000100000000000000AF, and the standby node has also received up to 0000000100000000000000AF. However, the latest WAL file that has been successfully archived from the primary is only 0000000100000000000000A1 (WAL files from A2 to AE have not yet been archived). If the primary crashes at this point, triggering a failover, the new primary will start generating and archiving WAL on a new timeline (2), beginning with 0000000200000000000000AF. It will not backfill the missing WAL files from timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE). As a result, while the new primary does not have any local WAL gaps, the archive directory will contain a gap in that WAL range.I’m not sure if I explained it clearly.
This will happen if the replica is lagging and promoted before the replica has had a chance to catch up. This is working correctly to the design intent. There are several tools available to tell us if the replica is sync before promoting. In the above case a lagging Replica was promoted, it stops looking at the previous timeline and will NOT look for the missing WAL files from the previous timeline. The replica does not even know they exist anymore.
The data in the previous timeline is not accessible anymore from the Promoted Replica; it is working on a new timeline. The only place the old timeline/missed WAL files are accessible is on the crashed primary, it never archived or streamed the WAL files to the replica.
Promoting an out of sync/lagging replica will result in loss of data.
Does this answer the question here?
On 8/12/25 22:48, px shi wrote: > Here’s a scenario: The latest WAL file on the primary node is > 0000000100000000000000AF, and the standby node has also received up to > 0000000100000000000000AF. However, the latest WAL file that has been > successfully archived from the primary is only 0000000100000000000000A1 > (WAL files from A2 to AE have not yet been archived). If the primary > crashes at this point, triggering a failover, the new primary will start > generating and archiving WAL on a new timeline (2), beginning with > 0000000200000000000000AF. It will not backfill the missing WAL files > from timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE). > As a result, while the new primary does not have any local WAL gaps, the > archive directory will contain a gap in that WAL range. > I’m not sure if I explained it clearly. Why does it matter? 1) Your standby is starting off up to date. 2) You can do a pg_basebackup from the new primary as a base for the restart of the old primary. Assuming you have archiving set up on the new primary then the restarted primary can catch up. 3) If you don't want to do 2) then you need an archive location that can deal with the velocity of the WAL archiving. > > > Justin <zzzzz.graf@gmail.com <mailto:zzzzz.graf@gmail.com>> 于2025年8月 > 13日周三 10:51写道: > -- Adrian Klaver adrian.klaver@aklaver.com
On Tue, Aug 12, 2025 at 3:20 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
File creation time in S3 is relatively slow so backup/restore
performance is improved by enabling file bundling.
Just to be clear for the archives, pgbackrest's file bundling only applies to backups, not to WAL, which is the OP is dealing with here.
Pixian Shi, what sort of WAL volume are we dealing with? (in WALs generated per minute)
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support