Обсуждение: Postgres restore sometimes restores to a point 2 days in the past
I'm running postgres 16.6
If a WAL archive is asked that doesn't exist yet, the script naturally cannot find it, and exits with status code 1. This is the end of the recovery.
My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.
The restore is on an isolated machine and is performed daily. It downloads the basebackup, unpacks it, sets a recovery.signal, and a script is provided as restore_command, to download the WAL archives %f and unpack them into %p
In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files are first checked with "gzip -t".
There are a few tables that are known to receive new entries multiple times per day. However, the state of the recovery showed the latest item to be 2 days in the past. Checking the live DB, there are an expected amount of items since that ID.
I checked the logs, the last WAL archive that got downloaded is indeed the last one that was available. The one that failed to download on the restore machine, was uploaded to the cloud 8 minutes later, according to the upload logs on the live DB.
The postgres logs themselves seem perfectly normal. It logs all these WAL recoveries, switches the timeline, and becomes available.
What could be going wrong? My main issue is that I don't know where to start looking, since nothing in the logs seems abnormal.
Regards,
Koen De Groote
On Fri, 2025-01-31 at 10:47 +0100, Koen De Groote wrote: > I'm running postgres 16.6 > > My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud. > > The restore is on an isolated machine and is performed daily. It downloads the > basebackup, unpacks it, sets a recovery.signal, and a script is provided as > restore_command, to download the WAL archives %f and unpack them into %p > > In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files > are first checked with "gzip -t". > > If a WAL archive is asked that doesn't exist yet, the script naturally cannot > find it, and exits with status code 1. This is the end of the recovery. > > There are a few tables that are known to receive new entries multiple times > per day. However, the state of the recovery showed the latest item to be 2 > days in the past. Checking the live DB, there are an expected amount of items > since that ID. > > I checked the logs, the last WAL archive that got downloaded is indeed the > last one that was available. The one that failed to download on the restore > machine, was uploaded to the cloud 8 minutes later, according to the upload > logs on the live DB. > > The postgres logs themselves seem perfectly normal. It logs all these WAL > recoveries, switches the timeline, and becomes available. > > What could be going wrong? My main issue is that I don't know where to start > looking, since nothing in the logs seems abnormal. I don't know, that all sounds like it is working as it should. If the last WAL archive that got downloaded by the "restore_command" is indeed the last one that was available, recovery did just what it is supposed to. If new WAL segments get archived later, that's too late. Perhaps you are looking for replication, not for restoring a backup, which is necessarily not totally up to date. Yours, Laurenz Albe
On 1/31/25 01:47, Koen De Groote wrote: Comments in line. > I'm running postgres 16.6 > > My backup strategy is: basebackup and WAL archive. These get uploaded to > the cloud. > > The restore is on an isolated machine and is performed daily. It > downloads the basebackup, unpacks it, sets a recovery.signal, and a > script is provided as restore_command, to download the WAL archives %f > and unpack them into %p > What is the complete pg_basebackup command? > In the script, the final unpacking is simply "gzip -dc %f > %p". The gz > files are first checked with "gzip -t". > > If a WAL archive is asked that doesn't exist yet, the script naturally > cannot find it, and exits with status code 1. This is the end of the > recovery. I don't understand the above. What is determining that a particular WAL file should be asked for? > > There are a few tables that are known to receive new entries multiple > times per day. However, the state of the recovery showed the latest item > to be 2 days in the past. Checking the live DB, there are an expected > amount of items since that ID. How active is the primary database you are pulling from? > > I checked the logs, the last WAL archive that got downloaded is indeed > the last one that was available. The one that failed to download on the > restore machine, was uploaded to the cloud 8 minutes later, according to > the upload logs on the live DB. Available where? If that was the last one available how could the subsequent one be a failure to download? > > The postgres logs themselves seem perfectly normal. It logs all these > WAL recoveries, switches the timeline, and becomes available. > > What could be going wrong? My main issue is that I don't know where to > start looking, since nothing in the logs seems abnormal. > > Regards, > Koen De Groote -- Adrian Klaver adrian.klaver@aklaver.com
No, it's meant to be an off-site restore, as to do a daily check if the restore actually works.
Regards,
Koen De Groote
On Fri, Jan 31, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2025-01-31 at 10:47 +0100, Koen De Groote wrote:
> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.
>
> The restore is on an isolated machine and is performed daily. It downloads the
> basebackup, unpacks it, sets a recovery.signal, and a script is provided as
> restore_command, to download the WAL archives %f and unpack them into %p
>
> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files
> are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally cannot
> find it, and exits with status code 1. This is the end of the recovery.
>
> There are a few tables that are known to receive new entries multiple times
> per day. However, the state of the recovery showed the latest item to be 2
> days in the past. Checking the live DB, there are an expected amount of items
> since that ID.
>
> I checked the logs, the last WAL archive that got downloaded is indeed the
> last one that was available. The one that failed to download on the restore
> machine, was uploaded to the cloud 8 minutes later, according to the upload
> logs on the live DB.
>
> The postgres logs themselves seem perfectly normal. It logs all these WAL
> recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to start
> looking, since nothing in the logs seems abnormal.
I don't know, that all sounds like it is working as it should.
If the last WAL archive that got downloaded by the "restore_command" is indeed
the last one that was available, recovery did just what it is supposed to.
If new WAL segments get archived later, that's too late.
Perhaps you are looking for replication, not for restoring a backup, which is
necessarily not totally up to date.
Yours,
Laurenz Albe
> What is the complete pg_basebackup command?
> What is determining that a particular WAL file should be asked for?
The command: pg_basebackup -h <IP> -p <PORT> -U <USERNAME> -D <ABSOLUTE_PATH> -Ft -z -P -v --wal-method=none
So basically the same as the 2nd example here: https://www.postgresql.org/docs/16/app-pgbasebackup.html except for the verbose flag and the wal-method flag.
The wal-method is none for 2 reasons:
1/ Experience teaches that, in the event of storage being on a network, timeouts to write WAL archives tot he network location can cause WAL creation during a basebackup to be considered failed, and that causes the entire basebackup to be considered failed, even if a retry occurs. Any failure during a basebackup will cause postgres to auto-delete it at the very end of pg_basebackup, declaring it "unusable". This is extremely bad in backups that take very long. Better to not include WAL files in the basebackup, and just get them after the fact.
2/ All my WAL files are archived and uploaded to the cloud. So, I can just have them downloaded.
This has worked for months on end, and so has restoring.
The wal-method is none for 2 reasons:
1/ Experience teaches that, in the event of storage being on a network, timeouts to write WAL archives tot he network location can cause WAL creation during a basebackup to be considered failed, and that causes the entire basebackup to be considered failed, even if a retry occurs. Any failure during a basebackup will cause postgres to auto-delete it at the very end of pg_basebackup, declaring it "unusable". This is extremely bad in backups that take very long. Better to not include WAL files in the basebackup, and just get them after the fact.
2/ All my WAL files are archived and uploaded to the cloud. So, I can just have them downloaded.
This has worked for months on end, and so has restoring.
> I don't understand the above.
> What is determining that a particular WAL file should be asked for?
The postgres server itself does this. Here's the documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND
In practice, Postgres will see the "standby.signal" file and start asking for WAL files. It will read the database it has and determine what the next WAL filename should be. And then it asks for it. And it will keep asking for these hexadecimal filenames, 1 at a time, for as long as the command or set of commands provided to "restore_command" returns exit code 0. If the process receives any other exit code, it stops recovery, switches timeline, and considers the database to be up and running at the state its in.
It's constantly asking "I want this file now" and the script I have as the restore command will attempt to download it from the cloud. Then it will attempt to unzip it and move it into place. If any of these steps fails, I return exit code 1.
> How active is the primary database you are pulling from?
Very active, plus automated testing to ensure everything is still running, this will generate multiple items per day on its own.
> Available where?
The cloud, as I stated: WAL files get archived and these archived files are then uploaded to the cloud.
Regards,
Koen De Groote
On Fri, Jan 31, 2025 at 5:50 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 1/31/25 01:47, Koen De Groote wrote:
Comments in line.
> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to
> the cloud.
>
> The restore is on an isolated machine and is performed daily. It
> downloads the basebackup, unpacks it, sets a recovery.signal, and a
> script is provided as restore_command, to download the WAL archives %f
> and unpack them into %p
>
What is the complete pg_basebackup command?
> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz
> files are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally
> cannot find it, and exits with status code 1. This is the end of the
> recovery.
I don't understand the above.
What is determining that a particular WAL file should be asked for?
>
> There are a few tables that are known to receive new entries multiple
> times per day. However, the state of the recovery showed the latest item
> to be 2 days in the past. Checking the live DB, there are an expected
> amount of items since that ID.
How active is the primary database you are pulling from?
>
> I checked the logs, the last WAL archive that got downloaded is indeed
> the last one that was available. The one that failed to download on the
> restore machine, was uploaded to the cloud 8 minutes later, according to
> the upload logs on the live DB.
Available where?
If that was the last one available how could the subsequent one be a
failure to download?
>
> The postgres logs themselves seem perfectly normal. It logs all these
> WAL recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to
> start looking, since nothing in the logs seems abnormal.
>
> Regards,
> Koen De Groote
--
Adrian Klaver
adrian.klaver@aklaver.com
Any advice on a different mailing list that something like this would be more suited to?
Regards,
Koen De Groote
On Fri, Jan 31, 2025 at 8:38 PM Koen De Groote <kdg.dev@gmail.com> wrote:
No, it's meant to be an off-site restore, as to do a daily check if the restore actually works.Regards,Koen De GrooteOn Fri, Jan 31, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:On Fri, 2025-01-31 at 10:47 +0100, Koen De Groote wrote:
> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.
>
> The restore is on an isolated machine and is performed daily. It downloads the
> basebackup, unpacks it, sets a recovery.signal, and a script is provided as
> restore_command, to download the WAL archives %f and unpack them into %p
>
> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files
> are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally cannot
> find it, and exits with status code 1. This is the end of the recovery.
>
> There are a few tables that are known to receive new entries multiple times
> per day. However, the state of the recovery showed the latest item to be 2
> days in the past. Checking the live DB, there are an expected amount of items
> since that ID.
>
> I checked the logs, the last WAL archive that got downloaded is indeed the
> last one that was available. The one that failed to download on the restore
> machine, was uploaded to the cloud 8 minutes later, according to the upload
> logs on the live DB.
>
> The postgres logs themselves seem perfectly normal. It logs all these WAL
> recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to start
> looking, since nothing in the logs seems abnormal.
I don't know, that all sounds like it is working as it should.
If the last WAL archive that got downloaded by the "restore_command" is indeed
the last one that was available, recovery did just what it is supposed to.
If new WAL segments get archived later, that's too late.
Perhaps you are looking for replication, not for restoring a backup, which is
necessarily not totally up to date.
Yours,
Laurenz Albe
On 1/31/25 12:10, Koen De Groote wrote: > > What is the complete pg_basebackup command? > 2/ All my WAL files are archived and uploaded to the cloud. So, I can > just have them downloaded. I should have asked earlier what is the archive command? Are > > What is determining that a particular WAL file should be asked for? > > The postgres server itself does this. Here's the documentation: > https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND <https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND> > > And here: https://www.postgresql.org/docs/current/warm-standby.html > <https://www.postgresql.org/docs/current/warm-standby.html> > > In practice, Postgres will see the "standby.signal" file and start In your OP you say: "It downloads the basebackup, unpacks it, sets a recovery.signal, ..." Are you setting standby.signal or recovery.signal or both? > asking for WAL files. It will read the database it has and determine > what the next WAL filename should be. And then it asks for it. And it > will keep asking for these hexadecimal filenames, 1 at a time, for as > long as the command or set of commands provided to "restore_command" > returns exit code 0. If the process receives any other exit code, it > stops recovery, switches timeline, and considers the database to be up > and running at the state its in. > > It's constantly asking "I want this file now" and the script I have as > the restore command will attempt to download it from the cloud. Then it > will attempt to unzip it and move it into place. If any of these steps > fails, I return exit code 1. > > Regards, > Koen De Groote > > -- Adrian Klaver adrian.klaver@aklaver.com
> I should have asked earlier what is the archive command
The example from the documentation, but with GZIP. So from the documentation: https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-ARCHIVING-WAL
Which becomes this: archive_command = 'test ! -f /mnt/server/archivedir/%f.gz && gzip -c %p /mnt/server/archivedir/%f.gz'
> Are you setting standby.signal or recovery.signal or both
Sorry, I keep confusing them. I checked my deployment code, it's recovery.signal and only that.
Regards,
Koen De Groote
On Fri, Jan 31, 2025 at 10:26 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 1/31/25 12:10, Koen De Groote wrote:
> > What is the complete pg_basebackup command?
> 2/ All my WAL files are archived and uploaded to the cloud. So, I can
> just have them downloaded.
I should have asked earlier what is the archive command?
Are
> > What is determining that a particular WAL file should be asked for?
>
> The postgres server itself does this. Here's the documentation:
> https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND <https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND>
>
> And here: https://www.postgresql.org/docs/current/warm-standby.html
> <https://www.postgresql.org/docs/current/warm-standby.html>
>
> In practice, Postgres will see the "standby.signal" file and start
In your OP you say:
"It downloads the basebackup, unpacks it, sets a recovery.signal, ..."
Are you setting standby.signal or recovery.signal or both?
> asking for WAL files. It will read the database it has and determine
> what the next WAL filename should be. And then it asks for it. And it
> will keep asking for these hexadecimal filenames, 1 at a time, for as
> long as the command or set of commands provided to "restore_command"
> returns exit code 0. If the process receives any other exit code, it
> stops recovery, switches timeline, and considers the database to be up
> and running at the state its in.
>
> It's constantly asking "I want this file now" and the script I have as
> the restore command will attempt to download it from the cloud. Then it
> will attempt to unzip it and move it into place. If any of these steps
> fails, I return exit code 1.
>
> Regards,
> Koen De Groote
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com
On 1/31/25 13:52, Koen De Groote wrote: > > I should have asked earlier what is the archive command > > The example from the documentation, but with GZIP. So from the > documentation: > https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-ARCHIVING-WAL <https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-ARCHIVING-WAL> > > Which becomes this: archive_command = 'test ! -f > /mnt/server/archivedir/%f.gz && gzip -c %p /mnt/server/archivedir/%f.gz' > > > > Are you setting standby.signal or recovery.signal or both > > Sorry, I keep confusing them. I checked my deployment code, it's > recovery.signal and only that. Do you have any of the Recovery Target settings from below set?: https://www.postgresql.org/docs/current/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY-TARGET > > Regards, > Koen De Groote > > -- Adrian Klaver adrian.klaver@aklaver.com