Обсуждение: Postgres restore sometimes restores to a point 2 days in the past

Поиск

Список

Период

Сортировка

Postgres restore sometimes restores to a point 2 days in the past

От

Koen De Groote

Дата:

31 января 2025 г., 12:47:17

I'm running postgres 16.6

My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.

The restore is on an isolated machine and is performed daily. It downloads the basebackup, unpacks it, sets a recovery.signal, and a script is provided as restore_command, to download the WAL archives %f and unpack them into %p

In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files are first checked with "gzip -t".

If a WAL archive is asked that doesn't exist yet, the script naturally cannot find it, and exits with status code 1. This is the end of the recovery.

There are a few tables that are known to receive new entries multiple times per day. However, the state of the recovery showed the latest item to be 2 days in the past. Checking the live DB, there are an expected amount of items since that ID.

I checked the logs, the last WAL archive that got downloaded is indeed the last one that was available. The one that failed to download on the restore machine, was uploaded to the cloud 8 minutes later, according to the upload logs on the live DB.

The postgres logs themselves seem perfectly normal. It logs all these WAL recoveries, switches the timeline, and becomes available.

What could be going wrong? My main issue is that I don't know where to start looking, since nothing in the logs seems abnormal.

Regards,

Koen De Groote

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Laurenz Albe

Дата:

31 января 2025 г., 16:30:01

On Fri, 2025-01-31 at 10:47 +0100, Koen De Groote wrote:
> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.
>
> The restore is on an isolated machine and is performed daily. It downloads the
> basebackup, unpacks it, sets a recovery.signal, and a script is provided as
> restore_command, to download the WAL archives %f and unpack them into %p
>
> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files
> are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally cannot
> find it, and exits with status code 1. This is the end of the recovery.
>
> There are a few tables that are known to receive new entries multiple times
> per day. However, the state of the recovery showed the latest item to be 2
> days in the past. Checking the live DB, there are an expected amount of items
> since that ID.
>
> I checked the logs, the last WAL archive that got downloaded is indeed the
> last one that was available. The one that failed to download on the restore
> machine, was uploaded to the cloud 8 minutes later, according to the upload
> logs on the live DB.
>
> The postgres logs themselves seem perfectly normal. It logs all these WAL
> recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to start
> looking, since nothing in the logs seems abnormal.

I don't know, that all sounds like it is working as it should.

If the last WAL archive that got downloaded by the "restore_command" is indeed
the last one that was available, recovery did just what it is supposed to.
If new WAL segments get archived later, that's too late.

Perhaps you are looking for replication, not for restoring a backup, which is
necessarily not totally up to date.

Yours,
Laurenz Albe

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Adrian Klaver

Дата:

31 января 2025 г., 19:50:34

On 1/31/25 01:47, Koen De Groote wrote:

Comments in line.

> I'm running postgres 16.6
> 
> My backup strategy is: basebackup and WAL archive. These get uploaded to 
> the cloud.
> 
> The restore is on an isolated machine and is performed daily. It 
> downloads the basebackup, unpacks it, sets a recovery.signal, and a 
> script is provided as restore_command, to download the WAL archives %f 
> and unpack them into %p
> 

What is the complete pg_basebackup command?

> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz 
> files are first checked with "gzip -t".
> 
> If a WAL archive is asked that doesn't exist yet, the script naturally 
> cannot find it, and exits with status code 1. This is the end of the 
> recovery.

I don't understand the above.

What is determining that a particular WAL file should be asked for?

> 
> There are a few tables that are known to receive new entries multiple 
> times per day. However, the state of the recovery showed the latest item 
> to be 2 days in the past. Checking the live DB, there are an expected 
> amount of items since that ID.

How active is the primary database you are pulling from?

> 
> I checked the logs, the last WAL archive that got downloaded is indeed 
> the last one that was available. The one that failed to download on the 
> restore machine, was uploaded to the cloud 8 minutes later, according to 
> the upload logs on the live DB.

Available where?

If that was the last one available how could the subsequent one be a 
failure to download?

> 
> The postgres logs themselves seem perfectly normal. It logs all these 
> WAL recoveries, switches the timeline, and becomes available.
> 
> What could be going wrong? My main issue is that I don't know where to 
> start looking, since nothing in the logs seems abnormal.
> 
> Regards,
> Koen De Groote

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Koen De Groote

Дата:

31 января 2025 г., 22:38:49

No, it's meant to be an off-site restore, as to do a daily check if the restore actually works.

Regards,

Koen De Groote

On Fri, Jan 31, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Fri, 2025-01-31 at 10:47 +0100, Koen De Groote wrote:
> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.
>
> The restore is on an isolated machine and is performed daily. It downloads the
> basebackup, unpacks it, sets a recovery.signal, and a script is provided as
> restore_command, to download the WAL archives %f and unpack them into %p
>
> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files
> are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally cannot
> find it, and exits with status code 1. This is the end of the recovery.
>
> There are a few tables that are known to receive new entries multiple times
> per day. However, the state of the recovery showed the latest item to be 2
> days in the past. Checking the live DB, there are an expected amount of items
> since that ID.
>
> I checked the logs, the last WAL archive that got downloaded is indeed the
> last one that was available. The one that failed to download on the restore
> machine, was uploaded to the cloud 8 minutes later, according to the upload
> logs on the live DB.
>
> The postgres logs themselves seem perfectly normal. It logs all these WAL
> recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to start
> looking, since nothing in the logs seems abnormal.

I don't know, that all sounds like it is working as it should.

If the last WAL archive that got downloaded by the "restore_command" is indeed
the last one that was available, recovery did just what it is supposed to.
If new WAL segments get archived later, that's too late.

Perhaps you are looking for replication, not for restoring a backup, which is
necessarily not totally up to date.

Yours,
Laurenz Albe

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Koen De Groote

Дата:

31 января 2025 г., 23:10:38

> What is the complete pg_basebackup command?

The command: pg_basebackup -h <IP> -p <PORT> -U <USERNAME> -D <ABSOLUTE_PATH> -Ft -z -P -v --wal-method=none

So basically the same as the 2nd example here: https://www.postgresql.org/docs/16/app-pgbasebackup.html except for the verbose flag and the wal-method flag.

The wal-method is none for 2 reasons:
1/ Experience teaches that, in the event of storage being on a network, timeouts to write WAL archives tot he network location can cause WAL creation during a basebackup to be considered failed, and that causes the entire basebackup to be considered failed, even if a retry occurs. Any failure during a basebackup will cause postgres to auto-delete it at the very end of pg_basebackup, declaring it "unusable". This is extremely bad in backups that take very long. Better to not include WAL files in the basebackup, and just get them after the fact.
2/ All my WAL files are archived and uploaded to the cloud. So, I can just have them downloaded.

This has worked for months on end, and so has restoring.

> I don't understand the above.

> What is determining that a particular WAL file should be asked for?

The postgres server itself does this. Here's the documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND

And here: https://www.postgresql.org/docs/current/warm-standby.html

In practice, Postgres will see the "standby.signal" file and start asking for WAL files. It will read the database it has and determine what the next WAL filename should be. And then it asks for it. And it will keep asking for these hexadecimal filenames, 1 at a time, for as long as the command or set of commands provided to "restore_command" returns exit code 0. If the process receives any other exit code, it stops recovery, switches timeline, and considers the database to be up and running at the state its in.

It's constantly asking "I want this file now" and the script I have as the restore command will attempt to download it from the cloud. Then it will attempt to unzip it and move it into place. If any of these steps fails, I return exit code 1.

> How active is the primary database you are pulling from?

Very active, plus automated testing to ensure everything is still running, this will generate multiple items per day on its own.

> Available where?

The cloud, as I stated: WAL files get archived and these archived files are then uploaded to the cloud.

See documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-ARCHIVE-COMMAND

Regards,

Koen De Groote

On Fri, Jan 31, 2025 at 5:50 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 1/31/25 01:47, Koen De Groote wrote:

Comments in line.

> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to
> the cloud.
>
> The restore is on an isolated machine and is performed daily. It
> downloads the basebackup, unpacks it, sets a recovery.signal, and a
> script is provided as restore_command, to download the WAL archives %f
> and unpack them into %p
>

What is the complete pg_basebackup command?

> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz
> files are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally
> cannot find it, and exits with status code 1. This is the end of the
> recovery.

I don't understand the above.

What is determining that a particular WAL file should be asked for?

>
> There are a few tables that are known to receive new entries multiple
> times per day. However, the state of the recovery showed the latest item
> to be 2 days in the past. Checking the live DB, there are an expected
> amount of items since that ID.

How active is the primary database you are pulling from?

>
> I checked the logs, the last WAL archive that got downloaded is indeed
> the last one that was available. The one that failed to download on the
> restore machine, was uploaded to the cloud 8 minutes later, according to
> the upload logs on the live DB.

Available where?

If that was the last one available how could the subsequent one be a
failure to download?

>
> The postgres logs themselves seem perfectly normal. It logs all these
> WAL recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to
> start looking, since nothing in the logs seems abnormal.
>
> Regards,
> Koen De Groote

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Koen De Groote

Дата:

31 января 2025 г., 23:11:37

Any advice on a different mailing list that something like this would be more suited to?

Regards,

Koen De Groote

On Fri, Jan 31, 2025 at 8:38 PM Koen De Groote <kdg.dev@gmail.com> wrote:

No, it's meant to be an off-site restore, as to do a daily check if the restore actually works.

Regards,
Koen De Groote

On Fri, Jan 31, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2025-01-31 at 10:47 +0100, Koen De Groote wrote:
> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.
>
> The restore is on an isolated machine and is performed daily. It downloads the
> basebackup, unpacks it, sets a recovery.signal, and a script is provided as
> restore_command, to download the WAL archives %f and unpack them into %p
>
> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files
> are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script naturally cannot
> find it, and exits with status code 1. This is the end of the recovery.
>
> There are a few tables that are known to receive new entries multiple times
> per day. However, the state of the recovery showed the latest item to be 2
> days in the past. Checking the live DB, there are an expected amount of items
> since that ID.
>
> I checked the logs, the last WAL archive that got downloaded is indeed the
> last one that was available. The one that failed to download on the restore
> machine, was uploaded to the cloud 8 minutes later, according to the upload
> logs on the live DB.
>
> The postgres logs themselves seem perfectly normal. It logs all these WAL
> recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know where to start
> looking, since nothing in the logs seems abnormal.

I don't know, that all sounds like it is working as it should.

If the last WAL archive that got downloaded by the "restore_command" is indeed
the last one that was available, recovery did just what it is supposed to.
If new WAL segments get archived later, that's too late.

Perhaps you are looking for replication, not for restoring a backup, which is
necessarily not totally up to date.

Yours,
Laurenz Albe

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Adrian Klaver

Дата:

01 февраля 2025 г., 00:26:55

On 1/31/25 12:10, Koen De Groote wrote:
>  > What is the complete pg_basebackup command?

> 2/ All my WAL files are archived and uploaded to the cloud. So, I can 
> just have them downloaded.

I should have asked earlier what is the archive command?

Are
>  > What is determining that a particular WAL file should be asked for?
> 
> The postgres server itself does this. Here's the documentation: 
> https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND
<https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND>
> 
> And here: https://www.postgresql.org/docs/current/warm-standby.html 
> <https://www.postgresql.org/docs/current/warm-standby.html>
> 
> In practice, Postgres will see the "standby.signal" file and start 

In your OP you say:

"It downloads the basebackup, unpacks it, sets a recovery.signal, ..."

Are you setting standby.signal or recovery.signal or both?


> asking for WAL files. It will read the database it has and determine 
> what the next WAL filename should be. And then it asks for it. And it 
> will keep asking for these hexadecimal filenames, 1 at a time, for as 
> long as the command or set of commands provided to "restore_command" 
> returns exit code 0. If the process receives any other exit code, it 
> stops recovery, switches timeline, and considers the database to be up 
> and running at the state its in.
> 
> It's constantly asking "I want this file now" and the script I have as 
> the restore command will attempt to download it from the cloud. Then it 
> will attempt to unzip it and move it into place. If any of these steps 
> fails, I return exit code 1.
> 

> Regards,
> Koen De Groote
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Koen De Groote

Дата:

01 февраля 2025 г., 00:52:39

> I should have asked earlier what is the archive command

The example from the documentation, but with GZIP. So from the documentation: https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-ARCHIVING-WAL

Which becomes this: archive_command = 'test ! -f /mnt/server/archivedir/%f.gz && gzip -c %p /mnt/server/archivedir/%f.gz'

> Are you setting standby.signal or recovery.signal or both

Sorry, I keep confusing them. I checked my deployment code, it's recovery.signal and only that.

Regards,

Koen De Groote

On Fri, Jan 31, 2025 at 10:26 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 1/31/25 12:10, Koen De Groote wrote:
> > What is the complete pg_basebackup command?

> 2/ All my WAL files are archived and uploaded to the cloud. So, I can
> just have them downloaded.

I should have asked earlier what is the archive command?

Are
> > What is determining that a particular WAL file should be asked for?
>
> The postgres server itself does this. Here's the documentation:
> https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND <https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND>
>
> And here: https://www.postgresql.org/docs/current/warm-standby.html
> <https://www.postgresql.org/docs/current/warm-standby.html>
>
> In practice, Postgres will see the "standby.signal" file and start

In your OP you say:

"It downloads the basebackup, unpacks it, sets a recovery.signal, ..."

Are you setting standby.signal or recovery.signal or both?

> asking for WAL files. It will read the database it has and determine
> what the next WAL filename should be. And then it asks for it. And it
> will keep asking for these hexadecimal filenames, 1 at a time, for as
> long as the command or set of commands provided to "restore_command"
> returns exit code 0. If the process receives any other exit code, it
> stops recovery, switches timeline, and considers the database to be up
> and running at the state its in.
>
> It's constantly asking "I want this file now" and the script I have as
> the restore command will attempt to download it from the cloud. Then it
> will attempt to unzip it and move it into place. If any of these steps
> fails, I return exit code 1.
>

> Regards,
> Koen De Groote
>
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Adrian Klaver

Дата:

01 февраля 2025 г., 01:26:56

On 1/31/25 13:52, Koen De Groote wrote:
>  > I should have asked earlier what is the archive command
> 
> The example from the documentation, but with GZIP. So from the 
> documentation: 
> https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-ARCHIVING-WAL
<https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-ARCHIVING-WAL>
> 
> Which becomes this: archive_command = 'test ! -f 
> /mnt/server/archivedir/%f.gz && gzip -c %p /mnt/server/archivedir/%f.gz'
> 
> 
>  > Are you setting standby.signal or recovery.signal or both
> 
> Sorry, I keep confusing them. I checked my deployment code, it's 
> recovery.signal and only that.

Do you have any of the Recovery Target settings from below set?:

https://www.postgresql.org/docs/current/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY-TARGET


> 
> Regards,
> Koen De Groote
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Postgres restore sometimes restores to a point 2 days in the past

От

Koen De Groote

Дата:

19 февраля, 22:30:03

Sorry for reviving this old thread.

I found the cause for this a few days after these messages: The WAL file download was interrupted. It comes from S3, and for some reason the downloaded file was not the same size as the one on S3. So, some network error occurred.

PG tries to restore via the restore_command, which is "gzip - dc". The restore fails because the file is incomplete, and PG concludes it's not receiving files anymore and thus, that the restore is done.

This happens again the next day, because the system is set up to not download files already found on local mount.

Mystery solved.

On Fri, Jan 31, 2025 at 9:47 AM Koen De Groote <kdg.dev@gmail.com> wrote:

I'm running postgres 16.6

My backup strategy is: basebackup and WAL archive. These get uploaded to the cloud.

The restore is on an isolated machine and is performed daily. It downloads the basebackup, unpacks it, sets a recovery.signal, and a script is provided as restore_command, to download the WAL archives %f and unpack them into %p

In the script, the final unpacking is simply "gzip -dc %f > %p". The gz files are first checked with "gzip -t".

If a WAL archive is asked that doesn't exist yet, the script naturally cannot find it, and exits with status code 1. This is the end of the recovery.

There are a few tables that are known to receive new entries multiple times per day. However, the state of the recovery showed the latest item to be 2 days in the past. Checking the live DB, there are an expected amount of items since that ID.

I checked the logs, the last WAL archive that got downloaded is indeed the last one that was available. The one that failed to download on the restore machine, was uploaded to the cloud 8 minutes later, according to the upload logs on the live DB.

The postgres logs themselves seem perfectly normal. It logs all these WAL recoveries, switches the timeline, and becomes available.

What could be going wrong? My main issue is that I don't know where to start looking, since nothing in the logs seems abnormal.

Regards,
Koen De Groote

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Postgres restore sometimes restores to a point 2 days in the past