Обсуждение: WAL and archive disks full
Hi,
What would be the best course of action for resolving a situation whereby your
postgres instance had crashed due to the wal disk and archive wal disk becoming 100% full? Say
your backups have been failing and your 'monitoring' had not reported it correctly.
You can't start the instance because it needs to write to the WAL disk (which is full), but if you
manually move WAL files off the WAL disk, the archiver will fail because it can't find WAL files it
needs to archive. The instance may also still be in backup mode, because the backups had not
completed due to the disk full situation.
Being new to postgres, im trying to understand what actions need to be taken to get the instance
back up and running without compromising recoverability...?
Thanks in advance.
What would be the best course of action for resolving a situation whereby your
postgres instance had crashed due to the wal disk and archive wal disk becoming 100% full? Say
your backups have been failing and your 'monitoring' had not reported it correctly.
You can't start the instance because it needs to write to the WAL disk (which is full), but if you
manually move WAL files off the WAL disk, the archiver will fail because it can't find WAL files it
needs to archive. The instance may also still be in backup mode, because the backups had not
completed due to the disk full situation.
Being new to postgres, im trying to understand what actions need to be taken to get the instance
back up and running without compromising recoverability...?
Thanks in advance.
Kieren Scott <kierenscott@hotmail.com> wrote: > What would be the best course of action for resolving a situation > whereby your postgres instance had crashed due to the wal disk and > archive wal disk becoming 100% full? Say your backups have been > failing and your 'monitoring' had not reported it correctly. > > You can't start the instance because it needs to write to the WAL > disk (which is full), but if you manually move WAL files off the > WAL disk, the archiver will fail because it can't find WAL files > it needs to archive. The instance may also still be in backup > mode, because the backups had not completed due to the disk full > situation. > > Being new to postgres, im trying to understand what actions need > to be taken to get the instance back up and running without > compromising recoverability...? You will get more detailed advice if you avoid hypotheticals and say exactly what's going on and what your priorities are. For starters, are you OK with a situation which gets your primary database running again and lets you start over with a new base backup, or is it critical that you continue your backup stream without having to take a new base backup? My advice would depend on that answer to that. Also, it would be helpful to have an idea what your various mount points are, how big they are, and what's on them. (If there's something else *also* on the same mount point as the WAL files, that might make a difference. What do you mean, exactly, when you say your wal disk and archive wal disk are 100% full? (Are those separate mount points? Did the archive fail to restore, thereby building up to where the archive later began to fail, or is it a shared drive?) -Kevin
Apologies for the hypothetical scenario, I was trying to gain a greater
understanding of what actions postgres would require in order to get the instance
started without any errors (such as archiver errors because wal files had been
wrongly manually deleted in order to free up space).
I'd be happy with a sitution which lets us start over again with a new base backup.
We have separate mount points for wal, and archived wal filesystems. Nothing
else apart from wal files are written to the filesystems.
I noticed a situation recently whereby our backup scripts had been failing, and the script
had subsequently not been clearing down the archive wal filesysytem after a successful backup.
The wal filesystem was almost full because the archive_command couldn't copy wal files
to the archive filesystem.
Sorry it's a bit of a what-if scenario. I can envisage encountering a situation in the future
whereby we hit this problem, and I was trying to put a plan in place for how to deal with it.
Thanks in advance.
> Date: Mon, 23 Aug 2010 16:47:57 -0500
> From: Kevin.Grittner@wicourts.gov
> To: kierenscott@hotmail.com; pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] WAL and archive disks full
>
> Kieren Scott <kierenscott@hotmail.com> wrote:
>
> > What would be the best course of action for resolving a situation
> > whereby your postgres instance had crashed due to the wal disk and
> > archive wal disk becoming 100% full? Say your backups have been
> > failing and your 'monitoring' had not reported it correctly.
> >
> > You can't start the instance because it needs to write to the WAL
> > disk (which is full), but if you manually move WAL files off the
> > WAL disk, the archiver will fail because it can't find WAL files
> > it needs to archive. The instance may also still be in backup
> > mode, because the backups had not completed due to the disk full
> > situation.
> >
> > Being new to postgres, im trying to understand what actions need
> > to be taken to get the instance back up and running without
> > compromising recoverability...?
>
> You will get more detailed advice if you avoid hypotheticals and say
> exactly what's going on and what your priorities are. For starters,
> are you OK with a situation which gets your primary database running
> again and lets you start over with a new base backup, or is it
> critical that you continue your backup stream without having to take
> a new base backup? My advice would depend on that answer to that.
>
> Also, it would be helpful to have an idea what your various mount
> points are, how big they are, and what's on them. (If there's
> something else *also* on the same mount point as the WAL files, that
> might make a difference. What do you mean, exactly, when you say
> your wal disk and archive wal disk are 100% full? (Are those
> separate mount points? Did the archive fail to restore, thereby
> building up to where the archive later began to fail, or is it a
> shared drive?)
>
> -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
understanding of what actions postgres would require in order to get the instance
started without any errors (such as archiver errors because wal files had been
wrongly manually deleted in order to free up space).
I'd be happy with a sitution which lets us start over again with a new base backup.
We have separate mount points for wal, and archived wal filesystems. Nothing
else apart from wal files are written to the filesystems.
I noticed a situation recently whereby our backup scripts had been failing, and the script
had subsequently not been clearing down the archive wal filesysytem after a successful backup.
The wal filesystem was almost full because the archive_command couldn't copy wal files
to the archive filesystem.
Sorry it's a bit of a what-if scenario. I can envisage encountering a situation in the future
whereby we hit this problem, and I was trying to put a plan in place for how to deal with it.
Thanks in advance.
> Date: Mon, 23 Aug 2010 16:47:57 -0500
> From: Kevin.Grittner@wicourts.gov
> To: kierenscott@hotmail.com; pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] WAL and archive disks full
>
> Kieren Scott <kierenscott@hotmail.com> wrote:
>
> > What would be the best course of action for resolving a situation
> > whereby your postgres instance had crashed due to the wal disk and
> > archive wal disk becoming 100% full? Say your backups have been
> > failing and your 'monitoring' had not reported it correctly.
> >
> > You can't start the instance because it needs to write to the WAL
> > disk (which is full), but if you manually move WAL files off the
> > WAL disk, the archiver will fail because it can't find WAL files
> > it needs to archive. The instance may also still be in backup
> > mode, because the backups had not completed due to the disk full
> > situation.
> >
> > Being new to postgres, im trying to understand what actions need
> > to be taken to get the instance back up and running without
> > compromising recoverability...?
>
> You will get more detailed advice if you avoid hypotheticals and say
> exactly what's going on and what your priorities are. For starters,
> are you OK with a situation which gets your primary database running
> again and lets you start over with a new base backup, or is it
> critical that you continue your backup stream without having to take
> a new base backup? My advice would depend on that answer to that.
>
> Also, it would be helpful to have an idea what your various mount
> points are, how big they are, and what's on them. (If there's
> something else *also* on the same mount point as the WAL files, that
> might make a difference. What do you mean, exactly, when you say
> your wal disk and archive wal disk are 100% full? (Are those
> separate mount points? Did the archive fail to restore, thereby
> building up to where the archive later began to fail, or is it a
> shared drive?)
>
> -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
Kieren Scott <kierenscott@hotmail.com> wrote: > Sorry it's a bit of a what-if scenario. I can envisage > encountering a situation in the future whereby we hit this > problem, and I was trying to put a plan in place for how to deal > with it. Oh, OK. I was afraid you had actually *hit* this situation and were being coy. No need to apologize for contingency planning! :-) Hypothetically, in the situation where the stall originated with the application of files from the archive, fixing that end would and clearing files from the archive directory (or moving or deleting old ones if they were applying cleanly and just sitting there after application), would allow the archive process to resume copying and cleaning up files on the source database. If someone panicked and deleted files from the pg_xlog directory, well, the first thing is to try to make sure nobody does that. You might be able to turn off archiving and get the server to come up. If not, start by making a complete copy of your data directory and all of its subdirectories while PostgreSQL is stopped, because you may wander into trouble and want to try again. If you can't start with archiving turned off, you might want to look at this: http://www.postgresql.org/docs/current/interactive/app-pgresetxlog.html Of course, you want to monitor closely to ensure your backups are running correctly so you never need any of the above advice. ;-) -Kevin
Thanks Kevin.
So if the wal filesystem is 100% full, can you actually startup postgres in archiving mode (so the archive process can resume copying)? Presumably postgres will try to write to the wal filesystem when you start it, and fail due to the filesystem full and then just shutdown/abort? Wouldnt you have to free some space in the wal filesystem in order to get postgres up and running?
Thanks for you help.
> Date: Mon, 23 Aug 2010 17:41:59 -0500
> From: Kevin.Grittner@wicourts.gov
> To: kierenscott@hotmail.com; pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] WAL and archive disks full
>
> Kieren Scott <kierenscott@hotmail.com> wrote:
>
> > Sorry it's a bit of a what-if scenario. I can envisage
> > encountering a situation in the future whereby we hit this
> > problem, and I was trying to put a plan in place for how to deal
> > with it.
>
> Oh, OK. I was afraid you had actually *hit* this situation and were
> being coy. No need to apologize for contingency planning! :-)
>
> Hypothetically, in the situation where the stall originated with the
> application of files from the archive, fixing that end would and
> clearing files from the archive directory (or moving or deleting old
> ones if they were applying cleanly and just sitting there after
> application), would allow the archive process to resume copying and
> cleaning up files on the source database.
>
> If someone panicked and deleted files from the pg_xlog directory,
> well, the first thing is to try to make sure nobody does that. You
> might be able to turn off archiving and get the server to come up.
> If not, start by making a complete copy of your data directory and
> all of its subdirectories while PostgreSQL is stopped, because you
> may wander into trouble and want to try again. If you can't start
> with archiving turned off, you might want to look at this:
>
> http://www.postgresql.org/docs/current/interactive/app-pgresetxlog.html
>
> Of course, you want to monitor closely to ensure your backups are
> running correctly so you never need any of the above advice. ;-)
>
> -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
So if the wal filesystem is 100% full, can you actually startup postgres in archiving mode (so the archive process can resume copying)? Presumably postgres will try to write to the wal filesystem when you start it, and fail due to the filesystem full and then just shutdown/abort? Wouldnt you have to free some space in the wal filesystem in order to get postgres up and running?
Thanks for you help.
> Date: Mon, 23 Aug 2010 17:41:59 -0500
> From: Kevin.Grittner@wicourts.gov
> To: kierenscott@hotmail.com; pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] WAL and archive disks full
>
> Kieren Scott <kierenscott@hotmail.com> wrote:
>
> > Sorry it's a bit of a what-if scenario. I can envisage
> > encountering a situation in the future whereby we hit this
> > problem, and I was trying to put a plan in place for how to deal
> > with it.
>
> Oh, OK. I was afraid you had actually *hit* this situation and were
> being coy. No need to apologize for contingency planning! :-)
>
> Hypothetically, in the situation where the stall originated with the
> application of files from the archive, fixing that end would and
> clearing files from the archive directory (or moving or deleting old
> ones if they were applying cleanly and just sitting there after
> application), would allow the archive process to resume copying and
> cleaning up files on the source database.
>
> If someone panicked and deleted files from the pg_xlog directory,
> well, the first thing is to try to make sure nobody does that. You
> might be able to turn off archiving and get the server to come up.
> If not, start by making a complete copy of your data directory and
> all of its subdirectories while PostgreSQL is stopped, because you
> may wander into trouble and want to try again. If you can't start
> with archiving turned off, you might want to look at this:
>
> http://www.postgresql.org/docs/current/interactive/app-pgresetxlog.html
>
> Of course, you want to monitor closely to ensure your backups are
> running correctly so you never need any of the above advice. ;-)
>
> -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
Kieren Scott <kierenscott@hotmail.com> writes: > [ hypothetical scenario: ] > You can't start the instance because it needs to write to the WAL disk > (which is full), but if you manually move WAL files off the WAL disk, > the archiver will fail because it can't find WAL files it needs to > archive. Uh, no, that shouldn't be a problem. You can manually move the same WAL files that the archiver would move. Look into the pg_xlog/archive_status subdirectory. Any WAL files that have a ".ready" file in there can be moved to archive, and then you delete the .ready file, and you're good to go. Of course, if you don't have any .ready files, you're going to need to look elsewhere for some disk space to reclaim :-( regards, tom lane
Thank you.
Kieren
> To: kierenscott@hotmail.com
> CC: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] WAL and archive disks full
> Date: Mon, 23 Aug 2010 20:17:02 -0400
> From: tgl@sss.pgh.pa.us
>
> Kieren Scott <kierenscott@hotmail.com> writes:
> > [ hypothetical scenario: ]
> > You can't start the instance because it needs to write to the WAL disk
> > (which is full), but if you manually move WAL files off the WAL disk,
> > the archiver will fail because it can't find WAL files it needs to
> > archive.
>
> Uh, no, that shouldn't be a problem. You can manually move the same WAL
> files that the archiver would move. Look into the
> pg_xlog/archive_status subdirectory. Any WAL files that have a ".ready"
> file in there can be moved to archive, and then you delete the .ready
> file, and you're good to go.
>
> Of course, if you don't have any .ready files, you're going to need to
> look elsewhere for some disk space to reclaim :-(
>
> regards, tom lane
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
Kieren
> To: kierenscott@hotmail.com
> CC: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] WAL and archive disks full
> Date: Mon, 23 Aug 2010 20:17:02 -0400
> From: tgl@sss.pgh.pa.us
>
> Kieren Scott <kierenscott@hotmail.com> writes:
> > [ hypothetical scenario: ]
> > You can't start the instance because it needs to write to the WAL disk
> > (which is full), but if you manually move WAL files off the WAL disk,
> > the archiver will fail because it can't find WAL files it needs to
> > archive.
>
> Uh, no, that shouldn't be a problem. You can manually move the same WAL
> files that the archiver would move. Look into the
> pg_xlog/archive_status subdirectory. Any WAL files that have a ".ready"
> file in there can be moved to archive, and then you delete the .ready
> file, and you're good to go.
>
> Of course, if you don't have any .ready files, you're going to need to
> look elsewhere for some disk space to reclaim :-(
>
> regards, tom lane
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin