Обсуждение: AW: Re: Backup and Recovery

Поиск
Список
Период
Сортировка

AW: Re: Backup and Recovery

От
Zeugswetter Andreas SB
Дата:
> > Also, isn't the WAL format rather bulky to archive hours and hours of?

If it were actually too bulky, then it needs to be made less so, since that
directly affects overall performance :-) 

> > I would expect high-level transaction redo records to be much more compact;
> > mixed into the WAL, such records shouldn't make the WAL grow much faster.

All redo records have to be at the tuple level, so what higher-level are you talking 
about ? (statement level redo records would not be able to reproduce the same
resulting table data (keyword: transaction isolation level)) 

> The page images are not needed and can be thrown away once the page is
> completely sync'ed to disk or a checkpoint happens.

Actually they should at least be kept another few seconds to allow "stupid"
disks to actually write the pages :-) But see previous mail, they can also 
help with various BAR restore solutions.

Andreas


Re: AW: Re: Backup and Recovery

От
Bruce Momjian
Дата:
> > The page images are not needed and can be thrown away once the page is
> > completely sync'ed to disk or a checkpoint happens.
> 
> Actually they should at least be kept another few seconds to allow "stupid"
> disks to actually write the pages :-) But see previous mail, they can also 
> help with various BAR restore solutions.

Agreed.  They have to be kept a few seconds.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: Backup and Recovery

От
ncm@zembu.com (Nathan Myers)
Дата:
On Thu, Jul 05, 2001 at 02:27:01PM +0200, Zeugswetter Andreas SB wrote:
> 
> > Also, isn't the WAL format rather bulky to archive hours and hours of?
> 
> If it were actually too bulky, then it needs to be made less so, since
> that directly affects overall performance :-) 

ISTM that WAL record size trades off against lots of things, including 
(at least) complexity of recovery code, complexity of WAL generation 
code, usefulness in fixing corrupt table images, and processing time
it would take to produce smaller log entries.  

Complexity is always expensive, and CPU time spent "pre-sync" is a lot
more expensive than time spent in background.  That is, time spent
generating the raw log entries affects latency and peak capacity, 
where time in background mainly affects average system load.

For a WAL, the balance seems to be far to the side of simple-and-bulky.
For other uses, the balance is sure to be different.

> > > I would expect high-level transaction redo records to be much more
> > > compact; mixed into the WAL, such records shouldn't make the WAL
> > > grow much faster.
> 
> All redo records have to be at the tuple level, so what higher-level
> are you talking about ? (statement level redo records would not be
> able to reproduce the same resulting table data (keyword: transaction
> isolation level)) 

Statement-level redo records would be nice, but as you note they are 
rarely practical if done by the database.

Redo records that contain that contain whole blocks may be much bulkier
than records of whole tuples.  Redo records of whole tuples may be much 
bulkier than those that just identify changed fields.

Bulky logs mean more-frequent snapshot backups, and bulky log formats 
are less suitable for network transmission, and therefore less useful 
for replication.  Smaller redo records take more processing to generate, 
but that processing can be done off-line, and the result saves other 
costs.

Nathan Myers
ncm@zembu.com


Re: Re: Backup and Recovery

От
Bruce Momjian
Дата:
> > > > I would expect high-level transaction redo records to be much more
> > > > compact; mixed into the WAL, such records shouldn't make the WAL
> > > > grow much faster.
> > 
> > All redo records have to be at the tuple level, so what higher-level
> > are you talking about ? (statement level redo records would not be
> > able to reproduce the same resulting table data (keyword: transaction
> > isolation level)) 
> 
> Statement-level redo records would be nice, but as you note they are 
> rarely practical if done by the database.
> 
> Redo records that contain that contain whole blocks may be much bulkier
> than records of whole tuples.  Redo records of whole tuples may be much 
> bulkier than those that just identify changed fields.
> 
> Bulky logs mean more-frequent snapshot backups, and bulky log formats 
> are less suitable for network transmission, and therefore less useful 
> for replication.  Smaller redo records take more processing to generate, 
> but that processing can be done off-line, and the result saves other 
> costs.

Tom has identified that VACUUM generates hug WAL traffic because of the
writing of page preimages in case the page is partially written to disk.
It would be nice to split those out into a separate WAL file _except_ it
would require two fsyncs() for commit (bad), so we are stuck.  Once the
page is flushed to disk after checkpoint, we don't really need those
pre-images anymore, hence the spliting of WAL page images and row
records for recovery purposes.

In other words, we keep the page images and row records in one file so
we can do one fsync, but once we have written the page, we don't want to
store them for later point-in-time recovery.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: Backup and Recovery

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> In other words, we keep the page images and row records in one file so
> we can do one fsync, but once we have written the page, we don't want to
> store them for later point-in-time recovery.

What we'd want to do is strip the page images from the version of the
logs that's archived for recovery purposes.  Ideally the archiving
process would also discard records from aborted transactions, but I'm
not sure how hard that'd be to do.
        regards, tom lane


Re: Re: Backup and Recovery

От
ncm@zembu.com (Nathan Myers)
Дата:
On Thu, Jul 05, 2001 at 09:33:17PM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > In other words, we keep the page images and row records in one file so
> > we can do one fsync, but once we have written the page, we don't want to
> > store them for later point-in-time recovery.
> 
> What we'd want to do is strip the page images from the version of the
> logs that's archived for recovery purposes.  

Am I correct in believing that the remaining row images would have to 
be applied to a clean table-image snapshot?  Maybe you can produce a 
clean table-image snapshot by making a dirty image copy, and then
replaying the WAL from the time you started copying up to the time
when you finish copying.  

How hard would it be to turn these row records into updates against a 
pg_dump image, assuming access to a good table-image file?

> Ideally the archiving process would also discard records from aborted
> transactions, but I'm not sure how hard that'd be to do.

A second pass over the WAL file -- or the log-picker daemon's 
first-pass output -- could eliminate the dead row images.  Handling 
WAL file boundaries might be tricky if one WAL file has dead row-images 
and the next has the abort-or-commit record.  Maybe the daemon has to 
look ahead into the next WAL file to know what to discard from the 
current file.  

Would it be useful to mark points in a WAL file where there are no 
transactions with outstanding writes?

Nathan Myers
ncm@zembu.com


Re: Re: Backup and Recovery

От
Bruce Momjian
Дата:
> On Thu, Jul 05, 2001 at 09:33:17PM -0400, Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > In other words, we keep the page images and row records in one file so
> > > we can do one fsync, but once we have written the page, we don't want to
> > > store them for later point-in-time recovery.
> > 
> > What we'd want to do is strip the page images from the version of the
> > logs that's archived for recovery purposes.  
> 
> Am I correct in believing that the remaining row images would have to 
> be applied to a clean table-image snapshot?  Maybe you can produce a 
> clean table-image snapshot by making a dirty image copy, and then
> replaying the WAL from the time you started copying up to the time
> when you finish copying.  

Good point.  You are going to need a tar image of the data files to
restore via WAL and skip all WAL records from before the tar image.  WAL
does some of the tricky stuff now as part of crash recovery but it gets
more complited for a point-in-time recovery because the binary images
was taken over time, not at a single point in time like crash recovery.

> 
> How hard would it be to turn these row records into updates against a 
> pg_dump image, assuming access to a good table-image file?

pg_dump is very hard because WAL contains only tids.  No way to match
that to pg_dump-loaded rows.


> > Ideally the archiving process would also discard records from aborted
> > transactions, but I'm not sure how hard that'd be to do.
> 
> A second pass over the WAL file -- or the log-picker daemon's 
> first-pass output -- could eliminate the dead row images.  Handling 
> WAL file boundaries might be tricky if one WAL file has dead row-images 
> and the next has the abort-or-commit record.  Maybe the daemon has to 
> look ahead into the next WAL file to know what to discard from the 
> current file.  
> 
> Would it be useful to mark points in a WAL file where there are no 
> transactions with outstanding writes?

I think CHECKPOINT is as good as we are going to get in that area, but
of course there are outstanding transactions that are not going to be
picked up because they weren't committed before the checkpoint
completed.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: Backup and Recovery

От
ncm@zembu.com (Nathan Myers)
Дата:
On Fri, Jul 06, 2001 at 06:52:49AM -0400, Bruce Momjian wrote:
> Nathan wrote:
> > How hard would it be to turn these row records into updates against a 
> > pg_dump image, assuming access to a good table-image file?
> 
> pg_dump is very hard because WAL contains only tids.  No way to match
> that to pg_dump-loaded rows.

Maybe pg_dump can write out a mapping of TIDs to line numbers, and the
back-end can create a map of inserted records' line numbers when the dump 
is reloaded, so that the original TIDs can be traced to the new TIDs.
I guess this would require a new option on IMPORT.  I suppose the
mappings could be temporary tables.

Nathan Myers
ncm@zembu.com