Обсуждение: ALTER DATABASE SET TABLESPACE vs crash safety

Поиск
Список
Период
Сортировка

ALTER DATABASE SET TABLESPACE vs crash safety

От
Tom Lane
Дата:
So I'm looking at the patch for ALTER DATABASE SET TABLESPACE, and
wondering about what happens if there's a system crash midway through.
The answer doesn't look too good: if the deletion pass has started,
your database is hosed.

I think we can fix this along the following lines:
1. Copy all files to new directory.2. Checkpoint (this is to prevent needing to replay   XLOG_DBASE_CREATE, for the
samereasons mentioned in   createdb()).3. Update the pg_database tuple.4. Commit transaction, start a new one.5. Remove
olddirectory and files.
 

If the system crashes before the commit in step 4, the ALTER SET didn't
happen, and the only bad consequence is some orphaned files copied to
the new directory.  If the system crashes after the commit, the ALTER
SET is committed, and the only bad consequence is some orphaned files
left behind in the old directory.

That is, that's true as long as the filesystem copy in fact pushed
everything to disk.  copydir() does an fsync() on each file it copies,
so I think we have done as much as we can to protect the data being
copied, but I wonder if anyone feels it's too dangerous?

Note that this is significantly more dangerous than CREATE DATABASE
(where presumably you have no great investment yet in the new DB)
or DROP DATABASE (where presumably you don't want the data anyway).
ALTER DATABASE SET TABLESPACE is pushing around what might be your
only copy of valuable data.

Comments?
        regards, tom lane


Re: ALTER DATABASE SET TABLESPACE vs crash safety

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> That is, that's true as long as the filesystem copy in fact pushed
> everything to disk.  copydir() does an fsync() on each file it copies,
> so I think we have done as much as we can to protect the data being
> copied, but I wonder if anyone feels it's too dangerous?

Do we need to fsync the directory itself?  My fsync(2) manpage says
      Calling  fsync()  does  not  necessarily ensure that the entry in the directory      containing the file has also
reacheddisk.  For that an explicit fsync()  on  a      file descriptor for the directory is also needed.
 


-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: ALTER DATABASE SET TABLESPACE vs crash safety

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Do we need to fsync the directory itself?  My fsync(2) manpage says

>        Calling  fsync()  does  not  necessarily ensure that the entry in the directory
>        containing the file has also reached disk.  For that an explicit fsync()  on  a
>        file descriptor for the directory is also needed.

Hmm ... I see that in the Linux manpage, but not on Darwin, HPUX, or in
the Single Unix Spec.  I'm inclined to argue that we've always expected
the filesystem to take care of its own metadata, and we've never seen
any indication that that's unsafe.  We don't try to "fsync the
directory" after a normal table create for instance.
        regards, tom lane


Re: ALTER DATABASE SET TABLESPACE vs crash safety

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Do we need to fsync the directory itself?  My fsync(2) manpage says
> 
> >        Calling  fsync()  does  not  necessarily ensure that the entry in the directory
> >        containing the file has also reached disk.  For that an explicit fsync()  on  a
> >        file descriptor for the directory is also needed.
> 
> Hmm ... I see that in the Linux manpage, but not on Darwin, HPUX, or in
> the Single Unix Spec.  I'm inclined to argue that we've always expected
> the filesystem to take care of its own metadata, and we've never seen
> any indication that that's unsafe.  We don't try to "fsync the
> directory" after a normal table create for instance.

I dimly recall the Postfix guys got burned by this some time ago (mails
got lost after a crash because they didn't fsync the directory on which
they had just created the files before acknowledging the email delivery
to the remote server).  I guess this is the reason we require a
filesystem that journals metadata.

http://osdir.com/ml/file-systems.reiserfs.general/2003-09/msg00120.html

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: ALTER DATABASE SET TABLESPACE vs crash safety

От
Decibel!
Дата:
On Nov 7, 2008, at 9:53 AM, Tom Lane wrote:
> So I'm looking at the patch for ALTER DATABASE SET TABLESPACE, and
> wondering about what happens if there's a system crash midway through.
> The answer doesn't look too good: if the deletion pass has started,
> your database is hosed.

FWIW, I don't see this patch as being terribly useful in the real  
world until it can take place in the background, without locking  
stuff for a huge amount of time. That tells me that we should have a  
way to move objects to a new tablespace a little bit at a time. My  
guess is that such a facility would be something that runs in the  
background over many different transactions. Once everything had been  
moved, only then would it go and delete the old files.

But it's too late to get that kind of functionality into 8.4. :( So,  
is there enough demand for this feature to get it into 8.4 and  
possibly paint ourselves into a corner, or should we just wait until  
8.5?
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




Re: ALTER DATABASE SET TABLESPACE vs crash safety

От
Bernd Helmle
Дата:
--On Sonntag, November 09, 2008 18:25:50 -0600 Decibel! 
<decibel@decibel.org> wrote:

> On Nov 7, 2008, at 9:53 AM, Tom Lane wrote:

> FWIW, I don't see this patch as being terribly useful in the real world
> until it can take place in the background, without locking stuff for a
> huge amount of time. That tells me that we should have a way to move
> objects to a new tablespace a little bit at a time. My guess is that such
> a facility would be something that runs in the background over many
> different transactions. Once everything had been moved, only then would
> it go and delete the old files.

Of course, such a facility is much more complicater than what this patch 
does. If you don't want to exclusive lock the database you need to track 
all changes during copying the relations and later merge them into the new 
ones in the worst case. I don't see how you want to preserve a consistent 
state of the database otherwise.

>
> But it's too late to get that kind of functionality into 8.4. :( So, is
> there enough demand for this feature to get it into 8.4 and possibly
> paint ourselves into a corner, or should we just wait until 8.5?

This patch is already committed.

--  Thanks
                   Bernd