Re: Parallel pg_dump for 9.1
От | Stefan Kaltenbrunner |
---|---|
Тема | Re: Parallel pg_dump for 9.1 |
Дата | |
Msg-id | 4BB0E054.8090406@kaltenbrunner.cc обсуждение исходный текст |
Ответ на | Re: Parallel pg_dump for 9.1 (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Parallel pg_dump for 9.1
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
Robert Haas wrote: > On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote: [...] >> - Regarding the output of pg_dump I am proposing two solutions. The >> first one is to introduce a new archive type "directory" where each >> table and each blob is a file in a directory, similar to the >> experimental "files" archive type. Also the idea has come up that you >> should be able to specify multiple directories in order to make use of >> several physical disk drives. Thinking this further, in order to >> manage all the mess that you can create with this, every file of the >> same backup needs to have a unique identifier and pg_restore should >> have a check parameter that tells you if your backup directory is in a >> sane and complete state (think about moving a file from one backup >> directory to another one or trying to restore from two directories >> which are from different backup sets...). > > I think that specifying several directories is a piece of complexity > that would be best left alone for a first version of this. But a > single directory with multiple files sounds pretty reasonable. Of > course we'll also need to support that format in non-parallel mode, > and in pg_restore. > >> The second solution to the single-file-problem is to generate no >> output at all, i.e. whatever you export from your source database you >> import directly into your target database, which in the end turns out >> to be a parallel form of "pg_dump | psql". > > This is a very interesting idea but you might want to get the other > thing merged first, as it's going to present a different set of > issues. I had some prior discussion with joachim (and I suspect I had some influence in him trying to implement that) on that. The reason why this is really needed is that the current pg_restore -j is actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are basically "duplicate this database to that location" (or any migration really). The example at had is a 240GB production database with around 850 tables, it takes ~145min to dump that database single threaded(completely CPU bound), simply loading the SQL using psql can restore it in ~150min(again CPU bound both for COPY and index creation), -j8 brings that down to ~55min. So if you do the math(and a bit of handwaving): * using pg_dump | psql you get greatest(140,150) -> 150min. * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min * using a theoretical parallel pg_dump and the existing parallel restore you would get: 50(just a guess for how fast it might be) + 55 -> 105min * a parallel dump & restore that can pipline would end up at greatest(50,55)->55min So a parallel dump alone would only give you a 50% speedup in total time for doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x the resources. A piplined solution would result in a ~3x speedup in total time and you don't even have to even think about stuff that might be a problem like having available diskspace on the source/destination to hold a full temporary dump(if you don't you might even have to add some transfer time as well). Stefan
В списке pgsql-hackers по дате отправления: