parallel pg_restore design issues

Поиск

Список

Период

Сортировка

От	Andrew Dunstan
Тема	parallel pg_restore design issues
Дата	5 октября 2008 г. 21:13:17
Msg-id	48E957C4.8060008@dunslane.net обсуждение исходный текст
Ответы	Re: parallel pg_restore design issues
Список	pgsql-hackers

Дерево обсуждения

There are a couple of open questions for parallel pg_restore.

First, we need a way to decide the boundary between the serially run 
"pre-data" section and the remainder of the items in the TOC. Currently 
the code uses the first TABLEDATA item as the boundary. That's not 
terribly robust (what if there aren't any?). Also, people have wanted to 
steer clear of hardcoding much knowledge of archive member types into 
pg_restore as a way of future-proofing it somewhat. I'm wondering if we 
should have pg_dump explicitly mark items as pre-data,data or post-data. 
For legacy archives we could still check for either a TABLEDATA item or 
something known to sort after those (i.e. a BLOB, BLOB COMMENT, 
CONSTRAINT, INDEX, RULE, TRIGGER or FK CONSTRAINT item).

Another item we have already discussed is how to prevent concurrent 
processes from trying to take conflicting locks. Her we really can't 
rely on pg_dump to help us out, as lock requirements might change (a 
little bird has already whispered in my ear about reducing the strength 
of FK CONSTRAINT locks taken). I haven't got a really good answer here.

Last, there is the question of what algorithm to use in chosing the next 
item to run. Currently, I am using "next item in the queue whose 
dependencies have been met", with no queue reordering.

Another possible algorithm would reorder the queue by elevating any item 
whose dependencies have been met. This will mean all the indexes for a 
table will tend to be grouped together, which might well be a good 
thing, and will tend to limit the tendency to do all the data loading at 
once.

Both of these could be modified by explicitly limiting TABLEDATA items 
to a certain proportion (say, one quarter) of the processing slots 
available, if other items are available.

I'm actually somewhat inclined to make provision for all of these 
possibilities via a command line option, with the first being the 
default. One size doesn't fit all, I suspect, and if it does we'll need 
lots of data before deciding what that size is. The extra logic won't 
really involve all that much code, and it will all be confined to a 
couple of functions.

Thoughts?

cheers

andrew

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

parallel pg_restore design issues