Re: Spreading full-page writes
От | Heikki Linnakangas |
---|---|
Тема | Re: Spreading full-page writes |
Дата | |
Msg-id | 538481FA.6040707@vmware.com обсуждение исходный текст |
Ответ на | Re: Spreading full-page writes (Greg Stark <stark@mit.edu>) |
Ответы |
Re: Spreading full-page writes
|
Список | pgsql-hackers |
On 05/27/2014 02:42 PM, Greg Stark wrote: > On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> >> On 05/26/2014 02:26 PM, Greg Stark wrote: >>> >>>> Another idea would be to have separate checkpoints for each buffer >>> partition. You would have to start recovery from the oldest checkpoint of >>> any of the partitions. >> >> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitionsis simpler. > > Interesting. I just thought of it independently. > > Incidentally you wouldn't actually want to use the buffer partitions > per se since the new server might start up with a different number of > partitions. You would want an algorithm for partitioning the block > space that xlog replay can reliably reproduce regardless of the size > of the buffer lock partition table. It might make sense to set it up > so it coincidentally ensures all the buffers being flushed are in the > same partition or maybe the reverse would be better. Probably it > doesn't actually matter. Since you will be flushing the buffers one "redo partition" at a time, you would want to allow the OS to do merge the writes within a partition as much as possible. So my even-odd split would in fact be pretty bad. Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the same partition, would be better. > I'm assuming you would keep N checkpoint positions in the control > file. That also means we can double the checkpoint timeout with only a > marginal increase in the worst case recovery time. Since the worst > case will be (1 + 1/n)*timeout's worth of wal to replay rather than > 2*n. The amount of time for recovery would be much more predictable. Good point. - Heikki
В списке pgsql-hackers по дате отправления: