Re: Minimizing Recovery Time (wal replication)
От | Greg Smith |
---|---|
Тема | Re: Minimizing Recovery Time (wal replication) |
Дата | |
Msg-id | alpine.GSO.2.01.0904091916440.6276@westnet.com обсуждение исходный текст |
Ответ на | Minimizing Recovery Time (wal replication) (Bryan Murphy <bmurphy1976@gmail.com>) |
Ответы |
Re: Minimizing Recovery Time (wal replication)
|
Список | pgsql-general |
On Thu, 9 Apr 2009, Bryan Murphy wrote: > (1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period Yeah, if you ever let this many files queue up you're facing a long recovery time. You really need to get into a position where you're applying WAL files regularly enough that you don't ever fall this far behind. > (2) hot spare pauses for 15 to 20 minutes, during this period pdflush > consumes 99% IO (iotop). Dirty (from /proc/meminfo) spikes to ~760mb, > remains at that level for the first 10 minutes, and then slowly ticks > down to 0 for the second 10 minutes. What does vmstat say about the bi/bo during this time period? It sounds like the volume of random I/O produced by recovery is just backing up as expected. Some quick math: 15GB RAM * 5% dirty_ratio = 750MB ; there's where your measured 760MB bottleneck is coming from. 750MB / 10 minutes = 1.25MB/s ; that's in the normal range for random writes with a single disk Therefore my bet is that "vmstat 1" will show bo~=1250 the whole time you're waiting there, with matching figures from the iostat to the database disk during that period. Basically your options here are: 1) Decrease the maximum possible segment backlog so you can never get this far behind 2) Increase the rate at which random I/O can be flushed to disk by either a) Improving things with a [better] battery-backed controller disk cache b) Stripe across more disks -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
В списке pgsql-general по дате отправления: