Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

Поиск
Список
Период
Сортировка
От Vladimir Nicolici
Тема Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time
Дата
Msg-id 59dcd18c.8dbadf0a.94538.4a20@mx.google.com
обсуждение исходный текст
Ответ на Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time  (Vladimir Nicolici <vladnc@gmail.com>)
Ответы Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime
Список pgsql-general

I experimented some more with the settings this weekend, while doing some large write operations (deleting 200 million records from a table), and I realized that the database is capable of generating much more WAL than I estimated.

 

And it seems that spikes in write activity, when longer than a few minutes, can cause the checkpoint process to “panic” and start a checkpoint earlier, and trying to complete it as soon as possible, estimating, correctly, that if that level of activity continues it will hit the max_wal_size limit.

 

Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, while keeping max_wal_size at 144GB . Alternatively I could have increased the maximum WAL size more, but I’m not sure it’s a good idea to set it higher than the shared buffers, which are also set at 144GB. After this change, on Monday all checkpoints were triggered by “time”, I didn’t have any more checkpoints triggered by “xlog”.

 

I also set checkpoint_completion_target to 0.5 to see if our hardware can handle concentrating the write activity for 20 minutes in just 10 minutes, and that worked very well too, checkpoints finished on time. The %util (busy%) for the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% during the checkpoint, so it seems the hardware will be able to handle future increases in activity just fine.

 

The lesson I learned here is that max_wal_size needs to be configured based on the *maximum* volume of wal the database can generate in the checkpoint_timeout interval. Initially I had it set based on the *average* volume of wal generated in that interval, setting it to 3 times that average, but that was not enough, triggering the unexpected behavior.

 

Thanks,

Vlad

В списке pgsql-general по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: [GENERAL] Permissions for Web App
Следующее
От: Achilleas Mantzios
Дата:
Сообщение: Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime