Re: Tracing down buildfarm "postmaster does not shut down" failures

Поиск

Список

Период

Сортировка

От	Andrew Dunstan
Тема	Re: Tracing down buildfarm "postmaster does not shut down" failures
Дата	8 февраля 2016 г. 20:59:44
Msg-id	56B901B4.2020207@dunslane.net обсуждение исходный текст
Ответ на	Tracing down buildfarm "postmaster does not shut down" failures (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения


On 02/08/2016 02:15 PM, Tom Lane wrote:
> Of late, by far the majority of the random-noise failures we see in the
> buildfarm have come from failure to shut down the postmaster in a
> reasonable timeframe.  An example is this current failure on hornet:
>
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55
>
> waiting for server to shut
down...........................................................................................................................
failed
> pg_ctl: server does not shut down
> =========== db log file ==========
> 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG:  received fast shutdown request
> 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG:  aborting any active transactions
> 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG:  autovacuum launcher shutting down
> 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG:  shutting down
>
> The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
> shows that pg_ctl was honoring that, so apparently the postmaster took
> more than 2 minutes to shut down.
>
> Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
> 40 or so seconds on this machine.  So it's possible that it was just so
> overloaded that it took three times that much time on this run, but I'm
> starting to think there may be more to it than that.  We've seen variants
> on this theme on half a dozen machines just in the past week --- and it
> seems to mostly happen in 9.5 and HEAD, which is fishy.
>
> The fact that we got "shutting down" but not "database system is shut
> down" indicates that the server was in ShutdownXLOG().  Normally the
> only component of that that takes much time is buffer flushing, but
> could something else be happening?  Or perhaps the checkpoint code
> has somehow not gotten the word to do its flushing at full speed?
>
> What I'd like to do to investigate this is put in a temporary HEAD-only
> patch that makes ShutdownXLOG() and its subroutines much chattier about
> how far they've gotten and what time it is, and also makes pg_ctl print
> out the current time if it gives up waiting.  A few failed runs with
> that in place will at least allow us to confirm or deny whether it's
> just that the shutdown checkpoint is sometimes really slow, or whether
> there's a bug lurking.
>
> Any objections?  Anybody have another idea for data to collect?
>
>             


I think that's an excellent idea. This has been a minor running sore for 
ages on slow machines.

cheers

andrew

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Tracing down buildfarm "postmaster does not shut down" failures