Re: Tracing down buildfarm "postmaster does not shut down" failures
От | Andrew Dunstan |
---|---|
Тема | Re: Tracing down buildfarm "postmaster does not shut down" failures |
Дата | |
Msg-id | 56B901B4.2020207@dunslane.net обсуждение исходный текст |
Ответ на | Tracing down buildfarm "postmaster does not shut down" failures (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
On 02/08/2016 02:15 PM, Tom Lane wrote: > Of late, by far the majority of the random-noise failures we see in the > buildfarm have come from failure to shut down the postmaster in a > reasonable timeframe. An example is this current failure on hornet: > > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55 > > waiting for server to shut down........................................................................................................................... failed > pg_ctl: server does not shut down > =========== db log file ========== > 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request > 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions > 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down > 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down > > The buildfarm script runs pg_ctl stop with -t 120, and counting the dots > shows that pg_ctl was honoring that, so apparently the postmaster took > more than 2 minutes to shut down. > > Looking at other recent successful runs, stopdb-C-1 usually takes 30 to > 40 or so seconds on this machine. So it's possible that it was just so > overloaded that it took three times that much time on this run, but I'm > starting to think there may be more to it than that. We've seen variants > on this theme on half a dozen machines just in the past week --- and it > seems to mostly happen in 9.5 and HEAD, which is fishy. > > The fact that we got "shutting down" but not "database system is shut > down" indicates that the server was in ShutdownXLOG(). Normally the > only component of that that takes much time is buffer flushing, but > could something else be happening? Or perhaps the checkpoint code > has somehow not gotten the word to do its flushing at full speed? > > What I'd like to do to investigate this is put in a temporary HEAD-only > patch that makes ShutdownXLOG() and its subroutines much chattier about > how far they've gotten and what time it is, and also makes pg_ctl print > out the current time if it gives up waiting. A few failed runs with > that in place will at least allow us to confirm or deny whether it's > just that the shutdown checkpoint is sometimes really slow, or whether > there's a bug lurking. > > Any objections? Anybody have another idea for data to collect? > > I think that's an excellent idea. This has been a minor running sore for ages on slow machines. cheers andrew
В списке pgsql-hackers по дате отправления: