Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
От | Robert Haas |
---|---|
Тема | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) |
Дата | |
Msg-id | CA+Tgmobr+Q2WgWeasdbDNefVwJkAGALxA=-VtEGNtQgL1V2Ryw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: test_shm_mq failing on anole (was: Sending out a
request for more buildfarm animals?)
|
Список | pgsql-hackers |
On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Further debugging reveals that sigusr1_handler() gets called > repeatedly, to start autovacuum workers, and it keeps waking up and > starting them. But that doesn't cause the background workers to get > started either, because although sigusr1_handler() contains a call to > maybe_start_bgworker, it only does that if start_bgworker = true, > which only happens if > CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true, > which on these calls it isn't. > And I think this might also be the missing ingredient answering > Andres's question from before: why doesn't the 60-second > select()-timeout cause the background worker to eventually start even > if the SELECT doesn't get interrupted? There seems to be a SIGUSR1 > arriving about every 3 seconds, and I bet that's resetting the timer > every time. For now I propose to address this by committing the attached patch, which gets rid of the separate start_bgworker flag inside sigusr1_handler() and instead uses the same StartWorkerNeeded || HaveCrashedWorker test that we use inside ServerLoop() to decide whether to call maybe_start_bgworker(). Either more signals will arrive (in which case the signal handler will launch an additional background worker every time a signal arrives) or they won't (in which case the 60-second timeout will eventually expire, and ServerLoop() will kick into high gear and satisfy all outstanding requests). This isn't really right, because there might still be a quite noticeable delay waiting for workers to get launched, but at least the delay would be bounded to at most 60 seconds rather than, as at present, potentially infinite. A totally correct fix will require a bit more thought. A search of the git history reveals that the problem of a signal restarting the timeout is not new: Tom fixed a similar problem back in 2007 by making the autovacuum launcher sleep for at most a second at a time. Such a fix isn't ideal here, because we really don't want an up-to-1-second delay launching a newly-registered background worker if there's a way to avoid that -- it's probably OK for launching daemons, but it's not so hot for parallel query. However, we could: (1) Use the self-pipe trick. We could not directly use a latch, at least not without a new API, because we might be waiting on more than one socket. (2) Have the postmaster not set SA_RESTART for the sigusr1 handler. I don't know how platform-independent this approach would be. (3) Have sigusr1_handler repeatedly call maybe_start_bgworker() until StartWorkerNeeded becomes false, instead of just calling it once. ServerLoop() is carefully coded to call maybe_start_bgworker() just once per iteration, presumably to make sure the server stays responsive even if the bgworker-starting machinery is quite busy; looping inside the signal handler would give up that nice property unless we had some way to break out of the loop if there's activity on the socket. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: