Обсуждение: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum
In a server where autovacuum is disabled and its databases reach autovacuum_freeze_max_age limit, an autovacuum is forced to prevent xid wraparound issues. At this stage, when the server is loaded with a lot of DML operations, an exceedingly high number of autovacuum workers keep on getting spawned, and these do not do anything, and then quit. The issue is : Sometimes an auto-vaccuum worker A1 finds it has no tables to vacuum because of another worker(s) B already concurrently vacuuming the tables. Then this worker A1 calls vac_update_datfrozenxid() in the end, which effectively requests for new auto-vacuum (vac_update_datfrozenxid() -> vac_truncate_clog() -> SetTransactionIdLimit -> Send PMSIGNAL_START_AUTOVAC_LAUNCHER signal). *Immediately* a new auto-vacuum is spawned since it is an xid-wraparound vacuum. This new autovacuum chooses the same earlier database for the new worker A2, because the datfrozenxid's aren't yet updated due to worker B still not done yet. Worker A2 finds that the same worker(s) B are still vacuuming the same tables. A2 again calls vac_update_datfrozenxid(), which leads to spawning another auto-vacuum worker A3 , and the cycle keeps on repeating until B finishes vacuuming and updates datfrozenid. Steps to reproduce : 1. In a fresh cluster, create tables using attached create.sql.gz 2. Insert some data : INSERT into pgbench_history select generate_series(1, 5402107, 1); UPDATE pgbench_history set bid = 1, aid = 11100, delta = 500, mtime = '2017/1/1', filler = '19:00' ; 3. Set these GUCs in postgresql.conf and restart the server : autovacuum = off autovacuum_freeze_max_age = 100000 # Make auto-vacuum start as early as possible. 4. Run : "pgbench -n -c 5 -t 2000000" and watch the log file. 5. After the age(datfrozenxid) of the databases crosses autovacuum_freeze_max_age value, after 1-2 minutes more, a number of messages will be seen, such as this : 2017-01-13 14:50:12.304 IST [111811] LOG: autovacuum launcher started 2017-01-13 14:50:12.346 IST [111816] LOG: autovacuum launcher started 2017-01-13 14:50:12.825 IST [111818] LOG: autovacuum launcher started I see around 70 messages per second. === Fix === For fixing the issue, one approach I thought was to make do_start_worker() choose a different database after determining that the earlier database is still being scanned by one of the workers in AutoVacuumShmem->av_runningWorkers. The current logic is that if the databases need xid-wraparound prevention, then we pick the one with oldest datfrozenxid. So, for the fix, just choose the db with the second oldest datfrozenxid if the oldest one is still being processed. But this does not solve the problem. Once the other databases are done, or if there's a single database, the workers will again go in a loop with the same database. Instead, the attached patch (prevent_useless_vacuums.patch) prevents the repeated cycle by noting that there's no point in doing whatever vac_update_datfrozenxid() does, if we didn't find anything to vacuum and there's already another worker vacuuming the same database. Note that it uses wi_tableoid field to check concurrency. It does not use wi_dboid field to check for already-processing worker, because using this field might cause each of the workers to think that there is some other worker vacuuming, and eventually no one vacuums. We have to be certain that the other worker has already taken a table to vacuum. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Amit Khandekar wrote: > In a server where autovacuum is disabled and its databases reach > autovacuum_freeze_max_age limit, an autovacuum is forced to prevent > xid wraparound issues. At this stage, when the server is loaded with a > lot of DML operations, an exceedingly high number of autovacuum > workers keep on getting spawned, and these do not do anything, and > then quit. I think this is the same problem as reported in https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com > === Fix === [...] > Instead, the attached patch (prevent_useless_vacuums.patch) prevents > the repeated cycle by noting that there's no point in doing whatever > vac_update_datfrozenxid() does, if we didn't find anything to vacuum > and there's already another worker vacuuming the same database. Note > that it uses wi_tableoid field to check concurrency. It does not use > wi_dboid field to check for already-processing worker, because using > this field might cause each of the workers to think that there is some > other worker vacuuming, and eventually no one vacuums. We have to be > certain that the other worker has already taken a table to vacuum. Hmm, it seems reasonable to skip the end action if we didn't do any cleanup after all. This would normally give enough time between vacuum attempts for the first worker to make further progress and avoid causing a storm. I'm not really sure that it fixes the problem completely, but perhaps it's enough. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 13 January 2017 at 19:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I think this is the same problem as reported in > https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com Ah yes, this is the same problem. Not sure why I didn't land on that thread when I tried to search pghackers using relevant keywords. > >> === Fix === > [...] >> Instead, the attached patch (prevent_useless_vacuums.patch) prevents >> the repeated cycle by noting that there's no point in doing whatever >> vac_update_datfrozenxid() does, if we didn't find anything to vacuum >> and there's already another worker vacuuming the same database. Note >> that it uses wi_tableoid field to check concurrency. It does not use >> wi_dboid field to check for already-processing worker, because using >> this field might cause each of the workers to think that there is some >> other worker vacuuming, and eventually no one vacuums. We have to be >> certain that the other worker has already taken a table to vacuum. > > Hmm, it seems reasonable to skip the end action if we didn't do any > cleanup after all. This would normally give enough time between vacuum > attempts for the first worker to make further progress and avoid causing > a storm. I'm not really sure that it fixes the problem completely, but > perhaps it's enough. I had thought about this : if we didn't clean up anything, skip the end action unconditionally without checking if there was any concurrent worker. But then thought it is better to skip only if we know there is another worker doing the same job, because : a) there might be some reason we are just calling vac_update_datfrozenxid() without any condition. But I am not sure whether it was intentionally kept like that. Didn't get any leads from the history. b) it's no harm in updating datfrozenxid() it if there was no other worker. In this case, we *know* that there was indeed nothing to be cleaned up. So the next time this database won't be chosen again, so there's no harm just calling this function. > > -- > Álvaro Herrera https://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum
От
Masahiko Sawada
Дата:
On Mon, Jan 16, 2017 at 1:50 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 13 January 2017 at 19:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> I think this is the same problem as reported in >> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com > > Ah yes, this is the same problem. Not sure why I didn't land on that > thread when I tried to search pghackers using relevant keywords. >> >>> === Fix === >> [...] >>> Instead, the attached patch (prevent_useless_vacuums.patch) prevents >>> the repeated cycle by noting that there's no point in doing whatever >>> vac_update_datfrozenxid() does, if we didn't find anything to vacuum >>> and there's already another worker vacuuming the same database. Note >>> that it uses wi_tableoid field to check concurrency. It does not use >>> wi_dboid field to check for already-processing worker, because using >>> this field might cause each of the workers to think that there is some >>> other worker vacuuming, and eventually no one vacuums. We have to be >>> certain that the other worker has already taken a table to vacuum. >> >> Hmm, it seems reasonable to skip the end action if we didn't do any >> cleanup after all. This would normally give enough time between vacuum >> attempts for the first worker to make further progress and avoid causing >> a storm. I'm not really sure that it fixes the problem completely, but >> perhaps it's enough. > > I had thought about this : if we didn't clean up anything, skip the > end action unconditionally without checking if there was any > concurrent worker. But then thought it is better to skip only if we > know there is another worker doing the same job, because : > a) there might be some reason we are just calling > vac_update_datfrozenxid() without any condition. But I am not sure > whether it was intentionally kept like that. Didn't get any leads from > the history. > b) it's no harm in updating datfrozenxid() it if there was no other > worker. In this case, we *know* that there was indeed nothing to be > cleaned up. So the next time this database won't be chosen again, so > there's no harm just calling this function. > Since autovacuum worker wakes up autovacuum launcher after launched the autovacuum launcher could try to spawn worker process at high frequently if you have database with very large table in it that has just passed autovacuum_freeze_max_age. autovacuum.c:L1605 /* wake up the launcher */ if (AutoVacuumShmem->av_launcherpid != 0) kill(AutoVacuumShmem->av_launcherpid,SIGUSR2); I think we should deal with this case as well. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 16 January 2017 at 15:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Since autovacuum worker wakes up autovacuum launcher after launched > the autovacuum launcher could try to spawn worker process at high > frequently if you have database with very large table in it that has > just passed autovacuum_freeze_max_age. > > autovacuum.c:L1605 > /* wake up the launcher */ > if (AutoVacuumShmem->av_launcherpid != 0) > kill(AutoVacuumShmem->av_launcherpid, SIGUSR2); > > I think we should deal with this case as well. When autovacuum is enabled, after getting SIGUSR2, the worker is launched only when it's time to launch. Doesn't look like it will be immediately launched : /* We're OK to start a new worker */ if (dlist_is_empty(&DatabaseList)) { /* Special case when the list is empty */ } else {.......... /* launch a worker if next_worker is right now or it is in the past */ if (TimestampDifferenceExceeds(avdb->adl_next_worker, current_time, 0)) launch_worker(current_time); } So from the above, it looks as if there will not be a storm of workers. Whereas, if autovacuum is disabled, autovacuum launcher does not wait for the worker to start; it just starts the worker and quits; so the issue won't show up here : /* * In emergency mode, just start a worker (unless shutdown was requested) * and go away. */ if (!AutoVacuumingActive()) { if (!got_SIGTERM) do_start_worker(); proc_exit(0); /* done */ } -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company On 16 January 2017 at 15:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Jan 16, 2017 at 1:50 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 13 January 2017 at 19:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> I think this is the same problem as reported in >>> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com >> >> Ah yes, this is the same problem. Not sure why I didn't land on that >> thread when I tried to search pghackers using relevant keywords. >>> >>>> === Fix === >>> [...] >>>> Instead, the attached patch (prevent_useless_vacuums.patch) prevents >>>> the repeated cycle by noting that there's no point in doing whatever >>>> vac_update_datfrozenxid() does, if we didn't find anything to vacuum >>>> and there's already another worker vacuuming the same database. Note >>>> that it uses wi_tableoid field to check concurrency. It does not use >>>> wi_dboid field to check for already-processing worker, because using >>>> this field might cause each of the workers to think that there is some >>>> other worker vacuuming, and eventually no one vacuums. We have to be >>>> certain that the other worker has already taken a table to vacuum. >>> >>> Hmm, it seems reasonable to skip the end action if we didn't do any >>> cleanup after all. This would normally give enough time between vacuum >>> attempts for the first worker to make further progress and avoid causing >>> a storm. I'm not really sure that it fixes the problem completely, but >>> perhaps it's enough. >> >> I had thought about this : if we didn't clean up anything, skip the >> end action unconditionally without checking if there was any >> concurrent worker. But then thought it is better to skip only if we >> know there is another worker doing the same job, because : >> a) there might be some reason we are just calling >> vac_update_datfrozenxid() without any condition. But I am not sure >> whether it was intentionally kept like that. Didn't get any leads from >> the history. >> b) it's no harm in updating datfrozenxid() it if there was no other >> worker. In this case, we *know* that there was indeed nothing to be >> cleaned up. So the next time this database won't be chosen again, so >> there's no harm just calling this function. >> > > Since autovacuum worker wakes up autovacuum launcher after launched > the autovacuum launcher could try to spawn worker process at high > frequently if you have database with very large table in it that has > just passed autovacuum_freeze_max_age. > > autovacuum.c:L1605 > /* wake up the launcher */ > if (AutoVacuumShmem->av_launcherpid != 0) > kill(AutoVacuumShmem->av_launcherpid, SIGUSR2); > > I think we should deal with this case as well. > > Regards, > > -- > Masahiko Sawada > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jan 13, 2017 at 8:45 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Khandekar wrote: >> In a server where autovacuum is disabled and its databases reach >> autovacuum_freeze_max_age limit, an autovacuum is forced to prevent >> xid wraparound issues. At this stage, when the server is loaded with a >> lot of DML operations, an exceedingly high number of autovacuum >> workers keep on getting spawned, and these do not do anything, and >> then quit. > > I think this is the same problem as reported in > https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com If I understand correctly, and it's possible that I don't, the issues are distinct. I think that the issue in that thread has to do with the autovacuum launcher starting workers over and over again in a tight loop, whereas this issue seems to be about autovacuum workers restarting the launcher over and over again in a tight loop. In that thread, it's the autovacuum launcher that is looping, which can only happen when autovacuum=on. In this thread, the autovacuum launcher is repeatedly exiting and getting restarted, which can only happen when autovacuum=off. In general, it seems we've been pretty cavalier about just how often it's reasonable to start the autovacuum launcher when autovacuum=off. That code probably doesn't see much real-world use. Foreground processes signal the postmaster only every 64kB transactions, which on today's hardware can't happen more than every couple of seconds if you're not using subtransactions or intentionally burning XIDs, but hardware keeps getting faster, and you might be using subtransactions. However, requiring that 65,536 transactions pass between signals does serve as something of a rate limit. In the case about which Amit is complaining, there's no rate limit at all. As fast as the autovacuum launcher starts up, it spawns a worker and exits; as fast as the worker can determine that it can't do anything useful, it starts a new launcher. Clearly, some kind of rate control is needed here; the only question is about where to put it. I would be tempted to install something directly in postmaster.c. If CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) && Shutdown == NoShutdown but we last set start_autovac_launcher = true less than 10 seconds ago, don't do it again. That limits us to launching the autovacuum launcher at most six times a minute when autovacuum = off. You could argue that defeats the point of the SendPostmasterSignal in SetTransactionIdLimit, but I don't think so. If vacuuming the oldest database took less than 10 seconds, then we won't vacuum the next-oldest database until we hit the next 64kB transaction ID boundary, but that can only cause a problem if we've got so many databases that we don't get to them all before we run out of transaction IDs, which is almost unthinkable. If you had a ten million tiny databases that all crossed the threshold at the same instant, it would take you 640 million transaction IDs to visit them all. If you also had autovacuum_freeze_max_age set very close to the upper limit for that variable, you could conceivably have the system shut down before all of those databases were reached. But that's a pretty artificial scenario. If someone has that scenario, perhaps they should consider more sensible configuration choices. I wondered for a while why the existing guard in vac_update_datfrozenxid() isn't sufficient to prevent this problem. That turns out to be due to Tom's commit 794e3e81a0e8068de2606015352c1254cb071a78, which causes ForceTransactionIdLimitUpdate() always returns true when we're past xidVacLimit. The commit doesn't contain much in the way of justification for the change, but I think the issue must be that if the database nearest to wraparound is dropped, we need some mechanism for eventually forcing xidVacLimit to get updated, rather than just spewing warnings. Another place where we could insert a guard is inside SetTransactionIdLimit itself. This is a little tricky. The easy idea would be just to skip sending the signal if xidVacLimit hasn't advanced, but that's wrong in the case where there are multiple databases with exactly the same oldest XID; vacuuming the first one doesn't change anything. It would be correct -- I think -- to skip sending the signal when xidVacLimit doesn't advance and vac_update_datfrozenxid() didn't change the current database's value either, but that requires passing a flag down the call stack a few levels. That's only mildly ugly so I'd be fine with it if it were the best fix, but there seem to be better options. Amit's chosen yet another possible place to insert the guard: teach autovacuum that if a worker skips at least one table due to concurrent autovacuum activity AND ends up vacuuming no tables, don't call vac_update_datfrozenxid(). Since there is or was another worker running, vac_update_datfrozenxid() either already has been called or will be when that worker finishes. So that seems safe. If his patch were changed to skip vac_update_datfrozenxid() in all cases where we do nothing rather than only when we skip a table due to concurrent activity, we'd reintroduce the dropped-database problem that was fixed by 794e3e81a0e8068de2606015352c1254cb071a78. I'm not entirely sure whether Amit's fix is better or worse than the postmaster-based fix. It seems like a fairly fundamental weakness for the postmaster to have no rate-limiting logic whatsoever here; it should be the postmaster's job to judge whether it's getting swamped with signals, and if we fix it in the postmaster then it stops systems with high rates of XID consumption from going bonkers for that reason. On the other hand, if somebody does have a scenario where repeatedly signaling the postmaster to start the launcher in a tight loop is allowing the system to zip through many small databases efficiently, Amit's fix will let that keep working, whereas throttling in the postmaster will make it take longer to get to all of those databases. In many cases, that could be an improvement, since it would tend to spread out the datfrozenxid values better, but I can't quite shake the niggling fear that there might be some case I'm not thinking of where it's problematic. So I don't know. As far as the problem on the other thread, maybe we could extend Amit's approach so that when a worker exits after having skipped some tables but not vacuum any tables, we blacklist the database for some period of time or some number of iterations: autovacuum workers aren't allowed to choose that database until the blacklist entry expires. That way, if it becomes evident that more autovacuum workers in that database are useless, other databases get a chance to attract some workers, at least for some period of time. I'm not sure how to calibrate that exactly, but it's a thought. I think we should fix this problem first, though; it's subject to a narrower and less-speculative repair. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 17, 2017 at 4:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Amit's chosen yet another possible place to insert the guard: teach > autovacuum that if a worker skips at least one table due to concurrent > autovacuum activity AND ends up vacuuming no tables, don't call > vac_update_datfrozenxid(). Since there is or was another worker > running, vac_update_datfrozenxid() either already has been called or > will be when that worker finishes. So that seems safe. If his patch > were changed to skip vac_update_datfrozenxid() in all cases where we > do nothing rather than only when we skip a table due to concurrent > activity, we'd reintroduce the dropped-database problem that was fixed > by 794e3e81a0e8068de2606015352c1254cb071a78. After sleeping on this, I'm inclined to go with Amit's fix for now. It seems less likely to break anything in the back-branches than any other option I can think up. An updated version of that patch is attached. I changed the "if" statement to avoid having an empty "if" clause and a non-empty "else" clause, and I rewrote the comment based on my previous analysis. Barring objections, I'll commit and back-patch this version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Robert Haas wrote: > After sleeping on this, I'm inclined to go with Amit's fix for now. > It seems less likely to break anything in the back-branches than any > other option I can think up. Yeah, no objections here. Note typo "imporatant" in the comment. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 January 2017 at 02:32, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 13, 2017 at 8:45 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> I think this is the same problem as reported in >> https://www.postgresql.org/message-id/CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=YPyFPQ@mail.gmail.com > > If I understand correctly, and it's possible that I don't, the issues > are distinct. I think that the issue in that thread has to do with > the autovacuum launcher starting workers over and over again in a > tight loop, whereas this issue seems to be about autovacuum workers > restarting the launcher over and over again in a tight loop. In that > thread, it's the autovacuum launcher that is looping, which can only > happen when autovacuum=on. In this thread, the autovacuum launcher is > repeatedly exiting and getting restarted, which can only happen when > autovacuum=off. Yes, that's true : in the other thread, autovacuum is on. Although, I haven't been able to get why there would there be a storm of workers spawned in case of autovacuum on. When it is on, the launcher starts worker only it's time to start the worker. > > I would be tempted to install something directly in postmaster.c. If > CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) && Shutdown == > NoShutdown but we last set start_autovac_launcher = true less than 10 > seconds ago, don't do it again. My impression was that postmaster is supposed to just do a minimal work of starting auto-vacuum launcher if not already. And, the work of ensuring all the things keep going is the job of auto-vacuum launcher. > That limits us to launching the > autovacuum launcher at most six times a minute when autovacuum = off. > You could argue that defeats the point of the SendPostmasterSignal in > SetTransactionIdLimit, but I don't think so. If vacuuming the oldest > database took less than 10 seconds, then we won't vacuum the > next-oldest database until we hit the next 64kB transaction ID > boundary, but that can only cause a problem if we've got so many > databases that we don't get to them all before we run out of > transaction IDs, which is almost unthinkable. If you had a ten > million tiny databases that all crossed the threshold at the same > instant, it would take you 640 million transaction IDs to visit them > all. If you also had autovacuum_freeze_max_age set very close to the > upper limit for that variable, you could conceivably have the system > shut down before all of those databases were reached. But that's a > pretty artificial scenario. If someone has that scenario, perhaps > they should consider more sensible configuration choices. Yeah this logic makes sense ... But I guess , from looking at the code, it seems that it was carefully made sure that in case of auto-vacuum off, we should clean up all databases as fast as possible with multiple workers cleaning up multiple tables in parallel. Instead of autovacuum launcher and worker together making sure that the cycle of iterations keep on running, I was thinking the auto-vacuum launcher itself should make sure it does not spawn another worker on the same database if it did nothing. But that seemed pretty invasive.
Re: [HACKERS] Too many autovacuum workers spawned during forced auto-vacuum
От
Michael Paquier
Дата:
On Fri, Jan 20, 2017 at 4:11 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: > >> After sleeping on this, I'm inclined to go with Amit's fix for now. >> It seems less likely to break anything in the back-branches than any >> other option I can think up. > > Yeah, no objections here. +1. -- Michael
On Fri, Jan 20, 2017 at 2:43 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jan 20, 2017 at 4:11 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Robert Haas wrote: >> >>> After sleeping on this, I'm inclined to go with Amit's fix for now. >>> It seems less likely to break anything in the back-branches than any >>> other option I can think up. >> >> Yeah, no objections here. > > +1. OK, committed and back-patched all the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/20/17 12:40 AM, Amit Khandekar wrote: > My impression was that postmaster is supposed to just do a minimal > work of starting auto-vacuum launcher if not already. And, the work of > ensuring all the things keep going is the job of auto-vacuum launcher. There's already a ton of logic in the launcher... ISTM it'd be nice to not start adding additional logic to the postmaster. If we had a generic need for rate limiting launching of things maybe it wouldn't be that bad, but AFAIK we don't. >> That limits us to launching the >> autovacuum launcher at most six times a minute when autovacuum = off. >> You could argue that defeats the point of the SendPostmasterSignal in >> SetTransactionIdLimit, but I don't think so. If vacuuming the oldest >> database took less than 10 seconds, then we won't vacuum the >> next-oldest database until we hit the next 64kB transaction ID >> boundary, but that can only cause a problem if we've got so many >> databases that we don't get to them all before we run out of >> transaction IDs, which is almost unthinkable. If you had a ten >> million tiny databases that all crossed the threshold at the same >> instant, it would take you 640 million transaction IDs to visit them >> all. If you also had autovacuum_freeze_max_age set very close to the >> upper limit for that variable, you could conceivably have the system >> shut down before all of those databases were reached. But that's a >> pretty artificial scenario. If someone has that scenario, perhaps >> they should consider more sensible configuration choices. > Yeah this logic makes sense ... I'm not sure that's true in the case of a significant number of databases and a very high XID rate, but I might be missing something. In any case I agree it's not worth worrying about. If you've disabled autovac you're already running with scissors. > But I guess , from looking at the code, it seems that it was carefully > made sure that in case of auto-vacuum off, we should clean up all > databases as fast as possible with multiple workers cleaning up > multiple tables in parallel. > > Instead of autovacuum launcher and worker together making sure that > the cycle of iterations keep on running, I was thinking the > auto-vacuum launcher itself should make sure it does not spawn another > worker on the same database if it did nothing. But that seemed pretty > invasive. IMHO we really need some more sophistication in scheduling for both launcher and worker. Somewhere on my TODO is allowing the worker to call a user defined SELECT to get a prioritized list, but since the launcher doesn't connect to a database that wouldn't work. What we could do rather simply is honor adl_next_worker in the logic that looks for freeze, something like the attached. On another note, does anyone else find the database selection logic rather difficult to trace through? The logic is kinda spread throughout several functions. The naming of rebuild_database_list() and get_database_list() is rather confusing too. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers