Обсуждение: another autovacuum scheduling thread
/me dons flame-proof suit
My goal with this thread is to produce some incremental autovacuum
scheduling improvements for v19, but realistically speaking, I know that
it's a bit of a long-shot.  There have been many discussions over the
years, and I've read through a few of them [0] [1] [2] [3] [4], but there
are certainly others I haven't found.  Since this seems to be a contentious
topic, I figured I'd start small to see if we can get _something_
committed.
While I am by no means wedded to a specific idea, my current concrete
proposal (proof-of-concept patch attached) is to start by ordering the
tables a worker will process by (M)XID age.  Here are the reasons:
* We do some amount of prioritization of databases at risk of wraparound at
database level, per the following comment from autovacuum.c:
     * Choose a database to connect to.  We pick the database that was least
     * recently auto-vacuumed, or one that needs vacuuming to prevent Xid
     * wraparound-related data loss.  If any db at risk of Xid wraparound is
     * found, we pick the one with oldest datfrozenxid, independently of
     * autovacuum times; similarly we pick the one with the oldest datminmxid
     * if any is in MultiXactId wraparound.  Note that those in Xid wraparound
     * danger are given more priority than those in multi wraparound danger.
However, we do no such prioritization of the tables within a database.  In
fact, the ordering of the tables is effectively random.  IMHO this gives us
quite a bit of wiggle room to experiment; since we are processing tables in
no specific order today, changing the order to something vacuuming-related
seems more likely to help than it is to harm.
* Prioritizing tables based on their (M)XID age might help avoid more
aggressive vacuums, not to mention wraparound.  Of course, there are
scenarios where this doesn't work.  For example, the age of a table may
have changed greatly between the time we recorded it and the time we
process it.  Or maybe there is another table in a different database that
is more important from a wraparound perspective.  We could complicate the
patch to try to handle some of these things, but I maintain that even some
basic, incremental scheduling improvements would be better than the status
quo.  And we can always change it further in the future to handle these
problems and to consider other things like bloat.
The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.
Thoughts?
[0] https://postgr.es/m/CA%2BTgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM%2B%2BJPwK%2BQ%40mail.gmail.com
[1] https://postgr.es/m/CAEG8a3%2B3fwQbgzak%2Bh3Q7Bp%3DvK_aWhw1X7w7g5RCgEW9ufdvtA%40mail.gmail.com
[2] https://postgr.es/m/CAD21AoBUaSRBypA6pd9ZD%3DU-2TJCHtbyZRmrS91Nq0eVQ0B3BA%40mail.gmail.com
[3] https://postgr.es/m/CA%2BTgmobT3m%3D%2BdU5HF3VGVqiZ2O%2Bv6P5wN1Gj%2BPrq%2Bhj7dAm9AQ%40mail.gmail.com
[4] https://postgr.es/m/20130124215715.GE4528%40alvh.no-ip.org
-- 
nathan
			
		Вложения
Thanks for raising this topic! I agree that autovacuum scheduling could be improved. > * Prioritizing tables based on their (M)XID age might help avoid more > aggressive vacuums, not to mention wraparound. Of course, there are > scenarios where this doesn't work. For example, the age of a table may > have changed greatly between the time we recorded it and the time we > process it. Or maybe there is another table in a different database that > is more important from a wraparound perspective. We could complicate the > patch to try to handle some of these things, but I maintain that even some > basic, incremental scheduling improvements would be better than the status > quo. And we can always change it further in the future to handle these > problems and to consider other things like bloat. One risk I see with this approach is that we will end up autovacuuming tables that also take the longest time to complete, which could cause smaller, quick-to-process tables to be neglected. It’s not always the case that the oldest tables in terms of (M)XID age are also the most expensive to vacuum, but that is often more true than not. Not saying that the current approach, which is as you mention is random, is any better, however this approach will likely increase the behavior of large tables saturating workers. But I also do see the merit of this approach when we know we are in failsafe territory, because I would want my oldest aged tables to be a/v'd first. -- Sami Imseih Amazon Web Services (AWS)
On 2025-Oct-08, Sami Imseih wrote: > One risk I see with this approach is that we will end up autovacuuming > tables that also take the longest time to complete, which could cause > smaller, quick-to-process tables to be neglected. Perhaps we can have autovacuum workers decide on a mode to use at startup (or launcher decides for them), and use different prioritization heuristics depending on the mode. For instance if we're past max freeze age for any tables then we know we have to first vacuum tables with higher MXID ages regardless of size considerations, but if there's at least one worker in that mode then we use the mode where smaller high-churn tables go first. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
Hi, On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: > However, we do no such prioritization of the tables within a database. In > fact, the ordering of the tables is effectively random. We don't prioritize tables, but I don't think the order really is random? Isn't it basically in the order in which the data is in pg_class? That typically won't change from one autovacuum pass to the next... > * Prioritizing tables based on their (M)XID age might help avoid more > aggressive vacuums, not to mention wraparound. Of course, there are > scenarios where this doesn't work. For example, the age of a table may > have changed greatly between the time we recorded it and the time we > process it. > Or maybe there is another table in a different database that > is more important from a wraparound perspective. That seems like something no ordering within a single AV worker can address. I think it's fine to just define that to be out of scope. > We could complicate the patch to try to handle some of these things, but I > maintain that even some basic, incremental scheduling improvements would be > better than the status quo. And we can always change it further in the > future to handle these problems and to consider other things like bloat. Agreed! It doesn't take much to be better at scheduling than "order in pg_class". > The attached patch works by storing the maximum of the XID age and the MXID > age in the list with the OIDs and sorting it prior to processing. I think it may be worth trying to avoid reliably using the same order - otherwise e.g. a corrupt index on the first scheduled table can cause autovacuum to reliably fail on the same relation, never allowing it to progress past that point. Greetings, Andres Freund
> Not saying that the current approach, which is as you mention is > random, is any better, however this approach will likely increase > the behavior of large tables saturating workers. Maybe it will be good to allocate some workers to the oldest tables and workers based on some random list? This could balance things out between the oldest (large) tables and everything else to avoid this problem. -- Sami Imseih Amazon Web Services (AWS)
On Wed, 8 Oct 2025 12:06:29 -0500 Sami Imseih <samimseih@gmail.com> wrote: > > One risk I see with this approach is that we will end up autovacuuming > tables that also take the longest time to complete, which could cause > smaller, quick-to-process tables to be neglected. > > It’s not always the case that the oldest tables in terms of (M)XID age > are also the most expensive to vacuum, but that is often more true > than not. I think an approach of doing largest objects first actually might work really well for balancing work amongst autovacuum workers. Many years ago I designed a system to backup many databases with a pool of workers and used this same simple & naive algorithm of just reverse sorting on db size, and it worked remarkably well. If you have one big thing then you probably want someone to get started on that first. As long as there's a pool of workers available, as you work through the queue, you can actually end up with pretty optimal use of all the workers. -Jeremy
On Thu, 9 Oct 2025 at 12:41, Jeremy Schneider <schneider@ardentperf.com> wrote: > I think an approach of doing largest objects first actually might work > really well for balancing work amongst autovacuum workers. Many years > ago I designed a system to backup many databases with a pool of workers > and used this same simple & naive algorithm of just reverse sorting on > db size, and it worked remarkably well. If you have one big thing then > you probably want someone to get started on that first. As long as > there's a pool of workers available, as you work through the queue, you > can actually end up with pretty optimal use of all the workers. I believe that is methodology for processing work applies much better in scenarios where there's no new work continually arriving and there's no adverse effects from giving a lower priority to certain portions of the work. I don't think you can apply that so easily to autovacuum as there are scenarios where the work can pile up faster than it can be handled. Also, smaller tables can bloat in terms of growth proportional to the original table size much more quickly than larger tables and that could have huge consequences for queries to small tables which are not indexed sufficiently to handle being becoming bloated and large. David
On Thu, 9 Oct 2025 12:59:23 +1300 David Rowley <dgrowleyml@gmail.com> wrote: > I believe that is methodology for processing work applies much better > in scenarios where there's no new work continually arriving and > there's no adverse effects from giving a lower priority to certain > portions of the work. I don't think you can apply that so easily to > autovacuum as there are scenarios where the work can pile up faster > than it can be handled. Also, smaller tables can bloat in terms of > growth proportional to the original table size much more quickly than > larger tables and that could have huge consequences for queries to > small tables which are not indexed sufficiently to handle being > becoming bloated and large. I'm arguing that it works well with autovacuum. Not saying there aren't going to be certain workloads that it's suboptimal for. We're talking about sorting by (M)XID age. As the clock continues to move forward any table that doesn't get processed naturally moves up the queue for the next autovac run. I think the concerns are minimal here and this would be a good change in general. -Jeremy -- To know the thoughts and deeds that have marked man's progress is to feel the great heart throbs of humanity through the centuries; and if one does not feel in these pulsations a heavenward striving, one must indeed be deaf to the harmonies of life. Helen Keller, The Story Of My Life, 1902, 1903, 1905, introduction by Ralph Barton Perry (Garden City, NY: Doubleday & Company, 1954), p90.
On Wed, 8 Oct 2025 17:27:27 -0700 Jeremy Schneider <schneider@ardentperf.com> wrote: > On Thu, 9 Oct 2025 12:59:23 +1300 > David Rowley <dgrowleyml@gmail.com> wrote: > > > I believe that is methodology for processing work applies much > > better in scenarios where there's no new work continually arriving > > and there's no adverse effects from giving a lower priority to > > certain portions of the work. I don't think you can apply that so > > easily to autovacuum as there are scenarios where the work can pile > > up faster than it can be handled. Also, smaller tables can bloat > > in terms of growth proportional to the original table size much > > more quickly than larger tables and that could have huge > > consequences for queries to small tables which are not indexed > > sufficiently to handle being becoming bloated and large. > > I'm arguing that it works well with autovacuum. Not saying there > aren't going to be certain workloads that it's suboptimal for. We're > talking about sorting by (M)XID age. As the clock continues to move > forward any table that doesn't get processed naturally moves up the > queue for the next autovac run. I think the concerns are minimal here > and this would be a good change in general. Hmm, doesn't work quite like that if the full queue needs to be processed before the next iteration ~ but at steady state these small tables are going to get processed at the same rate whether they were top of bottom of the queue right? And in non-steady-state conditions, this seems like a better order than pg_class ordering? -Jeremy
On Thu, 9 Oct 2025 at 13:27, Jeremy Schneider <schneider@ardentperf.com> wrote: > I'm arguing that it works well with autovacuum. Not saying there aren't > going to be certain workloads that it's suboptimal for. We're talking > about sorting by (M)XID age. As the clock continues to move forward any > table that doesn't get processed naturally moves up the queue for the > next autovac run. I think the concerns are minimal here and this would > be a good change in general. I thought if we're to have a priority queue that it would be hard to argue against sorting by how far over the given auto-vacuum threshold that the table is. If you assume that a table that just meets the dead rows required to trigger autovacuum based on the autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but another table that has n_mod_since_analyze twice over the autovacuum_analyze_scale_factor gets priority 2.0. Effectively, prioritise by the percentage over the given threshold the table is. That way users could still tune things when they weren't happy with the priority given to a table by adjusting the corresponding reloption. It just seems strange to me to only account for 1 of the 4 trigger points for autovacuum when it's possible to account for all 4 without much extra trouble. David
On Thu, 9 Oct 2025 14:03:34 +1300 David Rowley <dgrowleyml@gmail.com> wrote: > I thought if we're to have a priority queue that it would be hard to > argue against sorting by how far over the given auto-vacuum threshold > that the table is. If you assume that a table that just meets the > dead rows required to trigger autovacuum based on the > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but > another table that has n_mod_since_analyze twice over the > autovacuum_analyze_scale_factor gets priority 2.0. Effectively, > prioritise by the percentage over the given threshold the table is. > That way users could still tune things when they weren't happy with > the priority given to a table by adjusting the corresponding > reloption. If users are tuning this thing then I feel like we've already lost the battle :) On a healthy system, autovac runs continually and hits tables at regular intervals based on their steady state change rates. We have existing knobs (for better or worse) that people can use to tell PG to hit certain tables more frequently, to get rid of sleeps/delays, etc. With our fleet of PG databases here, my current approach is geared toward setting log_autovacuum_min_duration to some conservative value fleet-wide, then monitoring based on the logs for any cases where it runs longer than a defined threshold. I'm able to catch problems sooner this way, versus monitoring on xid age alone. Whenever there are problems with autovacuum, the actual issue is never going to be resolved by what order autovacuum processes tables. I don't think we should encourage any tunables here... to me it seems like putting focus entirely in the wrong place. -Jeremy
On Wed, 8 Oct 2025 18:25:20 -0700 Jeremy Schneider <schneider@ardentperf.com> wrote: > On Thu, 9 Oct 2025 14:03:34 +1300 > David Rowley <dgrowleyml@gmail.com> wrote: > > > I thought if we're to have a priority queue that it would be hard to > > argue against sorting by how far over the given auto-vacuum > > threshold that the table is. If you assume that a table that just > > meets the dead rows required to trigger autovacuum based on the > > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but > > another table that has n_mod_since_analyze twice over the > > autovacuum_analyze_scale_factor gets priority 2.0. Effectively, > > prioritise by the percentage over the given threshold the table is. > > That way users could still tune things when they weren't happy with > > the priority given to a table by adjusting the corresponding > > reloption. > > If users are tuning this thing then I feel like we've already lost the > battle :) I replied too quickly. Re-reading your email, I think your proposing a different algorithm, taking tuple counts into account. No tunables. Is there a fully fleshed out version of the proposed alternative algorithm somewhere? (one of the older threads?) I guess this is why its so hard to get anything committed in this area... -J
On Thu, 9 Oct 2025 at 14:47, Jeremy Schneider <schneider@ardentperf.com> wrote: > > On Wed, 8 Oct 2025 18:25:20 -0700 > Jeremy Schneider <schneider@ardentperf.com> wrote: > > If users are tuning this thing then I feel like we've already lost the > > battle :) > > I replied too quickly. Re-reading your email, I think your proposing a > different algorithm, taking tuple counts into account. No tunables. Is > there a fully fleshed out version of the proposed alternative algorithm > somewhere? (one of the older threads?) I guess this is why its so hard > to get anything committed in this area... It's along the lines of the "1a)" from [1]. I don't think that post does a great job of explaining it. I think the best way to understand it is if you look at relation_needs_vacanalyze() and see how it calculates boolean values for boolean output params. So, instead of calculating just a boolean value it instead calculates a float4 where < 1.0 means don't do the operation and anything >= 1.0 means do the operation. For example, let's say a table has 600 dead rows and the scale factor and threshold settings mean that autovacuum will trigger at 200 (3 times more dead tuples than the trigger point). That would result in the value of 3.0 (600 / 200). The priority for relfrozenxid portion is basically age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account for mxid by doing the same for that and taking the maximum of each value). For each of those component "scores", the priority for autovacuum would be the maximum of each of those. Effectively, it's a method of aligning the different units of measure, transactions or tuples into a single value which is calculated based on the very same values that we use today to trigger autovacuums. David [1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com
On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote: > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: >> The attached patch works by storing the maximum of the XID age and the MXID >> age in the list with the OIDs and sorting it prior to processing. > > I think it may be worth trying to avoid reliably using the same order - > otherwise e.g. a corrupt index on the first scheduled table can cause > autovacuum to reliably fail on the same relation, never allowing it to > progress past that point. Hm. What if we kept a short array of "failed" tables in shared memory? Each worker would consult this table before processing. If the table is there, it would remove it from the shared table and skip processing it. Then the next worker would try processing the table again. I also wonder how hard it would be to gracefully catch the error and let the worker continue with the rest of its list... -- nathan
On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote: > I think the best way to understand it is if you look at > relation_needs_vacanalyze() and see how it calculates boolean values > for boolean output params. So, instead of calculating just a boolean > value it instead calculates a float4 where < 1.0 means don't do the > operation and anything >= 1.0 means do the operation. For example, > let's say a table has 600 dead rows and the scale factor and threshold > settings mean that autovacuum will trigger at 200 (3 times more dead > tuples than the trigger point). That would result in the value of 3.0 > (600 / 200). The priority for relfrozenxid portion is basically > age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account > for mxid by doing the same for that and taking the maximum of each > value). For each of those component "scores", the priority for > autovacuum would be the maximum of each of those. > > Effectively, it's a method of aligning the different units of measure, > transactions or tuples into a single value which is calculated based > on the very same values that we use today to trigger autovacuums. I like the idea of a "score" approach, but I'm worried that we'll never come to an agreement on the formula to use. Perhaps we'd have more luck getting consensus on a multifaceted strategy if we kept it brutally simple. IMHO it's worth a try... -- nathan
Hi, On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote: > On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote: > > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote: > >> The attached patch works by storing the maximum of the XID age and the MXID > >> age in the list with the OIDs and sorting it prior to processing. > > > > I think it may be worth trying to avoid reliably using the same order - > > otherwise e.g. a corrupt index on the first scheduled table can cause > > autovacuum to reliably fail on the same relation, never allowing it to > > progress past that point. > > Hm. What if we kept a short array of "failed" tables in shared memory? I've thought about having that as part of pgstats... > Each worker would consult this table before processing. If the table is > there, it would remove it from the shared table and skip processing it. > Then the next worker would try processing the table again. > > I also wonder how hard it would be to gracefully catch the error and let > the worker continue with the rest of its list... The main set of cases I've seen are when workers get hung up permanently in corrupt indexes. There never is actually an error, the autovacuums just get terminated as part of whatever independent reason there is to restart. The problem with that is that you'll never actually have vacuum fail... Greetings, Andres Freund
On Thu, Oct 09, 2025 at 12:15:31PM -0400, Andres Freund wrote: > On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote: >> I also wonder how hard it would be to gracefully catch the error and let >> the worker continue with the rest of its list... > > The main set of cases I've seen are when workers get hung up permanently in > corrupt indexes. There never is actually an error, the autovacuums just get > terminated as part of whatever independent reason there is to restart. The > problem with that is that you'll never actually have vacuum fail... Ah. Wouldn't the other workers skip that table in that scenario? I'm not following the great advantage of varying the order in this case. I suppose the full set of workers might be able to process more tables before one inevitably gets stuck. Is that it? -- nathan
On Thu, Oct 9, 2025 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: > > Each worker would consult this table before processing. If the table is > > there, it would remove it from the shared table and skip processing it. > > Then the next worker would try processing the table again. > > > > I also wonder how hard it would be to gracefully catch the error and let > > the worker continue with the rest of its list... > > The main set of cases I've seen are when workers get hung up permanently in > corrupt indexes. How recently was this? I'm aware of problems like that that we discussed around 2018, but they were greatly mitigated. First by your commit 3a01f68e, then by my commit c34787f9. In general, there's no particularly good reason why (at least with nbtree indexes) VACUUM should ever hang forever. The access pattern is overwhelmingly simple, sequential access. The only exception is nbtree page deletion (plus backtracking), where it isn't particularly hard to just be very careful about self-deadlock. > There never is actually an error, the autovacuums just get > terminated as part of whatever independent reason there is to restart. What do you mean? In general I'd expect nbtree VACUUM of a corrupt index to either not fail at all (we'll soldier on to the best of our ability when page deletion encounters an inconsistency), or to get permanently stuck due to locking the same page twice/self-deadlock (though as I said, those problems were mitigated, and might even be almost impossible these days). Every other case involves some kind of error (e.g., an OOM is just about possible). I agree with you about using a perfectly deterministic order coming with real downsides, without any upside. Don't interpret what I've said as expressing opposition to that idea. -- Peter Geoghegan
On Thu, Oct 09, 2025 at 11:13:48AM -0500, Nathan Bossart wrote: > On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote: >> I think the best way to understand it is if you look at >> relation_needs_vacanalyze() and see how it calculates boolean values >> for boolean output params. So, instead of calculating just a boolean >> value it instead calculates a float4 where < 1.0 means don't do the >> operation and anything >= 1.0 means do the operation. For example, >> let's say a table has 600 dead rows and the scale factor and threshold >> settings mean that autovacuum will trigger at 200 (3 times more dead >> tuples than the trigger point). That would result in the value of 3.0 >> (600 / 200). The priority for relfrozenxid portion is basically >> age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account >> for mxid by doing the same for that and taking the maximum of each >> value). For each of those component "scores", the priority for >> autovacuum would be the maximum of each of those. >> >> Effectively, it's a method of aligning the different units of measure, >> transactions or tuples into a single value which is calculated based >> on the very same values that we use today to trigger autovacuums. > > I like the idea of a "score" approach, but I'm worried that we'll never > come to an agreement on the formula to use. Perhaps we'd have more luck > getting consensus on a multifaceted strategy if we kept it brutally simple. > IMHO it's worth a try... Here's a prototype of a "score" approach. Two notes: * I've given special priority to anti-wraparound vacuums. I think this is important to avoid focusing too much on bloat when wraparound is imminent. In any case, we need a separate wraparound score in case autovacuum is disabled. * I didn't include the analyze threshold in the score because it doesn't apply to TOAST tables, and therefore would artificially lower their prioritiy. Perhaps there is another way to deal with this. This is very much just a prototype of the basic idea. As-is, I think it'll favor processing tables with lots of bloat unless we're in an anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific we want to be about all of this, but I do intend to try some long-running tests. -- nathan
Вложения
On Fri, Oct 10, 2025 at 1:31 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Here's a prototype of a "score" approach. Two notes: > > * I've given special priority to anti-wraparound vacuums. I think this is > important to avoid focusing too much on bloat when wraparound is imminent. > In any case, we need a separate wraparound score in case autovacuum is > disabled. > > * I didn't include the analyze threshold in the score because it doesn't > apply to TOAST tables, and therefore would artificially lower their > prioritiy. Perhaps there is another way to deal with this. > > This is very much just a prototype of the basic idea. As-is, I think it'll > favor processing tables with lots of bloat unless we're in an > anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific > we want to be about all of this, but I do intend to try some long-running > tests. I think this is a reasonable starting point, although I'm surprised that you chose to combine the sub-scores using + rather than Max. I think it will take a lot of experimentation to figure out whether this particular algorithm (or any other) works well in practice. My intuition (for whatever that is worth to you, which may not be much) is that what will anger users is cases when we ignore a horrible problem to deal with a routine problem. Figuring out how to design the scoring system to avoid such outcomes is the hard part of this problem, IMHO. For this particular algorithm, the main hazards that spring to mind for me are: - The wraparound score can't be more than about 10, but the bloat score could be arbitrarily large, especially for tables with few tuples, so there may be lots of cases in which the wraparound score has no impact on the behavior. - The patch attempts to guard against this by disregarding the non-wraparound portion of the score once the wraparound portion reaches 1.0, but that results in an abrupt behavior shift at that point. Suddenly we go from mostly ignoring the wraparound score to entirely ignoring the bloat score. This might result in the system abruptly ignoring tables that are bloating extremely rapidly in favor of trying to catch up in a wraparound situation that is not yet terribly urgent. When I've thought about this problem -- and I can't claim to have thought about it very hard -- it's seemed to me that we need to (1) somehow normalize everything to somewhat similar units and (2) make sure that severe wraparound danger always wins over every other consideration, but mild wraparound danger can lose to severe bloat. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks for taking a look. On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote: > I think this is a reasonable starting point, although I'm surprised > that you chose to combine the sub-scores using + rather than Max. My thinking was that we should consider as many factors as we can in the score, not just the worst one. If a table has medium bloat and medium wraparound risk, should it always be lower in priority to something with large bloat and small wraparound risk? It seems worth exploring. I am curious why you first thought of Max. > When I've thought about this problem -- and I can't claim to have > thought about it very hard -- it's seemed to me that we need to (1) > somehow normalize everything to somewhat similar units and (2) make > sure that severe wraparound danger always wins over every other > consideration, but mild wraparound danger can lose to severe bloat. Agreed. I need to think about this some more. While I'm optimistic that we could come up with some sort of normalization framework, I deperately want to avoid super complicated formulas and GUCs, as those seem like sure-fire ways of ensuring nothing ever gets committed. -- nathan
On Fri, Oct 10, 2025 at 3:44 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote: > > I think this is a reasonable starting point, although I'm surprised > > that you chose to combine the sub-scores using + rather than Max. > > My thinking was that we should consider as many factors as we can in the > score, not just the worst one. If a table has medium bloat and medium > wraparound risk, should it always be lower in priority to something with > large bloat and small wraparound risk? It seems worth exploring. I am > curious why you first thought of Max. The right answer depends a good bit on how exactly you do the scoring, but it seems to me that it would be easy to overweight secondary problems. Consider a table with an XID age of 900m and an MXID age of 900m and another table with an XID age of 1.8b. I think it is VERY clear that the second one is MUCH worse; but just adding things up will make them seem equal. > Agreed. I need to think about this some more. While I'm optimistic that > we could come up with some sort of normalization framework, I deperately > want to avoid super complicated formulas and GUCs, as those seem like > sure-fire ways of ensuring nothing ever gets committed. IMHO, the trick here is to come up with something that's neither too simple nor too complicated. If it's too simple, we'll easily come up with cases where it sucks, and possibly where it's worse than what we do now (an impressive achievement, to be sure). If it's too complicated, it will be full of arbitrary things that will provoke dissent and probably not work out well in practice. I don't think we need something dramatically awesome to make a change to the status quo, but if it's extremely easy to think up simple scenarios in which a given idea will fail spectacularly, I'd be inclined to suspect that there will be a lot of real-world spectacular failures. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 10 Oct 2025 16:24:51 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > I don't think we > need something dramatically awesome to make a change to the status > quo, but if it's extremely easy to think up simple scenarios in which > a given idea will fail spectacularly, I'd be inclined to suspect that > there will be a lot of real-world spectacular failures. What does a real-world spectacular failure look like? "If those 3 autovac workers had processed tables in a different order everything would have been peachy" But if autovac is going to get jammed up long enough to wraparound the system, does it matter whether or not it did a one-time processing of a bunch of small tables before it got jammed? One particular table always scoring high shouldn't block autovac from other tables, because it doesn't start a new iteration until it goes all the way through the list from its current iteration right? And one iteration of autovac needs to process everything in the list... so it should take the same overall time regardless of order? The spectacular failures I've seen with autovac usually come down to things like too much sleeping (cost_delay) or too few workers, where better ordering would be nice but probably wouldn't fix any real problems leading to the spectacular failures From Robert's 2024 pgConf.dev talk: 1. slow - forward progress not fast enough 2. stuck - no forward progress 3. spinning - not accomplishing anything 4. skipped - thinks not needed 5. starvation - cant keep up I don't think any of these are really addressed by simply changing table order. From Robert's 2022 email to hackers: > A few people have proposed scoring systems, which I think is closer > to the right idea, because our basic goal is to start vacuuming any > given table soon enough that we finish vacuuming it before some > catastrophe strikes. ... > If table A will cause wraparound in 2 hours and take 2 hours to > vacuum, and table B will cause wraparound in 1 hour and take 10 > minutes to vacuum, table A is more urgent even though the catastrophe > is further out. Robert it sounds to me like the main use case you're focused on here is where basically wraparound is imminent - we are already screwed - and our very last hope was that a last-ditch autovac can finish just in time Failsafe and dynamic cost updates were huge advancements. Do we allow dynamic adjustment to worker count yet? I hope y'all just pick something and commit it without getting too lost in the details. I honestly think in the list of improvements around autovac, this is the lowest priority on my list of hopes and dreams as a user for wraparound prevention :) because if this ever matters to me for avoiding wraparound, I was screwed long before we got to this point and this is not going to fix my underlying problems. -Jeremy
On Sat, 11 Oct 2025 at 07:43, Robert Haas <robertmhaas@gmail.com> wrote:
> I think this is a reasonable starting point, although I'm surprised
> that you chose to combine the sub-scores using + rather than Max.
Adding up the component scores doesn't make sense to me either. That
means you could have 0.5 for inserted tuples, 0.5 for dead tuples and,
say 0.1 for analyze threshold, which all add up to 1.1, but neither
component score is high enough for auto-vacuum to have to do anything
yet. With Max(), we'd clearly see that there's nothing to do since the
overall score isn't >= 1.0.
> - The wraparound score can't be more than about 10, but the bloat
> score could be arbitrarily large, especially for tables with few
> tuples, so there may be lots of cases in which the wraparound score
> has no impact on the behavior.
That's a good point. I think we definitely do want to make it so
tables in near danger of causing the database to stop accepting
transactions are dealt with ASAP.
Maybe the score calculation could change when the relevant age() goes
above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start
scaling it very aggressively beyond that. There's plenty to debate,
but at a first cut, maybe something like the following (coded in SQL
for ease of result viewing):
select xidage as "age(relfrozenxid)",case xidage::float8 <
current_setting('vacuum_failsafe_age')::float8 when true then xidage /
current_setting('autovacuum_freeze_max_age')::float8 else power(xidage
/ current_setting('autovacuum_freeze_max_age')::float8,xidage::float8
/ 100_000_000) end xid_age_score from
generate_series(0,2_000_000_000,100_000_000) xidage;
which gives 1e+20 for age of 2 billion. It would take quite an
unreasonable amount of bloat to score higher than that.
I guess someone might argue that we should start taking it more
seriously before the table's relfrozenxid age gets to
vacuum_failsafe_age. Maybe that's true. I just don't know what. In any
case, if a table's age gets that old, then something's probably not
configured very well and needs attention. I did think maybe we could
keep the addressing of auto-vacuum being configured to run too slowly
as a separate thread.
David
			
		On Fri, Oct 10, 2025 at 6:00 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > The spectacular failures I've seen with autovac usually come down to > things like too much sleeping (cost_delay) or too few workers, where > better ordering would be nice but probably wouldn't fix any real > problems leading to the spectacular failures Since I have said the same thing myself, I can hardly disagree. However, there are probably a few exceptions. For instance, if autovacuum on a certain table is failing repeatedly or accomplishing nothing without removing the apparent need to autovacuum, and happens to be the first one in pg_class, it could divert a lot of attention from other tables. > Robert it sounds to me like the main use case you're focused on here > is where basically wraparound is imminent - we are already screwed - and > our very last hope was that a last-ditch autovac can finish just in time Yes, I would argue that this is the scenario that really matters. As you say above, the main thing is having little enough sleeping and a sufficient number of workers. When that's the case, we can do the work in any order and life will mostly be fine. However, if we get into a desperate situation by, say, having one table that can't be vacuumed, and eventually someone fixes that, say by dropping the corrupt index that is preventing vacuuming of that table, we might like it if autovacuum focused on getting that table vacuumed rather than getting lost in the sauce. Of course, if we have the pretty common situation where autovacuum gets behind on all tables, say due to a stale replication slot, then this is less critical, although a perfect system would probably prioritize vacuuming the *largest* tables in this situation, since those will take the longest to finish, and it's when a vacuum of every table in the cluster has been *completed* that the XID horizons can advance. > I hope y'all just pick something and commit it without getting too lost > in the details. I honestly think in the list of improvements around > autovac, this is the lowest priority on my list of hopes and dreams as a > user for wraparound prevention :) because if this ever matters to me for > avoiding wraparound, I was screwed long before we got to this point and > this is not going to fix my underlying problems. I'm not sure if this was your intention, but to me this kind of reads like "well, it's not going to matter anyway so just do whatever and move on" and I don't agree with that. I think that if we're not going to do high-quality engineering here, we just shouldn't change anything at all. It's better to keep having the same bad behavior than for each release to have new and different bad behavior. One possible positive result of leaning into this prioritization problem is that whoever's working in it (Nathan, in this case) might gain some useful insights about how to tackle some of the other problems in this space. All of this is hard enough that we haven't really had any major improvements in this area since, I want to say, 8.3, and it's desirable to break that logjam even if we don't all agree on which problems are most urgent. Even if I ultimately don't agree with whatever Nathan wants to do or proposes, I'm glad he's trying to do something, which is (in my experience) generally much better than making no effort at all. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Oct 12, 2025 at 07:27:10PM +1300, David Rowley wrote:
> On Sat, 11 Oct 2025 at 07:43, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this is a reasonable starting point, although I'm surprised
>> that you chose to combine the sub-scores using + rather than Max.
> 
> Adding up the component scores doesn't make sense to me either. That
> means you could have 0.5 for inserted tuples, 0.5 for dead tuples and,
> say 0.1 for analyze threshold, which all add up to 1.1, but neither
> component score is high enough for auto-vacuum to have to do anything
> yet. With Max(), we'd clearly see that there's nothing to do since the
> overall score isn't >= 1.0.
In v3, I switched to Max().
> Maybe the score calculation could change when the relevant age() goes
> above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start
> scaling it very aggressively beyond that. There's plenty to debate,
> but at a first cut, maybe something like the following (coded in SQL
> for ease of result viewing):
> 
> select xidage as "age(relfrozenxid)",case xidage::float8 <
> current_setting('vacuum_failsafe_age')::float8 when true then xidage /
> current_setting('autovacuum_freeze_max_age')::float8 else power(xidage
> / current_setting('autovacuum_freeze_max_age')::float8,xidage::float8
> / 100_000_000) end xid_age_score from
> generate_series(0,2_000_000_000,100_000_000) xidage;
> 
> which gives 1e+20 for age of 2 billion. It would take quite an
> unreasonable amount of bloat to score higher than that.
> 
> I guess someone might argue that we should start taking it more
> seriously before the table's relfrozenxid age gets to
> vacuum_failsafe_age. Maybe that's true. I just don't know what. In any
> case, if a table's age gets that old, then something's probably not
> configured very well and needs attention. I did think maybe we could
> keep the addressing of auto-vacuum being configured to run too slowly
> as a separate thread.
I did something similar to this in v3, although I used the *_freeze_max_age
parameters as the point to start scaling aggressively, and I simply raised
the score to the power of 10.
I've yet to do any real testing with this stuff.
-- 
nathan
			
		Вложения
On Wed, 22 Oct 2025 at 03:38, Nathan Bossart <nathandbossart@gmail.com> wrote: > I did something similar to this in v3, although I used the *_freeze_max_age > parameters as the point to start scaling aggressively, and I simply raised > the score to the power of 10. > > I've yet to do any real testing with this stuff. I've not tested it or compiled it, but the patch looks good. I did think that the freeze vacuum isn't that big a deal if it's just over the *freeze_max_age and thought it should become aggressive very quickly at the failsafe age, but that leaves a much smaller window of time to do the freezing if autovacuum has been busy with other higher priority tables. Your scaling is much more gentle and comes out (with standard settings) with a score of 1 billion for a table at the failsafe age, and about 1 million at half the failsafe age. That seems reasonable as it's hard to imagine a table having a 1 billion bloat score. However, just thinking of non-standard setting... I do wonder if it'll be aggressive enough if someone did something like raise the *freeze_max_age to 1 billion (it's certainly common that people raise this). With a 1.6 billion vacuum_failsafe_age, a table at freeze_max_age only scores in at 110. I guess there's no reason we couldn't keep your calc and then scale the score further once over vacuum_failsafe_age to ensure those are the highest priority. There is a danger that if a table scores too low when age(relfrozenxid) > vacuum_failsafe_age that autovacuum dawdles along handling bloated tables while oblivious to the nearing armageddon. Is it worth writing a comment explaining the philosophy behind the scoring system to make it easier for people to understand that it aims to standardise the priority of vacuums and unify the various trigger thresholds into a single number to determine which tables are most important to vacuum and/or analyze first? Thanks for working on this. David
On Wed, Oct 22, 2025 at 09:07:33AM +1300, David Rowley wrote: > However, just thinking of non-standard setting... I do wonder if it'll > be aggressive enough if someone did something like raise the > *freeze_max_age to 1 billion (it's certainly common that people raise > this). With a 1.6 billion vacuum_failsafe_age, a table at > freeze_max_age only scores in at 110. I guess there's no reason we > couldn't keep your calc and then scale the score further once over > vacuum_failsafe_age to ensure those are the highest priority. There is > a danger that if a table scores too low when age(relfrozenxid) > > vacuum_failsafe_age that autovacuum dawdles along handling bloated > tables while oblivious to the nearing armageddon. That's a good point. I wonder if we should try to make the wraparound score independent of the *_freeze_max_age parameters (once the table age surpasses said parameters). Else, different settings will greatly impact how aggressively tables are prioritized the closer they are to wraparound. Even if autovacuum_freeze_max_age is set to 200M, it's not critically important for autovacuum to pick up tables right away as soon as their age reaches 200M. But if the parameter is set to 2B, we _do_ want autovacuum to prioritize tables right away once their age reaches 2B. > Is it worth writing a comment explaining the philosophy behind the > scoring system to make it easier for people to understand that it aims > to standardise the priority of vacuums and unify the various trigger > thresholds into a single number to determine which tables are most > important to vacuum and/or analyze first? Yes, I think so. > Thanks for working on this. I appreciate the discussion. -- nathan
On Wed, Oct 22, 2025 at 01:40:11PM -0500, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 09:07:33AM +1300, David Rowley wrote:
>> However, just thinking of non-standard setting... I do wonder if it'll
>> be aggressive enough if someone did something like raise the
>> *freeze_max_age to 1 billion (it's certainly common that people raise
>> this). With a 1.6 billion vacuum_failsafe_age, a table at
>> freeze_max_age only scores in at 110. I guess there's no reason we
>> couldn't keep your calc and then scale the score further once over
>> vacuum_failsafe_age to ensure those are the highest priority. There is
>> a danger that if a table scores too low when age(relfrozenxid) >
>> vacuum_failsafe_age that autovacuum dawdles along handling bloated
>> tables while oblivious to the nearing armageddon.
> 
> That's a good point.  I wonder if we should try to make the wraparound
> score independent of the *_freeze_max_age parameters (once the table age
> surpasses said parameters).  Else, different settings will greatly impact
> how aggressively tables are prioritized the closer they are to wraparound.
> Even if autovacuum_freeze_max_age is set to 200M, it's not critically
> important for autovacuum to pick up tables right away as soon as their age
> reaches 200M.  But if the parameter is set to 2B, we _do_ want autovacuum
> to prioritize tables right away once their age reaches 2B.
I'm imagining something a bit like the following:
    select xidage "age(relfrozenxid)",
    power(1.001, xidage::float8 / (select min_val
    from pg_settings where name = 'autovacuum_freeze_max_age')::float8)
    xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage;
     age(relfrozenxid) |   xid_age_score
    -------------------+--------------------
                     0 |                  1
             100000000 | 2.7169239322355936
             200000000 |   7.38167565355452
             300000000 | 20.055451243143093
             400000000 |  54.48913545427955
             500000000 |  148.0428361625591
             600000000 | 402.22112456608977
             700000000 |  1092.804199384323
             800000000 |  2969.065882554825
             900000000 |  8066.726152697397
            1000000000 | 21916.681339054314
            1100000000 | 59545.956045257895
            1200000000 |  161781.8330472099
            1300000000 |  439548.9340069078
            1400000000 | 1194221.0181920114
            1500000000 |  3244607.664704634
            1600000000 |   8815352.21495106
            1700000000 | 23950641.403886583
            1800000000 |  65072070.82261215
            1900000000 | 176795866.53808445
            2000000000 |  480340920.9176516
    (21 rows)
-- 
nathan
			
		On Thu, 23 Oct 2025 at 07:58, Nathan Bossart <nathandbossart@gmail.com> wrote: > > That's a good point. I wonder if we should try to make the wraparound > > score independent of the *_freeze_max_age parameters (once the table age > > surpasses said parameters). Else, different settings will greatly impact > > how aggressively tables are prioritized the closer they are to wraparound. > > Even if autovacuum_freeze_max_age is set to 200M, it's not critically > > important for autovacuum to pick up tables right away as soon as their age > > reaches 200M. But if the parameter is set to 2B, we _do_ want autovacuum > > to prioritize tables right away once their age reaches 2B. > > I'm imagining something a bit like the following: > > select xidage "age(relfrozenxid)", > power(1.001, xidage::float8 / (select min_val > from pg_settings where name = 'autovacuum_freeze_max_age')::float8) > xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage; > > age(relfrozenxid) | xid_age_score > -------------------+-------------------- > 0 | 1 > 100000000 | 2.7169239322355936 > 200000000 | 7.38167565355452 > 300000000 | 20.055451243143093 This does start to put the score > 1 before the table reaches autovacuum_freeze_max_age. I don't think that's great as the score of 1.0 was meant to represent that the table now requires some autovacuum work. The main reason I was trying to keep the score scaling with the percentage over the given threshold that the table is was that I had imagined we could use the score number to start reducing the sleep time between autovacuum_vacuum_cost_limit when the highest scoring table persists in being high for too long. I was considering this to fix the misconfigured autovacuum problem that so many people have. If we scaled it the way similar to the query above, the score would look high even before it reaches the limit. This is the reason I was scaling the score linear with the autovacuum_freeze_max_age with the version I sent and only scaling exponentially after the failsafe age. I wanted to talk about the "reducing the cost delay" feature separately so as not to load up this thread and widen the scope for varying opinions, but in its most trivial form, the vacuum_cost_limit() code could be adjusted to only sleep for autovacuum_vacuum_cost_delay / <the table's score>. I think the one I proposed in [1] does this quite well. The table remains eligible to be autovacuumed with any score >= 1.0, and there's still a huge window of time to freeze a table once it's over autovacuum_freeze_max_age before there are issues and the exponential scaling once over failsafe age should ensure that the table is top of the list for when the failsafe code kicks in and removes the cost limit. If we had the varying sleep time as I mentioned above, the failsafe code could even be removed as the "autovacuum_vacuum_cost_delay / <tables score>" calculation would effectively zero the sleep time with any table > failsafe age. David [1] https://postgr.es/m/CAApHDvqrd=SHVUytdRj55OWnLH98Rvtzqam5zq2f4XKRZa7t9Q@mail.gmail.com
On Thu, Oct 23, 2025 at 08:34:49AM +1300, David Rowley wrote: > On Thu, 23 Oct 2025 at 07:58, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm imagining something a bit like the following: >> >> select xidage "age(relfrozenxid)", >> power(1.001, xidage::float8 / (select min_val >> from pg_settings where name = 'autovacuum_freeze_max_age')::float8) >> xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage; >> >> age(relfrozenxid) | xid_age_score >> -------------------+-------------------- >> 0 | 1 >> 100000000 | 2.7169239322355936 >> 200000000 | 7.38167565355452 >> 300000000 | 20.055451243143093 > > This does start to put the score > 1 before the table reaches > autovacuum_freeze_max_age. I don't think that's great as the score of > 1.0 was meant to represent that the table now requires some autovacuum > work. My thinking was that this formula would only be used once the table reaches autovacuum_freeze_max_age. If the age is less than that, we'd do something else, such as dividing the age by the *_max_age setting. > The main reason I was trying to keep the score scaling with the > percentage over the given threshold that the table is was that I had > imagined we could use the score number to start reducing the sleep > time between autovacuum_vacuum_cost_limit when the highest scoring > table persists in being high for too long. I was considering this to > fix the misconfigured autovacuum problem that so many people have. If > we scaled it the way similar to the query above, the score would look > high even before it reaches the limit. This is the reason I was > scaling the score linear with the autovacuum_freeze_max_age with the > version I sent and only scaling exponentially after the failsafe age. > I wanted to talk about the "reducing the cost delay" feature > separately so as not to load up this thread and widen the scope for > varying opinions, but in its most trivial form, the > vacuum_cost_limit() code could be adjusted to only sleep for > autovacuum_vacuum_cost_delay / <the table's score>. I see. > I think the one I proposed in [1] does this quite well. The table > remains eligible to be autovacuumed with any score >= 1.0, and there's > still a huge window of time to freeze a table once it's over > autovacuum_freeze_max_age before there are issues and the exponential > scaling once over failsafe age should ensure that the table is top of > the list for when the failsafe code kicks in and removes the cost > limit. Yeah. I'll update the patch with that formula. -- nathan
> > I think the one I proposed in [1] does this quite well. The table > > remains eligible to be autovacuumed with any score >= 1.0, and there's > > still a huge window of time to freeze a table once it's over > > autovacuum_freeze_max_age before there are issues and the exponential > > scaling once over failsafe age should ensure that the table is top of > > the list for when the failsafe code kicks in and removes the cost > > limit. > > Yeah. I'll update the patch with that formula. I was looking at v3, and I understand the formula will be updated in the next version. However, do you think we should benchmark the approach of using an intermediary list to store the eligible tables and sorting that list, which may cause larger performance overhead for databases with hundreds of tables that may all be eligible for autovacuum. I do think such cases out there are common, particularly in multi-tenant type databases, where each tenant could be one or more tables. What do you think? -- Sami
On Thu, Oct 23, 2025 at 01:22:24PM -0500, Sami Imseih wrote: > I was looking at v3, and I understand the formula will be updated in the > next version. However, do you think we should benchmark the approach > of using an intermediary list to store the eligible tables and sorting > that list, > which may cause larger performance overhead for databases with hundreds > of tables that may all be eligible for autovacuum. I do think such cases > out there are common, particularly in multi-tenant type databases, where > each tenant could be one or more tables. We already have an intermediary list of table OIDs, so the additional overhead is ultimately just the score calculation and the sort operation. I'd be quite surprised if that added up to anything remotely worrisome, even for thousands of eligible tables. -- nathan
> On Thu, Oct 23, 2025 at 01:22:24PM -0500, Sami Imseih wrote: > > I was looking at v3, and I understand the formula will be updated in the > > next version. However, do you think we should benchmark the approach > > of using an intermediary list to store the eligible tables and sorting > > that list, > > which may cause larger performance overhead for databases with hundreds > > of tables that may all be eligible for autovacuum. I do think such cases > > out there are common, particularly in multi-tenant type databases, where > > each tenant could be one or more tables. > > We already have an intermediary list of table OIDs, so the additional > overhead is ultimately just the score calculation and the sort operation. > I'd be quite surprised if that added up to anything remotely worrisome, > even for thousands of eligible tables. Yeah, you’re correct, the list already exists; sorry I missed that. My main concern is the additional overhead of the sort operation, especially if we have many eligible tables and an aggressive autovacuum_naptime. I don’t think we should make the existing performance of many relations any worse with an additional sort. That said, in such cases the sort may not even be the main performance bottleneck, since the catalog scan itself already doesn’t scale well with many relations. With our current approach, we have more options to improve this, but if we add a sort, we may not be able to avoid a full scan. -- Sami
On Fri, 24 Oct 2025 at 08:33, Sami Imseih <samimseih@gmail.com> wrote: > Yeah, you’re correct, the list already exists; sorry I missed that. My > main concern is > the additional overhead of the sort operation, especially if we have > many eligible > tables and an aggressive autovacuum_naptime. It is true that there are reasons that millions of tables could suddenly become eligible for autovacuum work with the consumption of a single xid, but I imagine sorting the list of tables is probably the least of the DBAs worries for that case as sorting the tables_to_process list is going to take a tiny fraction of the time that doing the vacuum work will take. If your concern is that the sort could take too large a portion of someone's 1sec autovacuum_naptime instance, then you also need to consider that the list isn't likely to be very long as there's very little time for tables to become eligible in such a short naptime, and if the tables are piling up because autovacuum is configured to run too slowly, then lets fix that at the root cause rather than be worried about improving one area because another area needs work. If we think like that, we'll remain gridlocked and autovacuum will never be improved. TBH, I think that mindset has likely contributed quite a bit to the fact that we've made about zero improvements in this area despite nobody thinking that nothing needs to be done. There are also things that could be done if we were genuinely concerned and had actual proof that this could reasonably be a problem. sort_template.h would reduce the constant factor of the indirect function call overhead by quite a bit. On a quick test here with a table containing 1 million random float8 values, a Seq Scan and in-memory Sort, EXPLAIN ANALYZE reports the sort took about 21ms: (actual time=172.273..193.824). I really doubt anyone will be concerned with 21ms when there's a list of 1 million tables needing to be autovacuumed. David
> On Fri, 24 Oct 2025 at 08:33, Sami Imseih <samimseih@gmail.com> wrote: > > Yeah, you’re correct, the list already exists; sorry I missed that. My > > main concern is > > the additional overhead of the sort operation, especially if we have > > many eligible > > tables and an aggressive autovacuum_naptime. > > It is true that there are reasons that millions of tables could > suddenly become eligible for autovacuum work with the consumption of a > single xid, but I imagine sorting the list of tables is probably the > least of the DBAs worries for that case as sorting the > tables_to_process list is going to take a tiny fraction of the time > that doing the vacuum work will take. Yes, in my last reply, I did indicate that the sort will likely not be the operation that will tip the performance over, but the catalog scan itself that I have seen not scale well as the number of relations grow ( in cases of thousands or hundreds of thousands of tables). If we are to prioritize vacuuming by M(XID), then it will be hard to avoid the catalog scan anymore in a future improvement. >TBH, I think that mindset has likely contributed quite a > bit to the fact that we've made about zero improvements in this area > despite nobody thinking that nothing needs to be done. I am not against this idea, just thinking out loud about the high relation cases I have seen in the past. -- Sami
On Fri, 24 Oct 2025 at 09:48, Sami Imseih <samimseih@gmail.com> wrote: > Yes, in my last reply, I did indicate that the sort will likely not be > the operation that will tip the performance over, but the > catalog scan itself that I have seen not scale well as the number of > relations grow ( in cases of thousands or hundreds of thousands of tables). > If we are to prioritize vacuuming by M(XID), then it will be hard to avoid the > catalog scan anymore in a future improvement. I grant you that I could see that could be a problem for a sufficiently large number of tables and small enough autovacuum_naptime, but I don't see how anything being proposed here moves the goalposts on the requirements to scan pg_class. We at least need to get the relopts from somewhere, plus reltuples, relpages, relallfrozen. We can't magic those values out of thin air. So, since nothing is changing in regards to the scan of pg_class or which columns we need to look at in that table, I don't know why we'd consider it a topic to discuss on this thread. If this thread becomes a dumping ground for unrelated problems, then nothing will be done to fix the problem at hand. David
Here is an updated patch based on the latest discussion. -- nathan
Вложения
On Wed, Oct 22, 2025 at 3:35 PM David Rowley <dgrowleyml@gmail.com> wrote: > If we had the varying sleep time as I mentioned above, the > failsafe code could even be removed as the > "autovacuum_vacuum_cost_delay / <tables score>" calculation would > effectively zero the sleep time with any table > failsafe age. I'm not sure what you mean by "the failsafe could be removed". Importantly, the failsafe will abandon all further index vacuuming. That's why it's presented as something that you as a user are not supposed to rely on. -- Peter Geoghegan
On Sat, 25 Oct 2025 at 10:14, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Oct 22, 2025 at 3:35 PM David Rowley <dgrowleyml@gmail.com> wrote: > > If we had the varying sleep time as I mentioned above, the > > failsafe code could even be removed as the > > "autovacuum_vacuum_cost_delay / <tables score>" calculation would > > effectively zero the sleep time with any table > failsafe age. > > I'm not sure what you mean by "the failsafe could be removed". > Importantly, the failsafe will abandon all further index vacuuming. > That's why it's presented as something that you as a user are not > supposed to rely on. I didn't realise it did that too. I thought it just dropped the delay to zero. In that case, I revoke the statement. David
On Sat, 25 Oct 2025 at 04:08, Nathan Bossart <nathandbossart@gmail.com> wrote:
> Here is an updated patch based on the latest discussion.
Thanks. I've just had a look at it. A few comments and questions.
1) The subtraction here looks back to front:
+ xid_age = TransactionIdIsNormal(relfrozenxid) ? relfrozenxid - recentXid : 0;
+ mxid_age = MultiXactIdIsValid(relminmxid) ? relminmxid - recentMulti : 0;
2) Would it be better to move all the code that sets the xid_score and
mxid_score to under an "if (force_vacuum)"? Those two variables could
be declared in there too.
3) Could the following be refactored a bit so we only check the "relid
!= StatisticRelationId" condition once?
+ if (relid != StatisticRelationId &&
+ classForm->relkind != RELKIND_TOASTVALUE)
Something like:
/* ANALYZE refuses to work with pg_statistic and we don't analyze
toast tables */
if (anltuples > anlthresh && relid != StatisticRelationId &&
    classForm->relkind != RELKIND_TOASTVALUE)
{
    *doanalyze = true;
    // calc analyze score and Max with *score
}
else
  *doanalyze = false;
then delete:
/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
    *doanalyze = false;
4) Should these be TransactionIds?
+ uint32 xid_age;
+ uint32 mxid_age;
5) Instead of:
+ double score = 0.0;
Is it better to zero the score inside relation_needs_vacanalyze() so
it works the same as the other output parameters?
David
			
		On Sun, Oct 26, 2025 at 02:25:48PM +1300, David Rowley wrote: > Thanks. I've just had a look at it. A few comments and questions. Thanks. > 1) The subtraction here looks back to front: > > + xid_age = TransactionIdIsNormal(relfrozenxid) ? relfrozenxid - recentXid : 0; > + mxid_age = MultiXactIdIsValid(relminmxid) ? relminmxid - recentMulti : 0; D'oh. > 2) Would it be better to move all the code that sets the xid_score and > mxid_score to under an "if (force_vacuum)"? Those two variables could > be declared in there too. Seems reasonable. > 3) Could the following be refactored a bit so we only check the "relid > != StatisticRelationId" condition once? Yes. We can update the vacuum part to follow the same pattern, too. > 4) Should these be TransactionIds? > > + uint32 xid_age; > + uint32 mxid_age; Probably. > 5) Instead of: > > + double score = 0.0; > > Is it better to zero the score inside relation_needs_vacanalyze() so > it works the same as the other output parameters? My only concern about this is that some compilers might complain about potentially-uninitialized uses. But we can still zero it in the function regardless. -- nathan
Вложения
I spent some time looking at this, and I am not sure how much this will move the goalpost, since most of the time the bottleneck for autovacuum is the limited number of workers and large tables that take a long time to process. That said, this is a good change for the simple reason that it is better to have a well-defined prioritization strategy for autovacuum than something that is somewhat random, as mentioned earlier. Just a couple of comments on v5: 1/ Should we add documentation explaining this prioritization behavior in [0]? I wrote a sql that returns the tables and scores, which I found was useful when I was testing this out, so having the actually rules spelled out in docs will actually be super useful. If we don't want to go that much in depth, at minimum the docs should say: "Autovacuum prioritizes tables based on how far they exceed their thresholds or if they are approaching wraparound limits." so a DBA can understand this behavior. 2/ * The score is calculated as the maximum of the ratios of each of the table's * relevant values to its threshold. For example, if the number of inserted * tuples is 100, and the insert threshold for the table is 80, the insert * score is 1.25. Should we consider clamping down on the score when reltuples = -1, otherwise the scores for such tables ( new tables with a large amount of ingested data ) will be over-inflated? Perhaps, if reltuples = -1 ( # of reltuples not known ), then give a score of .5, so we are not over-prioritizing but not pushing down to the bottom? [0] https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM -- Sami Imseih Amazon Web Services
On Mon, Oct 27, 2025 at 12:47:15PM -0500, Sami Imseih wrote: > 1/ Should we add documentation explaining this prioritization behavior in [0]? > > I wrote a sql that returns the tables and scores, which I found was > useful when I was testing this out, so having the actually rules spelled out > in docs will actually be super useful. Can you elaborate on how it would be useful? I'd be open to adding a short note that autovacuum attempts to prioritize the tables in a smart way, but I'm not sure I see the value of documenting every detail. I also don't want to add too much friction to future changes to the prioritization logic. > If we don't want to go that much in depth, at minimum the docs should say: > > "Autovacuum prioritizes tables based on how far they exceed their thresholds > or if they are approaching wraparound limits." so a DBA can understand > this behavior. Yeah, I would probably choose to keep it relatively vague like this. > * The score is calculated as the maximum of the ratios of each of the table's > * relevant values to its threshold. For example, if the number of inserted > * tuples is 100, and the insert threshold for the table is 80, the insert > * score is 1.25. > > Should we consider clamping down on the score when > reltuples = -1, otherwise the scores for such tables ( new tables > with a large amount of ingested data ) will be over-inflated? Perhaps, > if reltuples = -1 ( # of reltuples not known ), then give a score of .5, > so we are not over-prioritizing but not pushing down to the bottom? I'm not sure it's worth expending too much energy to deal with this. In the worst case, the table will be given an arbitrarily high priority the first time it is vacuumed, but AFAICT that's it. But that's already the case, as the thresholds will be artificially low before the first VACUUM/ANALYZE. -- nathan
> > I wrote a sql that returns the tables and scores, which I found was > > useful when I was testing this out, so having the actually rules spelled out > > in docs will actually be super useful. > > Can you elaborate on how it would be useful? I'd be open to adding a short > note that autovacuum attempts to prioritize the tables in a smart way, but > I'm not sure I see the value of documenting every detail. We discuss the threshold calculations in the documentation, and users can write scripts to monitor which tables are eligible. However, there is nothing that indicates which table autovacuum will work on next (I have been asked that question by users a few times, sometimes out of curiosity, or because they are monitoring vacuum activity and wondering when their important table will get a vacuum cycle, or if they should kick off a manual vacuum). With the scoring system, it will be much more difficult to explain, unless someone walks through the code. > I also don't > want to add too much friction to future changes to the prioritization > logic. Maybe future changes is a good reason to document the way autovacuum prioritizes, since this is a user-facing change. > > If we don't want to go that much in depth, at minimum the docs should say: > > > > "Autovacuum prioritizes tables based on how far they exceed their thresholds > > or if they are approaching wraparound limits." so a DBA can understand > > this behavior. > > Yeah, I would probably choose to keep it relatively vague like this. With all the above said, starting with something small is definitely better than nothing. > > * The score is calculated as the maximum of the ratios of each of the table's > > * relevant values to its threshold. For example, if the number of inserted > > * tuples is 100, and the insert threshold for the table is 80, the insert > > * score is 1.25. > > > > Should we consider clamping down on the score when > > reltuples = -1, otherwise the scores for such tables ( new tables > > with a large amount of ingested data ) will be over-inflated? Perhaps, > > if reltuples = -1 ( # of reltuples not known ), then give a score of .5, > > so we are not over-prioritizing but not pushing down to the bottom? > > I'm not sure it's worth expending too much energy to deal with this. In > the worst case, the table will be given an arbitrarily high priority the > first time it is vacuumed, but AFAICT that's it. But that's already the > case, as the thresholds will be artificially low before the first > VACUUM/ANALYZE. I can think of scenarios where they may be workloads that create/drops staging tables and load some data ( like batch processing ) where this may become an issue because we are now forcing such tables to the top of the list, potentially impacting other tables from getting vacuum cycles. It could happen now, but the difference with this change is we are forcing these tables to the top of the priority; based on an unknown value (pg_class.reltuples = -1). -- Sami Imseih Amazon Web Services (AWS)
The patch is starting to look good. Here's a review of v5:
1. I think the following code at the bottom of
relation_needs_vacanalyze() can be deleted. You've added the check to
ensure *doanalyze never gets set to true for pg_statistic.
/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
    *doanalyze = false;
2. As #1, but in recheck_relation_needs_vacanalyze(), the following I
think can now be removed:
/* ignore ANALYZE for toast tables */
if (classForm->relkind == RELKIND_TOASTVALUE)
    *doanalyze = false;
3. Would you be able to include what the idea behind the * 1.05 in the
preceding comment?
On Tue, 28 Oct 2025 at 05:06, Nathan Bossart <nathandbossart@gmail.com> wrote:
> +        effective_xid_failsafe_age = Max(vacuum_failsafe_age,
> +                                         autovacuum_freeze_max_age * 1.05);
> +        effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
> +                                          autovacuum_multixact_freeze_max_age * 1.05);
I assume it's to workaround some strange configuration settings, but
don't know for sure, or why 1.05 is a good value.
4. I think it might be neater to format the following as 3 separate "if" tests:
> +        if (force_vacuum ||
> +            vactuples > vacthresh ||
> +            (vac_ins_base_thresh >= 0 && instuples > vacinsthresh))
> +        {
> +            *dovacuum = true;
> +            *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> +            if (vac_ins_base_thresh >= 0)
> +                *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> +        }
> +        else
> +            *dovacuum = false;
i.e:
        if (force_vacuum)
            *dovacuum = true;
        if (vactuples > vacthresh)
        {
            *dovacuum = true;
            *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
        }
        if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
        {
            *dovacuum = true;
            *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
        }
and also get rid of all the "else *dovacuum = false;" (and *dovacuum =
false) in favour of setting those to false at the top of the function.
It's just getting harder to track that those parameters are getting
set in all cases when they're meant to be.
doing that also gets rid of the duplicative "if (vac_ins_base_thresh
>= 0)" check and also saves doing the score calc when the inputs to it
don't make sense. The current code is relying on Max always picking
the current *score when the threshold isn't met.
David
			
		On Tue, 28 Oct 2025 at 11:35, Sami Imseih <samimseih@gmail.com> wrote: > We discuss the threshold calculations in the documentation, and users > can write scripts to monitor which tables are eligible. However, there > is nothing that indicates which table autovacuum will work on next (I > have been asked that question by users a few times, sometimes out of > curiosity, or because they are monitoring vacuum activity and wondering > when their important table will get a vacuum cycle, or if they should > kick off a manual vacuum). With the scoring system, it will be much more > difficult to explain, unless someone walks through the code. I think it's reasonable to want to document how autovacuum prioritises tables, but maybe not in too much detail. Longer term, I think it would be good to have a pg_catalog view for this which showed the relid or schema/relname, and the output values of relation_needs_vacanalyze(). If we had that and we documented that autovacuum workers work from that list, but they just may have an older snapshot of it, then that might help make the score easier to document. It would also allow people to question the scores as I expect at least some people might not agree with the priorities. That would allow us to consider tuning the score calculation if someone points out a deficiency with the current calculation. Also, longer-term, it also doesn't seem that unreasonable that the autovacuum worker might want to refresh the tables_to_process once it finishes a table and if autovacuum_naptime * $value units of time have passed since it was last checked. That would allow the worker to deal with and react accordingly when scores have changed significantly since it last checked. I mean, it might be days between when autovacuum calculates the scores and finally vacuums the table when the list is long, of it it was tied up with large tables. Other workers may have gotten to some of the tables too, so the score may have dropped, but again made its way above the threshold, but to a lesser extent. David
On Tue, Oct 28, 2025 at 11:47:08AM +1300, David Rowley wrote:
> 1. I think the following code at the bottom of
> relation_needs_vacanalyze() can be deleted. You've added the check to
> ensure *doanalyze never gets set to true for pg_statistic.
> 
> /* ANALYZE refuses to work with pg_statistic */
> if (relid == StatisticRelationId)
>     *doanalyze = false;
> 
> 2. As #1, but in recheck_relation_needs_vacanalyze(), the following I
> think can now be removed:
> 
> /* ignore ANALYZE for toast tables */
> if (classForm->relkind == RELKIND_TOASTVALUE)
>     *doanalyze = false;
Removed.
> 3. Would you be able to include what the idea behind the * 1.05 in the
> preceding comment?
> 
> On Tue, 28 Oct 2025 at 05:06, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> +        effective_xid_failsafe_age = Max(vacuum_failsafe_age,
>> +                                         autovacuum_freeze_max_age * 1.05);
>> +        effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
>> +                                          autovacuum_multixact_freeze_max_age * 1.05);
> 
> I assume it's to workaround some strange configuration settings, but
> don't know for sure, or why 1.05 is a good value.
This is lifted from vacuum_xid_failsafe_check().  As noted in the docs, the
failsafe settings are silently limited to 105% of *_freeze_max_age.  I
expanded on this in the comment atop these lines.
> 4. I think it might be neater to format the following as 3 separate "if" tests:
> 
>> +        if (force_vacuum ||
>> +            vactuples > vacthresh ||
>> +            (vac_ins_base_thresh >= 0 && instuples > vacinsthresh))
>> +        {
>> +            *dovacuum = true;
>> +            *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
>> +            if (vac_ins_base_thresh >= 0)
>> +                *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
>> +        }
>> +        else
>> +            *dovacuum = false;
> 
> i.e:
> 
>         if (force_vacuum)
>             *dovacuum = true;
> 
>         if (vactuples > vacthresh)
>         {
>             *dovacuum = true;
>             *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
>         }
> 
>         if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
>         {
>             *dovacuum = true;
>             *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
>         }
> 
> and also get rid of all the "else *dovacuum = false;" (and *dovacuum =
> false) in favour of setting those to false at the top of the function.
> It's just getting harder to track that those parameters are getting
> set in all cases when they're meant to be.
> 
> doing that also gets rid of the duplicative "if (vac_ins_base_thresh
> >= 0)" check and also saves doing the score calc when the inputs to it
> don't make sense. The current code is relying on Max always picking
> the current *score when the threshold isn't met.
Done.
-- 
nathan
			
		Вложения
On Tue, Oct 28, 2025 at 12:16:28PM +1300, David Rowley wrote: > I think it's reasonable to want to document how autovacuum prioritises > tables, but maybe not in too much detail. Longer term, I think it > would be good to have a pg_catalog view for this which showed the > relid or schema/relname, and the output values of > relation_needs_vacanalyze(). If we had that and we documented that > autovacuum workers work from that list, but they just may have an > older snapshot of it, then that might help make the score easier to > document. It would also allow people to question the scores as I > expect at least some people might not agree with the priorities. That > would allow us to consider tuning the score calculation if someone > points out a deficiency with the current calculation. > > Also, longer-term, it also doesn't seem that unreasonable that the > autovacuum worker might want to refresh the tables_to_process once it > finishes a table and if autovacuum_naptime * $value units of time have > passed since it was last checked. That would allow the worker to deal > with and react accordingly when scores have changed significantly > since it last checked. I mean, it might be days between when > autovacuum calculates the scores and finally vacuums the table when > the list is long, of it it was tied up with large tables. Other > workers may have gotten to some of the tables too, so the score may > have dropped, but again made its way above the threshold, but to a > lesser extent. Agreed on both points. -- nathan
> Done. My compiler is complaining about v6 "../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on ‘*score’ may be undefined [-Wsequence-point] 3293 | *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1)); [2/2] Linking target src/backend/postgres" shouldn't just be like below? *score =Max(*score, (double) instuples / Max(vacinsthresh, 1)); -- Sami
HI Nathan Bossart 
> + if (vactuples > vacthresh)
> + {
> + *dovacuum = true;
> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> + }
> +
> + if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
> + {
> + *dovacuum = true;
> + *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> + }
I think it ( *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));) I believe this must be a slip of the hand on your part, having copied an extra one.
> + {
> + *dovacuum = true;
> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> + }
> +
> + if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
> + {
> + *dovacuum = true;
> + *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> + }
I think it ( *score = *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));) I believe this must be a slip of the hand on your part, having copied an extra one.
I also suggest add debug log for score 
    ereport(DEBUG2,
(errmsg("autovacuum candidate: %s (score=%.3f)",
get_rel_name(table->oid), table->score)));
(errmsg("autovacuum candidate: %s (score=%.3f)",
get_rel_name(table->oid), table->score)));
> + effective_xid_failsafe_age = Max(vacuum_failsafe_age,
> + autovacuum_freeze_max_age * 1.05);
Typically, DBAs avoid setting autovacuum_freeze_max_age too close to vacuum_failsafe_age. Therefore, your logic most likely uses the vacuum_failsafe_age value.
Would taking the average of the two be a better approach?
#
root@localhost:/data/pgsql/pg18data# grep vacuum_failsafe_age postgresql.conf
#vacuum_failsafe_age = 1600000000
root@localhost:/data/pgsql/pg18data# grep autovacuum_freeze_max_age postgresql.conf
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum
#vacuum_failsafe_age = 1600000000
root@localhost:/data/pgsql/pg18data# grep autovacuum_freeze_max_age postgresql.conf
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum
Thanks 
On Wed, Oct 29, 2025 at 6:45 AM Sami Imseih <samimseih@gmail.com> wrote:
> Done.
My compiler is complaining about v6
"../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on
‘*score’ may be undefined [-Wsequence-point]
3293 | *score = *score = Max(*score, (double)
instuples / Max(vacinsthresh, 1));
[2/2] Linking target src/backend/postgres"
shouldn't just be like below?
*score =Max(*score, (double) instuples / Max(vacinsthresh, 1));
--
Sami
> On Tue, Oct 28, 2025 at 12:16:28PM +1300, David Rowley wrote: > > I think it's reasonable to want to document how autovacuum prioritises > > tables, but maybe not in too much detail. Longer term, I think it > > would be good to have a pg_catalog view for this which showed the > > relid or schema/relname, and the output values of > > relation_needs_vacanalyze(). If we had that and we documented that > > autovacuum workers work from that list, but they just may have an > > older snapshot of it, then that might help make the score easier to > > document. It would also allow people to question the scores as I > > expect at least some people might not agree with the priorities. That > > would allow us to consider tuning the score calculation if someone > > points out a deficiency with the current calculation. > > > > Also, longer-term, it also doesn't seem that unreasonable that the > > autovacuum worker might want to refresh the tables_to_process once it > > finishes a table and if autovacuum_naptime * $value units of time have > > passed since it was last checked. That would allow the worker to deal > > with and react accordingly when scores have changed significantly > > since it last checked. I mean, it might be days between when > > autovacuum calculates the scores and finally vacuums the table when > > the list is long, of it it was tied up with large tables. Other > > workers may have gotten to some of the tables too, so the score may > > have dropped, but again made its way above the threshold, but to a > > lesser extent. > > Agreed on both points. I think we do need some documentation about this behavior, which v6 is still missing. Another thing I have been contemplating about is the change in prioritization and the resulting difference in the order in which tables are vacuumed is what it means for workloads in which autovacuum tuning that was done with the current assumptions will no longer be beneficial. Let's imagine staging tables that get created and dropped during some batch processing window and they see huge data ingestion/changes. The current scan will make these less of a priority naturally in relation to other permanent tables, but with the new priority, we are making these staging tables more of a priority. Users will now need to maybe turn off autovacuum on a per-table level to prevent this scenario. That is just one example. What I am also trying to say is should we provide a way, I hate to say a GUC, for users to go back to the old behavior? or am I overstating the risk here? -- Sami Imseih Amazon Web Services (AWS)
On Tue, Oct 28, 2025 at 05:44:37PM -0500, Sami Imseih wrote: > My compiler is complaining about v6 > > "../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on > ‘*score’ may be undefined [-Wsequence-point] > 3293 | *score = *score = Max(*score, (double) > instuples / Max(vacinsthresh, 1)); > [2/2] Linking target src/backend/postgres" > > shouldn't just be like below? > > *score =Max(*score, (double) instuples / Max(vacinsthresh, 1)); Oops. I fixed that typo in v7. -- nathan
Вложения
On Wed, Oct 29, 2025 at 11:10:55AM +0800, wenhui qiu wrote: > Typically, DBAs avoid setting autovacuum_freeze_max_age too close to > vacuum_failsafe_age. Therefore, your logic most likely uses the > vacuum_failsafe_age value. > Would taking the average of the two be a better approach? That approach would begin aggressively scaling the priority of tables sooner, but I don't know if that's strictly better. In any case, I'd like to avoid making the score calculation too magical. -- nathan
On Wed, Oct 29, 2025 at 10:24:17AM -0500, Sami Imseih wrote: > I think we do need some documentation about this behavior, which v6 is > still missing. Would you be interested in giving that part a try? > Another thing I have been contemplating about is the change in prioritization > and the resulting difference in the order in which tables are vacuumed > is what it means for workloads in which autovacuum tuning that was > done with the current assumptions will no longer be beneficial. > > Let's imagine staging tables that get created and dropped during > some batch processing window and they see huge data > ingestion/changes. The current scan will make these less of a priority > naturally in relation to other permanent tables, but with the new priority, > we are making these staging tables more of a priority. Users will now > need to maybe turn off autovacuum on a per-table level to prevent this > scenario. That is just one example. > > What I am also trying to say is should we provide a way, I hate > to say a GUC, for users to go back to the old behavior? or am I > overstating the risk here? It's probably worth testing out this scenario, but I can't say I'm terribly worried. Those kinds of tables are already getting chosen by autovacuum earlier due to reltuples == -1, and this patch will just move them to the front of the list that autovacuum creates. In any case, I'd really like to avoid a GUC or fallback switch here. -- nathan
HI Nathan
> That approach would begin aggressively scaling the priority of tables
> sooner, but I don't know if that's strictly better. In any case, I'd like
> to avoid making the score calculation too magical.
> That approach would begin aggressively scaling the priority of tables
> sooner, but I don't know if that's strictly better. In any case, I'd like
> to avoid making the score calculation too magical.
In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds more than 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly. Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_age and autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this average already exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigate and resolve why the table age is not decreasing.
Thanks 
On Thu, Oct 30, 2025 at 12:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Oct 29, 2025 at 10:24:17AM -0500, Sami Imseih wrote:
> I think we do need some documentation about this behavior, which v6 is
> still missing.
Would you be interested in giving that part a try?
> Another thing I have been contemplating about is the change in prioritization
> and the resulting difference in the order in which tables are vacuumed
> is what it means for workloads in which autovacuum tuning that was
> done with the current assumptions will no longer be beneficial.
>
> Let's imagine staging tables that get created and dropped during
> some batch processing window and they see huge data
> ingestion/changes. The current scan will make these less of a priority
> naturally in relation to other permanent tables, but with the new priority,
> we are making these staging tables more of a priority. Users will now
> need to maybe turn off autovacuum on a per-table level to prevent this
> scenario. That is just one example.
>
> What I am also trying to say is should we provide a way, I hate
> to say a GUC, for users to go back to the old behavior? or am I
> overstating the risk here?
It's probably worth testing out this scenario, but I can't say I'm terribly
worried. Those kinds of tables are already getting chosen by autovacuum
earlier due to reltuples == -1, and this patch will just move them to the
front of the list that autovacuum creates. In any case, I'd really like to
avoid a GUC or fallback switch here.
--
nathan
On Thu, 30 Oct 2025 at 15:58, wenhui qiu <qiuwenhuifx@gmail.com> wrote: > In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds morethan 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly.Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_ageand autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this averagealready exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigateand resolve why the table age is not decreasing. I don't think anyone would like to modify PostgreSQL in any way that increases the chances that a table gets as old as vacuum_failsafe_age. Regardless of the order in which tables are vacuumed, if a table gets as old as that then vacuum is configured to run too slowly, or there are not enough workers configured to cope with the given amount of work. I think we need to tackle prioritisation and rate limiting as two separate items. Nathan is proposing to improve the prioritisation in this thread and it seems to me that your concerns are with rate limiting. I've suggested an idea that might help with reducing the cost_delay based on the score of the table in this thread. I'd rather not introduce that as a topic for further discussion here (I imagine Nathan agrees). It's not as if the server is going to consume 1 billion xids in 5 mins. It's at least going to take a day to days or longer for that to happen and if autovacuum has not managed to get on top of the workload in that time, then it's configured to run too slowly and the cost_limit or delay needs to be adjusted. My concern is that there are countless problems with autovacuum and if you try and lump them all into a single thread to fix them all at once, we'll get nowhere. Autovacuum was added to core in 8.1, 20 years ago and I don't believe we've done anything to change the ratelimiting aside from reducing the default cost_delay since then. It'd be good to fix that at some point, just not here, please. FWIW, I agree with Nathan about keeping the score calculation non-magical. The score should be simple and easy to document. We can introduce complexity to it as and when it's needed and when the supporting evidence arrives, rather than from people waving their hands. David
HI 
     I think there might be some misunderstanding — I’m only suggesting changing
effective_xid_failsafe_age = Max(vacuum_failsafe_age,
autovacuum_freeze_max_age * 1.05);
to
effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0;
In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age.
As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late.
effective_xid_failsafe_age = Max(vacuum_failsafe_age,
autovacuum_freeze_max_age * 1.05);
to
effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0;
In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age.
As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late.
Thanks
On Thu, Oct 30, 2025 at 11:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 30 Oct 2025 at 15:58, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
> In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds more than 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly. Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_age and autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this average already exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigate and resolve why the table age is not decreasing.
I don't think anyone would like to modify PostgreSQL in any way that
increases the chances that a table gets as old as vacuum_failsafe_age.
Regardless of the order in which tables are vacuumed, if a table gets
as old as that then vacuum is configured to run too slowly, or there
are not enough workers configured to cope with the given amount of
work. I think we need to tackle prioritisation and rate limiting as
two separate items. Nathan is proposing to improve the prioritisation
in this thread and it seems to me that your concerns are with rate
limiting. I've suggested an idea that might help with reducing the
cost_delay based on the score of the table in this thread. I'd rather
not introduce that as a topic for further discussion here (I imagine
Nathan agrees). It's not as if the server is going to consume 1
billion xids in 5 mins. It's at least going to take a day to days or
longer for that to happen and if autovacuum has not managed to get on
top of the workload in that time, then it's configured to run too
slowly and the cost_limit or delay needs to be adjusted.
My concern is that there are countless problems with autovacuum and if
you try and lump them all into a single thread to fix them all at
once, we'll get nowhere. Autovacuum was added to core in 8.1, 20 years
ago and I don't believe we've done anything to change the ratelimiting
aside from reducing the default cost_delay since then. It'd be good to
fix that at some point, just not here, please.
FWIW, I agree with Nathan about keeping the score calculation
non-magical. The score should be simple and easy to document. We can
introduce complexity to it as and when it's needed and when the
supporting evidence arrives, rather than from people waving their
hands.
David
On Thu, 30 Oct 2025 at 19:48, wenhui qiu <qiuwenhuifx@gmail.com> wrote: > I think there might be some misunderstanding — I’m only suggesting changing > effective_xid_failsafe_age = Max(vacuum_failsafe_age, > autovacuum_freeze_max_age * 1.05); > to > effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0; > In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age. > As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late. I understand your proposal. The autovacuum will trigger for the wraparound at autovacuum_freeze_max_age, so for autovacuum still not to have gotten to the table by the time the table is aged at vacuum_failsafe_age, it means autovacuum isn't working quickly enough to get through the workload, therefore the problem is with the speed of autovacuum not the priority of autovacuum. David
On Wed, Oct 29, 2025 at 11:51 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Oops. I fixed that typo in v7. Are you planning to do some practical experimentation with this? I feel like it would be a good idea to set up some kind of a test case where this is expected to provide a benefit and see if it actually does; and also maybe set up a test case where it will reorder the tables but with no practical difference in the outcome expected and verify that, in fact, nothing changes. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 30, 2025 at 04:05:19PM -0400, Robert Haas wrote: > Are you planning to do some practical experimentation with this? I > feel like it would be a good idea to set up some kind of a test case > where this is expected to provide a benefit and see if it actually > does; and also maybe set up a test case where it will reorder the > tables but with no practical difference in the outcome expected and > verify that, in fact, nothing changes. Yes. I've been thinking through how I want to test this but have yet to actually do so. If you have ideas, I'm all ears. -- nathan
> On Thu, Oct 30, 2025 at 04:05:19PM -0400, Robert Haas wrote: > > Are you planning to do some practical experimentation with this? I > > feel like it would be a good idea to set up some kind of a test case > > where this is expected to provide a benefit and see if it actually > > does; and also maybe set up a test case where it will reorder the > > tables but with no practical difference in the outcome expected and > > verify that, in fact, nothing changes. > > Yes. I've been thinking through how I want to test this but have yet to > actually do so. If you have ideas, I'm all ears. FWIW, I've been putting some scripts together to test some workloads and I will share shortly what I have. -- Sami Imseih Amazon Web Services (AWS)
> FWIW, I've been putting some scripts together to test some workloads > and I will share shortly what I have. Here is my attempt to test the behavior with the new prioritization. I wanted a way to run the same tests with different workloads, both with and without the prioritization patch, and to see if anything stands out as suspicious in terms of autovacuum or autoanalyze activity. For example, certain tables showing too little or too much autovacuum activity. The scripts I put together (attached) run a busy update workload (OLTP) and a separate batch workload. They use pgbench to execute custom scripts that are generated on the fly. The results are summarized by the average number of autovacuum and autoanalyze runs *per table*, along with some other DML activity stats to ensure that the workloads being compared have similar DML activity. Using the scripts: Place the attached scripts in a specific directory, and modify the section under "Caller should adjust these values" in run_workloads.sh to adjust the workload. The scripts assume you have a running cluster with your specific config file adjusted for the test. Once ready, call run_workloads.sh and at the end a summary will show up as you see below. Hopefully it works for you :) The summary.sh script can also be run while the workloads are executing. Here is a example of a test I wanted to run based on the discussion [0]: This scenario is one that was mentioned, but there are others in which a batch process performing inserts only is prioritized over the update workload. I ran this test for 10 minutes, using 200 clients for the update workload and 5 clients for the batch workload, with the following configuration: ``` max_connections=1000; autovacuum_naptime = '10s' shared_buffers = '4GB' autovacuum_max_workers = 6 ``` -- HEAD ``` Total Activity -[ RECORD 1 ]-------------+---------- total_n_dead_tup | 985183 total_n_mod_since_analyze | 220294866 total_reltuples | 247690373 total_autovacuum_count | 137 total_autoanalyze_count | 470 total_n_tup_upd | 7720012 total_n_tup_ins | 446683000 table_count | 105 Activity By Workload Type -[ RECORD 1 ]-----------------+---------------- table_group | batch_tables ** avg_autovacuum_count | 7.400 ** avg_autoanalyze_count | 8.000 avg_vacuum_count | 0.000 avg_analyze_count | 0.000 rows_inserted | 436683000 rows_updated | 0 rows_hot_updated | 0 table_count | 5 -[ RECORD 2 ]-----------------+---------------- table_group | numbered_tables ** avg_autovacuum_count | 1.000 ** avg_autoanalyze_count | 4.300 avg_vacuum_count | 1.000 avg_analyze_count | 0.000 rows_inserted | 10000000 rows_updated | 7720012 rows_hot_updated | 7094573 table_count | 100 ``` -- with v7 applied ``` Total Activity -[ RECORD 1 ]-------------+---------- total_n_dead_tup | 1233045 total_n_mod_since_analyze | 137843507 total_reltuples | 350704437 total_autovacuum_count | 146 total_autoanalyze_count | 605 total_n_tup_upd | 7896354 total_n_tup_ins | 487974000 table_count | 105 Activity By Workload Type -[ RECORD 1 ]-----------------+---------------- table_group | batch_tables ** avg_autovacuum_count | 11.000 ** avg_autoanalyze_count | 13.200 avg_vacuum_count | 0.000 avg_analyze_count | 0.000 rows_inserted | 477974000 rows_updated | 0 rows_hot_updated | 0 table_count | 5 -[ RECORD 2 ]-----------------+---------------- table_group | numbered_tables ** avg_autovacuum_count | 0.910 ** avg_autoanalyze_count | 5.390 avg_vacuum_count | 1.000 avg_analyze_count | 0.000 rows_inserted | 10000000 rows_updated | 7896354 rows_hot_updated | 7123134 table_count | 100 ``` The results above show what I expected: the batch tables receive higher priority, as seen from the averages of autovacuum and autoanalyze runs. This behavior is expected, but it may catch some users by surprise after an upgrade, since certain tables will now receive more attention than others. Longer tests might also show more bloat accumulating on heavily updated tables. In such cases, a user may need to adjust autovacuum settings on a per-table basis to restore the previous behavior. So, I am not quite sure what is the best way to test except for trying to find these non steady state workloads and see the impact of the prioritization change to (auto)vacuum/analyze activity . Maybe there is a better way? [0] https://www.postgresql.org/message-id/aQI7tGEs8IOPxG64%40nathan -- Sami Imseih Amazon Web Services (AWS)
Вложения
On Thu, Oct 30, 2025 at 07:38:15PM -0500, Sami Imseih wrote: > Here is my attempt to test the behavior with the new prioritization. Thanks. > The results above show what I expected: the batch tables receive higher > priority, as seen from the averages of autovacuum and autoanalyze runs. > This behavior is expected, but it may catch some users by surprise after > an upgrade, since certain tables will now receive more attention than > others. Longer tests might also show more bloat accumulating on heavily > updated tables. In such cases, a user may need to adjust autovacuum > settings on a per-table basis to restore the previous behavior. Interesting. From these results, it almost sounds as if we're further amplifying the intended effect of commit 06eae9e. That could be a good thing. Something else I'm curious about is datfrozenxid, i.e., whether prioritization keeps the database (M)XID ages lower. -- nathan
On Sat, 1 Nov 2025 at 09:12, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Oct 30, 2025 at 07:38:15PM -0500, Sami Imseih wrote: > > The results above show what I expected: the batch tables receive higher > > priority, as seen from the averages of autovacuum and autoanalyze runs. > > This behavior is expected, but it may catch some users by surprise after > > an upgrade, since certain tables will now receive more attention than > > others. Longer tests might also show more bloat accumulating on heavily > > updated tables. In such cases, a user may need to adjust autovacuum > > settings on a per-table basis to restore the previous behavior. > > Interesting. From these results, it almost sounds as if we're further > amplifying the intended effect of commit 06eae9e. That could be a good > thing. Something else I'm curious about is datfrozenxid, i.e., whether > prioritization keeps the database (M)XID ages lower. I wonder if it would be more realistic to throttle the work simulation to a certain speed with pgbench -R rather than having it go flat out. The results show that quite a bit higher "rows_inserted" for the batch_tables with the patched version. Sami didn't mention any changes to vacuum_cost_limit, so I suspect that autovacuum would be getting quite behind on this run, which isn't ideal. Rate limiting to something that the given vacuum_cost_limit could keep up with seems more realistic. The fact that the patched version did more insert work in the batch tables does seem a bit unfair as that gave autovacuum more work to do in the patched test run which would result in the lower-scoring tables being neglected more in the patched version. This makes me wonder if we should log the score of the table when the autovacuum starts for the table. We do calculate the score again in recheck_relation_needs_vacanalyze() just before doing the vacuum/analyze, so maybe the score can be stored in the autovac_table struct and displayed somewhere. Maybe along with the log_autovacuum_min_duration / log_autoanalyze_min_duration would be useful. It might be good in there for DBA analysis to give some visibility on how bad things got before autovacuum got around to working on a given table. If we logged the score, we could do the "unpatched" test with the patched code, just with commenting out the list_sort(tables_to_process, TableToProcessComparator); It'd then be interesting to zero the log_auto*_min_duration settings and review the order differences and how high the scores got. Would the average score be higher or lower with patched version? I'd guess lower since the higher scoring tables would tend to get vacuumed later with the unpatched version and their score would be even higher by the time autovacuum got to them. I think if the average score has gone down at the point that the vacuum starts, then that's a very good thing. Maybe we'd need to write a patch to recalculate the "tables_to_process" List after a table is vacuumed and autovacuum_naptime has elapsed for us to see this, else the priorities might have become too outdated. I'd expect that to be even more true when vacuum_cost_limit is configured too low. David
On Sat, 1 Nov 2025 at 14:50, David Rowley <dgrowleyml@gmail.com> wrote:
> If we logged the score, we could do the "unpatched" test with the
> patched code, just with commenting out the
> list_sort(tables_to_process, TableToProcessComparator); It'd then be
> interesting to zero the log_auto*_min_duration settings and review the
> order differences and how high the scores got. Would the average score
> be higher or lower with patched version? I'd guess lower since the
> higher scoring tables would tend to get vacuumed later with the
> unpatched version and their score would be even higher by the time
> autovacuum got to them. I think if the average score has gone down at
> the point that the vacuum starts, then that's a very good thing. Maybe
> we'd need to write a patch to recalculate the "tables_to_process" List
> after a table is vacuumed and autovacuum_naptime has elapsed for us to
> see this, else the priorities might have become too outdated. I'd
> expect that to be even more true when vacuum_cost_limit is configured
> too low.
I'm not yet sure how meaningful it is, but I tried adding the
following to recheck_relation_needs_vacanalyze():
elog(LOG, "Performing autovacuum of table \"%s\" with score = %f",
get_rel_name(relid), score);
then after grepping the logs and loading the data into a table and performing:
select case patched when true then 'v7' else 'master' end as
patched,case when left(tab, 11) = 'table_batch' then 'table_batch_*'
when left(tab,6) = 'table_' then 'table_*' else 'other' end,
avg(score) as avg_Score,count(*) as count from autovac where score>0
and score<2000 group by rollup(1,2) order by 2,1;
with vacuum_cost_limit = 5000, I got:
 patched |     case      |     avg_score      | count
---------+---------------+--------------------+-------
 master  | other         |  2.004997014705882 |    68
 v7      | other         | 1.9668087323943668 |    71
 master  | table_*       |  1.196698981375357 |  1396
 v7      | table_*       | 1.2134741693430646 |  1370
 master  | table_batch_* | 2.1887380086206902 |   116
 v7      | table_batch_* | 1.8882025693430664 |   137
 master  |               | 1.3043197367088595 |  1580
 v7      |               | 1.3059485323193893 |  1578
         |               | 1.3051336187460454 |  3158
It would still be good to do the rate limiting as there's more work
being done in the patched version. Seems to be about 1.1% more rows in
batch_tables and 0.48% more updates in the numbered_tables in the
patched version.
David