Обсуждение: vacuum_multixact_failsafe_age doesn't account for MultiXact member exhaustion
vacuum_multixact_failsafe_age doesn't account for MultiXact member exhaustion
От
Peter Geoghegan
Дата:
Some of you might have already seen the recent postmortem from Metronome, which involves an outage caused by MultiXact Member exhaustion: https://metronome.com/blog/root-cause-analysis-postgresql-multixact-member-exhaustion-incidents-may-2025 This reminded me of the fact that vacuum_multixact_failsafe_age has no concern for member space, except to the extent that mxid_age(relminmxid) is a proxy for how much member space remains from which the system can still allocate new Multis. We know from experience that age isn't a particularly good proxy for member space; that's why we taught VACUUM to determine whether or not to aggressively VACUUM on the basis of "effective_multixact_freeze_max_age", by calling MultiXactMemberFreezeThreshold(). ISTM that vacuum_xid_failsafe_check() should really be doing something similar. For example, it could prorate using vacuum_multixact_failsafe_age, calculating a member-space-wise threshold to trigger the failsafe at, independent of mxid_age(relminmxid) itself. Separately, I don't think that it would be too hard to make VACUUM "inexpensively freeze" xmax MultiXacts in more cases where the system-wide OldestMxact is held back -- particularly when OldestXmin isn't held back by nearly as much. At one point I wrote a very simple patch that seems worth considering in this context: https://postgr.es/m/CAH2-Wzn846=BKnYxC2v_8EC=4S+v9kMPjzsjq03j0wSWMzz9vg@mail.gmail.com This would perhaps make it more reasonable to trigger the failsafe for member space exhaustion more proactively. I worry about the potential impact from VACUUM allocating new Multis, to freeze aggressively, in order to be able to advance relminmxid. FreezeMultiXactId() could likely stand to be more clever about avoiding allocating new multis, likely by being looser about applying cutoffs like FreezeLimit. Basically, I suspect that we should be even more aggressive about freezing xmax when it doesn't necessitate allocating a Multi (per that old abandoned patch), while at the same time being less aggressive when we see that allocating a new multi is/might be required. -- Peter Geoghegan
Re: vacuum_multixact_failsafe_age doesn't account for MultiXact member exhaustion
От
Sami Imseih
Дата:
> ISTM that vacuum_xid_failsafe_check() should really be doing something > similar. For example, it could prorate using > vacuum_multixact_failsafe_age, calculating a member-space-wise > threshold to trigger the failsafe at, independent of > mxid_age(relminmxid) itself. +1. For the case mentioned in this thread, running vacuum without index-cleanup did help in their case. So triggering failsafe based on a member-space-wise threshold sounds like a good idea to me. I also think exposing the members count [0] will be a good idea as well. One of the complaints in the postmortem is the lack of visibility into multixact members. [0] https://www.postgresql.org/message-id/flat/CALdSSPi3Gh08NtcCn44uVeUAYGOT74sU6uei_06qUTa5rMK43g%40mail.gmail.com#bfd9ae766ef42f7599258183aa8ddb3b -- Sami Imseih Amazon Web Services (AWS)