Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
От | Robert Haas |
---|---|
Тема | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae |
Дата | |
Msg-id | CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
(Peter Geoghegan <pg@bowt.ie>)
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Melanie Plageman <melanieplageman@gmail.com>) Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Andres Freund <andres@anarazel.de>) |
Список | pgsql-bugs |
On Fri, Mar 22, 2024 at 3:34 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think we need to understand what's actually happening here before we > jump to solutions. I think it's clear that we can't determine a > relfrozenxid in a way that is untethered from the decisions made while > pruning, but surely whoever wrote this code intended for there to be > some relationship between GlobalVisState and > VacuumCutoffs->OldestXmin. For example, it may be that they intended > for VacuumCutoffs->OldestXmin to always contain an XID that is older > than any XID that the GlobalVisState could possibly have pruned; but > they might have done that incorrectly, in such a fashion that when > GlobalVisState->maybe_needed moves backward the intended invariant no > longer holds. OK, I did some more analysis of this. I kind of feel like maybe this should be getting more attention from Melanie (since it was the commit of her patch that triggered the addition of this to the open items list) or Andres (because it has been alleged to be related to the GlobalVisTest stuff), but here we are. Maybe this is already clear to some other people here, but I think the bug is essentially the result of this code in heap_vacuum_rel: vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs); vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel); vacrel->vistest = GlobalVisTestFor(rel); The idea here seems to be that because we call vacuum_get_cutoffs() before we call GlobalVisTestFor(), the cutoff established by the former should precede the cutoff established by the latter. vacrel->cutoffs.OldestXmin is propagated into vacrel->NewRelfrozenXid, and although the latter value seems like it can get updated later, it remains true that vacrel->cutoffs.OldestXmin affects how relfrozenxid ultimately gets set. However, that value is set in vacuum_get_cutoffs() to the result of GetOldestNonRemovableTransactionId(rel); and that function works with the results of ComputeXidHorizons(). Meanwhile, GlobalVisTestFor() creates a visibility test object which is ultimately going to be tested inside heap_page_prune() using GlobalVisTestIsRemovableXid() which is a wrapper around GlobalVisTestIsRemovableFullXid() which can call GlobalVisUpdate() which can call ... ta da ... ComputeXidHorizons(). So the behavior of ComputeXidHorizons() is the key point here. And sure enough, the header comment for that function says this: * Note: despite the above, it's possible for the calculated values to move * backwards on repeated calls. The calculated values are conservative, so * that anything older is definitely not considered as running by anyone * anymore, but the exact values calculated depend on a number of things. For * example, if there are no transactions running in the current database, the * horizon for normal tables will be latestCompletedXid. If a transaction * begins after that, its xmin will include in-progress transactions in other * databases that started earlier, so another call will return a lower value. * Nonetheless it is safe to vacuum a table in the current database with the * first result. There are also replication-related effects: a walsender * process can set its xmin based on transactions that are no longer running * on the primary but are still being replayed on the standby, thus possibly * making the values go backwards. In this case there is a possibility that * we lose data that the standby would like to have, but unless the standby * uses a replication slot to make its xmin persistent there is little we can * do about that --- data is only protected if the walsender runs continuously * while queries are executed on the standby. (The Hot Standby code deals * with such cases by failing standby queries that needed to access * already-removed data, so there's no integrity bug.) That means that heap_vacuum_rel() isn't guaranteeing anything by calling vacuum_get_cutoffs() before it calls GlobalVisTestFor(). I started wondering when this bug got introduced, and concluded that it probably dates all the way back to dc7420c2c9274a283779ec19718d2d16323640c0, when Andres first introduced the GlobalVisTest stuff. The code was substantially refactored afterwards by Peter Geoghegan, in b4af70cb210393c9c8f41643acf6b213e21178e7 and 73f6ec3d3c8d5786c54373e71a096e5acf78e7ca. However, as far as I can see, neither of those commits created the problem; they just moved code around, and in fact I'd say that the problem is easier to see now than it was before all of that refactoring because the two problematic initializations are now closer together than they were originally. I think that Peter may have been under the impression at the time of the latter commit that the order of initializing vacrel->cutoffs vs. vacrel->vistest was significant, because the commit message says: Explain the relationship between vacuumlazy.c's vistest and OldestXmin cutoffs. These closely related cutoffs are different in subtle but important ways. Also document a closely related rule: we must establish rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be missed by lazy_scan_heap(). And a comment in that commit says: + * We expect vistest will always make heap_page_prune remove any deleted + * tuple whose xmax is < OldestXmin. lazy_scan_prune must never become Neither of these is perhaps 100% explicit about the ordering being relevant, but if that wasn't thought to make this code safe, then I don't know what was thought to make it safe. At any rate, unless I'm missing something, which is definitely possible, the actual bug dates to dc7420c2c, which means whatever fix we come up with here will need to be back-patched back to v14. But what exactly is the fix here? I now understand why there was discussion of preventing maybe_needed from going backward, but I'm not sure that's quite the right idea. I think the problem isn't exactly that the maybe_needed value that is updated by GlobalVisUpdateApply() can go backwards, but rather that the SOMETHING_oldest_removable value on which it depends can go backwards. AFAICS, the problem isn't so much that the GlobalVisTest itself can retreat to an older value, but that it can retreat compared to the OldestXmin obtained from calling GetOldestNonRemovableTransactionId(), which depends on the SOMETHING_oldest_removable threshold but *not* on maybe_needed. Perhaps, for some reason I'm not grokking at the moment, preventing maybe_needed from retreating would be good enough to prevent trouble in practice, but it doesn't appear to me to be the most principled solution as of this writing. I kind of feel like the code is just going at this in a way that's completely backwards. It feels like what we should be doing is looking at the actual states of the pages post-pruning, and specifically what the oldest (M)XIDs that are really present there are. But changing that seems like a lot to do in a fix that needs to be back-patched. It's tempting to look for a narrower fix, where we just find out the oldest horizon that was ever actually used. But I don't feel very good about doing that by tampering with ComputeXidHorizons(), either. That seems like it could have nasty side effects in terms of either performance or correctness, which is really scary for a fix that needs to be back-patched. I wonder if there's some way that we can get a more reliable idea of what XIDs were used for pruning that is strictly local to vacuum and can't affect the rest of the system (and is also hopefully not so conservative that it causes damage by unduly reducing relfrozenxid advancement). However, I don't have a good idea about what such a fix might look like right now, and I'm out of time for today. -- Robert Haas EDB: http://www.enterprisedb.com
В списке pgsql-bugs по дате отправления:
Следующее
От: Peter GeogheganДата:
Сообщение: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae