Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
От | Noah Misch |
---|---|
Тема | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
Дата | |
Msg-id | 20240110193851.f0.nmisch@google.com обсуждение исходный текст |
Ответ на | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
|
Список | pgsql-bugs |
On Wed, Jan 10, 2024 at 02:06:42PM -0500, Peter Geoghegan wrote: > On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote: > > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote: > > > > Did the affected system that you investigated happen to have an > > > > atypically high number of databases? The system 15.4 system that I saw > > > > the problem on had almost 3,000 databases. > > > > > > No, single-digit database count here. > > > > My suspicion was that this factor might increase the propensity of > > calls to GetOldestNonRemovableTransactionId (used to establish > > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used > > by pruneheap.c, in the way that we need to worry about here (i.e. > > inconsistencies that lead to VACUUM getting stuck inside > > lazy_scan_prune's loop). > > Another question about your database/system: does VACUUM get stuck > while scanning a page some time after it has already completed a round > of index vacuuming? I don't know. That particular system experienced the infinite loop only once. > That's what I see here -- I don't think that pruning leaves behind > even a single live heap tuple, despite scanning thousands of pages > before reaching the page that it gets stuck on. Could be another red > herring. But it doesn't seem impossible that some of the nbtree calls > to procarray.c routines performed by code added by my commit > 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are > somehow related. That is, that code could be part of the chain of > events that cause the problem (whether or not the code itself is > technically at fault). > > I'm referring to calls such as the > "GetOldestNonRemovableTransactionId(NULL)" and > "GlobalVisCheckRemovableFullXid()" calls that take place inside > _bt_pendingfsm_finalize(). It's not like we do stuff like that in very > many other places. I see what you mean about the rarity and potential importance of "GetOldestNonRemovableTransactionId(NULL)". There's just one other caller, vac_update_datfrozenxid(), which calls it for an unrelated cause.
В списке pgsql-bugs по дате отправления: