Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
От | Peter Geoghegan |
---|---|
Тема | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
Дата | |
Msg-id | CAH2-Wznv94Q_Td8OS8bAN7fYLpfU6CGgjn6Xau5eJ_sDxEGeBA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
|
Список | pgsql-bugs |
On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote: > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote: > > > Did the affected system that you investigated happen to have an > > > atypically high number of databases? The system 15.4 system that I saw > > > the problem on had almost 3,000 databases. > > > > No, single-digit database count here. > > My suspicion was that this factor might increase the propensity of > calls to GetOldestNonRemovableTransactionId (used to establish > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used > by pruneheap.c, in the way that we need to worry about here (i.e. > inconsistencies that lead to VACUUM getting stuck inside > lazy_scan_prune's loop). Another question about your database/system: does VACUUM get stuck while scanning a page some time after it has already completed a round of index vacuuming? And if so, does an nbtree bulk delete end up deleting and then recycling many index leaf pages (e.g., due to bulk range deletions)? That's what I see here -- I don't think that pruning leaves behind even a single live heap tuple, despite scanning thousands of pages before reaching the page that it gets stuck on. Could be another red herring. But it doesn't seem impossible that some of the nbtree calls to procarray.c routines performed by code added by my commit 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are somehow related. That is, that code could be part of the chain of events that cause the problem (whether or not the code itself is technically at fault). I'm referring to calls such as the "GetOldestNonRemovableTransactionId(NULL)" and "GlobalVisCheckRemovableFullXid()" calls that take place inside _bt_pendingfsm_finalize(). It's not like we do stuff like that in very many other places. -- Peter Geoghegan
В списке pgsql-bugs по дате отправления: