Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Дата
Msg-id CAH2-Wznv94Q_Td8OS8bAN7fYLpfU6CGgjn6Xau5eJ_sDxEGeBA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Noah Misch <noah@leadboat.com>)
Список pgsql-bugs
On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > Did the affected system that you investigated happen to have an
> > > atypically high number of databases? The system 15.4 system that I saw
> > > the problem on had almost 3,000 databases.
> >
> > No, single-digit database count here.
>
> My suspicion was that this factor might increase the propensity of
> calls to GetOldestNonRemovableTransactionId (used to establish
> VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> by pruneheap.c, in the way that we need to worry about here (i.e.
> inconsistencies that lead to VACUUM getting stuck inside
> lazy_scan_prune's loop).

Another question about your database/system: does VACUUM get stuck
while scanning a page some time after it has already completed a round
of index vacuuming? And if so, does an nbtree bulk delete end up
deleting and then recycling many index leaf pages (e.g., due to bulk
range deletions)?

That's what I see here -- I don't think that pruning leaves behind
even a single live heap tuple, despite scanning thousands of pages
before reaching the page that it gets stuck on. Could be another red
herring. But it doesn't seem impossible that some of the nbtree calls
to procarray.c routines performed by code added by my commit
9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
somehow related. That is, that code could be part of the chain of
events that cause the problem (whether or not the code itself is
technically at fault).

I'm referring to calls such as the
"GetOldestNonRemovableTransactionId(NULL)" and
"GlobalVisCheckRemovableFullXid()" calls that take place inside
_bt_pendingfsm_finalize(). It's not like we do stuff like that in very
many other places.

--
Peter Geoghegan



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Alexander Lakhin
Дата:
Сообщение: Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger
Следующее
От: Noah Misch
Дата:
Сообщение: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()