Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

Поиск

Список

Период

Сортировка

От	Noah Misch
Тема	Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Дата	10 января 2024 г. 19:38:51
Msg-id	20240110193851.f0.nmisch@google.com обсуждение исходный текст
Ответ на	Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() (Peter Geoghegan <pg@bowt.ie>)
Ответы	Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Список	pgsql-bugs

Дерево обсуждения

On Wed, Jan 10, 2024 at 02:06:42PM -0500, Peter Geoghegan wrote:
> On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > > Did the affected system that you investigated happen to have an
> > > > atypically high number of databases? The system 15.4 system that I saw
> > > > the problem on had almost 3,000 databases.
> > >
> > > No, single-digit database count here.
> >
> > My suspicion was that this factor might increase the propensity of
> > calls to GetOldestNonRemovableTransactionId (used to establish
> > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> > by pruneheap.c, in the way that we need to worry about here (i.e.
> > inconsistencies that lead to VACUUM getting stuck inside
> > lazy_scan_prune's loop).
> 
> Another question about your database/system: does VACUUM get stuck
> while scanning a page some time after it has already completed a round
> of index vacuuming?

I don't know.  That particular system experienced the infinite loop only once.

> That's what I see here -- I don't think that pruning leaves behind
> even a single live heap tuple, despite scanning thousands of pages
> before reaching the page that it gets stuck on. Could be another red
> herring. But it doesn't seem impossible that some of the nbtree calls
> to procarray.c routines performed by code added by my commit
> 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
> somehow related. That is, that code could be part of the chain of
> events that cause the problem (whether or not the code itself is
> technically at fault).
> 
> I'm referring to calls such as the
> "GetOldestNonRemovableTransactionId(NULL)" and
> "GlobalVisCheckRemovableFullXid()" calls that take place inside
> _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> many other places.

I see what you mean about the rarity and potential importance of
"GetOldestNonRemovableTransactionId(NULL)".  There's just one other caller,
vac_update_datfrozenxid(), which calls it for an unrelated cause.

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()