Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
От | Peter Geoghegan |
---|---|
Тема | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
Дата | |
Msg-id | CAH2-Wzn57T=d7eB90m0wr+AiAXetk-NWA=ntS89R2mOcDimNsQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() (Noah Misch <noah@leadboat.com>) |
Ответы |
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
|
Список | pgsql-bugs |
On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote: > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote: > > Did the affected system that you investigated happen to have an > > atypically high number of databases? The system 15.4 system that I saw > > the problem on had almost 3,000 databases. > > No, single-digit database count here. My suspicion was that this factor might increase the propensity of calls to GetOldestNonRemovableTransactionId (used to establish VACUUM's OldestXmin) to not agree with the GlobalVis* based state used by pruneheap.c, in the way that we need to worry about here (i.e. inconsistencies that lead to VACUUM getting stuck inside lazy_scan_prune's loop). Using gdb I was able to determine that ComputeXidHorizonsResultLastXmin == RecentXmin at some point long after the system gets stuck (when I actually looked). So GlobalVisTestShouldUpdate() doesn't return true at that point. And, I see that VACUUM's OldestXmin value is between GlobalVisDataRels.maybe_needed and GlobalVisDataRels.definitely_needed. The deleted tuple's xmax is committed according to OldestXmin (i.e. it's a value < OldestXmin), and is < GlobalVisDataRels.definitely_needed, too. The same tuple xmax is > GlobalVisDataRels.maybe_needed. As for this tuple's xmin, it was already frozen by a previous VACUUM operation. The tuple infomask flags indicate that it's a pretty standard deleted tuple. Overall, there aren't a lot of details here that seem like they might be out of the ordinary, hinting at a specific underlying cause. It looks more like the assumptions that we make about OldestXmin agreeing with GlobalVis* state just aren't quite robust, in general. Ideally I'd be able to point to some specific assumption that has been violated -- and we might yet tie the problem to some specific detail that I've yet to identify. As I said upthread, I'm concerned that code in places like procarray.c is rather loose about how the horizons are recomputed, in a way that doesn't sit well with me. GlobalVisTestShouldUpdate() thinks that it's okay to use ComputeXidHorizonsResultLastXmin-based heuristics to decide when to recompute horizons. It is more or less treated as a matter of weighing costs against benefits -- not as a potential correctness issue. -- Peter Geoghegan
В списке pgsql-bugs по дате отправления: