On Tue, Apr 16, 2024 at 11:01:08AM -0700, Andres Freund wrote:
> On 2024-04-15 20:58:25 -0700, Noah Misch wrote:
> > On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> > > On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > > > I have observed the infinite loop in production with v15.5, so that
> > > > non-reproduce outcome is a limitation in the test procedure. (v14.2 added
> > > > those two commits.)
> > >
> > > How closely have you analyzed those production occurences? It's not too hard
> > > to imagine some form of corruption that leads to such a loop, but which isn't
> > > related to the horizon going backwards? E.g. a corrupted HOT chain can lead
> > > to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> > > then encounter a DEAD tuple.
I've not seen this recur for any one table, so I think we can rule out
corruption modes that would reach the loop every time. (If a hypothesized
loop explanation calls for both corruption and horizon movement, that could
still apply.)
> > One occurrence had these facts:
> >
> > HeapTupleHeaderGetXmin = 95271613
> > HeapTupleHeaderGetUpdateXid = 95280147
> > vacrel->OldestXmin = 95317451
> > vacrel->vistest->definitely_needed = 95318928
> > vacrel->vistest->maybe_needed = 93624425
> >
> > How compatible are those with the corruption vectors you have in view?
>
> Do you have more information about the page this was on? E.g. pageinspect
> output? Or at least the infomasks of that tuple?
No, unfortunately.
> I assume this was a normal
> data table (i.e. not a [shared|user] catalog table or temp table)?
Normal data table
> Do you know what ComputeXidHorizonsResultLastXmin, RecentXmin were set to?
No.
> > I tried briefly to understand
> > https://postgr.es/m/flat/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
> > but I felt verifying its argument was going to be a big job for me. Would
> > those errors happen transiently, like the infinite loop, or would they
> > persist until something resets the tuple fields (e.g. ATRewriteTables())?
>
> I think they'd be transient, because the visibility information during the
> next vacuum would presumably not be "skewed" anymore?
That is good.
> Of course it's possible
> you'd re-encounter the problem, if you constantly have horizons going back and
> forth. But I'd still classify that as transient.
Certainly.