Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
От | Noah Misch |
---|---|
Тема | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae |
Дата | |
Msg-id | 20240416035825.8e.nmisch@google.com обсуждение исходный текст |
Ответ на | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
|
Список | pgsql-bugs |
On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote: > On 2024-04-15 13:52:04 -0700, Noah Misch wrote: > > On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote: > > > I propose to remove this open item from > > > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items > > > > > > On the original thread (BUG #17257), Alexander Lakhin says that he > > > can't reproduce this after dad1539ae/18b87b201. Based on my analysis > > > > I have observed the infinite loop in production with v15.5, so that > > non-reproduce outcome is a limitation in the test procedure. (v14.2 added > > those two commits.) > > How closely have you analyzed those production occurences? It's not too hard > to imagine some form of corruption that leads to such a loop, but which isn't > related to the horizon going backwards? E.g. a corrupted HOT chain can lead > to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would > then encounter a DEAD tuple. One occurrence had these facts: HeapTupleHeaderGetXmin = 95271613 HeapTupleHeaderGetUpdateXid = 95280147 vacrel->OldestXmin = 95317451 vacrel->vistest->definitely_needed = 95318928 vacrel->vistest->maybe_needed = 93624425 How compatible are those with the corruption vectors you have in view? > > > of the code, I suspect that there is a residual bug, or at least that > > > there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90. > > > > Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated > > the regression that 1ccc1e05ae introduced? Thanks for discovering that. > > Which regression has 1ccc1e05ae actually introduced? As I pointed out > upthread, the proposed path to corruption doesn't seem to actually lead to > corruption, "just" an error? Which actually seems considerably better than an > endless retry loop that cannot be cancelled. A transient, spurious error is far better than an uninterruptible infinite loop with a buffer content lock held. If a transient error is the consistent outcome, I would agree 1ccc1e05ae improved the situation and didn't regress it. That would close the open item. I tried briefly to understand https://postgr.es/m/flat/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de but I felt verifying its argument was going to be a big job for me. Would those errors happen transiently, like the infinite loop, or would they persist until something resets the tuple fields (e.g. ATRewriteTables())? Thanks, nm
В списке pgsql-bugs по дате отправления: