Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
От | Peter Geoghegan |
---|---|
Тема | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Дата | |
Msg-id | CAH2-WzmqBtFwdmBW=AG8ZZW_uF5a4entmv4QmqhsXOT-Fj4L-g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Список | pgsql-bugs |
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote: > The way this definitely breaks - I have been able to reproduce this in > isolation - is when one tuple is processed twice by heap_prune_chain(), and > the result of HeapTupleSatisfiesVacuum() changes from > HEAPTUPLE_DELETE_IN_PROGRESS to DEAD. I had no idea that that was now possible. I really think that this ought to be documented centrally. As you know, I don't like the way that vacuumlazy.c doesn't explain anything about the relationship between OldestXmin (which still exists, but isn't used for pruning), and the similar GlobalVisState state (used only during pruning). Surely this deserves to be addressed, because we expect these two things to agree in certain specific ways. But not necessarily in others. > Note that there are several paths < 14, that cause HTSV()'s answer to change > for the same xid. E.g. when the transaction inserting a tuple version aborts, > we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD. Right -- that one I certainly knew about. After all, the tupgone-ectomy work from my commit 8523492d specifically targeted this case. > But I haven't quite found a > path to trigger problems with that, because there won't be redirects to a > tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects > to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD). That explains why the snapshot scalability either made these problems possible for the first time, or at the very least made them far far more likely in practice. The relevant code in pruneheap.c was always incredibly fragile -- no question. Even still, there is really no good reason to believe that that was actually a problem before commit dc7420c2. Even if we assume that there's a problem before 14, the surface area is vastly smaller than on 14 -- the relevant pruneheap.c code hasn't really ever changed since HOT went in. And so I think that the most sensible course of action here is this: commit a fix to Postgres 14 + HEAD only -- no backpatch to earlier versions. We could go back further than that, but ISTM that the risk of causing new problems far outweighs the benefits. Whereas I feel pretty confident that we need to do something on 14. -- Peter Geoghegan
В списке pgsql-bugs по дате отправления: