Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

Поиск

Список

Период

Сортировка

От	Peter Geoghegan
Тема	Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Дата	10 ноября 2021 г. 21:04:43
Msg-id	CAH2-WzmqBtFwdmBW=AG8ZZW_uF5a4entmv4QmqhsXOT-Fj4L-g@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Andres Freund <andres@anarazel.de>)
Ответы	Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Список	pgsql-bugs

Дерево обсуждения

On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> The way this definitely breaks - I have been able to reproduce this in
> isolation - is when one tuple is processed twice by heap_prune_chain(), and
> the result of HeapTupleSatisfiesVacuum() changes from
> HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.

I had no idea that that was now possible. I really think that this
ought to be documented centrally.

As you know, I don't like the way that vacuumlazy.c doesn't explain
anything about the relationship between OldestXmin (which still
exists, but isn't used for pruning), and the similar GlobalVisState
state (used only during pruning). Surely this deserves to be
addressed, because we expect these two things to agree in certain
specific ways. But not necessarily in others.

> Note that there are several paths < 14, that cause HTSV()'s answer to change
> for the same xid. E.g. when the transaction inserting a tuple version aborts,
> we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD.

Right -- that one I certainly knew about.  After all, the
tupgone-ectomy work from my commit 8523492d specifically targeted this
case.

> But I haven't quite found a
> path to trigger problems with that, because there won't be redirects to a
> tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
> to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).

That explains why the snapshot scalability either made these problems
possible for the first time, or at the very least made them far far
more likely in practice.

The relevant code in pruneheap.c was always incredibly fragile -- no
question. Even still, there is really no good reason to believe that
that was actually a problem before commit dc7420c2. Even if we assume
that there's a problem before 14, the surface area is vastly smaller
than on 14 -- the relevant pruneheap.c code hasn't really ever changed
since HOT went in. And so I think that the most sensible course of
action here is this: commit a fix to Postgres 14 + HEAD only -- no
backpatch to earlier versions.

We could go back further than that, but ISTM that the risk of causing
new problems far outweighs the benefits. Whereas I feel pretty
confident that we need to do something on 14.

-- 
Peter Geoghegan

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum