Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
От | Peter Geoghegan |
---|---|
Тема | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Дата | |
Msg-id | CAH2-WznNKY6ydUczuTXutVmb_dj3MnAcoaVYc8xyignWfNQ=FQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Список | pgsql-bugs |
On Tue, Nov 9, 2021 at 9:51 AM Peter Geoghegan <pg@bowt.ie> wrote: > I've discussed this privately with Andres -- expect more from him > soon. I came up with more sophisticated instrumentation (better > assertions, really) that shows that the problem begins in VACUUM, not > opportunistic pruning (certainly with the test case we have). Attached is a WIP fix for the bug. The idea here is to follow all HOT chains in an initial pass over the page, while even following LIVE heap-only tuples. Any heap-only tuples that we don't determine are part of some valid HOT chain (following an initial pass over the whole heap page) will now be processed in a second pass over the page. We expect (and assert) that these "disconnected" heap-only tuples will all be either DEAD or RECENTLY_DEAD. We treat them as DEAD either way, on the grounds that they must be from an aborted xact in any case. Note that we sometimes do something very similar already -- we can sometimes consider some tuples from a HOT chain DEAD, even though they're RECENTLY_DEAD (provided a later tuple from the chain really is DEAD). The patch also has more detailed assertions inside heap_page_prune(). These should catch any HOT chain invariant violations at just about the earliest opportunity, at least when assertions are enabled. Especially because we're now following every HOT chain from beginning to end now, even when we already know that there are no more DEAD/RECENTLY_DEAD tuples in the chain to be found. I'm not sure why this seems to have become more of a problem following the snapshot scalability work from Andres -- Alexander mentioned that commit dc7420c2 looked like it was the source of the problem here, but I can't see any reason why that might be true (even though I accept that it might well *appear* to be true). I believe Andres has some theory on that, but I don't know the details myself. AFAICT, this is a live bug on all supported versions. We simply weren't being careful enough about breaking the invariant that an LP_REDIRECT can only point to a valid heap-only tuple. The really surprising thing here is that it took this long for it to visibly break. -- Peter Geoghegan
Вложения
В списке pgsql-bugs по дате отправления: