Обсуждение: Adding further hardening to nbtree page deletion

Поиск
Список
Период
Сортировка

Adding further hardening to nbtree page deletion

От
Peter Geoghegan
Дата:
Attached patch adds additional hardening to nbtree page deletion. It
makes nbtree VACUUM tolerate a certain sort of cross-page
inconsistencies in the structure of an index (corruption). VACUUM can
press on, avoiding an eventual wraparound/xidStopLimit failure in
environments where nobody notices the problem for an extended period.

This is very similar to my recent commit 5abff197 (though it's even
closer to commit 5b861baa). Once again we're demoting an ERROR to a
LOG message, and pressing on with vacuuming. I propose that this patch
be backpatched all the way, too. The hardening added by the patch
seems equally well targeted and low risk. It's a parent/child
inconsistency, as opposed to a sibling inconsistency. Very familiar
stuff, overall.

I have seen an internal report of the ERROR causing issues for a
production instance, so this definitely can fail in the field on
modern Postgres versions. Though this particular inconsistency ("right
sibling is not next child...") has a long history. It has definitely
been spotted in the field several times over many years. This 2006
thread about problems with a Wisconsin courts database is one example
of that:

https://www.postgresql.org/message-id/flat/3355.1144873721%40sss.pgh.pa.us#b0a89b2d9e7f6a3c818fdf723b8fa29b

At the time the ERROR was a PANIC. A few years later (in 2010), it was
demoted to an ERROR (see commit 8fa30f90). And now I want to demote it
to a LOG -- which is much easier now that we have a robust approach to
page deletion (after 2014 commit efada2b8e9).

-- 
Peter Geoghegan

Вложения

Re: Adding further hardening to nbtree page deletion

От
Peter Geoghegan
Дата:
On Fri, Jun 16, 2023 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached patch adds additional hardening to nbtree page deletion. It
> makes nbtree VACUUM tolerate a certain sort of cross-page
> inconsistencies in the structure of an index (corruption). VACUUM can
> press on, avoiding an eventual wraparound/xidStopLimit failure in
> environments where nobody notices the problem for an extended period.

My current plan is to commit this in the next couple of days. I'll
backpatch all the way, like last time.

--
Peter Geoghegan



Re: Adding further hardening to nbtree page deletion

От
Andres Freund
Дата:
Hi,

On 2023-06-16 14:15:08 -0700, Peter Geoghegan wrote:
> Attached patch adds additional hardening to nbtree page deletion. It
> makes nbtree VACUUM tolerate a certain sort of cross-page
> inconsistencies in the structure of an index (corruption). VACUUM can
> press on, avoiding an eventual wraparound/xidStopLimit failure in
> environments where nobody notices the problem for an extended period.
>
> This is very similar to my recent commit 5abff197 (though it's even
> closer to commit 5b861baa). Once again we're demoting an ERROR to a
> LOG message, and pressing on with vacuuming. I propose that this patch
> be backpatched all the way, too. The hardening added by the patch
> seems equally well targeted and low risk. It's a parent/child
> inconsistency, as opposed to a sibling inconsistency. Very familiar
> stuff, overall.
>
> [...]
>
> At the time the ERROR was a PANIC. A few years later (in 2010), it was
> demoted to an ERROR (see commit 8fa30f90). And now I want to demote it
> to a LOG -- which is much easier now that we have a robust approach to
> page deletion (after 2014 commit efada2b8e9).

I have no objection to this concrete change (nor have I reviewed it
carefully).


But the further we go down this path, the more important it is that we provide
some way to monitor stuff like this. IME it's not particularly practical to
rely on scanning logs to find such issues at scale. I suspect we ought to add
at least something that makes such "ignored errors" visible from stats.

Greetings,

Andres Freund



Re: Adding further hardening to nbtree page deletion

От
Peter Geoghegan
Дата:
On Tue, Jun 20, 2023 at 10:39 PM Andres Freund <andres@anarazel.de> wrote:
> But the further we go down this path, the more important it is that we provide
> some way to monitor stuff like this. IME it's not particularly practical to
> rely on scanning logs to find such issues at scale. I suspect we ought to add
> at least something that makes such "ignored errors" visible from stats.

I'm in favor of that, of course. We do at least use
ERRCODE_INDEX_CORRUPTED for all of the ERRORs that became LOG messages
in the past several years. That's a start.

FWIW, I'm almost certain that I'll completely run out of ERRORs to
demote to LOGs before too long. In fact, this might very well be the
last ERROR that I ever have to demote to a LOG to harden nbtree
VACUUM. There just aren't that many ERRORs that would benefit from
similar treatment. And most of the individual cases that I've
addressed come up very infrequently in practice.

--
Peter Geoghegan



Re: Adding further hardening to nbtree page deletion

От
Peter Geoghegan
Дата:
On Tue, Jun 20, 2023 at 11:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> FWIW, I'm almost certain that I'll completely run out of ERRORs to
> demote to LOGs before too long. In fact, this might very well be the
> last ERROR that I ever have to demote to a LOG to harden nbtree
> VACUUM.

Pushed this just now, backpatching all the way.

--
Peter Geoghegan