Обсуждение: post-recovery amcheck expectations

Поиск
Список
Период
Сортировка

post-recovery amcheck expectations

От
Noah Misch
Дата:
Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c):

 *               1
 *           /       \
 *        2     <->     3

We're deleting 2, the leftmost leaf under a leftmost internal page.  After the
MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which
still has a btpo_prev pointing to 2.  bt_index_parent_check() complains here:

        /* The first page we visit at the level should be leftmost */
        if (first && !BlockNumberIsValid(state->prevrightlink) && !P_LEFTMOST(opaque))
            ereport(ERROR,
                    (errcode(ERRCODE_INDEX_CORRUPTED),
                     errmsg("the first child of leftmost target page is not leftmost of its level in index \"%s\"",
                            RelationGetRelationName(state->rel)),
                     errdetail_internal("Target block=%u child block=%u target page lsn=%X/%X.",
                                        state->targetblock, blkno,
                                        LSN_FORMAT_ARGS(state->targetlsn))));

One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record
and its corresponding UNLINK_PAGE record.  See the attached test case.  The
index is actually fine in such a state, right?  I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST.  There's a
different error from bt_index_check(), and I've not yet studied how to fix
that:

  ERROR:  left link/right link pair in index "not_leftmost_pk" not in agreement
  DETAIL:  Block=0 left block=0 left link from block=4294967295.

Alternatively, one could view this as a need for the user to VACUUM between
recovery and amcheck.  The documentation could direct users to "VACUUM
(DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since
last recovery.  Does anyone prefer that or some other alternative?

For some other amcheck expectations, the comments suggest reliance on the
bt_index_parent_check() ShareLock.  I haven't tried to make test cases for
them, but perhaps recovery can trick them the same way.  Examples:

  errmsg("downlink or sibling link points to deleted block in index \"%s\"",
  errmsg("block %u is not leftmost in index \"%s\"",
  errmsg("block %u is not true root in index \"%s\"",

Thanks,
nm

Вложения

Re: post-recovery amcheck expectations

От
Peter Geoghegan
Дата:
On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <noah@leadboat.com> wrote:
> Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c):
>
>  *               1
>  *           /       \
>  *        2     <->     3
>
> We're deleting 2, the leftmost leaf under a leftmost internal page.  After the
> MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which
> still has a btpo_prev pointing to 2.  bt_index_parent_check() complains here:

Thanks for working on this. Your analysis seems sound to me.

When I reviewed the patch that became commit d114cc53, I presented
Alexander with a test case that demonstrated false positive reports of
corruption involving interrupted page splits (not interrupted page
deletions). Obviously I didn't give sufficient thought to this case,
which is analogous.

Might make sense to test the fix for this issue using a similar
approach: by adding custom code that randomly throws errors at a point
that stresses the implementation. I'm referring to the point at which
VACUUM is between the first and second phase of page deletion: right
before (or directly after) _bt_unlink_halfdead_page() is called. That
isn't fundamentally different to the approach from your TAP test, but
seems like it might add some interesting variation.

> One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record
> and its corresponding UNLINK_PAGE record.  See the attached test case.  The
> index is actually fine in such a state, right?

Yes, it is fine.

FWIW, this feels like it might be related to the fact that (unlike
Lanin & Shasha), we don't make the key space move left; we make it
move right instead (just like page splits). In other words, page
deletion isn't the precise opposite of a page split, which is a bit
awkward.

Note, in particular, that _bt_mark_page_halfdead() doesn't do a
straight delete of the pivot tuple in the parent page that points to
the target page, as you might expect. It actually deletes the right
sibling of the target page's pivot, and then performs an in-place overwrite of
the downlink from the pivot tuple that originally pointed to the
target page. Perhaps this isn't worth going into now, but thought you
might appreciate the context.

Terminology note: we sometimes use "downlink" as a synonym of "pivot
tuple" or even "separator key", which is misleading.

> I lean toward fixing this by
> having amcheck scan left; if left links reach only half-dead or deleted pages,
> that's as good as the present child block being P_LEFTMOST.

Also my preference.

> There's a
> different error from bt_index_check(), and I've not yet studied how to fix
> that:
>
>   ERROR:  left link/right link pair in index "not_leftmost_pk" not in agreement
>   DETAIL:  Block=0 left block=0 left link from block=4294967295.

This looks like this might be a straightforward case of confusing
P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
indicate "no such block" here.

> Alternatively, one could view this as a need for the user to VACUUM between
> recovery and amcheck.  The documentation could direct users to "VACUUM
> (DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since
> last recovery.  Does anyone prefer that or some other alternative?

I'd rather not go that route. That strikes me as defining the problem
out of existence.

> For some other amcheck expectations, the comments suggest reliance on the
> bt_index_parent_check() ShareLock.  I haven't tried to make test cases for
> them, but perhaps recovery can trick them the same way.  Examples:
>
>   errmsg("downlink or sibling link points to deleted block in index \"%s\"",
>   errmsg("block %u is not leftmost in index \"%s\"",
>   errmsg("block %u is not true root in index \"%s\"",

These are all much older. They're certainly all from before the
relevant checks were first added (by commit d114cc53), and seem much
less likely to be buggy.

These older cases are all cases where we descend directly from an
internal page to one of its child pages. Whereas the problem you've
demonstrated involves traversal across levels *and* across siblings in
newer code. That's quite a bit more complicated, since it requires
that we worry about both phases of page deletion -- not just the
first. That in itself necessitates that we deal with various edge
cases. (The really prominent edge-case is the interrupted page
deletion case, which requires significant handling, but evidently
missed a subtlety with leftmost pages).


--
Peter Geoghegan



Re: post-recovery amcheck expectations

От
Noah Misch
Дата:
On Mon, Oct 09, 2023 at 04:46:26PM -0700, Peter Geoghegan wrote:
> On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <noah@leadboat.com> wrote:

> Might make sense to test the fix for this issue using a similar
> approach: by adding custom code that randomly throws errors at a point
> that stresses the implementation. I'm referring to the point at which
> VACUUM is between the first and second phase of page deletion: right
> before (or directly after) _bt_unlink_halfdead_page() is called. That
> isn't fundamentally different to the approach from your TAP test, but
> seems like it might add some interesting variation.

My initial manual test was like that, actually.

> > I lean toward fixing this by
> > having amcheck scan left; if left links reach only half-dead or deleted pages,
> > that's as good as the present child block being P_LEFTMOST.
> 
> Also my preference.

Done mostly that way, except I didn't accept deleted pages.  Making this work
on !readonly would take more than that, and readonly shouldn't need that.

> > There's a
> > different error from bt_index_check(), and I've not yet studied how to fix
> > that:
> >
> >   ERROR:  left link/right link pair in index "not_leftmost_pk" not in agreement
> >   DETAIL:  Block=0 left block=0 left link from block=4294967295.
> 
> This looks like this might be a straightforward case of confusing
> P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
> indicate "no such block" here.

Roughly so.  It turned out this scenario was passing leftcurrent=P_NONE to
bt_recheck_sibling_links(), causing that function to use BTPageGetOpaque() on
the metapage.

> > For some other amcheck expectations, the comments suggest reliance on the
> > bt_index_parent_check() ShareLock.  I haven't tried to make test cases for
> > them, but perhaps recovery can trick them the same way.  Examples:
> >
> >   errmsg("downlink or sibling link points to deleted block in index \"%s\"",
> >   errmsg("block %u is not leftmost in index \"%s\"",
> >   errmsg("block %u is not true root in index \"%s\"",
> 
> These are all much older. They're certainly all from before the
> relevant checks were first added (by commit d114cc53), and seem much
> less likely to be buggy.

After I fixed the original error, the "block %u is not leftmost" surfaced
next.  The attached patch fixes that, too.  I didn't investigate the others.
The original test was flaky in response to WAL flush timing, but this one
survives thousands of runs.

Вложения

Re: post-recovery amcheck expectations

От
Peter Geoghegan
Дата:
On Fri, Oct 20, 2023 at 8:55 PM Noah Misch <noah@leadboat.com> wrote:
> > > I lean toward fixing this by
> > > having amcheck scan left; if left links reach only half-dead or deleted pages,
> > > that's as good as the present child block being P_LEFTMOST.
> >
> > Also my preference.
>
> Done mostly that way, except I didn't accept deleted pages.  Making this work
> on !readonly would take more than that, and readonly shouldn't need that.

That makes sense to me. I believe that it's not possible to have a
string of consecutive sibling pages that are all half-dead (regardless
of the BlockNumber order of sibling pages, even). But I'd probably
have written the fix in roughly the same way. Although...maybe you
should try to detect a string of half-dead pages? Hard to say if it's
worth the trouble.

Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
for good luck.

> After I fixed the original error, the "block %u is not leftmost" surfaced
> next.  The attached patch fixes that, too.  I didn't investigate the others.
> The original test was flaky in response to WAL flush timing, but this one
> survives thousands of runs.

Hmm. Can't argue with that. Your fix seems sound.

--
Peter Geoghegan



Re: post-recovery amcheck expectations

От
Noah Misch
Дата:
On Mon, Oct 23, 2023 at 04:46:23PM -0700, Peter Geoghegan wrote:
> On Fri, Oct 20, 2023 at 8:55 PM Noah Misch <noah@leadboat.com> wrote:
> > > > I lean toward fixing this by
> > > > having amcheck scan left; if left links reach only half-dead or deleted pages,
> > > > that's as good as the present child block being P_LEFTMOST.
> > >
> > > Also my preference.
> >
> > Done mostly that way, except I didn't accept deleted pages.  Making this work
> > on !readonly would take more than that, and readonly shouldn't need that.
> 
> That makes sense to me. I believe that it's not possible to have a
> string of consecutive sibling pages that are all half-dead (regardless
> of the BlockNumber order of sibling pages, even). But I'd probably
> have written the fix in roughly the same way. Although...maybe you
> should try to detect a string of half-dead pages? Hard to say if it's
> worth the trouble.

I imagined a string of half-dead siblings could arise in structure like this:

 *               1
                                                        
 
 *          /    |    \
                                                         
 
 *         4 <-> 2 <-> 3

With events like this:

- DELETE renders blk 4 deletable.
- Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
- DELETE renders blk 2 deletable.
- Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.

I didn't try to reproduce that, and something may well prevent it.

> Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
> for good luck.

Added.  That gave me the idea to check for circular links, like other parts of
amcheck do.  Net diff:

--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -949,11 +949,16 @@ bt_leftmost_ignoring_half_dead(BtreeCheckState *state,
         Page        page = palloc_btree_page(state, reached);
         BTPageOpaque reached_opaque = BTPageGetOpaque(page);
 
+        CHECK_FOR_INTERRUPTS();
+
         /*
-         * _bt_unlink_halfdead_page() writes that side-links will continue to
-         * point to the siblings.  We can easily check btpo_next.
+         * Try to detect btpo_prev circular links.  _bt_unlink_halfdead_page()
+         * writes that side-links will continue to point to the siblings.
+         * Check btpo_next for that property.
          */
-        all_half_dead = P_ISHALFDEAD(reached_opaque) &&
+        all_half_dead = P_ISHALFDEAD(reached_opaque);
+            reached != start &&
+            reached != reached_from &&
             reached_opaque->btpo_next == reached_from;
         if (all_half_dead)
         {



Re: post-recovery amcheck expectations

От
Peter Geoghegan
Дата:
On Mon, Oct 23, 2023 at 7:28 PM Noah Misch <noah@leadboat.com> wrote:
> > That makes sense to me. I believe that it's not possible to have a
> > string of consecutive sibling pages that are all half-dead (regardless
> > of the BlockNumber order of sibling pages, even). But I'd probably
> > have written the fix in roughly the same way. Although...maybe you
> > should try to detect a string of half-dead pages? Hard to say if it's
> > worth the trouble.
>
> I imagined a string of half-dead siblings could arise in structure like this:
>
>  *               1
>  *          /    |    \
>  *         4 <-> 2 <-> 3
>
> With events like this:
>
> - DELETE renders blk 4 deletable.
> - Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
> - DELETE renders blk 2 deletable.
> - Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.
>
> I didn't try to reproduce that, and something may well prevent it.

FWIW a couple of factors prevent it (in the absence of corruption). These are:

1. Only VACUUM can delete pages, and in general the only possible
source of half-dead pages is an unfortunately timed crash/error within
VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
page.

2. One thing that makes VACUUM back out of deleting an empty page is
the presence of a half-dead right sibling leaf page left behind by
some VACUUM that was interrupted at some point in the past -- see
_bt_rightsib_halfdeadflag() for details.

Obviously, factors 1 and 2 together make three consecutive half-dead
sibling pages impossible. I'm not quite prepared to say that even two
neighboring half-dead sibling pages are an impossibility right now,
but I think that it might well be. Possibly for reasons that are more
accidental than anything else (is the _bt_rightsib_halfdeadflag thing
a "real invariant" or just something we do because we don't want to
add additional complicated handling to nbtpage.c?), so I'll avoid
going into further detail for now.

I'm pointing this out because it argues for softening the wording
about "accept[ing] an arbitrarily-long chain of half-dead,
sibling-linked pages to the left" from your patch.

I was also wondering (mostly to myself) about the relationship (if
any) between the _bt_rightsib_halfdeadflag/_bt_leftsib_splitflag
"invariants" and the bt_child_highkey_check() check. But I don't think
that it makes sense to put that in scope -- your fix seems like a
strict improvement. This relationship is perhaps in scope here to the
limited extent that talking about strings of consecutive half-dead
pages might make it even harder to understand the design of the
bt_child_highkey_check() check. On the other hand...I'm not sure that
I understand every nuance of it myself.

> > Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
> > for good luck.
>
> Added.  That gave me the idea to check for circular links, like other parts of
> amcheck do.

Good idea.

--
Peter Geoghegan



Re: post-recovery amcheck expectations

От
Noah Misch
Дата:
On Tue, Oct 24, 2023 at 07:03:34PM -0700, Peter Geoghegan wrote:
> On Mon, Oct 23, 2023 at 7:28 PM Noah Misch <noah@leadboat.com> wrote:
> > > That makes sense to me. I believe that it's not possible to have a
> > > string of consecutive sibling pages that are all half-dead (regardless
> > > of the BlockNumber order of sibling pages, even). But I'd probably
> > > have written the fix in roughly the same way. Although...maybe you
> > > should try to detect a string of half-dead pages? Hard to say if it's
> > > worth the trouble.
> >
> > I imagined a string of half-dead siblings could arise in structure like this:
> >
> >  *               1
> >  *          /    |    \
> >  *         4 <-> 2 <-> 3
> >
> > With events like this:
> >
> > - DELETE renders blk 4 deletable.
> > - Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
> > - DELETE renders blk 2 deletable.
> > - Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.
> >
> > I didn't try to reproduce that, and something may well prevent it.
> 
> FWIW a couple of factors prevent it (in the absence of corruption). These are:
> 
> 1. Only VACUUM can delete pages, and in general the only possible
> source of half-dead pages is an unfortunately timed crash/error within
> VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
> page.

Agreed.

> 2. One thing that makes VACUUM back out of deleting an empty page is
> the presence of a half-dead right sibling leaf page left behind by
> some VACUUM that was interrupted at some point in the past -- see
> _bt_rightsib_halfdeadflag() for details.
> 
> Obviously, factors 1 and 2 together make three consecutive half-dead
> sibling pages impossible.

Can't it still happen if the sequence of unfortunately timed crashes causes
deletions from left to right?  Take this example, expanding the one above.
Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:

 *                  1
 *           /    / | \    \
 *         4 <-> 3 <-> 2 <-> 1

(That's not to say it has ever happened outside of a test.)



Re: post-recovery amcheck expectations

От
Peter Geoghegan
Дата:
On Tue, Oct 24, 2023 at 8:05 PM Noah Misch <noah@leadboat.com> wrote:
> Can't it still happen if the sequence of unfortunately timed crashes causes
> deletions from left to right?  Take this example, expanding the one above.
> Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:
>
>  *                  1
>  *           /    / | \    \
>  *         4 <-> 3 <-> 2 <-> 1
>
> (That's not to say it has ever happened outside of a test.)

Hmm. Perhaps you're right. I thought that this wasn't possible in part
due to the fact that you'd have to access all of these leaf pages in
the same order each time, without ever passing over a previous
half-dead page. But I suppose that there's nothing stopping the index
tuples from being deleted from each page in an order that leaves open
the possibility of something like this. (It's extremely unlikely, of
course, but that wasn't ever in question.)

I withdraw my suggestion about the wording from your patch. It seems
committable.

Thanks
--
Peter Geoghegan