Обсуждение: _bt_split(), and the risk of OOM before its critical section

Поиск

Список

Период

Сортировка

_bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

06 мая 2019 г., 19:48:41

Commit 8fa30f906be reduced the elevel of a number of "can't happen"
errors from PANIC to ERROR. These were all critical-section-adjacent
errors involved in nbtree page splits, and nbtree page deletion. It
also established the following convention within _bt_split(), which
allowed Tom to keep the length of the critical section just as short
as it had always been:

/*
 * origpage is the original page to be split.  leftpage is a temporary
 * buffer that receives the left-sibling data, which will be copied back
 * into origpage on success.  rightpage is the new page that receives the
 * right-sibling data.  If we fail before reaching the critical section,
 * origpage hasn't been modified and leftpage is only workspace. In
 * principle we shouldn't need to worry about rightpage either, because it
 * hasn't been linked into the btree page structure; but to avoid leaving
 * possibly-confusing junk behind, we are careful to rewrite rightpage as
 * zeroes before throwing any error.
 */

The INCLUDE indexes work looks like it subtly broke this, because it
allocated memory after the initialization of the right page --
allocating memory can always fail. On the other hand, even when
8fa30f906be went in back in 2010 this "rule" was arguably broken,
because we were already calling PageGetTempPage() after the right
sibling page is initialized, which palloc()s a full BLCKSZ, which is
far more than truncation is every likely to allocate.

On the other other hand, it seems to me that the PageGetTempPage()
thing might have been okay, because it happens before the high key is
inserted on the new right buffer page. The same cannot be said for the
way we generate a new high key for the left/old page via suffix
truncation, which happens to occur after the right buffer page is
first modified by inserted its high key (the original/left page's
original high key). I think that there may be a risk that VACUUM's
page deletion code will get confused by finding an errant right
sibling page from a failed page split when there is a high key. If so,
that would be a risk that was introduced in Postgres 11, and made much
more likely in practice in Postgres 12. (I haven't got as far as doing
an analysis of the risks to page deletion, though. The "fastpath"
rightmost page insertion optimization that was also added to Postgres
11 seems like it also might need to be considered here.)

-- 
Peter Geoghegan

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

06 мая 2019 г., 21:51:38

On Mon, May 6, 2019 at 12:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On the other other hand, it seems to me that the PageGetTempPage()
> thing might have been okay, because it happens before the high key is
> inserted on the new right buffer page. The same cannot be said for the
> way we generate a new high key for the left/old page via suffix
> truncation, which happens to occur after the right buffer page is
> first modified by inserted its high key (the original/left page's
> original high key). I think that there may be a risk that VACUUM's
> page deletion code will get confused by finding an errant right
> sibling page from a failed page split when there is a high key. If so,
> that would be a risk that was introduced in Postgres 11, and made much
> more likely in practice in Postgres 12. (I haven't got as far as doing
> an analysis of the risks to page deletion, though. The "fastpath"
> rightmost page insertion optimization that was also added to Postgres
> 11 seems like it also might need to be considered here.)

It seems like my fears about page deletion were well-founded, at least
if you assume that the risk of an OOM at the wrong time is greater
than negligible.

If I simulate an OOM error during suffix truncation, then
non-rightmost page splits leave the tree in a state that confuses
VACUUM/page deletion. When I simulate an OOM on page 42, we will later
see the dreaded "failed to re-find parent key in index "foo" for
deletion target page 42" error message from a VACUUM. That's not good.

It doesn't matter if the same things happens when splitting a
rightmost page, which naturally doesn't insert a new high key on the
new right half. This confirms my theory that the PageGetTempPage()
memory allocation can fail without confusing VACUUM, since that
allocation occurs before the critical-but-not-critical point (the
point that we really start to modify the new right half of the split).

Fortunately, this bug seems easy enough to fix: we can simply move the
"insert new high key on right page" code so that it comes after suffix
truncation. This makes it safe for suffix truncation to have an OOM,
or at least as safe as the PageGetTempPage() allocation that seems
safe to me.

-- 
Peter Geoghegan

Re: _bt_split(), and the risk of OOM before its critical section

От

Tom Lane

Дата:

06 мая 2019 г., 22:29:11

Peter Geoghegan <pg@bowt.ie> writes:
> Commit 8fa30f906be reduced the elevel of a number of "can't happen"
> errors from PANIC to ERROR. These were all critical-section-adjacent
> errors involved in nbtree page splits, and nbtree page deletion. It
> also established the following convention within _bt_split(), which
> allowed Tom to keep the length of the critical section just as short
> as it had always been:

> /*
>  * origpage is the original page to be split.  leftpage is a temporary
>  * buffer that receives the left-sibling data, which will be copied back
>  * into origpage on success.  rightpage is the new page that receives the
>  * right-sibling data.  If we fail before reaching the critical section,
>  * origpage hasn't been modified and leftpage is only workspace. In
>  * principle we shouldn't need to worry about rightpage either, because it
>  * hasn't been linked into the btree page structure; but to avoid leaving
>  * possibly-confusing junk behind, we are careful to rewrite rightpage as
>  * zeroes before throwing any error.
>  */

> The INCLUDE indexes work looks like it subtly broke this, because it
> allocated memory after the initialization of the right page --
> allocating memory can always fail.

Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no
errors" function, which it isn't.  The pfree for its result is being
done in an ill-chosen place, too.

Another problem now that I look at it is that the _bt_getbuf for the right
sibling is probably not too safe.  And the _bt_vacuum_cycleid() call seems
a bit dangerous from this standpoint as well.

> On the other hand, even when
> 8fa30f906be went in back in 2010 this "rule" was arguably broken,
> because we were already calling PageGetTempPage() after the right
> sibling page is initialized, which palloc()s a full BLCKSZ, which is
> far more than truncation is every likely to allocate.

I'm not really concerned about that one because at that point the
right page is still in a freshly-pageinit'd state.  It's perhaps
not quite as nice as having it be zeroes, but it won't look like
it has any interesting data.  (But, having said that, we could
certainly reorder the code to construct the temp page first.)

In any case, once we've started to fill the ropaque area, it would really
be better if we don't call anything that could throw errors.

Maybe we should bite the bullet and use two temp pages, so that none
of the data ends up in the shared buffer arena until we reach the
critical section?  The extra copying is slightly annoying, but
it certainly seems like enforcing this invariant over such a
long stretch of code is not very maintainable.

            regards, tom lane

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

06 мая 2019 г., 23:11:55

On Mon, May 6, 2019 at 3:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no
> errors" function, which it isn't.  The pfree for its result is being
> done in an ill-chosen place, too.

I am tempted to move the call to _bt_truncate() out of _bt_split()
entirely on HEAD, possibly relocating it to
nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer
separation between how split points are chosen, suffix truncation, and
the mechanical process of executing a legal page split.

> Another problem now that I look at it is that the _bt_getbuf for the right
> sibling is probably not too safe.  And the _bt_vacuum_cycleid() call seems
> a bit dangerous from this standpoint as well.

Yeah, we can tighten those up without much difficulty.

> I'm not really concerned about that one because at that point the
> right page is still in a freshly-pageinit'd state.  It's perhaps
> not quite as nice as having it be zeroes, but it won't look like
> it has any interesting data.

The important question is how VACUUM will recognize it. It's clearly
not as bad as something that causes "failed to re-find parent key"
errors, but I think that VACUUM might not be reclaiming it for the FSM
(haven't checked). Note that _bt_unlink_halfdead_page() is perfectly
happy to ignore the fact that the left sibling of a half-dead page has
a rightlink that doesn't point back to the target. Because, uh, there
might have been a concurrent page deletion, somehow.

We have heard a lot about "failed to re-find parent key" errors from
VACUUM before now because that is about the only strong cross-check
that it does. (Not that I'm arguing that we need more of that.)

> In any case, once we've started to fill the ropaque area, it would really
> be better if we don't call anything that could throw errors.
>
> Maybe we should bite the bullet and use two temp pages, so that none
> of the data ends up in the shared buffer arena until we reach the
> critical section?  The extra copying is slightly annoying, but
> it certainly seems like enforcing this invariant over such a
> long stretch of code is not very maintainable.

While I think that the smarts we have around deciding a split point
will probably improve in future releases, and that we'll probably make
_bt_truncate() itself do more, the actual business of performing a
split has no reason to change that I can think of. I would like to
keep _bt_split() as simple as possible anyway -- it should only be
copying tuples using simple primitives like the bufpage.c routines.
Living with what we have now (not using a temp buffer for the right
page) seems fine.

-- 
Peter Geoghegan

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

07 мая 2019 г., 00:15:30

On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The important question is how VACUUM will recognize it. It's clearly
> not as bad as something that causes "failed to re-find parent key"
> errors, but I think that VACUUM might not be reclaiming it for the FSM
> (haven't checked). Note that _bt_unlink_halfdead_page() is perfectly
> happy to ignore the fact that the left sibling of a half-dead page has
> a rightlink that doesn't point back to the target. Because, uh, there
> might have been a concurrent page deletion, somehow.

VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page)
within _bt_mark_page_halfdead(), but doesn't test that condition in
release builds. This means that the earliest modifications of the
right page, before the high key PageAddItem(), are enough to cause a
subsequent "failed to re-find parent key" failure in VACUUM. Merely
setting the sibling blocks in the right page special area is enough to
cause VACUUM to refuse to run.

Of course, the problem goes away if you restart the database, because
the right page buffer is never marked dirty, and never can be. That
factor would probably make the problem appear to be an intermittent
issue in the kinds of environments where it is most likely to be seen.

-- 
Peter Geoghegan

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

07 мая 2019 г., 00:26:09

On Mon, May 6, 2019 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page)
> within _bt_mark_page_halfdead(), but doesn't test that condition in
> release builds. This means that the earliest modifications of the
> right page, before the high key PageAddItem(), are enough to cause a
> subsequent "failed to re-find parent key" failure in VACUUM. Merely
> setting the sibling blocks in the right page special area is enough to
> cause VACUUM to refuse to run.

To be clear, my point here was that this confirms what you said about
PageGetTempPage() failing after _bt_getbuf() has initialized the
buffer for the new right page -- that is not in itself a problem.
However, practically any other change to the right page that might
occur before an error is raised within _bt_split() is a problem -- not
just adding a new item. (You were right about that, too.)

-- 
Peter Geoghegan

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

08 мая 2019 г., 01:15:01

On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I am tempted to move the call to _bt_truncate() out of _bt_split()
> entirely on HEAD, possibly relocating it to
> nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer
> separation between how split points are chosen, suffix truncation, and
> the mechanical process of executing a legal page split.

I decided against that -- better to make it clear how truncation deals
with space overhead within _bt_split(). Besides, the resource
management around sharing a maybe-palloc()'d high key across module
boundaries seems complicated, and best avoided.

Attached draft patch for HEAD fixes the bug by organizing _bt_split()
into clear phases. _bt_split() now works as follows, which is a little
different:

*  An initial phase that is entirely concerned with the left page temp
buffer itself -- initializes its special area.

* Suffix truncation to get left page's new high key, and then add it
to left page.

* A phase that is mostly concerned with initializing the right page
special area, but also finishes off one or two details about the left
page that needed to be delayed. This is also where the "shadow
critical section" begins. Note also that this is where
_bt_vacuum_cycleid() is called, because its contract actually
*requires* that caller has a buffer lock on both pages at once. This
should not be changed on the grounds that _bt_vacuum_cycleid() might
fail (nor for any other reason).

* Add new high key to right page if needed. (No change, other than the
fact that it happens later now.)

* Add other items to both leftpage and rightpage. Critical section
that copies leftpage into origpage buffer. (No changes here.)

I suppose I'm biased, but I prefer the new approach anyway. Adding the
left high key first, and then the right high key seems simpler and
more logical. It emphasizes the similarities and differences between
leftpage and rightpage. Furthermore, this approach fixes the
theoretical risk of leaving behind a minimally-initialized nbtree page
that has existed since 2010. We don't allocated *any* memory after the
point that a new rightpage buffer is acquired.

I suppose that this will need to be backpatched.

Thoughts?
--
Peter Geoghegan

Вложения

v1-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

08 мая 2019 г., 22:37:50

On Tue, May 7, 2019 at 6:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I suppose I'm biased, but I prefer the new approach anyway. Adding the
> left high key first, and then the right high key seems simpler and
> more logical. It emphasizes the similarities and differences between
> leftpage and rightpage.

I came up with a better way of doing it in the attached revision. Now,
_bt_split() calls _bt_findsplitloc() directly. This makes it possible
to significantly simplify the signature of _bt_split().

It makes perfect sense for _bt_split() to call _bt_findsplitloc()
directly, since _bt_findsplitloc() is already aware of almost every
_bt_split() implementation detail, whereas those same details are not
of interest anywhere else. _bt_findsplitloc() also knows all about
suffix truncation. It's also nice that the actual _bt_truncate() call
is closely tied to the _bt_findsplitloc() call.

-- 
Peter Geoghegan

Вложения

v2-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch

Re: _bt_split(), and the risk of OOM before its critical section

От

Peter Geoghegan

Дата:

09 мая 2019 г., 22:32:52

On Wed, May 8, 2019 at 3:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It makes perfect sense for _bt_split() to call _bt_findsplitloc()
> directly, since _bt_findsplitloc() is already aware of almost every
> _bt_split() implementation detail, whereas those same details are not
> of interest anywhere else.

I discovered that it even used to work like that until 1997, when
commit 71b3e93c505 added handling of duplicate index tuples. Tom
ripped the duplicate handling stuff out a couple of years later, for
what seemed to me to be very good reasons, but _bt_findsplitloc()
remained outside of _bt_split() until now.

I intend to push ahead with the fix for both v11 and HEAD on Monday.
-- 
Peter Geoghegan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: _bt_split(), and the risk of OOM before its critical section

Вложения

Вложения