Обсуждение: _bt_split(), and the risk of OOM before its critical section
Commit 8fa30f906be reduced the elevel of a number of "can't happen" errors from PANIC to ERROR. These were all critical-section-adjacent errors involved in nbtree page splits, and nbtree page deletion. It also established the following convention within _bt_split(), which allowed Tom to keep the length of the critical section just as short as it had always been: /* * origpage is the original page to be split. leftpage is a temporary * buffer that receives the left-sibling data, which will be copied back * into origpage on success. rightpage is the new page that receives the * right-sibling data. If we fail before reaching the critical section, * origpage hasn't been modified and leftpage is only workspace. In * principle we shouldn't need to worry about rightpage either, because it * hasn't been linked into the btree page structure; but to avoid leaving * possibly-confusing junk behind, we are careful to rewrite rightpage as * zeroes before throwing any error. */ The INCLUDE indexes work looks like it subtly broke this, because it allocated memory after the initialization of the right page -- allocating memory can always fail. On the other hand, even when 8fa30f906be went in back in 2010 this "rule" was arguably broken, because we were already calling PageGetTempPage() after the right sibling page is initialized, which palloc()s a full BLCKSZ, which is far more than truncation is every likely to allocate. On the other other hand, it seems to me that the PageGetTempPage() thing might have been okay, because it happens before the high key is inserted on the new right buffer page. The same cannot be said for the way we generate a new high key for the left/old page via suffix truncation, which happens to occur after the right buffer page is first modified by inserted its high key (the original/left page's original high key). I think that there may be a risk that VACUUM's page deletion code will get confused by finding an errant right sibling page from a failed page split when there is a high key. If so, that would be a risk that was introduced in Postgres 11, and made much more likely in practice in Postgres 12. (I haven't got as far as doing an analysis of the risks to page deletion, though. The "fastpath" rightmost page insertion optimization that was also added to Postgres 11 seems like it also might need to be considered here.) -- Peter Geoghegan
On Mon, May 6, 2019 at 12:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > On the other other hand, it seems to me that the PageGetTempPage() > thing might have been okay, because it happens before the high key is > inserted on the new right buffer page. The same cannot be said for the > way we generate a new high key for the left/old page via suffix > truncation, which happens to occur after the right buffer page is > first modified by inserted its high key (the original/left page's > original high key). I think that there may be a risk that VACUUM's > page deletion code will get confused by finding an errant right > sibling page from a failed page split when there is a high key. If so, > that would be a risk that was introduced in Postgres 11, and made much > more likely in practice in Postgres 12. (I haven't got as far as doing > an analysis of the risks to page deletion, though. The "fastpath" > rightmost page insertion optimization that was also added to Postgres > 11 seems like it also might need to be considered here.) It seems like my fears about page deletion were well-founded, at least if you assume that the risk of an OOM at the wrong time is greater than negligible. If I simulate an OOM error during suffix truncation, then non-rightmost page splits leave the tree in a state that confuses VACUUM/page deletion. When I simulate an OOM on page 42, we will later see the dreaded "failed to re-find parent key in index "foo" for deletion target page 42" error message from a VACUUM. That's not good. It doesn't matter if the same things happens when splitting a rightmost page, which naturally doesn't insert a new high key on the new right half. This confirms my theory that the PageGetTempPage() memory allocation can fail without confusing VACUUM, since that allocation occurs before the critical-but-not-critical point (the point that we really start to modify the new right half of the split). Fortunately, this bug seems easy enough to fix: we can simply move the "insert new high key on right page" code so that it comes after suffix truncation. This makes it safe for suffix truncation to have an OOM, or at least as safe as the PageGetTempPage() allocation that seems safe to me. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > Commit 8fa30f906be reduced the elevel of a number of "can't happen" > errors from PANIC to ERROR. These were all critical-section-adjacent > errors involved in nbtree page splits, and nbtree page deletion. It > also established the following convention within _bt_split(), which > allowed Tom to keep the length of the critical section just as short > as it had always been: > /* > * origpage is the original page to be split. leftpage is a temporary > * buffer that receives the left-sibling data, which will be copied back > * into origpage on success. rightpage is the new page that receives the > * right-sibling data. If we fail before reaching the critical section, > * origpage hasn't been modified and leftpage is only workspace. In > * principle we shouldn't need to worry about rightpage either, because it > * hasn't been linked into the btree page structure; but to avoid leaving > * possibly-confusing junk behind, we are careful to rewrite rightpage as > * zeroes before throwing any error. > */ > The INCLUDE indexes work looks like it subtly broke this, because it > allocated memory after the initialization of the right page -- > allocating memory can always fail. Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no errors" function, which it isn't. The pfree for its result is being done in an ill-chosen place, too. Another problem now that I look at it is that the _bt_getbuf for the right sibling is probably not too safe. And the _bt_vacuum_cycleid() call seems a bit dangerous from this standpoint as well. > On the other hand, even when > 8fa30f906be went in back in 2010 this "rule" was arguably broken, > because we were already calling PageGetTempPage() after the right > sibling page is initialized, which palloc()s a full BLCKSZ, which is > far more than truncation is every likely to allocate. I'm not really concerned about that one because at that point the right page is still in a freshly-pageinit'd state. It's perhaps not quite as nice as having it be zeroes, but it won't look like it has any interesting data. (But, having said that, we could certainly reorder the code to construct the temp page first.) In any case, once we've started to fill the ropaque area, it would really be better if we don't call anything that could throw errors. Maybe we should bite the bullet and use two temp pages, so that none of the data ends up in the shared buffer arena until we reach the critical section? The extra copying is slightly annoying, but it certainly seems like enforcing this invariant over such a long stretch of code is not very maintainable. regards, tom lane
On Mon, May 6, 2019 at 3:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no > errors" function, which it isn't. The pfree for its result is being > done in an ill-chosen place, too. I am tempted to move the call to _bt_truncate() out of _bt_split() entirely on HEAD, possibly relocating it to nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer separation between how split points are chosen, suffix truncation, and the mechanical process of executing a legal page split. > Another problem now that I look at it is that the _bt_getbuf for the right > sibling is probably not too safe. And the _bt_vacuum_cycleid() call seems > a bit dangerous from this standpoint as well. Yeah, we can tighten those up without much difficulty. > I'm not really concerned about that one because at that point the > right page is still in a freshly-pageinit'd state. It's perhaps > not quite as nice as having it be zeroes, but it won't look like > it has any interesting data. The important question is how VACUUM will recognize it. It's clearly not as bad as something that causes "failed to re-find parent key" errors, but I think that VACUUM might not be reclaiming it for the FSM (haven't checked). Note that _bt_unlink_halfdead_page() is perfectly happy to ignore the fact that the left sibling of a half-dead page has a rightlink that doesn't point back to the target. Because, uh, there might have been a concurrent page deletion, somehow. We have heard a lot about "failed to re-find parent key" errors from VACUUM before now because that is about the only strong cross-check that it does. (Not that I'm arguing that we need more of that.) > In any case, once we've started to fill the ropaque area, it would really > be better if we don't call anything that could throw errors. > > Maybe we should bite the bullet and use two temp pages, so that none > of the data ends up in the shared buffer arena until we reach the > critical section? The extra copying is slightly annoying, but > it certainly seems like enforcing this invariant over such a > long stretch of code is not very maintainable. While I think that the smarts we have around deciding a split point will probably improve in future releases, and that we'll probably make _bt_truncate() itself do more, the actual business of performing a split has no reason to change that I can think of. I would like to keep _bt_split() as simple as possible anyway -- it should only be copying tuples using simple primitives like the bufpage.c routines. Living with what we have now (not using a temp buffer for the right page) seems fine. -- Peter Geoghegan
On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > The important question is how VACUUM will recognize it. It's clearly > not as bad as something that causes "failed to re-find parent key" > errors, but I think that VACUUM might not be reclaiming it for the FSM > (haven't checked). Note that _bt_unlink_halfdead_page() is perfectly > happy to ignore the fact that the left sibling of a half-dead page has > a rightlink that doesn't point back to the target. Because, uh, there > might have been a concurrent page deletion, somehow. VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page) within _bt_mark_page_halfdead(), but doesn't test that condition in release builds. This means that the earliest modifications of the right page, before the high key PageAddItem(), are enough to cause a subsequent "failed to re-find parent key" failure in VACUUM. Merely setting the sibling blocks in the right page special area is enough to cause VACUUM to refuse to run. Of course, the problem goes away if you restart the database, because the right page buffer is never marked dirty, and never can be. That factor would probably make the problem appear to be an intermittent issue in the kinds of environments where it is most likely to be seen. -- Peter Geoghegan
On Mon, May 6, 2019 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page) > within _bt_mark_page_halfdead(), but doesn't test that condition in > release builds. This means that the earliest modifications of the > right page, before the high key PageAddItem(), are enough to cause a > subsequent "failed to re-find parent key" failure in VACUUM. Merely > setting the sibling blocks in the right page special area is enough to > cause VACUUM to refuse to run. To be clear, my point here was that this confirms what you said about PageGetTempPage() failing after _bt_getbuf() has initialized the buffer for the new right page -- that is not in itself a problem. However, practically any other change to the right page that might occur before an error is raised within _bt_split() is a problem -- not just adding a new item. (You were right about that, too.) -- Peter Geoghegan
On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > I am tempted to move the call to _bt_truncate() out of _bt_split() > entirely on HEAD, possibly relocating it to > nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer > separation between how split points are chosen, suffix truncation, and > the mechanical process of executing a legal page split. I decided against that -- better to make it clear how truncation deals with space overhead within _bt_split(). Besides, the resource management around sharing a maybe-palloc()'d high key across module boundaries seems complicated, and best avoided. Attached draft patch for HEAD fixes the bug by organizing _bt_split() into clear phases. _bt_split() now works as follows, which is a little different: * An initial phase that is entirely concerned with the left page temp buffer itself -- initializes its special area. * Suffix truncation to get left page's new high key, and then add it to left page. * A phase that is mostly concerned with initializing the right page special area, but also finishes off one or two details about the left page that needed to be delayed. This is also where the "shadow critical section" begins. Note also that this is where _bt_vacuum_cycleid() is called, because its contract actually *requires* that caller has a buffer lock on both pages at once. This should not be changed on the grounds that _bt_vacuum_cycleid() might fail (nor for any other reason). * Add new high key to right page if needed. (No change, other than the fact that it happens later now.) * Add other items to both leftpage and rightpage. Critical section that copies leftpage into origpage buffer. (No changes here.) I suppose I'm biased, but I prefer the new approach anyway. Adding the left high key first, and then the right high key seems simpler and more logical. It emphasizes the similarities and differences between leftpage and rightpage. Furthermore, this approach fixes the theoretical risk of leaving behind a minimally-initialized nbtree page that has existed since 2010. We don't allocated *any* memory after the point that a new rightpage buffer is acquired. I suppose that this will need to be backpatched. Thoughts? -- Peter Geoghegan
Вложения
On Tue, May 7, 2019 at 6:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > I suppose I'm biased, but I prefer the new approach anyway. Adding the > left high key first, and then the right high key seems simpler and > more logical. It emphasizes the similarities and differences between > leftpage and rightpage. I came up with a better way of doing it in the attached revision. Now, _bt_split() calls _bt_findsplitloc() directly. This makes it possible to significantly simplify the signature of _bt_split(). It makes perfect sense for _bt_split() to call _bt_findsplitloc() directly, since _bt_findsplitloc() is already aware of almost every _bt_split() implementation detail, whereas those same details are not of interest anywhere else. _bt_findsplitloc() also knows all about suffix truncation. It's also nice that the actual _bt_truncate() call is closely tied to the _bt_findsplitloc() call. -- Peter Geoghegan
Вложения
On Wed, May 8, 2019 at 3:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > It makes perfect sense for _bt_split() to call _bt_findsplitloc() > directly, since _bt_findsplitloc() is already aware of almost every > _bt_split() implementation detail, whereas those same details are not > of interest anywhere else. I discovered that it even used to work like that until 1997, when commit 71b3e93c505 added handling of duplicate index tuples. Tom ripped the duplicate handling stuff out a couple of years later, for what seemed to me to be very good reasons, but _bt_findsplitloc() remained outside of _bt_split() until now. I intend to push ahead with the fix for both v11 and HEAD on Monday. -- Peter Geoghegan