Re: Failure while inserting parent tuple to B-tree is not fun
От | Heikki Linnakangas |
---|---|
Тема | Re: Failure while inserting parent tuple to B-tree is not fun |
Дата | |
Msg-id | 5266C3F9.4020803@vmware.com обсуждение исходный текст |
Ответ на | Re: Failure while inserting parent tuple to B-tree is not fun (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: Failure while inserting parent tuple to B-tree is not
fun
|
Список | pgsql-hackers |
On 22.10.2013 21:25, Andres Freund wrote: > On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote: >> Splitting a B-tree page is a two-stage process: First, the page is split, >> and then a downlink for the new right page is inserted into the parent >> (which might recurse to split the parent page, too). What happens if >> inserting the downlink fails for some reason? I tried that out, and it turns >> out that it's not nice. >> >> I used this to cause a failure: >> >>> --- a/src/backend/access/nbtree/nbtinsert.c >>> +++ b/src/backend/access/nbtree/nbtinsert.c >>> @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, >>> _bt_relbuf(rel, pbuf); >>> } >>> >>> + elog(ERROR, "fail!"); >>> + >>> /* get high key from left page == lowest key on new right page */ >>> ritem = (IndexTuple) PageGetItem(page, >>> PageGetItemId(page, P_HIKEY)); >> >> postgres=# create table foo (i int4 primary key); >> CREATE TABLE >> postgres=# insert into foo select generate_series(1, 10000); >> ERROR: fail! >> >> That's not surprising. But when I removed that elog again and restarted the >> server, I still can't insert. The index is permanently broken: >> >> postgres=# insert into foo select generate_series(1, 10000); >> ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5 >> >> In real life, you would get a failure like this e.g if you run out of memory >> or disk space while inserting the downlink to the parent. Although rare in >> practice, it's no fun if it happens. > > Why doesn't the incomplete split mechanism prevent this? Because we do > not delay checkpoints on the primary and a checkpoint happened just > befor your elog(ERROR) above? Because there's no recovery involved. The failure I injected (or an out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC, just an ERROR that rolls back the current transaction, nothing more. We could put a critical section around the whole recursion that inserts the downlinks, so that you would get a PANIC and the incomplete split mechanism would fix it at recovery. But that would hardly be an improvement. - Heikki
В списке pgsql-hackers по дате отправления: