Обсуждение: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST
My database version is Postgresql 13.2 , backup server core when redo a btree_xlog_insert that type is XLOG_BTREE_INSERT_POST
#0 0x00002aab66695d86inmemmove ssse3 krom /lib64/libc.so.6
#1 0x00000000004f5574in_bt_swap_posting(newitem=0x125d998,oposting=0x2aabcd55dcc8ostingoff=13) at nbtdedup.c:796
#2 0x000000000050cf08 in btree_xlog_insert(isleaf=true,ismeta=false, posting=true,r ecord=0x122d7b8)at nbtxlog.c:224
43 0x000000000050ed53 in btree_redo(record=0x122d7b8) at nbtxlog.c:969
44 0x00000000005487b5 in StartupXLOG()at log.c:7320
#5 0x00000000008292ab in StartupProcessMain () atstartup.c:204
46 0x000000000055c916 in AuxiliaryProcessMain(argc=2,argv=0x7fff071f8a00) at bootstrap.c:443
#7 0x000000000082820d in StartChildProcess(type=StartupProcess) at postmaster.c:5492
48 0x00000000008236ce in PostmasterMain(argc=1argv=0x11fb2e0)at postmaster.c:1404
#9 0x00000000007398f7 in main(argc=1,argv=0x11fb2e0)at main.c:210
gdb) f 1
#1 0x00000000004f5574in btswap_posting(newitem=0x125d998,oposting=0x2aabcd55dcc8ostingoff=13) at nbtdedup.c.796
796
gdb) p((Size)((oposting)->t_info & 0x1FFF)) //oposting index tuple size
$12 = 16
gdb) p nhtids
$13= 17
gdb) p postingoff
$14=13
gdb) p nmovebytes
$15=18
IndexTuple
_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
{
int nhtids;
char *replacepos;
char *replaceposright;
Size nmovebytes;
IndexTuple nposting;
nhtids = BTreeTupleGetNPosting(oposting);
=== nhtids = 17, postingoff = 13
Assert(_bt_posting_valid(oposting));
Assert(postingoff > 0 && postingoff < nhtids);
/*
* Move item pointers in posting list to make a gap for the new item's
* heap TID. We shift TIDs one place to the right, losing original
* rightmost TID. (nmovebytes must not include TIDs to the left of
* postingoff, nor the existing rightmost/max TID that gets overwritten.)
*/
nposting = CopyIndexTuple(oposting);
=== IndexTupleSize(oposting) = 16, nposting memory size 16
replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
memmove(replaceposright, replacepos, nmovebytes);
===core here, nmovebytes = 18, for nposting size is 16, so here is out of memory
}
Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST
On Thu, Nov 21, 2024 at 10:03 AM yuansong <yyuansong@126.com> wrote: > Should nhtids be less than or equal to IndexTupleSize(oposting)? > Why is nhtids larger than IndexTupleSize(oposting) ? I think there should be an error in the master host writing the wallog. > Does anyone know when this will happen? It'll happen whenever there is a certain kind of data corruption. There were complaints about issues like this in the past. But those complaints seem to have gone away when more hardening was added to the code that runs during original execution (not the REDO routine code, which can only do what it is told to do by the WAL record). You're using PostgreSQL 13.2, which is a very old point release that lacks this hardening -- the current 13 point release is 13.18, so you're missing a lot. Had you been on a later point release you'd very probably have still had the issue with corruption (which could be from bad hardware), but you likely would have avoided the problem with the REDO routine crashing like this. -- Peter Geoghegan
There may be something wrong with my previous description, "Should nhtids be less than or equal to IndexTupleSize(oposting)?
Why is nhtids larger than IndexTupleSize(oposting) " Here nhtids should be nmovebytes.
It is normal whether nhtids is larger than IndexTupleSize(oposting) or smaller than IndexTupleSize(oposting).
Should nmovebytes be smaller than IndexTupleSize(oposting)?
At 2024-11-21 23:58:03, "Peter Geoghegan" <pg@bowt.ie> wrote: >On Thu, Nov 21, 2024 at 10:03 AM yuansong <yyuansong@126.com> wrote: >> Should nhtids be less than or equal to IndexTupleSize(oposting)? >> Why is nhtids larger than IndexTupleSize(oposting) ? I think there should be an error in the master host writing the wal log. >> Does anyone know when this will happen? > >It'll happen whenever there is a certain kind of data corruption. > >There were complaints about issues like this in the past. But those >complaints seem to have gone away when more hardening was added to the >code that runs during original execution (not the REDO routine code, >which can only do what it is told to do by the WAL record). > >You're using PostgreSQL 13.2, which is a very old point release that >lacks this hardening -- the current 13 point release is 13.18, so >you're missing a lot. Had you been on a later point release you'd very >probably have still had the issue with corruption (which could be from >bad hardware), but you likely would have avoided the problem with the >REDO routine crashing like this. > >-- >Peter Geoghegan
the _bt_binsrch_insert
function always returns low
, but during the post list search, are there cases where low
and mid
are unequal?
If so, this could potentially cause an offset in the subsequent _bt_insertonpg
function.
maybe we fix it like this ?
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
{
......
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
/*
* If tuple at offset located by binary search is a posting list whose
* TID range overlaps with caller's scantid, perform posting list
* binary search to set postingoff for caller. Caller must split the
* posting list when postingoff is set. This should happen
* infrequently.
*/
if (unlikely(result == 0 && key->scantid != NULL))
{
/*
* postingoff should never be set more than once per leaf page
* binary search. That would mean that there are duplicate table
* TIDs in the index, which is never okay. Check for that here.
*/
if (insertstate->postingoff != 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
ItemPointerGetBlockNumber(key->scantid),
ItemPointerGetOffsetNumber(key->scantid),
low, stricthigh,
BufferGetBlockNumber(insertstate->buf),
RelationGetRelationName(rel))));
insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
// Here, will low
and mid
ever be unequal? If low
is returned in such cases, would it result in an error? maybe we fix it like this ?
// low = mid;
// break;
}
}
........
return low;
}
At 2024-11-27 18:53:20, "yuansong" <yyuansong@126.com> wrote:
we find crash reson
We have identified the cause of the crash: it was due to the XLOG_BTREE_INSERT_POST XLOG having an OffsetNumber offnum that was one less than what was stored in the index. I experimented with adding +1, and the index data remained normal in both cases. This issue is likely caused by concurrent operations on the B-tree, and upon reviewing the corresponding WAL logs, we found SPLIT_L and INSERT_LEAF operations on the same block before the crash. This might be a bug. I'm not sure if there's a related fix.
At 2024-11-21 23:58:03, "Peter Geoghegan" <pg@bowt.ie> wrote: >On Thu, Nov 21, 2024 at 10:03 AM yuansong <yyuansong@126.com> wrote: >> Should nhtids be less than or equal to IndexTupleSize(oposting)? >> Why is nhtids larger than IndexTupleSize(oposting) ? I think there should be an error in the master host writing the wal log. >> Does anyone know when this will happen? > >It'll happen whenever there is a certain kind of data corruption. > >There were complaints about issues like this in the past. But those >complaints seem to have gone away when more hardening was added to the >code that runs during original execution (not the REDO routine code, >which can only do what it is told to do by the WAL record). > >You're using PostgreSQL 13.2, which is a very old point release that >lacks this hardening -- the current 13 point release is 13.18, so >you're missing a lot. Had you been on a later point release you'd very >probably have still had the issue with corruption (which could be from >bad hardware), but you likely would have avoided the problem with the >REDO routine crashing like this. > >-- >Peter Geoghegan
the _bt_binsrch_insert
function always returns low
, but during the post list search, are there cases where low
and mid
are unequal?
If so, this could potentially cause an offset in the subsequent _bt_insertonpg
function.
maybe we fix it like this ?
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
{
......
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
/*
* If tuple at offset located by binary search is a posting list whose
* TID range overlaps with caller's scantid, perform posting list
* binary search to set postingoff for caller. Caller must split the
* posting list when postingoff is set. This should happen
* infrequently.
*/
if (unlikely(result == 0 && key->scantid != NULL))
{
/*
* postingoff should never be set more than once per leaf page
* binary search. That would mean that there are duplicate table
* TIDs in the index, which is never okay. Check for that here.
*/
if (insertstate->postingoff != 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
ItemPointerGetBlockNumber(key->scantid),
ItemPointerGetOffsetNumber(key->scantid),
low, stricthigh,
BufferGetBlockNumber(insertstate->buf),
RelationGetRelationName(rel))));
insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
// Here, will low
and mid
ever be unequal? If low
is returned in such cases, would it result in an error? maybe we fix it like this ?
// low = mid;
// break;
}
}
........
return low;
}
At 2024-11-27 18:53:20, "yuansong" <yyuansong@126.com> wrote:
we find crash reson
We have identified the cause of the crash: it was due to the XLOG_BTREE_INSERT_POST XLOG having an OffsetNumber offnum that was one less than what was stored in the index. I experimented with adding +1, and the index data remained normal in both cases. This issue is likely caused by concurrent operations on the B-tree, and upon reviewing the corresponding WAL logs, we found SPLIT_L and INSERT_LEAF operations on the same block before the crash. This might be a bug. I'm not sure if there's a related fix.
At 2024-11-21 23:58:03, "Peter Geoghegan" <pg@bowt.ie> wrote: >On Thu, Nov 21, 2024 at 10:03 AM yuansong <yyuansong@126.com> wrote: >> Should nhtids be less than or equal to IndexTupleSize(oposting)? >> Why is nhtids larger than IndexTupleSize(oposting) ? I think there should be an error in the master host writing the wal log. >> Does anyone know when this will happen? > >It'll happen whenever there is a certain kind of data corruption. > >There were complaints about issues like this in the past. But those >complaints seem to have gone away when more hardening was added to the >code that runs during original execution (not the REDO routine code, >which can only do what it is told to do by the WAL record). > >You're using PostgreSQL 13.2, which is a very old point release that >lacks this hardening -- the current 13 point release is 13.18, so >you're missing a lot. Had you been on a later point release you'd very >probably have still had the issue with corruption (which could be from >bad hardware), but you likely would have avoided the problem with the >REDO routine crashing like this. > >-- >Peter Geoghegan
Re: Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST
On Sun, Dec 1, 2024 at 8:11 AM yuansong <yyuansong@126.com> wrote: > the _bt_binsrch_insert function always returns low, but during the post list search, are there cases where low and midare unequal? > > If so, this could potentially cause an offset in the subsequent _bt_insertonpg function. > > maybe we fix it like this ? Why? The "if (unlikely(result == 0 && key->scantid != NULL))" block is only hit when we have an "exactly-equal" index tuple -- which means an index tuple with a posting list covering a range of heap TIDs that our scantid is covered by. There should be at most one index tuple like this on any leaf page -- otherwise the index is corrupt. The hardening that I added for the posting list split stuff was added to 13.4 and 13.5. You're running a very old point release (13.2), so you weren't running a version of the server with that hardening. -- Peter Geoghegan
Re: Re: Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST
On Sun, Dec 1, 2024 at 10:15 AM yuansong <yyuansong@126.com> wrote: > I confirmed that there's an issue with prematurely ending the loop. When I enter the if (unlikely(result == 0 && key->scantid!= NULL)) block, low = 2, mid = 4, and high = 4. At this point, the offset of insertstate->postingoff is obtainedfrom the mid value, which is 4. I don't understand why you believe that there is a bug. What bug? Your patch seems to just add a micro-optimization. Your patch makes the robustness situation worse, not better: it will make it impossible to hit the "if (insertstate->postingoff != 0)" ereport() when (for whatever reason) there is index corruption that somehow leads to a leaf page with more than one "result == 0 && key->scantid != NULL" tuple. In other words, your patch actually *removes* the thing that I added to ameliorate/avoid problems such as your problem with a btree_xlog_insert record containing a spurious offset number during REDO. Of course there shouldn't be more than one single "if (insertstate->postingoff != 0)" index tuple on a leaf page that we're inserting onto, but, assuming that that invariant somehow becomes violated (due to corruption), why should we then just accept the first such overlapping index tuple within _bt_binsrch_insert? We should be (and are) more careful than that: we make sure that there isn't another "if (insertstate->postingoff != 0)" tuple, in passing -- even though doing so is theoretically unnecessary (in theory the database never has corruption, in practice it might, for many different reasons). -- Peter Geoghegan
Thank you very much for Peter Geoghegan's help. The issue has been resolved for some time now. It was indeed caused by corrupted index data, which resulted from collation differences between the hosts before and after the migration. Although the collation names were consistent, the differing glibc versions on the servers led to collation discrepancies, which caused the index sorting to become chaotic.
However, I noticed that the values of insertstate->postingoff
are a bit disordered. In _bt_binsrch_posting
, when if (!BTreeTupleIsPosting(itup)) return 0;
is encountered, the normal search return value low
starts from 0. In theory, there could also be a situation where the return value is 0. Therefore, it might be worth considering initializing or using non-PostList values as -1, and using -2 to represent ItemIdIsDead
. Any other valid values should be greater than 0.
Additionally, I don’t think it's a good implementation for _bt_binsrch_insert
to continue searching after finding the correct postingoff
using binary search. The current approach is confusing and inefficient. Moreover, the idea of continuing the search to check if another postingoff
exists in order to trigger an error seems to have minimal impact, as it only detects certain cases. This results in a function doing unrelated tasks. If such index anomalies need to be checked, it would be better to do so using external tools rather than checking during the core search process. The theory of single responsibility for a function。 If you agree with this proposal, I can try to implement the related fix later.
Re: Re: Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST
On Sat, Jan 11, 2025 at 4:40 AM yuansong <yyuansong@126.com> wrote: > If such index anomalies need to be checked, it would be better to do so using external tools rather than checking duringthe core search process. Why? Did you use an external tool? > If you agree with this proposal, I can try to implement the related fix later. I strongly disagree. If the hardening in this function (and similar hardening elsewhere) had been in place on the production server that you were using (if you'd kept up with point releases, rather than staying on 13.2) then you almost certainly wouldn't have had the problem of btree_xlog_insert crashing in the first place. The problem would have been contained, and you'd probably only have needed to REINDEX to fix everything. But you now propose to *remove* that hardening, because it is "confusing and inefficient". This seems illogical. -- Peter Geoghegan