Обсуждение: BUG #17893: Assert failed in heap_update()/_delete() when FK modiified by RI trigger in non-read-committed xact

Поиск
Список
Период
Сортировка

BUG #17893: Assert failed in heap_update()/_delete() when FK modiified by RI trigger in non-read-committed xact

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17893
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 22.04
Description:

The following script:
echo "
CREATE TABLE pkt(i int, PRIMARY KEY(i));
CREATE TABLE fkt(i int REFERENCES pkt ON DELETE SET NULL);
" | psql

for ((i=1;i<=50;i++)); do
  echo "INSERT INTO pkt VALUES (1); INSERT INTO fkt VALUES (1);" | psql &
  echo "BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; DELETE FROM pkt
WHERE i = 1; COMMIT;" | psql &
wait
grep 'TRAP:' server.log && break
done

triggers an assertion failure:
...
iteration 5
BEGIN
INSERT 0 1
INSERT 0 1
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
TRAP: FailedAssertion("!ItemPointerEquals(&oldtup.t_self,
&oldtup.t_data->t_ctid)", File: "heapam.c", Line: 3492, PID: 2563451)

with the following stack trace:
Core was generated by `postgres: law regression [local] DELETE
                        '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/2521461' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=139782217279296) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=139782217279296) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139782217279296) at
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139782217279296, signo=signo@entry=6) at
./nptl/pthread_kill.c:89
#3  0x00007f219521a476 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#4  0x00007f21952007f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000056068b9e4e2c in ExceptionalCondition (
    conditionName=0x56068ba7f088 "!ItemPointerEquals(&oldtup.t_self,
&oldtup.t_data->t_ctid)", 
    errorType=0x56068ba7e3c9 "FailedAssertion", fileName=0x56068ba7e4e8
"heapam.c", lineNumber=3492)
    at assert.c:69
#6  0x000056068b313dd3 in heap_update (relation=0x7f2195671cb0,
otid=0x7ffe2a4bf2c2, newtup=0x56068d987bd0, 
    cid=1, crosscheck=0x56068d8a0748, wait=true, tmfd=0x7ffe2a4bf368,
lockmode=0x7ffe2a4bf284) at heapam.c:3492
#7  0x000056068b324f02 in heapam_tuple_update (relation=0x7f2195671cb0,
otid=0x7ffe2a4bf2c2, 
    slot=0x56068d98a130, cid=1, snapshot=0x56068d8a0bb8,
crosscheck=0x56068d8a0748, wait=true, 
    tmfd=0x7ffe2a4bf368, lockmode=0x7ffe2a4bf284,
update_indexes=0x7ffe2a4bf281) at heapam_handler.c:327
#8  0x000056068b5ee748 in table_tuple_update (rel=0x7f2195671cb0,
otid=0x7ffe2a4bf2c2, slot=0x56068d98a130, 
    cid=1, snapshot=0x56068d8a0bb8, crosscheck=0x56068d8a0748, wait=true,
tmfd=0x7ffe2a4bf368, 
    lockmode=0x7ffe2a4bf284, update_indexes=0x7ffe2a4bf281) at
../../../src/include/access/tableam.h:1511
#9  0x000056068b5f1cf8 in ExecUpdateAct (context=0x7ffe2a4bf340,
resultRelInfo=0x56068d9861c8, 
    tupleid=0x7ffe2a4bf2c2, oldtuple=0x0, slot=0x56068d98a130,
canSetTag=true, updateCxt=0x7ffe2a4bf280)
    at nodeModifyTable.c:2077
#10 0x000056068b5f2267 in ExecUpdate (context=0x7ffe2a4bf340,
resultRelInfo=0x56068d9861c8, 
    tupleid=0x7ffe2a4bf2c2, oldtuple=0x0, slot=0x56068d98a130,
canSetTag=true) at nodeModifyTable.c:2297
#11 0x000056068b5f4e65 in ExecModifyTable (pstate=0x56068d985fb0) at
nodeModifyTable.c:3798
#12 0x000056068b5b1731 in ExecProcNodeFirst (node=0x56068d985fb0) at
execProcnode.c:464
#13 0x000056068b5a49ad in ExecProcNode (node=0x56068d985fb0) at
../../../src/include/executor/executor.h:259
#14 0x000056068b5a78b4 in ExecutePlan (estate=0x56068d985d40,
planstate=0x56068d985fb0, use_parallel_mode=false, 
    operation=CMD_UPDATE, sendTuples=false, numberTuples=0,
direction=ForwardScanDirection, 
    dest=0x56068bd34d20 <spi_printtupDR>, execute_once=true) at
execMain.c:1636
#15 0x000056068b5a50e4 in standard_ExecutorRun (queryDesc=0x56068d9839f8,
direction=ForwardScanDirection, 
    count=0, execute_once=true) at execMain.c:363
#16 0x000056068b5a4ecf in ExecutorRun (queryDesc=0x56068d9839f8,
direction=ForwardScanDirection, count=0, 
    execute_once=true) at execMain.c:307
#17 0x000056068b610e2d in _SPI_pquery (queryDesc=0x56068d9839f8,
fire_triggers=false, tcount=0) at spi.c:2908
#18 0x000056068b610830 in _SPI_execute_plan (plan=0x56068d97ad00,
options=0x7ffe2a4bf690, 
    snapshot=0x56068bd69440 <SecondarySnapshotData>,
crosscheck_snapshot=0x56068d8a0748, fire_triggers=false)
    at spi.c:2677
#19 0x000056068b60d39b in SPI_execute_snapshot (plan=0x56068d97ad00,
Values=0x7ffe2a4bf760, 
    Nulls=0x7ffe2a4bf960 " \320\225\215\006V", snapshot=0x56068bd69440
<SecondarySnapshotData>, 
    crosscheck_snapshot=0x56068d8a0748, read_only=false,
fire_triggers=false, tcount=0) at spi.c:795
#20 0x000056068b94654a in ri_PerformCheck (riinfo=0x56068d96f890,
qkey=0x7ffe2a4bfa58, qplan=0x56068d97ad00, 
    fk_rel=0x7f2195671cb0, pk_rel=0x7f2195665788, oldslot=0x56068d957510,
newslot=0x0, detectNewRows=true, 
    expect_OK=9) at ri_triggers.c:2408
#21 0x000056068b943b14 in ri_set (trigdata=0x7ffe2a4bff40, is_set_null=true,
tgkind=3) at ri_triggers.c:1185
#22 0x000056068b943398 in RI_FKey_setnull_del (fcinfo=0x7ffe2a4bfd80) at
ri_triggers.c:980
#23 0x000056068b5681a3 in ExecCallTriggerFunc (trigdata=0x7ffe2a4bff40,
tgindx=0, finfo=0x56068d955fd0, 
    instr=0x0, per_tuple_context=0x56068d973b80) at trigger.c:2430
#24 0x000056068b56beaf in AfterTriggerExecute (estate=0x56068d955750,
event=0x56068d971cb0, 
    relInfo=0x56068d955bd8, src_relInfo=0x56068d955bd8,
dst_relInfo=0x56068d955bd8, trigdesc=0x56068d955df0, 
    finfo=0x56068d955fd0, instr=0x0, per_tuple_context=0x56068d973b80,
trig_tuple_slot1=0x0, 
    trig_tuple_slot2=0x0) at trigger.c:4541
#25 0x000056068b56c4c6 in afterTriggerInvokeEvents (events=0x56068d8a09a0,
firing_id=1, estate=0x56068d955750, 
    delete_ok=false) at trigger.c:4780
#26 0x000056068b56ce12 in AfterTriggerEndQuery (estate=0x56068d955750) at
trigger.c:5135
#27 0x000056068b5a52f2 in standard_ExecutorFinish (queryDesc=0x56068d955340)
at execMain.c:438
#28 0x000056068b5a51a8 in ExecutorFinish (queryDesc=0x56068d955340) at
execMain.c:406
#29 0x000056068b81d50a in ProcessQuery (plan=0x56068d942e70, 
    sourceText=0x56068d8797a0 "DELETE FROM pkt WHERE i = 1;", params=0x0,
queryEnv=0x0, dest=0x56068d943850, 
    qc=0x7ffe2a4c03d0) at pquery.c:193
#30 0x000056068b81efe3 in PortalRunMulti (portal=0x56068d8e8030,
isTopLevel=true, setHoldSnapshot=false, 
    dest=0x56068d943850, altdest=0x56068d943850, qc=0x7ffe2a4c03d0) at
pquery.c:1277
#31 0x000056068b81e4c6 in PortalRun (portal=0x56068d8e8030,
count=9223372036854775807, isTopLevel=true, 
    run_once=true, dest=0x56068d943850, altdest=0x56068d943850,
qc=0x7ffe2a4c03d0) at pquery.c:791
#32 0x000056068b8172a1 in exec_simple_query (query_string=0x56068d8797a0
"DELETE FROM pkt WHERE i = 1;")
    at postgres.c:1250
#33 0x000056068b81c1c0 in PostgresMain (dbname=0x56068d8a5ea8 "regression",
username=0x56068d8a5e88 "law")
    at postgres.c:4593
#34 0x000056068b7414e1 in BackendRun (port=0x56068d89e980) at
postmaster.c:4511
#35 0x000056068b740d68 in BackendStartup (port=0x56068d89e980) at
postmaster.c:4239
#36 0x000056068b73ccf5 in ServerLoop () at postmaster.c:1806
#37 0x000056068b73c452 in PostmasterMain (argc=3, argv=0x56068d873a00) at
postmaster.c:1478
#38 0x000056068b6301d3 in main (argc=3, argv=0x56068d873a00) at main.c:202

With the cascading delete:
CREATE TABLE fkt(i int REFERENCES pkt ON DELETE CASCADE);
I get an assertion failure inside heap_delete():
iteration 10
BEGIN
INSERT 0 1
INSERT 0 1
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
TRAP: FailedAssertion("!(tp.t_data->t_infomask & HEAP_XMAX_INVALID)", File:
"heapam.c", Line: 2872, PID: 2564424)

The first assert appeared with 5db6df0c0, so this failure reproduced on
REL_12_STABLE .. master. On REL_11_STABLE "ON DELETE SET NULL" triggers an
assert similar to the one present in heap_delete().
The same failure is observed with the serializable isolation level.

Initial analysis of the first case shows that inside heap_update() for an
assert-triggering tuple HeapTupleSatisfiesUpdate() returns TM_Ok (because
the tuple's xmin is commited and xmax is invalid), but
HeapTupleSatisfiesVisibility() returns false (because
XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot) == true).

A debug logging added to heap_update() shows:
2023-04-12 12:53:40.923 MSK|law|regression|64367fa4.259930|LOG:
heap_update() HeapTupleSatisfiesUpdate(&oldtup, cid, buffer): 0, crosscheck:
0x56476f5dc698, HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer):
1, HeapTupleHeaderXminCommitted(oldtup.t_data): 1,
oldtup.t_data->t_infomask: 00000900,
HeapTupleHeaderXminFrozen(oldtup.t_data): 0,
XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(oldtup.t_data), crosscheck)): 0,
HeapTupleHeaderGetRawXmin(oldtup.t_data): 731, crosscheck->xmin: 732,
crosscheck->xmax: 732
^^^ a valid situation

2023-04-12 12:53:40.931 MSK|law|regression|64367fa4.25993a|LOG:
heap_update() HeapTupleSatisfiesUpdate(&oldtup, cid, buffer): 0, crosscheck:
0x56476f5dc698, HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer):
0, HeapTupleHeaderXminCommitted(oldtup.t_data): 1,
oldtup.t_data->t_infomask: 00000900,
HeapTupleHeaderXminFrozen(oldtup.t_data): 0,
XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(oldtup.t_data), crosscheck)): 1,
HeapTupleHeaderGetRawXmin(oldtup.t_data): 736, crosscheck->xmin: 736,
crosscheck->xmax: 736
^^^ leads to the assertion failure

So in the latter case the tuple's xmin is not committed according to the
crosscheck snapshot. Meanwhile, a comment in nodeModifyTable.c says:
     * Note: if es_crosscheck_snapshot isn't InvalidSnapshot, we check that
     * the row to be updated is visible to that snapshot, and throw a
     * can't-serialize error if not. ...

And with a non-assert build I really get:
ERROR:  could not serialize access due to concurrent update
in these cases.


12.04.2023 15:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17893
>
> The following script:
> ...
>
> triggers an assertion failure:

I've come to this assertion failure again, this time from another side, so I
have one reason more to move forward to the issue fix.

First, I've found that two instances of "additional check for
transaction-snapshot mode RI updates" in head_update() and heap_delete()
are not covered by the existing regression tests at all, so I would like to
propose an addition for the fk-snapshot test (see attached). It increases
coverage and causes the assertion failures in both functions.

Second, it looks like Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID))
was added to make sure that tmfd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
(which ends up with just returning t_xmax) returns valid xmax. Really, in the
problematic case we get zero in tmfd->xmax. But ExecDelete(), ExecUpdate() do:
...
         switch (result)
...
             case TM_Updated:
...
                     if (IsolationUsesXactSnapshot())
                         ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                  errmsg("could not serialize access due to concurrent update")));

before any processing of tmfd. So it seems like the invalid xmax is not
an issue in this case.
Thus, maybe
         Assert((!(tp.t_data->t_infomask & HEAP_XMAX_INVALID))
should be changed to
         Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
                (result == TM_Updated && crosscheck != InvalidSnapshot));

Third, with this change applied I immediately got a failure of the next
assert in heap_delete():
             Assert(!ItemPointerEquals(&oldtup.t_self, &oldtup.t_data->t_ctid));
which was introduced by 5db6df0c0.

(By the way, I've noticed a typo brought by that commit:
t_xmax, and, if possible, and, if possible, t_cmax
)

It's not clear to me yet, which anomalous condition that Assert should
detect, specifically in heap_update(), inside this branch:
         /* Perform additional check for transaction-snapshot mode RI updates */
         if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
         {
             result = TM_Updated;
             Assert(!ItemPointerEquals(&oldtup.t_self, &oldtup.t_data->t_ctid));
         }

Andres, could you shed some light on this, please?

Best regards,
Alexander
Вложения
02.09.2023 16:00, Alexander Lakhin wrote:
>
> Third, with this change applied I immediately got a failure of the next
> assert in heap_delete():
>             Assert(!ItemPointerEquals(&oldtup.t_self, &oldtup.t_data->t_ctid));
> which was introduced by 5db6df0c0.
Sorry for my mistake here. I had cited a wrong line. It should be:
         Assert(result != TM_Updated ||
                !ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid));

As I still can't see, which cases those asserts intended for, but see cases
when they are falsified, I propose removing them.
Please look at the complete patch attached.

Best regards,
Alexander
Вложения
On 23/10/2023 09:00, Alexander Lakhin wrote:
> Third, with this change applied I immediately got a failure of the next
> assert in heap_delete():
> [added more context to show the code better]
>     if (crosscheck != InvalidSnapshot && result == TM_Ok)
>     {
>         /* Perform additional check for transaction-snapshot mode RI updates */
>         if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
>         {
>             result = TM_Updated;
>             Assert(!ItemPointerEquals(&oldtup.t_self, &oldtup.t_data->t_ctid));
>         }
>     }

Yeah that Assert is wrong and should be removed. I think it's trying to 
assert that if we are deleting a tuple and the visibility check fails, 
it must be because it was concurrently updated, not because the 
inserting XID is not visible. But that doesn't hold for this additional 
check with RI updates, the inserting XID might well be invisible to the 
crosscheck snapshot.

>> which was introduced by 5db6df0c0.
> Sorry for my mistake here. I had cited a wrong line. It should be:
>           Assert(result != TM_Updated ||
>                  !ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid));
> As I still can't see, which cases those asserts intended for, but see cases
> when they are falsified, I propose removing them.
> Please look at the complete patch attached.

I propose to attached slight variation, which moves the assertions 
before the crosscheck snapshot check. The assertions are correct as they 
are, if you don't need to take the crosscheck into account.

I'm a little worried if the callers of tuple_update() might also get 
confused when tuple_update() returns TM_Updated but xmax is invalid. As 
you said:

> So in the latter case the tuple's xmin is not committed according to the
> crosscheck snapshot. Meanwhile, a comment in nodeModifyTable.c says:
>      * Note: if es_crosscheck_snapshot isn't InvalidSnapshot, we check that
>      * the row to be updated is visible to that snapshot, and throw a
>      * can't-serialize error if not. ...
> 
> 
> And with a non-assert build I really get:
> ERROR:  could not serialize access due to concurrent update
> in these cases.

I think you only get that error with REPEATABLE READ or SERIALIZABLE 
isolation level. I don't quite understand how this works. In READ 
COMMITTED level, ISTM that ExecUpdate() or ExecDelete() will proceed 
with EvalPlanQualSlot() and locking the row with table_tuple_lock(). Or 
do we never use the cross-check snapshot in READ COMMITTED mode?

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Вложения
Hello Heikki,

27.11.2023 22:23, Heikki Linnakangas wrote:
> I propose to attached slight variation, which moves the assertions before the crosscheck snapshot check. The 
> assertions are correct as they are, if you don't need to take the crosscheck into account.
>

Thank you for paying attention to this!
I agree with your variation, it's more accurate.

> I'm a little worried if the callers of tuple_update() might also get confused when tuple_update() returns TM_Updated

> but xmax is invalid. As you said:
> ...
> I think you only get that error with REPEATABLE READ or SERIALIZABLE isolation level. I don't quite understand how 
> this works. In READ COMMITTED level, ISTM that ExecUpdate() or ExecDelete() will proceed with EvalPlanQualSlot() and

> locking the row with table_tuple_lock(). Or do we never use the cross-check snapshot in READ COMMITTED mode?
>

I think so, as I see the following in ri_PerformCheck():
     /*
      * In READ COMMITTED mode, we just need to use an up-to-date regular
      * snapshot, and we will see all rows that could be interesting. But in
      * transaction-snapshot mode, we can't change the transaction snapshot. If
      * the caller passes detectNewRows == false then it's okay to do the query
      * with the transaction snapshot; otherwise we use a current snapshot, and
      * tell the executor to error out if it finds any rows under the current
      * snapshot that wouldn't be visible per the transaction snapshot.  Note
      * that SPI_execute_snapshot will register the snapshots, so we don't need
      * to bother here.
      */
     if (IsolationUsesXactSnapshot() && detectNewRows)
     {
         CommandCounterIncrement();  /* be sure all my own work is visible */
         test_snapshot = GetLatestSnapshot();
         crosscheck_snapshot = GetTransactionSnapshot();
     }

IIUC, that's the only place where crosscheck_snapshot (and then
es_crosscheck_snapshot) may get a non-invalid value.

Best regards,
Alexander



On 28/11/2023 11:00, Alexander Lakhin wrote:
> 27.11.2023 22:23, Heikki Linnakangas wrote:
>> I propose to attached slight variation, which moves the assertions before the crosscheck snapshot check. The
>> assertions are correct as they are, if you don't need to take the crosscheck into account.
> 
> Thank you for paying attention to this!
> I agree with your variation, it's more accurate.

Committed and backpatched. Thank you!

-- 
Heikki Linnakangas
Neon (https://neon.tech)