Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Дата
Msg-id b3d74bba-743f-27fe-d098-841c8eb64562@enterprisedb.com
обсуждение исходный текст
Ответ на Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers

On 9/4/22 13:49, Amit Kapila wrote:
> On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
>> in test_decoding ddl test in ~50% runs. I don't see the same failure
>> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
>> I suspect this is a race condition, and with valgrind the timing changes
>> in a way to make it more likely.
>>
>> The crash always happens in the "ddl" test. The backtrace always looks
>> like this:
>>
>>     (ExceptionalCondition+0x98)[0x8f6f7c]
>>     (+0x57a7ec)[0x6827ec]
>>     (+0x579edc)[0x681edc]
>>     (ReorderBufferAddNewTupleCids+0x60)[0x686758]
>>     (SnapBuildProcessNewCid+0x94)[0x68b920]
>>     (heap2_decode+0x17c)[0x671584]
>>     (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
>>     (+0x570f88)[0x678f88]
>>     (pg_logical_slot_get_changes+0x1c)[0x6790fc]
>>     (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
>>     (+0x3be638)[0x4c6638]
>>     (+0x3a2c14)[0x4aac14]
>>     (ExecScan+0x8c)[0x4aaca8]
>>     (+0x3bea14)[0x4c6a14]
>>     (+0x39ea60)[0x4a6a60]
>>     (+0x392378)[0x49a378]
>>     (+0x39520c)[0x49d20c]
>>     (standard_ExecutorRun+0x214)[0x49aad8]
>>     (ExecutorRun+0x64)[0x49a8b8]
>>     (+0x62f53c)[0x73753c]
>>     (PortalRun+0x27c)[0x737198]
>>     (+0x627e78)[0x72fe78]
>>     (PostgresMain+0x9a0)[0x73512c]
>>     (+0x547be8)[0x64fbe8]
>>     (+0x547540)[0x64f540]
>>     (+0x542d30)[0x64ad30]
>>     (PostmasterMain+0x1460)[0x64a574]
>>     (+0x418888)[0x520888]
>>
>> I'm unable to get a better backtrace from the valgrind-produces core
>> usign gdb, for some reason.
>>
>> However, I've modified AssertTXNLsnOrder() - which is where the assert
>> is checked - to also dump toplevel_by_lsn instead of just triggering the
>> assert, and the result is always like this:
>>
>>     WARNING:  ==============================================
>>     WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>     WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>     WARNING:  ==============================================
>>
>> The LSNs change a bit between the runs, but the failing transactions are
>> always 848 and 849. Also, both transactions have exactly the same info.
>>
>> But the very first WAL record for 849 is
>>
>>     ASSIGNMENT xtop 848: subxacts: 849
>>
>> so it's strange 849 is in the toplevel_by_lsn list at all, because it
>> clearly is a subxact of 848.
>>
> 
> There is no guarantee that toplevel_by_lsn won't have subxact.

I don't think that's quite true - toplevel_by_lsn should not contain any
*known* subxacts. Yes, we may initially add a subxact to the list, but
as soon as we get assignment record, it should be removed. See what
ReorderBufferAssignChild does.

And in this case the ASSIGNMENT is the first WAL record we get for 849
(in fact, isn't that guaranteed since 7259736a6e?), so we know from the
very beginning 849 is a subxact.


> As per
> my understanding, the problem I reported in the email [1] is the same
> and we have seen this in BF failures as well. I posted a way to
> reproduce it in that email. It seems this is possible if the decoding
> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
> subtransaction) after XLOG_RUNNING_XACTS.
> 

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Следующее
От: Ranier Vilela
Дата:
Сообщение: Re: Fix typo function circle_same (src/backend/utils/adt/geo_ops.c)