Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
| От | Tomas Vondra |
|---|---|
| Тема | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
| Дата | |
| Msg-id | c5624ffb-5c15-9b23-c4de-39e8bc384622@enterprisedb.com обсуждение исходный текст |
| Ответ на | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
| Ответы |
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
|
| Список | pgsql-hackers |
On 9/4/22 16:08, Tomas Vondra wrote: > ... > > so in fact we *know* 849 is a subxact of 848, but we don't call > ReorderBufferAssignChild in this case. In fact we can't even do the > assignment easily in this case, because we create the subxact first, so > that the crash happens right when we attempt to create the toplevel one, > and we never even get a chance to do the assignment: > > 1) process the NEW_CID record, logged for 849 (subxact) > 2) process CIDs in the WAL record, which has topleve_xid 848 > > > So IMHO we need to figure out what to do for WAL records that create > both the toplevel and subxact - either we need to skip them, or rethink > how we create the ReorderBufferTXN structs. > This fixes the crash for me, by adding a ReorderBufferAssignChild call to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to ensure we don't try to create the top xact before updating the subxact and removing it from the toplevel_by_lsn list. Essentially, what's happening is this: 1) We read the NEW_CID record, which is logged with XID 849, i.e. the subxact. But we don't know it's a subxact, so we create it as a top-level xact with the LSN. 2) We start processing contents of the NEW_CID, which however has info that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which promptly does ReorderBufferTXNByXid() with the top-level XID, which creates it with the same LSN, and crashes because of the assert. I'm not sure what's the right/proper way to fix this ... The problem is ReorderBufferAssignChild was coded in a way that did not expect the subxact to be created first (as a top-level xact). And indeed, if I add Assert(false) to the (!new_sub) branch that converts top-level xact to subxact, check-world still passes. So we never test this case, but the NEW_CID breaks this assumption and creates them in the opposite order (i.e. subxact first). So the patch "fixes" this by (a) tweaking ReorderBufferAssignChild to first remove the subxact from the list of top-level transactions (b) call ReorderBufferAssignChild when processing NEW_CID However, I wonder whether we even have to process these records? If the restart_lsn is half-way through the xact, so can we even decode it? Maybe we can just skip all of this, somehow? We'd still need to remember 849 is a subxact of 848, at least, so that we know to skip it too. Thread [1] suggested to relax the assert to allow the same LSN, provided it's xact and it's subxact. That goes directly against the expectation the toplevel_by_lsn list contains no known subxacts, and I don't think we should be relaxing that. After all, just tweaking the LSN does not really fix the issue, because not remembering it's xact+subxact is part of the issue. In principle, I think the issue is exactly the opposite, i.e. that we don't realize 849 is a subxact, and leave it in the list. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: