Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Дата
Msg-id 45567656-fb0e-c844-48fb-f4c2a54dc2af@enterprisedb.com
обсуждение исходный текст
Ответ на Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers

On 6/7/23 07:18, Amit Kapila wrote:
> On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 6/6/23 17:42, Tomas Vondra wrote:
>>>
>>
>> In investigated this a bit more, and the problem actually seems to be
>> more like this:
>>
>> 1) we create a new logical replication slot
>>
>> 2) while building the initial snapshot, we start with current insert
>> location, and then process records
>>
>> 3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
>> SnapBuildFindSnapshot
>>
>> 4) SnapBuildFindSnapshot does this:
>>
>>     else if (!builder->building_full_snapshot &&
>>              SnapBuildRestore(builder, lsn))
>>     {
>>         /* there won't be any state to cleanup */
>>         return false;
>>     }
>>
>> 5) because create_logical_replication_slot and get_changes both call
>> CreateInitDecodingContext with needs_full_snapshot=false, we end up
>> calling SnapBuildRestore()
>>
>> 6) once in a while this likely hits a snapshot created by a concurrent
>> session (for another logical slot) with SNAPBUILD_CONSISTENT state
>>
> 
> I think this analysis doesn't seem to match what you mentioned in the
> previous email which is as follows:
>>> However, we skip the assignment, because the log for this call of
>>> get_changes says this:
>>>
>>>    LOG:  logical decoding found consistent point at 0/6462E5D8
>>>
>>> so we fail to realize the 26662 is a subxact.
> 
> This is because the above LOG is printed when
> "running->oldestRunningXid == running->nextXid" not when we restore
> the snapshot. Am, I missing something?
> 

There are multiple places in snapbuild.c with the same message. Two in
SnapBuildFindSnapshot (one of them being the one you mentioned) and one
in SnapBuildRestore (which is the one actually triggered).

>>
>> I don't know what's the correct fix for this. Maybe we should set
>> needs_full_snapshot=true in create_logical_replication_slot when
>> creating the initial snapshot? Maybe we should use true even in
>> pg_logical_slot_get_changes_guts? This seems to fix the crashes ...
>>
> 
> I don't think that is advisable because setting "needs_full_snapshot"
> to true for decoding means the snapshot will start tracking
> non-catalog committed xacts as well which is costly and is not
> required for this case.
> 

True. TBH I managed to forget most of these details, so I meant it more
like a data point that it seems to fix the issue for me.

>> That'll prevent reading the serialized snapshots like this, but how
>> could that ever work? It seems pretty much guaranteed to ignore any
>> assignments that happened right before the snapshot?
>>
> 
> This part needs some analysis/thoughts. BTW, do you mean that it skips
> the assignment (a) because the assignment record is before we reach a
> consistent point, or (b) because we start reading WAL after the
> assignment, or (c) something else? If you intend to say (a) then can
> you please point me to the code you are referring to for the same?
> 

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two TXNs
with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production because
it only fails because of hitting an assert. Regular builds are likely to
just ignore it, although I haven't checked if the COMMIT cleanup (I
wonder if we remove the subxact from the toplevel list on commit).

I think the problem is we just grab an existing snapshot, before all
running xacts complete. Maybe we should fix that, and leave the
needs_full_snapshot alone. Haven't tried that, though.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Daniel Verite"
Дата:
Сообщение: Re: Inconsistent results with libc sorting on Windows
Следующее
От: Joe Conway
Дата:
Сообщение: Re: Inconsistent results with libc sorting on Windows