Обсуждение: Assertion failure when streaming logical changes

Поиск
Список
Период
Сортировка

Assertion failure when streaming logical changes

От
Michael Paquier
Дата:
Hi all,

Using test_decoding on HEAD (cc761b1) I am seeing the following assertion failure:
TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root == ((void*)0)))", File: "snapmgr.c", Line: 677)

(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10
    frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177
    frame #2: 0x000000010b556dd9 postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb, errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677) + 137 at assert.c:54
    frame #3: 0x000000010b5af952 postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8, owner=0x00007faea38a6838) + 146 at snapmgr.c:677
    frame #4: 0x000000010b5af8b2 postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at snapmgr.c:663
    frame #5: 0x000000010b0166ce postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504
    frame #6: 0x000000010b5474b8 postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at relcache.c:568
    frame #7: 0x000000010b53e45c postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at relcache.c:1035
    frame #8: 0x000000010b53d564 postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
    frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455, lockmode=1) + 108 at heapam.c:1047
    frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455, lockmode=1) + 25 at indexam.c:167
    frame #11: 0x000000010b01603d postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455, indexOK='\x01', snapshot=0x0000000000000000, nkeys=2, key=0x00007fff54c6a990) + 93 at genam.c:334
    frame #12: 0x000000010b54a976 postgres`RelidByRelfilenode(reltablespace=0, relfilenode=12709) + 742 at relfilenodemap.c:204
    frame #13: 0x000000010b3353d9 postgres`ReorderBufferCommit(rb=0x00007faea38c1038, xid=1001, commit_lsn=23817712, end_lsn=23818128, commit_time=476842122105685) + 665 at reorderbuffer.c:1338
    frame #14: 0x000000010b330ccb postgres`DecodeCommit(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58, xid=1001, dboid=16384, commit_time=476842122105685, nsubxacts=0, sub_xids=0x00007faea3885530, ninval_msgs=22, msgs=0x00007faea3885530) + 443 at decode.c:548
    frame #15: 0x000000010b32f663 postgres`DecodeXactOp(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58) + 547 at decode.c:210
    frame #16: 0x000000010b32f10e postgres`LogicalDecodingProcessRecord(ctx=0x00007faea38b0838, record=0x00007faea38b0af8) + 142 at decode.c:103
    frame #17: 0x000000010b3433f5 postgres`XLogSendLogical + 165 at walsender.c:2425
    frame #18: 0x000000010b3431ad postgres`WalSndLoop(send_data=0x000000010b343350) + 269 at walsender.c:1834
    frame #19: 0x000000010b341938 postgres`StartLogicalReplication(cmd=0x00007faea38846a8) + 568 at walsender.c:997
    frame #20: 0x000000010b34021c postgres`exec_replication_command(cmd_string=0x00007faea3833a38) + 524 at walsender.c:1326
    frame #21: 0x000000010b3abaab postgres`PostgresMain(argc=1, argv=0x00007faea3803fc8, dbname=0x00007faea3803ec0, username=0x00007faea3803ea0) + 2475 at postgres.c:4022
    frame #22: 0x000000010b312a2e postgres`BackendRun(port=0x00007faea3405d60) + 686 at postmaster.c:4141
    frame #23: 0x000000010b311ff0 postgres`BackendStartup(port=0x00007faea3405d60) + 384 at postmaster.c:3826
    frame #24: 0x000000010b30e7f7 postgres`ServerLoop + 663 at postmaster.c:1594
    frame #25: 0x000000010b30c017 postgres`PostmasterMain(argc=3, argv=0x00007faea34044d0) + 5623 at postmaster.c:1241
    frame #26: 0x000000010b24e11d postgres`main(argc=3, argv=0x00007faea34044d0) + 749 at main.c:221

The problem can be easily reproduced using pg_recvlogical after creating a logical slot plugged with test_decoding.
Regards,
--
Michael

Re: Assertion failure when streaming logical changes

От
Andres Freund
Дата:
Hi,

On 2015-02-10 21:20:37 +0900, Michael Paquier wrote:
> Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
> failure:
> TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root ==
> ((void*)0)))", File: "snapmgr.c", Line: 677)

Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes
it?

> (lldb) bt
> * thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill
> + 10, stop reason = signal SIGSTOP
>   * frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10
>     frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177
>     frame #2: 0x000000010b556dd9
> postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb,
> errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677)
> + 137 at assert.c:54
>     frame #3: 0x000000010b5af952
> postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8,
> owner=0x00007faea38a6838) + 146 at snapmgr.c:677
>     frame #4: 0x000000010b5af8b2
> postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at
> snapmgr.c:663
>     frame #5: 0x000000010b0166ce
> postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504
>     frame #6: 0x000000010b5474b8
> postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at
> relcache.c:568
>     frame #7: 0x000000010b53e45c
> postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at
> relcache.c:1035
>     frame #8: 0x000000010b53d564
> postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
>     frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455,
> lockmode=1) + 108 at heapam.c:1047
>     frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455,
> lockmode=1) + 25 at indexam.c:167
>     frame #11: 0x000000010b01603d
> postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455,
> indexOK='\x01', snapshot=0x0000000000000000, nkeys=2,
> key=0x00007fff54c6a990) + 93 at genam.c:334
>     frame #12: 0x000000010b54a976

Yea, it really looks like the above commit is to "blame". The new xmin
tracking infrastructure doesn't know about the historical snapshot...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Assertion failure when streaming logical changes

От
Michael Paquier
Дата:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Feb 10, 2015 at 9:46 PM, Andres
Freund<span dir="ltr"><<a href="mailto:andres@2ndquadrant.com" target="_blank">andres@2ndquadrant.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Yea,
itreally looks like the above commit is to "blame". The new xmin<br /> tracking infrastructure doesn't know about the
historicalsnapshot...<span class="HOEnZb"><font color="#888888"><br /></font></span></blockquote></div><br /></div><div
class="gmail_extra">Ithink that we need a better regression coverage here... For example, we could add some tap tests
intest_decoding to test streaming of logical changes. This would help in the future to detect such problems via the
buildfarm.<br/>-- <br /><div class="gmail_signature">Michael<br /></div></div></div> 

Re: Assertion failure when streaming logical changes

От
Andres Freund
Дата:
On 2015-02-10 22:06:34 +0900, Michael Paquier wrote:
> On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> 
> > Yea, it really looks like the above commit is to "blame". The new xmin
> > tracking infrastructure doesn't know about the historical snapshot...
> >
> 
> I think that we need a better regression coverage here... For example, we
> could add some tap tests in test_decoding to test streaming of logical
> changes. This would help in the future to detect such problems via the
> buildfarm.

I agree. It's more or less a accident that the assert - which just
should be moved in the regd_count == 0 branch - didn't trigger for the
SQL interface. The snapshot acquired by the SELECT statement prevents it
there.

It's not entirely trivial to add tests for receivelogical though. You
need to stop it programatically after a while.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Assertion failure when streaming logical changes

От
Heikki Linnakangas
Дата:
On 02/10/2015 02:46 PM, Andres Freund wrote:
> On 2015-02-10 21:20:37 +0900, Michael Paquier wrote:
>> Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
>> failure:
>> TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root ==
>> ((void*)0)))", File: "snapmgr.c", Line: 677)
>
> Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes
> it?
> ...
>
> Yea, it really looks like the above commit is to "blame". The new xmin
> tracking infrastructure doesn't know about the historical snapshot...

The new xmin tracking code assumes that if a snapshots's regd_count > 0,
it has already been pushed to the RegisteredSnapshots heap. That
assumption doesn't hold because the logical decoding stuff creates its
own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing
them, even though they haven't been registered with RegisterSnapshot.

The second paragraph in this comment in snapmgr.c needs fixing:

>  * Likewise, any snapshots that have been exported by pg_export_snapshot
>  * have regd_count = 1 and are counted in RegisteredSnapshots, but are not
>  * tracked by any resource owner.
>  *
>  * The same is true for historic snapshots used during logical decoding,
>  * their lifetime is managed separately (as they life longer as one xact.c
>  * transaction).

Besides the spelling, that's incorrect: historic snapshots are *not*
counted in RegisteredSnapshots. That was harmless before commit
94028691, but it was always wrong.


I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack.
snapbuild.c also abuses active_count as a reference counter, which is
similarly ugly. I think regd_count and active_count should both be
treated as private to snapmgr.c, and initialized to zero elsewhere.

As a minimal fix, we could change the logical decoding code to not use
regd_count to prevent snapmgr.c from freeing its snapshots, but use
active_count for that too. Setting regd_count to 1 in
SnapBuildBuildSnapshot() seems unnecessary in the first place, as the
callers also call SnapBuildSnapIncRefcount(). Patch attached.

But could we get rid of those active_count manipulations too? Could you
replace the SnapBuildSnap[Inc|Dec]Refcount calls with
[Un]RegisterSnapshot()?

- Heikki


Вложения

Re: Assertion failure when streaming logical changes

От
Craig Ringer
Дата:


On 10 February 2015 at 21:43, Andres Freund <andres@2ndquadrant.com> wrote:
On 2015-02-10 22:06:34 +0900, Michael Paquier wrote:
> On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>
> > Yea, it really looks like the above commit is to "blame". The new xmin
> > tracking infrastructure doesn't know about the historical snapshot...
> >
>
> I think that we need a better regression coverage here... For example, we
> could add some tap tests in test_decoding to test streaming of logical
> changes. This would help in the future to detect such problems via the
> buildfarm.

I agree. It's more or less a accident that the assert - which just
should be moved in the regd_count == 0 branch - didn't trigger for the
SQL interface. The snapshot acquired by the SELECT statement prevents it
there.

It's not entirely trivial to add tests for receivelogical though. You
need to stop it programatically after a while.


I reckon we should improve that then.

What about an idle timeout for pg_recvlogical, so we can get it to time out after (say) 5 seconds of no data? Time-based facilities aren't great in tests though.

The SQL interface can stop after a given number of changes received or a target LSN. Is there a practical reason pg_recvlogical doesn't? Or just that there wasn't a reason to implement it? I figured I'd ask before jumping in and trying to add it in case I'd be wasting my time trying.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Assertion failure when streaming logical changes

От
Craig Ringer
Дата:


On 11 February 2015 at 08:51, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

The new xmin tracking code assumes that if a snapshots's regd_count > 0, it has already been pushed to the RegisteredSnapshots heap. That assumption doesn't hold because the logical decoding stuff creates its own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing them, even though they haven't been registered with RegisterSnapshot.

The second paragraph in this comment in snapmgr.c needs fixing:

 * Likewise, any snapshots that have been exported by pg_export_snapshot
 * have regd_count = 1 and are counted in RegisteredSnapshots, but are not
 * tracked by any resource owner.
 *
 * The same is true for historic snapshots used during logical decoding,
 * their lifetime is managed separately (as they life longer as one xact.c
 * transaction).

Besides the spelling, that's incorrect: historic snapshots are *not* counted in RegisteredSnapshots. That was harmless before commit 94028691, but it was always wrong.


I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack. snapbuild.c also abuses active_count as a reference counter, which is similarly ugly. I think regd_count and active_count should both be treated as private to snapmgr.c, and initialized to zero elsewhere.

As a minimal fix, we could change the logical decoding code to not use regd_count to prevent snapmgr.c from freeing its snapshots, but use active_count for that too. Setting regd_count to 1 in SnapBuildBuildSnapshot() seems unnecessary in the first place, as the callers also call SnapBuildSnapIncRefcount(). Patch attached.

It might be a good idea to apply this if nothing better is forthcoming. Logical decoding in WALsenders is broken at the moment.

Otherwise it needs to go on the 9.5 blockers list.

But could we get rid of those active_count manipulations too? Could you replace the SnapBuildSnap[Inc|Dec]Refcount calls with [Un]RegisterSnapshot()?

It would be interesting to know why it was done that way in the first place, rather than using the snapshot management infrastructure. I presume it needed to do something not directly offered by the snapshot manager but I haven't managed to grasp what, exactly.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Assertion failure when streaming logical changes

От
Andres Freund
Дата:
On 2015-04-07 17:22:12 +0800, Craig Ringer wrote:
> It might be a good idea to apply this if nothing better is forthcoming.
> Logical decoding in WALsenders is broken at the moment.

Yes.

> Otherwise it needs to go on the 9.5 blockers list.
> 
> But could we get rid of those active_count manipulations too? Could you
> > replace the SnapBuildSnap[Inc|Dec]Refcount calls with
> > [Un]RegisterSnapshot()?
> >
> 
> It would be interesting to know why it was done that way in the first
> place, rather than using the snapshot management infrastructure.

Because the snapshot tracking infrastructure isn't particularly
suitable. These snapshots are much longer lived; they span transactions
and such. Nothing snapmgr.c is made for. It seemed more
complicated/bug-prone to change snapmgr.c to support a foreign use case.

Greetings,

Andres Freund



Re: Assertion failure when streaming logical changes

От
Heikki Linnakangas
Дата:
On 04/07/2015 03:54 PM, Andres Freund wrote:
> On 2015-04-07 17:22:12 +0800, Craig Ringer wrote:
>> It might be a good idea to apply this if nothing better is forthcoming.
>> Logical decoding in WALsenders is broken at the moment.
>
> Yes.

Committed.
- Heikki