Обсуждение: Assertion failure when streaming logical changes
Hi all,
Using test_decoding on HEAD (cc761b1) I am seeing the following assertion failure:
TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root == ((void*)0)))", File: "snapmgr.c", Line: 677)
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10
frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177
frame #2: 0x000000010b556dd9 postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb, errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677) + 137 at assert.c:54
frame #3: 0x000000010b5af952 postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8, owner=0x00007faea38a6838) + 146 at snapmgr.c:677
frame #4: 0x000000010b5af8b2 postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at snapmgr.c:663
frame #5: 0x000000010b0166ce postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504
frame #6: 0x000000010b5474b8 postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at relcache.c:568
frame #7: 0x000000010b53e45c postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at relcache.c:1035
frame #8: 0x000000010b53d564 postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455, lockmode=1) + 108 at heapam.c:1047
frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455, lockmode=1) + 25 at indexam.c:167
frame #11: 0x000000010b01603d postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455, indexOK='\x01', snapshot=0x0000000000000000, nkeys=2, key=0x00007fff54c6a990) + 93 at genam.c:334
frame #12: 0x000000010b54a976 postgres`RelidByRelfilenode(reltablespace=0, relfilenode=12709) + 742 at relfilenodemap.c:204
frame #13: 0x000000010b3353d9 postgres`ReorderBufferCommit(rb=0x00007faea38c1038, xid=1001, commit_lsn=23817712, end_lsn=23818128, commit_time=476842122105685) + 665 at reorderbuffer.c:1338
frame #14: 0x000000010b330ccb postgres`DecodeCommit(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58, xid=1001, dboid=16384, commit_time=476842122105685, nsubxacts=0, sub_xids=0x00007faea3885530, ninval_msgs=22, msgs=0x00007faea3885530) + 443 at decode.c:548
frame #15: 0x000000010b32f663 postgres`DecodeXactOp(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58) + 547 at decode.c:210
frame #16: 0x000000010b32f10e postgres`LogicalDecodingProcessRecord(ctx=0x00007faea38b0838, record=0x00007faea38b0af8) + 142 at decode.c:103
frame #17: 0x000000010b3433f5 postgres`XLogSendLogical + 165 at walsender.c:2425
frame #18: 0x000000010b3431ad postgres`WalSndLoop(send_data=0x000000010b343350) + 269 at walsender.c:1834
frame #19: 0x000000010b341938 postgres`StartLogicalReplication(cmd=0x00007faea38846a8) + 568 at walsender.c:997
frame #20: 0x000000010b34021c postgres`exec_replication_command(cmd_string=0x00007faea3833a38) + 524 at walsender.c:1326
frame #21: 0x000000010b3abaab postgres`PostgresMain(argc=1, argv=0x00007faea3803fc8, dbname=0x00007faea3803ec0, username=0x00007faea3803ea0) + 2475 at postgres.c:4022
frame #22: 0x000000010b312a2e postgres`BackendRun(port=0x00007faea3405d60) + 686 at postmaster.c:4141
frame #23: 0x000000010b311ff0 postgres`BackendStartup(port=0x00007faea3405d60) + 384 at postmaster.c:3826
frame #24: 0x000000010b30e7f7 postgres`ServerLoop + 663 at postmaster.c:1594
frame #25: 0x000000010b30c017 postgres`PostmasterMain(argc=3, argv=0x00007faea34044d0) + 5623 at postmaster.c:1241
frame #26: 0x000000010b24e11d postgres`main(argc=3, argv=0x00007faea34044d0) + 749 at main.c:221
Using test_decoding on HEAD (cc761b1) I am seeing the following assertion failure:
TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root == ((void*)0)))", File: "snapmgr.c", Line: 677)
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10
frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177
frame #2: 0x000000010b556dd9 postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb, errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677) + 137 at assert.c:54
frame #3: 0x000000010b5af952 postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8, owner=0x00007faea38a6838) + 146 at snapmgr.c:677
frame #4: 0x000000010b5af8b2 postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at snapmgr.c:663
frame #5: 0x000000010b0166ce postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504
frame #6: 0x000000010b5474b8 postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at relcache.c:568
frame #7: 0x000000010b53e45c postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at relcache.c:1035
frame #8: 0x000000010b53d564 postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455, lockmode=1) + 108 at heapam.c:1047
frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455, lockmode=1) + 25 at indexam.c:167
frame #11: 0x000000010b01603d postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455, indexOK='\x01', snapshot=0x0000000000000000, nkeys=2, key=0x00007fff54c6a990) + 93 at genam.c:334
frame #12: 0x000000010b54a976 postgres`RelidByRelfilenode(reltablespace=0, relfilenode=12709) + 742 at relfilenodemap.c:204
frame #13: 0x000000010b3353d9 postgres`ReorderBufferCommit(rb=0x00007faea38c1038, xid=1001, commit_lsn=23817712, end_lsn=23818128, commit_time=476842122105685) + 665 at reorderbuffer.c:1338
frame #14: 0x000000010b330ccb postgres`DecodeCommit(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58, xid=1001, dboid=16384, commit_time=476842122105685, nsubxacts=0, sub_xids=0x00007faea3885530, ninval_msgs=22, msgs=0x00007faea3885530) + 443 at decode.c:548
frame #15: 0x000000010b32f663 postgres`DecodeXactOp(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58) + 547 at decode.c:210
frame #16: 0x000000010b32f10e postgres`LogicalDecodingProcessRecord(ctx=0x00007faea38b0838, record=0x00007faea38b0af8) + 142 at decode.c:103
frame #17: 0x000000010b3433f5 postgres`XLogSendLogical + 165 at walsender.c:2425
frame #18: 0x000000010b3431ad postgres`WalSndLoop(send_data=0x000000010b343350) + 269 at walsender.c:1834
frame #19: 0x000000010b341938 postgres`StartLogicalReplication(cmd=0x00007faea38846a8) + 568 at walsender.c:997
frame #20: 0x000000010b34021c postgres`exec_replication_command(cmd_string=0x00007faea3833a38) + 524 at walsender.c:1326
frame #21: 0x000000010b3abaab postgres`PostgresMain(argc=1, argv=0x00007faea3803fc8, dbname=0x00007faea3803ec0, username=0x00007faea3803ea0) + 2475 at postgres.c:4022
frame #22: 0x000000010b312a2e postgres`BackendRun(port=0x00007faea3405d60) + 686 at postmaster.c:4141
frame #23: 0x000000010b311ff0 postgres`BackendStartup(port=0x00007faea3405d60) + 384 at postmaster.c:3826
frame #24: 0x000000010b30e7f7 postgres`ServerLoop + 663 at postmaster.c:1594
frame #25: 0x000000010b30c017 postgres`PostmasterMain(argc=3, argv=0x00007faea34044d0) + 5623 at postmaster.c:1241
frame #26: 0x000000010b24e11d postgres`main(argc=3, argv=0x00007faea34044d0) + 749 at main.c:221
The problem can be easily reproduced using pg_recvlogical after creating a logical slot plugged with test_decoding.
Regards,
--
Michael
Hi, On 2015-02-10 21:20:37 +0900, Michael Paquier wrote: > Using test_decoding on HEAD (cc761b1) I am seeing the following assertion > failure: > TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root == > ((void*)0)))", File: "snapmgr.c", Line: 677) Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes it? > (lldb) bt > * thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill > + 10, stop reason = signal SIGSTOP > * frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10 > frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177 > frame #2: 0x000000010b556dd9 > postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb, > errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677) > + 137 at assert.c:54 > frame #3: 0x000000010b5af952 > postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8, > owner=0x00007faea38a6838) + 146 at snapmgr.c:677 > frame #4: 0x000000010b5af8b2 > postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at > snapmgr.c:663 > frame #5: 0x000000010b0166ce > postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504 > frame #6: 0x000000010b5474b8 > postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at > relcache.c:568 > frame #7: 0x000000010b53e45c > postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at > relcache.c:1035 > frame #8: 0x000000010b53d564 > postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777 > frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455, > lockmode=1) + 108 at heapam.c:1047 > frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455, > lockmode=1) + 25 at indexam.c:167 > frame #11: 0x000000010b01603d > postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455, > indexOK='\x01', snapshot=0x0000000000000000, nkeys=2, > key=0x00007fff54c6a990) + 93 at genam.c:334 > frame #12: 0x000000010b54a976 Yea, it really looks like the above commit is to "blame". The new xmin tracking infrastructure doesn't know about the historical snapshot... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund<span dir="ltr"><<a href="mailto:andres@2ndquadrant.com" target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Yea, itreally looks like the above commit is to "blame". The new xmin<br /> tracking infrastructure doesn't know about the historicalsnapshot...<span class="HOEnZb"><font color="#888888"><br /></font></span></blockquote></div><br /></div><div class="gmail_extra">Ithink that we need a better regression coverage here... For example, we could add some tap tests intest_decoding to test streaming of logical changes. This would help in the future to detect such problems via the buildfarm.<br/>-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
On 2015-02-10 22:06:34 +0900, Michael Paquier wrote: > On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > > Yea, it really looks like the above commit is to "blame". The new xmin > > tracking infrastructure doesn't know about the historical snapshot... > > > > I think that we need a better regression coverage here... For example, we > could add some tap tests in test_decoding to test streaming of logical > changes. This would help in the future to detect such problems via the > buildfarm. I agree. It's more or less a accident that the assert - which just should be moved in the regd_count == 0 branch - didn't trigger for the SQL interface. The snapshot acquired by the SELECT statement prevents it there. It's not entirely trivial to add tests for receivelogical though. You need to stop it programatically after a while. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/10/2015 02:46 PM, Andres Freund wrote: > On 2015-02-10 21:20:37 +0900, Michael Paquier wrote: >> Using test_decoding on HEAD (cc761b1) I am seeing the following assertion >> failure: >> TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root == >> ((void*)0)))", File: "snapmgr.c", Line: 677) > > Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes > it? > ... > > Yea, it really looks like the above commit is to "blame". The new xmin > tracking infrastructure doesn't know about the historical snapshot... The new xmin tracking code assumes that if a snapshots's regd_count > 0, it has already been pushed to the RegisteredSnapshots heap. That assumption doesn't hold because the logical decoding stuff creates its own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing them, even though they haven't been registered with RegisterSnapshot. The second paragraph in this comment in snapmgr.c needs fixing: > * Likewise, any snapshots that have been exported by pg_export_snapshot > * have regd_count = 1 and are counted in RegisteredSnapshots, but are not > * tracked by any resource owner. > * > * The same is true for historic snapshots used during logical decoding, > * their lifetime is managed separately (as they life longer as one xact.c > * transaction). Besides the spelling, that's incorrect: historic snapshots are *not* counted in RegisteredSnapshots. That was harmless before commit 94028691, but it was always wrong. I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack. snapbuild.c also abuses active_count as a reference counter, which is similarly ugly. I think regd_count and active_count should both be treated as private to snapmgr.c, and initialized to zero elsewhere. As a minimal fix, we could change the logical decoding code to not use regd_count to prevent snapmgr.c from freeing its snapshots, but use active_count for that too. Setting regd_count to 1 in SnapBuildBuildSnapshot() seems unnecessary in the first place, as the callers also call SnapBuildSnapIncRefcount(). Patch attached. But could we get rid of those active_count manipulations too? Could you replace the SnapBuildSnap[Inc|Dec]Refcount calls with [Un]RegisterSnapshot()? - Heikki
Вложения
On 10 February 2015 at 21:43, Andres Freund <andres@2ndquadrant.com> wrote:
I agree. It's more or less a accident that the assert - which justOn 2015-02-10 22:06:34 +0900, Michael Paquier wrote:
> On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>
> > Yea, it really looks like the above commit is to "blame". The new xmin
> > tracking infrastructure doesn't know about the historical snapshot...
> >
>
> I think that we need a better regression coverage here... For example, we
> could add some tap tests in test_decoding to test streaming of logical
> changes. This would help in the future to detect such problems via the
> buildfarm.
should be moved in the regd_count == 0 branch - didn't trigger for the
SQL interface. The snapshot acquired by the SELECT statement prevents it
there.
It's not entirely trivial to add tests for receivelogical though. You
need to stop it programatically after a while.
I reckon we should improve that then.
What about an idle timeout for pg_recvlogical, so we can get it to time out after (say) 5 seconds of no data? Time-based facilities aren't great in tests though.
The SQL interface can stop after a given number of changes received or a target LSN. Is there a practical reason pg_recvlogical doesn't? Or just that there wasn't a reason to implement it? I figured I'd ask before jumping in and trying to add it in case I'd be wasting my time trying.
On 11 February 2015 at 08:51, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
It might be a good idea to apply this if nothing better is forthcoming. Logical decoding in WALsenders is broken at the moment.
The new xmin tracking code assumes that if a snapshots's regd_count > 0, it has already been pushed to the RegisteredSnapshots heap. That assumption doesn't hold because the logical decoding stuff creates its own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing them, even though they haven't been registered with RegisterSnapshot.
The second paragraph in this comment in snapmgr.c needs fixing:* Likewise, any snapshots that have been exported by pg_export_snapshot
* have regd_count = 1 and are counted in RegisteredSnapshots, but are not
* tracked by any resource owner.
*
* The same is true for historic snapshots used during logical decoding,
* their lifetime is managed separately (as they life longer as one xact.c
* transaction).
Besides the spelling, that's incorrect: historic snapshots are *not* counted in RegisteredSnapshots. That was harmless before commit 94028691, but it was always wrong.
I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack. snapbuild.c also abuses active_count as a reference counter, which is similarly ugly. I think regd_count and active_count should both be treated as private to snapmgr.c, and initialized to zero elsewhere.
As a minimal fix, we could change the logical decoding code to not use regd_count to prevent snapmgr.c from freeing its snapshots, but use active_count for that too. Setting regd_count to 1 in SnapBuildBuildSnapshot() seems unnecessary in the first place, as the callers also call SnapBuildSnapIncRefcount(). Patch attached.
It might be a good idea to apply this if nothing better is forthcoming. Logical decoding in WALsenders is broken at the moment.
Otherwise it needs to go on the 9.5 blockers list.
But could we get rid of those active_count manipulations too? Could you replace the SnapBuildSnap[Inc|Dec]Refcount calls with [Un]RegisterSnapshot()?
It would be interesting to know why it was done that way in the first place, rather than using the snapshot management infrastructure. I presume it needed to do something not directly offered by the snapshot manager but I haven't managed to grasp what, exactly.
On 2015-04-07 17:22:12 +0800, Craig Ringer wrote: > It might be a good idea to apply this if nothing better is forthcoming. > Logical decoding in WALsenders is broken at the moment. Yes. > Otherwise it needs to go on the 9.5 blockers list. > > But could we get rid of those active_count manipulations too? Could you > > replace the SnapBuildSnap[Inc|Dec]Refcount calls with > > [Un]RegisterSnapshot()? > > > > It would be interesting to know why it was done that way in the first > place, rather than using the snapshot management infrastructure. Because the snapshot tracking infrastructure isn't particularly suitable. These snapshots are much longer lived; they span transactions and such. Nothing snapmgr.c is made for. It seemed more complicated/bug-prone to change snapmgr.c to support a foreign use case. Greetings, Andres Freund
On 04/07/2015 03:54 PM, Andres Freund wrote: > On 2015-04-07 17:22:12 +0800, Craig Ringer wrote: >> It might be a good idea to apply this if nothing better is forthcoming. >> Logical decoding in WALsenders is broken at the moment. > > Yes. Committed. - Heikki