Обсуждение: rare crash - FailedAssertion snapbuild.c Line: 580
Hello,
To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used
last year [1] - now in a perl incarnation). Most of the time the
replication is stable and finishes correctly but sometimes (rarely) I
get:
TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)
This will probably be difficult to reproduce and to act upon but I
wanted to report it anyway as in the course of the last few months I
have seen it several times, on several machines. Always rarely, always
postgres 11 (I did not try other versions).
Erik Rijkers
[1]
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl
source/version: bf2d0462cd73 / REL_11_STABLE (compiled 20180828_1629)
1 primary, 4 replicas (on 1 host).
Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres
then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres
then: wait for log-repl sync.
(I also added extra data type columns to the basic pgbench tables to
test all the different data types)
Below is a grep for the assert message, with 15 surrounding lines from
the latest occurrence.
$ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B
15 -Ei 'trap|assert' 65*/logfile.65* | less
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG:
logical decoding found consistent point at 0/89E038E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG:
received replication command: CREATE_REPLICATION_SLOT
"sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG:
logical decoding found initial starting point at 0/8AF5D590
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL:
Waiting for transactions (approximately 1) older than 886297 to end.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG:
logical decoding found consistent point at 0/8B018310
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG:
received replication command: START_REPLICATION SLOT
"sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1',
publication_names '"pub1_6516"')
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG:
starting logical decoding for slot "sub2_6517_6517_18834_sync_18793"
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL:
Streaming transactions committing after 0/89E03918, reading WAL from
0/841EE960.
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG:
logical decoding found consistent point at 0/841EF3E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL:
Logical decoding will begin using saved snapshot.
6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG:
received replication command: CREATE_REPLICATION_SLOT
"sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG:
logical decoding found consistent point at 0/8C6BA408
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516:TRAP:
FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))",
File: "snapbuild.c", Line: 580)
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:
server process (PID 139541) was terminated by signal 6: Aborted
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL:
Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:
terminating any other active server processes
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT:
slot "sub2_6517_6517", output plugin "pgoutput", in the change callback,
associated LSN 0/8B11CF00
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.116 CEST [130398] LOG: all
server processes terminated; reinitializing
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.515 CEST [139548] LOG:
database system was interrupted; last known up at 2018-08-29 14:26:14
CEST
Hi,
On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:
> To test postgres 11, I still regularly run series of short sessions of
> pgbench-over-logical-replication (basically the same thing that I used last
> year [1] - now in a perl incarnation). Most of the time the replication is
> stable and finishes correctly but sometimes (rarely) I get:
>
> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin))", File: "snapbuild.c", Line: 580)
>
> This will probably be difficult to reproduce and to act upon but I wanted to
> report it anyway as in the course of the last few months I have seen it
> several times, on several machines. Always rarely, always postgres 11 (I did
> not try other versions).
Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble? Knowing what the
values here are would be tremendously helpful...
Greetings,
Andres Freund
On 2018-08-29 21:15, Andres Freund wrote:
> Hi,
>
> On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:
>> To test postgres 11, I still regularly run series of short sessions of
>> pgbench-over-logical-replication (basically the same thing that I used
>> last
>> year [1] - now in a perl incarnation). Most of the time the
>> replication is
>> stable and finishes correctly but sometimes (rarely) I get:
>>
>> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
>> snap->xmin))", File: "snapbuild.c", Line: 580)
>>
>> This will probably be difficult to reproduce and to act upon but I
>> wanted to
>> report it anyway as in the course of the last few months I have seen
>> it
>> several times, on several machines. Always rarely, always postgres 11
>> (I did
>> not try other versions).
>
> Thanks for testing! Could you possibly run the tests with core files
> enabled, so we at get a backtrace in case of trouble? Knowing what the
> values here are would be tremendously helpful...
ok, is this any use?
$ gdb --quiet
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core
Reading symbols from
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done.
[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in
transaction '.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or
directory.
(gdb) bt
#0 0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f0fd20e8448 in __GI_abort () at abort.c:89
#2 0x00000000008880bf in ExceptionalCondition
(conditionName=conditionName@entry=0xa417f8
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))",
errorType=errorType@entry=0x8d365d "FailedAssertion",
fileName=fileName@entry=0xa41223 "snapbuild.c",
lineNumber=lineNumber@entry=580) at assert.c:54
#3 0x000000000072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
#4 0x000000000072ed54 in CreateReplicationSlot (cmd=0x223bae0) at
walsender.c:951
#5 exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY
LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527
#6 0x000000000077e8ee in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x21cbbc8, dbname=<optimized out>, username=<optimized
out>) at postgres.c:4155
#7 0x0000000000704fde in BackendRun (port=0x21c4520) at
postmaster.c:4361
#8 BackendStartup (port=0x21c4520) at postmaster.c:4033
#9 ServerLoop () at postmaster.c:1706
#10 0x0000000000705e0f in PostmasterMain (argc=argc@entry=12,
argv=argv@entry=0x219c470) at postmaster.c:1379
#11 0x0000000000478d80 in main (argc=12, argv=0x219c470) at main.c:228
(gdb)
On 2018-Aug-30, Erik Rijkers wrote: > ok, is this any use? Seems mostly good, but the Xids are not printed. Could you please do "bt full"? Also: frame 3 print *snap Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-08-30 16:44, Alvaro Herrera wrote: > On 2018-Aug-30, Erik Rijkers wrote: > >> ok, is this any use? > > Seems mostly good, but the Xids are not printed. Could you please do > "bt full"? Also: > > frame 3 > print *snap See the attached.