Обсуждение: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors

Поиск
Список
Период
Сортировка

BUG #18076: Consistently receiving Signal 7 and Signal 11 errors

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18076
Logged by:          Robert Sanford
Email address:      wobbet@wobbet.com
PostgreSQL version: 15.4
Operating system:   Linux 5.15.0-1042-azure #49~20.04.1-Ubuntu aarch64
Description:

Installed v15 via Docker on my local dev machine (Intel x64). Created the
database schema, ran a bunch of tests. All very good.

pg_dump the database and then imported into a new instance running on ARM64
VM in Azure (8 cores, 64 GB RAM, Ubuntu 20.04.1). This is a full VM, not a
docker instance. ARM64 CPUs are, according to Azure, physical CPUs and not
virtual.

Logs from the Azure machine show Signal 7 and Signal 11 errors consistently
in the same place of the processing every time (log entries below). It's
executing a specific function. Server restarts and has to catch up from the
WAL. None of the other functions show this behavior. Running the stored
procedure from DBeaver does not cause the same issues. 

My ETL app is a .NET Core 6 console application running on my local machine
and connecting to either the DB on my local machine or the VM in Azure. It
connects via NPGSQL. It doesn't seem to be horribly expensive in terms of
CPU usage and generally finishes in < 5s. That being said, there can be as
many as 32 simultaneous connections running the specific stored procedure.


After reading other bug reports of Signal 7 I set huge_pages=off and
disabled extensions such as autoanalyze. Other non-default config entries
include:
- max_connections = 100
- shared_buffers = 8GB
- work_mem = 8GB
- maintenance_work_mem = 1GB
- max_wal_size = 4GB
- min_wal_size = 500MB

I am hopeful that someone will be able to help me out with this. Let me know
if there's any other details that I need to provide.

Thanks!

rjsjr



Log entries...
2023-08-29 01:50:57.225 UTC [432307] LOG:  server process (PID 433701) was
terminated by signal 7: Bus error
2023-08-29 01:50:57.225 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:50:57.225 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:50:57.231 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:50:57.441 UTC [438947] LOG:  database system was interrupted;
last known up at 2023-08-29 01:44:22 UTC
2023-08-29 01:50:57.547 UTC [438947] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:50:57.553 UTC [438947] LOG:  redo starts at F/3CD6C068
2023-08-29 01:50:57.896 UTC [438950] FATAL:  the database system is not yet
accepting connections
2023-08-29 01:50:57.896 UTC [438950] DETAIL:  Consistent recovery state has
not been yet reached.
2023-08-29 01:50:58.612 UTC [438947] LOG:  unexpected pageaddr E/99864000 in
log segment 000000010000000F00000050, offset 8798208
2023-08-29 01:50:58.612 UTC [438947] LOG:  redo done at F/50863FB0 system
usage: CPU: user: 0.66 s, system: 0.32 s, elapsed: 1.05 s
2023-08-29 01:50:58.629 UTC [438948] LOG:  checkpoint starting:
end-of-recovery immediate wait


2023-08-29 01:51:07.996 UTC [438948] LOG:  checkpoint complete: wrote 35042
buffers (3.3%); 0 WAL file(s) added, 20 removed, 0 recycled; write=8.989 s,
sync=0.197 s, total=9.372 s; sync files=254, longest=0.193 s, average=0.001
s; distance=322528 kB, estimate=322528 kB
2023-08-29 01:51:08.027 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:07.472 UTC [432307] LOG:  server process (PID 440110) was
terminated by signal 7: Bus error
2023-08-29 01:52:07.472 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:07.472 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:07.500 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:07.689 UTC [440184] LOG:  database system was interrupted;
last known up at 2023-08-29 01:51:07 UTC
2023-08-29 01:52:07.797 UTC [440184] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:07.803 UTC [440184] LOG:  redo starts at F/508640C0
2023-08-29 01:52:07.892 UTC [440184] LOG:  unexpected pageaddr E/C8C98000 in
log segment 000000010000000F00000051, offset 13205504
2023-08-29 01:52:07.892 UTC [440184] LOG:  redo done at F/51C97F28 system
usage: CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.08 s

2023-08-29 01:52:07.908 UTC [440185] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:08.085 UTC [440185] LOG:  checkpoint complete: wrote 1216
buffers (0.1%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.140 s,
sync=0.011 s, total=0.181 s; sync files=39, longest=0.005 s, average=0.001
s; distance=20688 kB, estimate=20688 kB
2023-08-29 01:52:08.104 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:08.153 UTC [440187] FATAL:  the database system is not yet
accepting connections
2023-08-29 01:52:08.153 UTC [440187] DETAIL:  Consistent recovery state has
not been yet reached.
2023-08-29 01:52:09.716 UTC [432307] LOG:  server process (PID 440192) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:09.716 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:09.716 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:09.720 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:09.873 UTC [440196] LOG:  database system was interrupted;
last known up at 2023-08-29 01:52:08 UTC
2023-08-29 01:52:09.984 UTC [440196] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:09.998 UTC [440196] LOG:  redo starts at F/51C980C0
2023-08-29 01:52:10.001 UTC [440196] LOG:  unexpected pageaddr E/C8CEA000 in
log segment 000000010000000F00000051, offset 13541376
2023-08-29 01:52:10.001 UTC [440196] LOG:  redo done at F/51CE9AB0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:10.024 UTC [440197] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:10.132 UTC [440197] LOG:  checkpoint complete: wrote 58
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.077 s,
sync=0.014 s, total=0.113 s; sync files=26, longest=0.005 s, average=0.001
s; distance=328 kB, estimate=328 kB
2023-08-29 01:52:10.141 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:11.501 UTC [432307] LOG:  server process (PID 440272) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:11.501 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:11.501 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:11.504 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:11.648 UTC [440279] LOG:  database system was interrupted;
last known up at 2023-08-29 01:52:10 UTC
2023-08-29 01:52:11.748 UTC [440279] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:11.754 UTC [440279] LOG:  redo starts at F/51CEA0C0
2023-08-29 01:52:11.759 UTC [440279] LOG:  unexpected pageaddr E/C8D86000 in
log segment 000000010000000F00000051, offset 14180352
2023-08-29 01:52:11.759 UTC [440279] LOG:  redo done at F/51D85F90 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:11.778 UTC [440280] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:11.874 UTC [440280] LOG:  checkpoint complete: wrote 89
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.062 s,
sync=0.016 s, total=0.103 s; sync files=31, longest=0.008 s, average=0.001
s; distance=624 kB, estimate=624 kB
2023-08-29 01:52:11.883 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:53:16.516 UTC [432307] LOG:  server process (PID 440285) was
terminated by signal 11: Segmentation fault
2023-08-29 01:53:16.516 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)


Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors

От
Robert Sanford
Дата:
Note - one thing I read was that work_mem is a per-connection setting so 8GB * 32 active connections = way more RAM than the machine has. 

I decreased that to 128MB and re-ran the tests. Failed in the same place. It did get more done than it had previously, but failed in the same place. Then I decreased it to 32MB and the same thing happened.

Using the Azure VM monitor I see that available memory has not ever gone below 50% so work_mem may not be the issue.

rjsjr

On Tue, Aug 29, 2023 at 3:31 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18076
Logged by:          Robert Sanford
Email address:      wobbet@wobbet.com
PostgreSQL version: 15.4
Operating system:   Linux 5.15.0-1042-azure #49~20.04.1-Ubuntu aarch64
Description:       

Installed v15 via Docker on my local dev machine (Intel x64). Created the
database schema, ran a bunch of tests. All very good.

pg_dump the database and then imported into a new instance running on ARM64
VM in Azure (8 cores, 64 GB RAM, Ubuntu 20.04.1). This is a full VM, not a
docker instance. ARM64 CPUs are, according to Azure, physical CPUs and not
virtual.

Logs from the Azure machine show Signal 7 and Signal 11 errors consistently
in the same place of the processing every time (log entries below). It's
executing a specific function. Server restarts and has to catch up from the
WAL. None of the other functions show this behavior. Running the stored
procedure from DBeaver does not cause the same issues.

My ETL app is a .NET Core 6 console application running on my local machine
and connecting to either the DB on my local machine or the VM in Azure. It
connects via NPGSQL. It doesn't seem to be horribly expensive in terms of
CPU usage and generally finishes in < 5s. That being said, there can be as
many as 32 simultaneous connections running the specific stored procedure.


After reading other bug reports of Signal 7 I set huge_pages=off and
disabled extensions such as autoanalyze. Other non-default config entries
include:
- max_connections = 100
- shared_buffers = 8GB
- work_mem = 8GB
- maintenance_work_mem = 1GB
- max_wal_size = 4GB
- min_wal_size = 500MB

I am hopeful that someone will be able to help me out with this. Let me know
if there's any other details that I need to provide.

Thanks!

rjsjr



Log entries...
2023-08-29 01:50:57.225 UTC [432307] LOG:  server process (PID 433701) was
terminated by signal 7: Bus error
2023-08-29 01:50:57.225 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:50:57.225 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:50:57.231 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:50:57.441 UTC [438947] LOG:  database system was interrupted;
last known up at 2023-08-29 01:44:22 UTC
2023-08-29 01:50:57.547 UTC [438947] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:50:57.553 UTC [438947] LOG:  redo starts at F/3CD6C068
2023-08-29 01:50:57.896 UTC [438950] FATAL:  the database system is not yet
accepting connections
2023-08-29 01:50:57.896 UTC [438950] DETAIL:  Consistent recovery state has
not been yet reached.
2023-08-29 01:50:58.612 UTC [438947] LOG:  unexpected pageaddr E/99864000 in
log segment 000000010000000F00000050, offset 8798208
2023-08-29 01:50:58.612 UTC [438947] LOG:  redo done at F/50863FB0 system
usage: CPU: user: 0.66 s, system: 0.32 s, elapsed: 1.05 s
2023-08-29 01:50:58.629 UTC [438948] LOG:  checkpoint starting:
end-of-recovery immediate wait


2023-08-29 01:51:07.996 UTC [438948] LOG:  checkpoint complete: wrote 35042
buffers (3.3%); 0 WAL file(s) added, 20 removed, 0 recycled; write=8.989 s,
sync=0.197 s, total=9.372 s; sync files=254, longest=0.193 s, average=0.001
s; distance=322528 kB, estimate=322528 kB
2023-08-29 01:51:08.027 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:07.472 UTC [432307] LOG:  server process (PID 440110) was
terminated by signal 7: Bus error
2023-08-29 01:52:07.472 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:07.472 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:07.500 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:07.689 UTC [440184] LOG:  database system was interrupted;
last known up at 2023-08-29 01:51:07 UTC
2023-08-29 01:52:07.797 UTC [440184] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:07.803 UTC [440184] LOG:  redo starts at F/508640C0
2023-08-29 01:52:07.892 UTC [440184] LOG:  unexpected pageaddr E/C8C98000 in
log segment 000000010000000F00000051, offset 13205504
2023-08-29 01:52:07.892 UTC [440184] LOG:  redo done at F/51C97F28 system
usage: CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.08 s

2023-08-29 01:52:07.908 UTC [440185] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:08.085 UTC [440185] LOG:  checkpoint complete: wrote 1216
buffers (0.1%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.140 s,
sync=0.011 s, total=0.181 s; sync files=39, longest=0.005 s, average=0.001
s; distance=20688 kB, estimate=20688 kB
2023-08-29 01:52:08.104 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:08.153 UTC [440187] FATAL:  the database system is not yet
accepting connections
2023-08-29 01:52:08.153 UTC [440187] DETAIL:  Consistent recovery state has
not been yet reached.
2023-08-29 01:52:09.716 UTC [432307] LOG:  server process (PID 440192) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:09.716 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:09.716 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:09.720 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:09.873 UTC [440196] LOG:  database system was interrupted;
last known up at 2023-08-29 01:52:08 UTC
2023-08-29 01:52:09.984 UTC [440196] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:09.998 UTC [440196] LOG:  redo starts at F/51C980C0
2023-08-29 01:52:10.001 UTC [440196] LOG:  unexpected pageaddr E/C8CEA000 in
log segment 000000010000000F00000051, offset 13541376
2023-08-29 01:52:10.001 UTC [440196] LOG:  redo done at F/51CE9AB0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:10.024 UTC [440197] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:10.132 UTC [440197] LOG:  checkpoint complete: wrote 58
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.077 s,
sync=0.014 s, total=0.113 s; sync files=26, longest=0.005 s, average=0.001
s; distance=328 kB, estimate=328 kB
2023-08-29 01:52:10.141 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:11.501 UTC [432307] LOG:  server process (PID 440272) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:11.501 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:11.501 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:11.504 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:11.648 UTC [440279] LOG:  database system was interrupted;
last known up at 2023-08-29 01:52:10 UTC
2023-08-29 01:52:11.748 UTC [440279] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:11.754 UTC [440279] LOG:  redo starts at F/51CEA0C0
2023-08-29 01:52:11.759 UTC [440279] LOG:  unexpected pageaddr E/C8D86000 in
log segment 000000010000000F00000051, offset 14180352
2023-08-29 01:52:11.759 UTC [440279] LOG:  redo done at F/51D85F90 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:11.778 UTC [440280] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:11.874 UTC [440280] LOG:  checkpoint complete: wrote 89
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.062 s,
sync=0.016 s, total=0.103 s; sync files=31, longest=0.008 s, average=0.001
s; distance=624 kB, estimate=624 kB
2023-08-29 01:52:11.883 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:53:16.516 UTC [432307] LOG:  server process (PID 440285) was
terminated by signal 11: Segmentation fault
2023-08-29 01:53:16.516 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)

Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors

От
Tom Lane
Дата:
Robert Sanford <wobbet@gmail.com> writes:
> Using the Azure VM monitor I see that available memory has not ever gone
> below 50% so work_mem may not be the issue.

There was never any reason to think that.

>> 2023-08-29 01:50:57.225 UTC [432307] LOG:  server process (PID 433701) was
>> terminated by signal 7: Bus error

The usual cause of SIGBUS is a misaligned memory access on hardware
that's picky about that.  (Intel chips aren't, so that it's often
possible for such bugs to slip through developer testing.  But your
ARM system might be.)

Unfortunately, there's about zero chance of locating the bug from
the information you've provided.  What'd be helpful is to capture
stack traces from a few of the failed processes.  See

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

for directions.  Please try to get traces from both the SIG7 and
SIG11 cases, as it's not very clear whether they have the same
root.

            regards, tom lane



Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors

От
Robert Sanford
Дата:
Tom,

Thanks for that! The symlink was the issue. I recreated the symlink and it found everything. I started seeing some weird errors when running a simple test but realized that they were actually bugs in my code!

Now I'm rerunning the full suite and am no longer getting the Signal 7 or Signal 11. I'm hoping that the OS upgrade is making things work better but that also worries me.

If (when?!?!) I am able to recreate the bug I'll get a stack trace.

rjsjr





On Wed, Aug 30, 2023 at 9:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Sanford <wobbet@gmail.com> writes:
> The update dropped the mount point where the database was stored. I've
> restored the mount point but now PostgreSQL can't find it.
> It's looking for pg_tblspc/16392/PG_15_202209061 when that folder is
> located instead on /datadrive/pgsql/PG_15_202209061.

pg_tblspc/16392/PG_15_202209061 should be a symlink to wherever the
tablespace really is.  You could make that link by hand if it
disappeared somehow, but I think you are probably misdiagnosing
the problem.  Shouldn't you be adding something to the system's
list of mounts, instead?

                        regards, tom lane