Обсуждение: strange hung processes
We're running 7.4.9 and have run into something strange. A group of processes seem to have hung and cannot be killed. netstat showed that there were no active TCP connections at the time. Sending SIGTERM to the parent process caused PG to begin its shutdown, but it never finished. We then kill -9 the postmaster, which caused it to die, but these child procs remain, even after a clean restart. Our next step is to reboot the box, but before we do, I'm just curious anyone else has seen this, why this happened, and if anyone know ways to prevent this from happening again..... thanks -- jeremy ashcraft operations/development EDucation GATEways jashcraft@edgate.com
I forgot to include the process list: postgres 13214 5142 0 Jan09 ? 00:00:00 postgres: postgre edgate 10.1.1.3 authentication postgres 13215 5142 0 Jan09 ? 00:00:00 postgres: postgre edgate 10.1.1.1 authentication postgres 13216 5142 0 Jan09 ? 00:00:00 postgres: postgre sn_master 10.1.1.3 authentication postgres 13217 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13218 5142 0 Jan09 ? 00:00:00 postgres: postgre edgate 10.1.1.1 authentication postgres 13219 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13220 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13221 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13222 5142 0 Jan09 ? 00:00:00 postgres: postgre sn_master 10.1.1.3 authentication postgres 13223 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13224 5142 0 Jan09 ? 00:00:00 postgres: postgre sn_master 10.1.1.3 authentication postgres 13225 5142 0 Jan09 ? 00:00:00 postgres: postgre sn_master 10.1.1.3 authentication postgres 13226 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13227 5142 0 Jan09 ? 00:00:00 postgres: snuser sn_master 10.1.1.3 authentication postgres 13228 5142 0 Jan09 ? 00:00:00 postgres: postgre sn_master 10.1.1.3 authentication postgres 13229 5142 0 Jan09 ? 00:00:00 postgres: postgre sn_master 10.1.1.3 authentication Jeremy Ashcraft wrote: > We're running 7.4.9 and have run into something strange. > A group of processes seem to have hung and cannot be killed. netstat > showed that there were no active TCP connections at the time. Sending > SIGTERM to the parent process caused PG to begin its shutdown, but it > never finished. We then kill -9 the postmaster, which caused it to > die, but these child procs remain, even after a clean restart. Our > next step is to reboot the box, but before we do, I'm just curious > anyone else has seen this, why this happened, and if anyone know ways > to prevent this from happening again..... > > thanks > -- jeremy ashcraft operations/development EDucation GATEways jashcraft@edgate.com
>We're running 7.4.9 and have run into something strange. >A group of processes seem to have hung and cannot be killed. netstat >showed that there were no active TCP connections at the time. Sending >SIGTERM to the parent process caused PG to begin its shutdown, but it >never finished. We then kill -9 the postmaster, which caused it to die, >but these child procs remain, even after a clean restart. Our next step >is to reboot the box, but before we do, I'm just curious anyone else has >seen this, why this happened, and if anyone know ways to prevent this >from happening again..... If it's not sensitive information, what does this show? lsof | grep 'pid of hung process' Regards, Ben Kim Developer http://benix.tamu.edu
Ben Kim wrote: >> We're running 7.4.9 and have run into something strange. >> A group of processes seem to have hung and cannot be killed. netstat >> showed that there were no active TCP connections at the time. >> Sending SIGTERM to the parent process caused PG to begin its >> shutdown, but it never finished. We then kill -9 the postmaster, >> which caused it to die, but these child procs remain, even after a >> clean restart. Our next step is to reboot the box, but before we >> do, I'm just curious anyone else has seen this, why this happened, >> and if anyone know ways to prevent this from happening again..... > > If it's not sensitive information, what does this show? > > lsof | grep 'pid of hung process' > from the save-a-process committee: lsof -p 'pid of hung process' will give the same info :) LER -- Larry Rosenman Database Support Engineer PERVASIVE SOFTWARE. INC. 12365B RIATA TRACE PKWY 3015 AUSTIN TX 78727-6531 Tel: 512.231.6173 Fax: 512.231.6597 Email: Larry.Rosenman@pervasive.com Web: www.pervasive.com
If it's not sensitive information, what does this show? >> lsof | grep 'pid of hung process' >> From my systems guy: in /proc codd 13262 # cat status {proc "filesystem"} Name: postmaster State: D (disk sleep) Because of the "D" state, it can't be killed as it is not interuptible (waiting on IO ?). codd opt # lsof -p 13263 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME postmaste 13263 postgres cwd DIR 8,17 464 4030 /app1/user/postgres postmaste 13263 postgres rtd DIR 8,3 488 2 / postmaste 13263 postgres txt REG 8,17 2552859 5012 /app1/pg/7.4.9/bin/postgres postmaste 13263 postgres mem REG 0,0 0 [heap] (stat: No such file or directory) postmaste 13263 postgres DEL REG 0,7 0 /SYSV0052e2c1 postmaste 13263 postgres mem REG 8,3 35428 5729 /lib/libnss_files-2.3.4.so postmaste 13263 postgres mem REG 8,5 18852 6895 /usr/lib/libgpm.so.1.19.0 postmaste 13263 postgres mem REG 8,3 266284 5965 /lib/libncurses.so.5.4 postmaste 13263 postgres mem REG 8,3 1167808 5977 /lib/libc-2.3.4.so postmaste 13263 postgres mem REG 8,3 151896 5755 /lib/libm-2.3.4.so postmaste 13263 postgres mem REG 8,3 10620 5987 /lib/libdl-2.3.4.so postmaste 13263 postgres mem REG 8,3 75848 6003 /lib/libnsl-2.3.4.so postmaste 13263 postgres mem REG 8,3 60884 5998 /lib/libresolv-2.3.4.so postmaste 13263 postgres mem REG 8,3 18424 6006 /lib/libcrypt-2.3.4.so postmaste 13263 postgres mem REG 8,3 173996 6007 /lib/libreadline.so.4.3 postmaste 13263 postgres mem REG 8,3 63204 5976 /lib/libz.so.1.2.2 postmaste 13263 postgres mem REG 8,3 95392 5915 /lib/ld-2.3.4.so postmaste 13263 postgres 0r CHR 1,3 2767 /dev/null postmaste 13263 postgres 1w REG 8,7 7513385 3649 /var/log/pg/ruby.log.bak1 postmaste 13263 postgres 2w REG 8,7 7513385 3649 /var/log/pg/ruby.log.bak1 postmaste 13263 postgres 5u sock 0,4 11790654 can't identify protocol
Jeremy Ashcraft <jashcraft@edgate.com> writes: > in /proc > codd 13262 # cat status {proc "filesystem"} > Name: postmaster > State: D (disk sleep) > Because of the "D" state, it can't be killed as it is not interuptible > (waiting on IO ?). If the process is stuck in D state then it's not Postgres' fault. You're looking at a hardware problem, or if the database is mounted via NFS then it might be an NFS-protocol-level problem. In any case you need to call out the kernel and hardware troops, not us database weenies ... regards, tom lane