Обсуждение: rebellious pg stats collector (reopened case)
PostgreSQL 8.3.5, the system is now stable (uptime > 10 days). PostgreSQL stats collector uses 100% CPU forever: On Thursday: last pid: 29509; load averages: 2.36, 2.01, 2.03 up 5+17:28:56 04:02:53 196 processes: 3 running, 184 sleeping, 9 zombie CPU states: 5.3% user, 0.0% nice, 15.3% system, 0.0% interrupt, 79.4% idle Mem: 1009M Active, 5995M Inact, 528M Wired, 354M Cache, 214M Buf, 12M Free Swap: 16G Total, 500K Used, 16G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 26709 pgsql 1 106 0 22400K 6832K CPU6 6 973:52 99.02% postgres Today: last pid: 71326; load averages: 2.69, 3.15, 2.92 up 10+20:13:44 06:47:41 176 processes: 3 running, 166 sleeping, 7 zombie CPU states: % user, % nice, % system, % interrupt, % idle Mem: 928M Active, 5868M Inact, 557M Wired, 380M Cache, 214M Buf, 172M Free Swap: 16G Total, 620K Used, 16G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 44689 pgsql 1 107 0 22400K 7060K CPU6 6 748:07 99.02% postgres 64221 www 1 96 0 158M 27628K select 5 0:18 4.20% httpd 68567 www 1 20 0 151M 23092K lockf 0 0:03 2.39% httpd ... shopzeus# uptime 6:48AM up 10 days, 20:14, 1 user, load averages: 2.21, 3.01, 2.87 More than 10 hours on a dual-quad core Xeon 5420??? We have two databases, total database size is about 15GB. (The stats collector also uses significant disk I/O.) Thursday: # date Thu Dec 11 04:05:00 EST 2008 # ls -l ~pgsql/data/ # ls -l ~pgsql/data/global/pgstat.stat -rw------- 1 pgsql pgsql 231673 Dec 10 12:01 /usr/local/pgsql/data/global/pgstat.stat Today: #date Tue Dec 16 06:48:27 EST 2008 # cd ~pgsql/data # ls -l global/pgstat.stat -rw------- 1 pgsql pgsql 232358 Dec 15 18:45 global/pgstat.stat Looks like the pgstat.stat was not updated since the pg stats collector (re)started. #uname -a FreeBSD shopzeus.com 7.0-RELEASE-p5 FreeBSD 7.0-RELEASE-p5 #0: Mon Nov 17 21:37:25 EST 2008 root@shopzeus.chello.hu:/usr/obj/usr/src/sys/SHOPZEUS amd64 After restarting the postmaster, the process disappeares for a while (some hours, sometimes for one day), then it start updating the stat file correctly. Please advise. Thanks, Laszlo
Laszlo Nagy wrote: > PostgreSQL 8.3.5, the system is now stable (uptime > 10 days). > PostgreSQL stats collector uses 100% CPU forever: Could you grab a few stack traces from it and post them? Also possibly useful, leave strace running on the pgstat process for a while and post the output somewhere. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Laszlo Nagy wrote: > >> PostgreSQL 8.3.5, the system is now stable (uptime > 10 days). >> PostgreSQL stats collector uses 100% CPU forever: >> > > Could you grab a few stack traces from it and post them? Also possibly > useful, leave strace running on the pgstat process for a while and post > the output somewhere. > Very interesting results. Before I used 'truss' on the process, it was: 78816 pgsql 1 106 0 22400K 7100K CPU6 6 24.2H 99.02% postgres After I started "truss -p 78815" the first message I got was: SIGNAL 17 (SIGSTOP) and the it waited for some seconds. Then my screen was filled with messages like this: poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) = 984 (0x3d8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) = 984 (0x3d8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M^X\^A\0\0\M^T\f\r\0"...,1000,0x0,NULL,0x0) = 408 (0x198) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\b\0\0\0@\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64 (0x40) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\b\0\0\0@\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64 (0x40) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^O\v\b\0\^B"...,1000,0x0,NULL,0x0) = 216 (0xd8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504 (0x1f8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^T\f\r\0\^B"...,1000,0x0,NULL,0x0) = 216 (0xd8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504 (0x1f8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^T\f\r\0\^B"...,1000,0x0,NULL,0x0) = 216 (0xd8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504 (0x1f8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\b\0\0\0@\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64 (0x40) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) I terminated the debugger with Ctrl+C. Then I started it again: #truss -p 78815 SIGNAL 17 (SIGSTOP) poll({7/POLLIN|POLLERR},1,2000) = 0 (0x0) getppid(0x1,0x1,0x7d0,0x8014606cc,0xffffffff80a579c0,0x7fffffffceb8) = 95860 (0x17674) open("global/pgstat.tmp",O_WRONLY|O_CREAT|O_TRUNC,0666) = 4 (0x4) fstat(4,{mode=-rw------- ,inode=73223196,size=0,blksize=4096}) = 0 (0x0) write(4,"\M^W\M-<\M-%\^A\^F\n\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\M-Z\^C\0\0\0\0\0\0\M-d\M^E\M-0u"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0+\0\0\0\0\0\0\0\^N\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\^C\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0T\^T\^N\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\^R\M-(\M^[\v\M-^\M-0A\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\^N\0\0\0\0\0\0\0\\n\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0T\M-A\n\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0~\M-d\M-fp\n\M-^\M-0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\^E\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"$\0\0\0\0\0\0+\^V\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0Z\M-F%"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0|\^B\0\0\0\0\0\0<\M-I\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0T\M-f\t\^V\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\M-H'\M-4\M-N\n\M-^\M-0A"...,4096) = 4096 (0x1000) write(4,"\0\0\^N\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\b\0\0\0\0\0L\^B\0\0\0\0\0\0\M^^"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0dD\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0>\M-h\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0T%;\r\0\0\0\0\0 \0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\M-I\\j\n\M-^\M-0A\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0Y\240\^A\0\0\0\0\0\^C\M-3"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\^N\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\^A\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0P\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\^A\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\^P\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\M-P\^R\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,3119) = 3119 (0xc2f) close(4) = 0 (0x0) rename("global/pgstat.tmp","global/pgstat.stat") = 0 (0x0) process exit, rval = 0 I did NOT send the SIGSTOP signal to the process. The pgstat collector was running at 100%CPU for 24 hours, when I attached the debugger, and attaching the debugger somehow caused the process to get a SIGSTOP, but the process did not exit. When I started the debugger for the second time, it got a SIGSTOP again, and this made the collector exiting the endless loop (instead of stopping it). What is the next step? Laszlo
Laszlo Nagy wrote: > Alvaro Herrera wrote: >> Laszlo Nagy wrote: >> >>> PostgreSQL 8.3.5, the system is now stable (uptime > 10 days). >>> PostgreSQL stats collector uses 100% CPU forever: >>> >> >> Could you grab a few stack traces from it and post them? Also possibly >> useful, leave strace running on the pgstat process for a while and post >> the output somewhere. >> > Very interesting results. Before I used 'truss' on the process, it was: > > 78816 pgsql 1 106 0 22400K 7100K CPU6 6 24.2H 99.02% > postgres > > After I started "truss -p 78815" the first message I got was: It was 78816 and you traced 78815? Are you sure the process with 24h of CPU was pgstat? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
> It was 78816 and you traced 78815? Are you sure the process with 24h of > CPU was pgstat? > I'm sorry that was a typo. Of course I traced the good process (proof is that at the end it renamed a file to "global/pgstat.stat". And yes, "top" showed 24H in the TIME column and 99% in the WCPU column. Also this process used 50-90% of disk I/O continuously, slowing down the computer. I'm sure that it will happen again, then I'm going to post another trace (and make sure that I trace the good process). Best, Laszlo
OK, here is the new test. shopzeus# ps l -p 39766 -p 39767 -p 39769 UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 70 39766 78806 0 96 0 451960 423896 select Ss ?? 8:41.85 postgres: writer process (postgres) 70 39767 78806 0 96 0 451960 7184 select Ss ?? 0:58.75 postgres: wal writer process (postgres) 70 39769 78806 0 96 0 22400 7008 select Ss ?? 14:42.39 postgres: stats collector process (postgres) top -otime 10 last pid: 7984; load averages: 1.38, 1.51, 1.49 up 16+18:18:22 05:02:20 170 processes: 3 running, 161 sleeping, 1 stopped, 5 zombie CPU states: % user, % nice, % system, % interrupt, % idle Mem: 959M Active, 5947M Inact, 562M Wired, 275M Cache, 214M Buf, 164M Free Swap: 16G Total, 1088K Used, 16G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 78806 pgsql 1 98 0 441M 6960K select 4 67:14 0.39% postgres 39769 pgsql 1 96 0 22400K 7008K select 7 14:43 0.00% postgres 39766 pgsql 1 96 0 441M 414M select 3 8:42 0.00% postgres 10602 root 1 96 0 149M 18916K select 4 5:30 0.00% httpd 67535 root 1 4 0 3720K 1336K kqread 0 3:54 0.00% dovecot 795 root 1 4 0 8188K 1452K kqread 4 3:21 0.00% master 95791 shopzeus 3 116 20 59892K 17440K select 3 2:22 0.00% python 32397 postfix 1 4 0 8280K 1980K kqread 6 1:23 0.00% qmgr 67536 root 1 4 0 5996K 1448K kqread 4 1:07 0.00% dovecot-auth 525 root 1 96 0 4684K 1156K select 3 1:03 0.00% syslogd In other words, the stats collector is running since 14 hours. shopzeus# date Mon Dec 22 05:03:24 EST 2008 shopzeus# ls -l ~pgsql/data/global/pgstat.stat -rw------- 1 pgsql pgsql 232084 Dec 22 05:03 /usr/local/pgsql/data/global/pgstat.stat Looks like it is updating the stats file continuously. Now I try to trace: shopzeus# truss -p 39769 poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) = 984 (0x3d8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) = 984 (0x3d8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) = 984 (0x3d8) poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1) recvfrom(7,"\^A\0\0\0X\^B\0\0\M^T\f\r\0\^F\0"...,1000,0x0,NULL,0x0) = 600 (0x258) poll({7/POLLIN|POLLERR},1,2000) ERR#4 'Interrupted system call' SIGNAL 14 (SIGALRM) sigreturn(0x7fffffffca70,0x0,0x7fffffffca70,0xffffffff809fb5c0,0x7fffffffffc0,0x7fffffffca60) ERR#4 'Interrupted system call' getppid(0x1,0x1,0x801901120,0x4,0x7fffffffd39f,0x6) = 11204 (0x2bc4) open("global/pgstat.tmp",O_WRONLY|O_CREAT|O_TRUNC,0666) = 4 (0x4) fstat(4,{mode=-rw------- ,inode=73223197,size=0,blksize=4096}) = 0 (0x0) write(4,"\M^W\M-<\M-%\^A`\^B\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"l\0\0\0\0\0\0\0\M-T\M-s.\M-`\M-r"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\t\0\0\0\0\0\0\0\^C"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\^C\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0T\^T\^N\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\M-aKo\M-s\M-a\M-0A\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\^C\0\0\0\0\0\0\0\M-&"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0T\M-A\n\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\M-_o\M-T\M-u\M-p"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\^E\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\^A\0\0\0\0\0\0\^?\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0X\M-<"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\b\^D@f\0\0\0\0\0\0\0F \0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0T\M-2\f\b\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0;\0\0\0\0\0\0\0\^R\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0T_\n\0\0\0\0\0\0\M-*R\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"r\M^W\M-`\M-p\M-a\M-0A\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\M-a\M-p\M-a\M-0A\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0c\^Q\^A\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\^F\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0Z\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\^D\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000) write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,2708) = 2708 (0xa94) close(4) = 0 (0x0) rename("global/pgstat.tmp","global/pgstat.stat") = 0 (0x0) process exit, rval = 0 And the stats collector exited! It cannot be by accident. Last time I checked, it had the same problem: tried to poll something. It was in an inifinte loop, polling something but always got POLERR. Then because I started to trace it, it exited with rval=0 or rval=1. I do not understand why it exited after I started to trace it? Is this some kind of bug? Thanks, Laszlo
Laszlo Nagy <gandalf@shopzeus.com> writes: > And the stats collector exited! It cannot be by accident. Last time I > checked, it had the same problem: tried to poll something. It was in an > inifinte loop, polling something but always got POLERR. Then because I > started to trace it, it exited with rval=0 or rval=1. > I do not understand why it exited after I started to trace it? Is this > some kind of bug? AFAICS, the only ways to reach the exit(0) in PgstatCollectorMain are (1) need_exit becomes set (and the only way for that to happen is to receive a SIGQUIT signal); or (2) PostmasterIsAlive(true) fails, which implies that getppid() is returning something different than it used to. I wonder whether your tracing tool is affecting the result of getppid(). Most people would consider that a bug in the tracing tool. regards, tom lane
Tom Lane wrote: > I wonder whether your tracing tool is affecting the result of > getppid(). Most people would consider that a bug in the tracing tool. Probably having a close look at the PPID column in ps or top during the truss run would prove this. Still, the actual problem being debugged is something else. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I wonder whether your tracing tool is affecting the result of >> getppid(). Most people would consider that a bug in the tracing tool. > Probably having a close look at the PPID column in ps or top during the > truss run would prove this. It's at least conceivable that the result being returned inside the process is different from what the rest of the world sees. What'd prove this one way or the other is to write a trivial test program along the lines of while(1) { sleep(1); printf("ppid = %d\n", getppid()); } and see if its output changes when you start to trace it. > Still, the actual problem being debugged is something else. Agreed, but we need to understand what the tools being used to investigate the problem are doing ... regards, tom lane
> and see if its output changes when you start to trace it. > %cat test.c #include <stdio.h> int main() { while(1) { sleep(5); printf("ppid = %d\n", getppid()); } } %gcc -o test test.c %./test ppid = 47653 ppid = 47653 ppid = 47653 # Started "truss -p 48864" here! ppid = 49073 ppid = 49073 ppid = 49073 > Agreed, but we need to understand what the tools being used to > investigate the problem are doing ... > Unfortunately, I'm not able to install strace: # pwd /usr/ports/devel/strace # make ===> strace-4.5.7 is only for i386, while you are running amd64. *** Error code 1 Stop in /usr/ports/devel/strace. I'll happily install any trace tool, but have no clue which one would help.
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> Tom Lane wrote: >> >>> I wonder whether your tracing tool is affecting the result of >>> getppid(). Most people would consider that a bug in the tracing tool. >>> > > I wrote to an official the FreeBSD list about this getppid() problem but got no answer other than that "this behaviour is documented". :-( The problem is still there: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11205 pgsql 1 104 0 22400K 7112K CPU5 5 159.7H 99.02% postgres 100% CPU since 159 hours! What can I do? Instead of tracing system calls, is there a way to start the stats collector in debug mode? Or maybe is it possible to change the source code, and disable the "is postmaster alive" check for testing? Thanks
Laszlo Nagy <gandalf@shopzeus.com> writes: > maybe is it possible to change the source code, and disable the "is > postmaster alive" check for testing? Rather than disabling it, it'd probably be more convenient to make any getppid value except 1 (the init process) be treated as "it's alive". Otherwise you'll have trouble with the collector not stopping when you want it to. Look into src/backend/storage/ipc/pmsignal.c: - return (getppid() == PostmasterPid); + return (getppid() != 1); (Obviously this is a hack, but it should work on any Unixish system) regards, tom lane
> alive". Otherwise you'll have trouble with the collector not stopping > when you want it to. Look into src/backend/storage/ipc/pmsignal.c: > > - return (getppid() == PostmasterPid); > + return (getppid() != 1); > > (Obviously this is a hack, but it should work on any Unixish system) > Thanks. I'm going to try this next weekend (when nobody works on the database) and come back with the results.
Laszlo Nagy írta: > >> alive". Otherwise you'll have trouble with the collector not stopping >> when you want it to. Look into src/backend/storage/ipc/pmsignal.c: >> >> - return (getppid() == PostmasterPid); >> + return (getppid() != 1); >> >> (Obviously this is a hack, but it should work on any Unixish system) >> Sorry for coming back so late. I changed this, and reinstalled postgresql but it won't start. Some messages from the log: Jan 8 00:40:29 shopzeus postgres[80522]: [3814-1] FATAL: the database system is in recovery mode Jan 8 00:40:30 shopzeus postgres[80559]: [3849-1] FATAL: the database system is in recovery mode Jan 8 00:40:30 shopzeus postgres[80632]: [3917-1] FATAL: the database system is in recovery mode Jan 8 00:40:31 shopzeus postgres[80662]: [3946-1] FATAL: the database system is in recovery mode Jan 8 00:40:31 shopzeus postgres[80700]: [3981-1] FATAL: the database system is in recovery mode Jan 8 00:40:31 shopzeus postgres[80734]: [4014-1] FATAL: the database system is in recovery mode Jan 8 00:40:31 shopzeus postgres[80738]: [4017-1] FATAL: the database system is in recovery mode Jan 8 00:40:32 shopzeus postgres[80768]: [4046-1] FATAL: the database system is in recovery mode Jan 8 00:40:32 shopzeus postgres[80802]: [4078-1] FATAL: the database system is in recovery mode Jan 8 00:40:32 shopzeus postgres[80806]: [4079-1] WARNING: terminating connection because of crash of another server process Jan 8 00:40:32 shopzeus postgres[80806]: [4079-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Jan 8 00:40:32 shopzeus postgres[80806]: [4079-3] process exited abnormally and possibly corrupted shared memory. Jan 8 00:40:32 shopzeus postgres[80806]: [4079-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. Jan 8 00:40:32 shopzeus postgres[80835]: [4109-1] FATAL: the database system is in recovery mode Jan 8 00:40:32 shopzeus postgres[80837]: [4110-1] FATAL: the database system is in recovery mode Jan 8 00:40:33 shopzeus postgres[80870]: [4141-1] FATAL: the database system is in recovery mode Jan 8 00:40:33 shopzeus postgres[80900]: [4170-1] FATAL: the database system is in recovery mode Jan 8 00:40:34 shopzeus postgres[80938]: [4205-1] FATAL: the database system is in recovery mode Jan 8 00:40:34 shopzeus postgres[80967]: [4233-1] FATAL: the database system is in recovery mode (I had to reinstall the original version because it will be used today.)