Обсуждение: pg_stat_activity showing non-existent processes
Hello; Briefly, we've been fighting an "old idle transaction" problem on our Pg 8.0.3 Solaris 2.9 production system for a long time. This is due to some quirks in our app server code (to be fixed ASAP <groan>). Hourly we run a script that SIGTERMs all backends reported as being in idle transaction state for 6 hours or better. Many, many runs of this procedure during past several weeks; no problem. Suddently, we have a condition where the pg_stat_activity view is reporting several such backends and there is no such PID in the system any more. These were either successfully terminated by an earlier script run, died off abnormally or even by normal client disconnect. (Unknown). At any rate; I'm wondering what possible causes might be responsible for pg_stat_activity's underlying functions to lose track of the valid process list? Great big thanks! -- ------------------------------------------------------------------------------- Jerry Sievers 305 854-3001 (home) WWW ECommerce Consultant 305 321-1144 (mobile http://www.JerrySievers.com/
Jerry Sievers <jerry@jerrysievers.com> writes: > At any rate; I'm wondering what possible causes might be responsible > for pg_stat_activity's underlying functions to lose track of the valid > process list? It sounds like the stats collector missed a few "backend quit" messages. This isn't real surprising: the stats messaging mechanism is intentionally designed to drop messages under severe load, rather than slow down backends. We recently put in a filter that prevents reporting pg_stat_activity lines for backends that are dead according to the up-to-date list in shared memory. I don't think that's in 8.0.3 though. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Jerry Sievers <jerry@jerrysievers.com> writes: > > At any rate; I'm wondering what possible causes might be responsible > > for pg_stat_activity's underlying functions to lose track of the valid > > process list? > > It sounds like the stats collector missed a few "backend quit" > messages. This isn't real surprising: the stats messaging mechanism is > intentionally designed to drop messages under severe load, rather than > slow down backends. Tom, I appreciate your comments on this. > We recently put in a filter that prevents reporting pg_stat_activity > lines for backends that are dead according to the up-to-date list in > shared memory. I don't think that's in 8.0.3 though. This machine is going to be upgraded to 8.1 ASAP. In the meantime, I may adjust the cron run time for the batch job that terminates these sessions in case it's running concurrently with other batch traffic. I believe the thing runs hourly at the 00 minute and this may likely be a time that we're kicking off all sorts of other jobs too. > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- ------------------------------------------------------------------------------- Jerry Sievers 305 854-3001 (home) WWW ECommerce Consultant 305 321-1144 (mobile http://www.JerrySievers.com/
>>> On Sat, Mar 25, 2006 at 8:40 pm, in message <22874.1143340808@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jerry Sievers <jerry@jerrysievers.com> writes: >> At any rate; I'm wondering what possible causes might be responsible >> for pg_stat_activity's underlying functions to lose track of the valid >> process list? > > It sounds like the stats collector missed a few "backend quit" > messages. This isn't real surprising: the stats messaging mechanism is > intentionally designed to drop messages under severe load, rather than > slow down backends. Is there any way to tweak this in favor of more accurate information, even if has a performance cost? We're finding that during normal operations we're not seeing most connections added to the pg_stat_activity table. We would like to be able to count on accurate information there. We've been considering adding functions to get at the underlying structures to be able to retrieve it, but it would make a lot of sense (for us, anyway) to make this table accurate instead. What would be involved in that? Would it improve the accuracy of the other statistics, as well? Would anyone else be interested in something like this (probably controlled by a configuration option), or are we unique in this regard? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Is there any way to tweak this in favor of more accurate information, > even if has a performance cost? We're finding that during normal > operations we're not seeing most connections added to the > pg_stat_activity table. We would like to be able to count on accurate > information there. That's basically a non-starter because of the delay in reporting from the stats collector process (ie, even if the information was "completely accurate" it'd still be stale by the time that your code gets its hands on it). I think you'd be talking about a complete redesign of the stats subsystem to be able to use it that way. Having said that, though, I'd be pretty surprised if the stats subsystem was dropping more than a small fraction of messages --- I would think that could only occur under very heavy load, and if that's your normal operating state then it's time to upgrade your hardware ;-). Maybe you should investigate a bit more closely to find out why it's dropping so much. regards, tom lane
>>> On Mon, Apr 3, 2006 at 11:52 am, in message <14779.1144083156@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Is there any way to tweak this in favor of more accurate information, >> even if has a performance cost? We're finding that during normal >> operations we're not seeing most connections added to the >> pg_stat_activity table. We would like to be able to count on accurate >> information there. > > That's basically a non- starter because of the delay in reporting from > the stats collector process (ie, even if the information was "completely > accurate" it'd still be stale by the time that your code gets its hands > on it). I think you'd be talking about a complete redesign of the stats > subsystem to be able to use it that way. We want this for our monitoring software, to raise an alert when the connection pool diverges from its nominal configuration beyond prescribed limits or in excess of a prescribed duration. What we're looking for is not necessarily a table which is accurate immediately, but one which won't entirely miss a connection. Even then, if it only misbehaves under extreme load, that would be OK; such extreme usage might be worthy of note in and of itself. Since we have converted to PostgreSQL we have not had this monitoring, and folks are nervous that we will not detect a struggling middle tier before it fails. (Not something that happens often, but we really hate having users tell us that something is broken, versus spotting the impending failure and correcting it before it fails.) > Having said that, though, I'd be pretty surprised if the stats subsystem > was dropping more than a small fraction of messages --- I would think > that could only occur under very heavy load, and if that's your normal > operating state then it's time to upgrade your hardware ;- ). We have a pair of database servers for our transaction repository. Each has four Xeon processors. One of these is Windows, one is Linux. On the Windows machine, I see 10% CPU utilization. On the Linux machine I see a load average of 0.30. The Linux machine seems to be very reliable about showing the connections. The Windows machine, when I refresh a 20-connection pool, I either get no connections showing, or only a few. > Maybe you > should investigate a bit more closely to find out why it's dropping so > much. It is probably related to something we've been seeing in the PostgreSQL logs on the Windows servers: [2006-04-03 08:28:25.990 ] 2072 FATAL: could not read from statistics collector pipe: No error [2006-04-03 08:28:26.068 ] 2012 LOG: statistics collector process (PID 3268) was terminated by signal 1 We're going to patch to try to capture more info from WinSock. In src/port/pipe.c we plan to add before return ret in piperead(): if (ret == SOCKET_ERROR) { ereport(LOG, (errmsg_internal("SOCKET ERROR: %ui", WSAGetLastError()))); } I hope to post more info, and possibly a patch, tomorrow. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > It is probably related to something we've been seeing in the PostgreSQL > logs on the Windows servers: > [2006-04-03 08:28:25.990 ] 2072 FATAL: could not read from statistics > collector pipe: No error > [2006-04-03 08:28:26.068 ] 2012 LOG: statistics collector process (PID > 3268) was terminated by signal 1 We've heard reports of instability in the stats collector on Windows before, though I'm not sure if this is exactly the symptom --- check the list archives. Nobody's been able to track it down yet. regards, tom lane
"Lane Van Ingen" <lvaningen@ESNCC.com> writes: > Don't understand the 'target machine' message, either; in this case, we are > running the application and the database server on the same box. > 2006-04-04 03:12:05 FATAL: could not read from statistics collector pipe: > No error 2006-04-04 03:12:06 FATAL: could not write to statistics collector > pipe: No connection could be made because the target machine actively > refused it. I think that's Microsoftese for ECONNRESET, ie, the kernel bounced a packet for lack of any listening process to deliver it to. The real question is what's causing the collector to fail (the "could not read"). While it'd be easy to make it retry read attempts, the reason it considers that FATAL is that it really should never happen. I'd like to find out exactly what's happening before we try to fix it. As Kevin mentioned, adding some more debug printout would be helpful. regards, tom lane
Perhaps I might be able to help you track this problem down, but I could use some help. The limited amount of discussion on this I Googled up didn't help. We have never seen these messages before; we are assuming that their appearance now (we have been running OK since last August) is related to two recent changes we made to our Windows 2003 Server (SvcPk 1), PostgreSQL 8.0.4 installation: (1) started capturing statistics with the following config parms; all other parms were installation defaults: debug_pretty_print "on" log_min_duration_statement "60" log_min_error_statement "debug1" log_statement "ddl" log_truncate_on_rotation "on" stats_block_level "on" stats_command_string "on" stats_reset_on_server_start "on" stats_row_level "on" stats_start_collector "on" (2) added additional application workload to this server Don't understand the 'target machine' message, either; in this case, we are running the application and the database server on the same box. 2006-04-04 03:12:05 FATAL: could not read from statistics collector pipe: No error 2006-04-04 03:12:06 FATAL: could not write to statistics collector pipe: No connection could be made because the target machine actively refused it. 2006-04-04 04:16:58 FATAL: could not read from statistics collector pipe: No error 2006-04-04 04:16:58 FATAL: could not write to statistics collector pipe: No connection could be made because the target machine actively refused it. 2006-04-04 05:47:26 FATAL: could not read from statistics collector pipe: No error 2006-04-04 05:47:27 LOG: statistics collector process (PID 1776) was terminated by signal 1 -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane Sent: Tuesday, April 04, 2006 1:29 AM To: Kevin Grittner Cc: Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > It is probably related to something we've been seeing in the PostgreSQL > logs on the Windows servers: > [2006-04-03 08:28:25.990 ] 2072 FATAL: could not read from statistics > collector pipe: No error > [2006-04-03 08:28:26.068 ] 2012 LOG: statistics collector process (PID > 3268) was terminated by signal 1 We've heard reports of instability in the stats collector on Windows before, though I'm not sure if this is exactly the symptom --- check the list archives. Nobody's been able to track it down yet. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Does anyone out there have a Microsoft installation, know how to get the dump(s) desired to diagnose the problems below, know where dumps end up being placed, and know how to interpret dumps? I am willing to try to help solve these problems, but I need some help ... -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, April 04, 2006 3:52 PM To: Lane Van Ingen Cc: Kevin Grittner; Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes "Lane Van Ingen" <lvaningen@ESNCC.com> writes: > Don't understand the 'target machine' message, either; in this case, we are > running the application and the database server on the same box. > 2006-04-04 03:12:05 FATAL: could not read from statistics collector pipe: > No error 2006-04-04 03:12:06 FATAL: could not write to statistics collector > pipe: No connection could be made because the target machine actively > refused it. I think that's Microsoftese for ECONNRESET, ie, the kernel bounced a packet for lack of any listening process to deliver it to. The real question is what's causing the collector to fail (the "could not read"). While it'd be easy to make it retry read attempts, the reason it considers that FATAL is that it really should never happen. I'd like to find out exactly what's happening before we try to fix it. As Kevin mentioned, adding some more debug printout would be helpful. regards, tom lane -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Lane Van Ingen Sent: Tuesday, April 04, 2006 3:40 PM To: Tom Lane; Kevin Grittner Cc: Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes Perhaps I might be able to help you track this problem down, but I could use some help. The limited amount of discussion on this I Googled up didn't help. We have never seen these messages before; we are assuming that their appearance now (we have been running OK since last August) is related to two recent changes we made to our Windows 2003 Server (SvcPk 1), PostgreSQL 8.0.4 installation: (1) started capturing statistics with the following config parms; all other parms were installation defaults: debug_pretty_print "on" log_min_duration_statement "60" log_min_error_statement "debug1" log_statement "ddl" log_truncate_on_rotation "on" stats_block_level "on" stats_command_string "on" stats_reset_on_server_start "on" stats_row_level "on" stats_start_collector "on" (2) added additional application workload to this server Don't understand the 'target machine' message, either; in this case, we are running the application and the database server on the same box. 2006-04-04 03:12:05 FATAL: could not read from statistics collector pipe: No error 2006-04-04 03:12:06 FATAL: could not write to statistics collector pipe: No connection could be made because the target machine actively refused it. 2006-04-04 04:16:58 FATAL: could not read from statistics collector pipe: No error 2006-04-04 04:16:58 FATAL: could not write to statistics collector pipe: No connection could be made because the target machine actively refused it. 2006-04-04 05:47:26 FATAL: could not read from statistics collector pipe: No error 2006-04-04 05:47:27 LOG: statistics collector process (PID 1776) was terminated by signal 1 -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane Sent: Tuesday, April 04, 2006 1:29 AM To: Kevin Grittner Cc: Jerry Sievers; pgsql-admin@postgresql.org; Peter Brant Subject: Re: [ADMIN] pg_stat_activity showing non-existent processes "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > It is probably related to something we've been seeing in the PostgreSQL > logs on the Windows servers: > [2006-04-03 08:28:25.990 ] 2072 FATAL: could not read from statistics > collector pipe: No error > [2006-04-03 08:28:26.068 ] 2012 LOG: statistics collector process (PID > 3268) was terminated by signal 1 We've heard reports of instability in the stats collector on Windows before, though I'm not sure if this is exactly the symptom --- check the list archives. Nobody's been able to track it down yet. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
On Apr 4, 2006, at 1:29 AM, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> It is probably related to something we've been seeing in the >> PostgreSQL >> logs on the Windows servers: > >> [2006-04-03 08:28:25.990 ] 2072 FATAL: could not read from >> statistics >> collector pipe: No error >> [2006-04-03 08:28:26.068 ] 2012 LOG: statistics collector process >> (PID >> 3268) was terminated by signal 1 > > We've heard reports of instability in the stats collector on Windows > before, though I'm not sure if this is exactly the symptom --- check > the list archives. Nobody's been able to track it down yet. BTW, I was (attempting) to do some performance testing on windows Sunday with pgbench and I was getting an error similar to that (sorry for being so vague, but I'm ~2000 miles from that server right now). This was on a database created with a scale factor of 40 running on a dual Xeon with plain SCSI drives (base tables on a seperate drive from pg_xlog) running on XP with SP1. Any attempt to run pgbench with 40 connections for more than about 50 transactions would result in that error (as well as some other ugliness). Based just on that limited experience it looks like the windows port could use a lot more shaking out with pgbench. Perhaps it would be a good idea to add pgbench to the buildfarm test. -- Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Apr 3, 2006, at 12:52 PM, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Is there any way to tweak this in favor of more accurate information, >> even if has a performance cost? We're finding that during normal >> operations we're not seeing most connections added to the >> pg_stat_activity table. We would like to be able to count on >> accurate >> information there. > > That's basically a non-starter because of the delay in reporting from > the stats collector process (ie, even if the information was > "completely > accurate" it'd still be stale by the time that your code gets its > hands > on it). I think you'd be talking about a complete redesign of the > stats > subsystem to be able to use it that way. BTW, there's some effort going into adding monitoring probes such as dtrace to PostgreSQL. These would likely be ideal for what you're trying to do. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461