Обсуждение: Problem with pgstat timneouts
About a month ago, I started receiving quite a few pgstat timeouts on my production database.
PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD], 64-bit
I am not sure where to start looking for the cause.
Database has been up since march without any restart
db1# ps -aux | grep stat
pgsql 1287 1.0 0.0 22832 4724 ?? Ss 11Mar11 4265:13.87 postgres: stats collector process (postgres)
It appears that these timeouts are affecting the performance of the application – sometimes it hangs for a few seconds before data gets returned, which was not the case before this.
2011-12-22 03:45:04 MST WARNING: pgstat wait timeout
2011-12-22 03:45:04 MST WARNING: pgstat wait timeout
2011-12-22 03:45:04 MST WARNING: pgstat wait timeout
2011-12-22 03:45:04 MST WARNING: pgstat wait timeout
2011-12-22 03:45:41 MST WARNING: pgstat wait timeout
2011-12-22 03:45:41 MST WARNING: pgstat wait timeout
2011-12-22 03:45:41 MST WARNING: pgstat wait timeout
2011-12-22 03:46:18 MST WARNING: pgstat wait timeout
2011-12-22 03:46:18 MST WARNING: pgstat wait timeout
2011-12-22 03:47:32 MST WARNING: pgstat wait timeout
2011-12-22 03:47:32 MST WARNING: pgstat wait timeout
2011-12-22 03:47:32 MST WARNING: pgstat wait timeout
2011-12-22 03:48:09 MST WARNING: pgstat wait timeout
2011-12-22 03:48:09 MST WARNING: pgstat wait timeout
2011-12-22 03:48:09 MST WARNING: pgstat wait timeout
2011-12-22 03:49:26 MST WARNING: pgstat wait timeout
Runtime statistics setting for the database are as follows:
#------------------------------------------------------------------------------
# RUNTIME STATISTICS
#------------------------------------------------------------------------------
# - Query/Index Statistics Collector -
#track_activities = on
#track_counts = on
#track_functions = none # none, pl, all
#track_activity_query_size = 1024 # (change requires restart)
# On FreeBSD, this is a performance hog, so keep it off if you need speed
update_process_title = off
#update_process_title = on
#stats_temp_directory = 'pg_stat_tmp'
# - Statistics Monitoring -
#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off
"Benjamin Krajmalnik" <kraj@servoyant.com> writes: > About a month ago, I started receiving quite a few pgstat timeouts on my > production database. > PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc (GCC) > 4.2.1 20070719 [FreeBSD], 64-bit > I am not sure where to start looking for the cause. Yeah, so are the rest of us :-(. People have been reporting sometimes seeing these messages without any clear cause. We've been unable to find any bug in Postgres that would explain them, and the obvious explanation of "your system is so overloaded that the stats collector is too slow to respond" isn't very satisfying if response time seems snappy otherwise. I think we have seen one case where the problem was provably caused by significant (multi-second) clock skew between different processors, but that's likely not applicable to many people. Can you reconstruct what changed in your system around the time you started seeing these? > It appears that these timeouts are affecting the performance of the > application - sometimes it hangs for a few seconds before data gets > returned, which was not the case before this. Usually these timeouts are reported by autovacuum, and so should not be affecting the performance of foreground queries, at least not queries that are not explicitly requesting pgstats data. The fact that your log messages don't include any query context reinforces the idea that they are coming from autovacuum for you too. So if you are seeing query response time problems, maybe the pgstat timeouts are another symptom, not the cause, of an underlying too-much-load problem. You might try turning on query duration logging and see if you can correlate response-time glitches and pgstat timeouts to anything else observable, such as peaks of system load average. regards, tom lane
Thanks, Tom. The system is running 16 cores (dual E5620's), 96GB RAM. Nothing changed on the system when this started happening - at least nothing in the configuration - it has not been touched since it was set up. Load averages are in between 6 and 8 (all 3 of them) Memory wise, I do not see a problem - plenty of free memory, and only 120K of swap has been used. The only thing which I see when I run iostat is load on the drives every so often. Mifd0 is the database, mifd1 is the pg_xlog. db1# iostat -x 5 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 1.6 0 ad6 0.0 0.0 0.0 0.0 0 0.8 0 mfid0 11.6 172.2 171.7 2988.5 0 178.1 6 mfid1 0.1 734.0 1.1 5898.8 0 0.2 4 ar0 0.3 0.7 1.0 10.3 0 114.7 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 2.8 73.4 22.4 1961.9 0 0.2 2 mfid1 0.0 1784.9 0.0 14303.2 0 0.0 5 ar0 0.0 1.0 0.0 6.8 0 0.8 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 2.2 652.7 44.8 7671.1 2053 99.6 10 mfid1 0.0 1525.9 0.0 12207.6 1 0.1 12 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 2202.4 8.0 19776.3 0 632.0 68 mfid1 0.0 574.3 0.0 4666.4 0 0.5 23 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.8 75.8 14.4 2242.8 0 0.3 2 mfid1 0.0 1722.5 0.0 13802.0 0 0.0 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.4 71.4 11.2 1991.6 0 0.2 1 mfid1 0.0 1708.9 0.0 13671.4 0 0.0 4 ar0 0.2 2.4 3.2 38.4 0 2.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 78.0 8.0 2222.5 0 0.1 1 mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5 ar0 0.0 0.6 0.0 9.6 0 0.6 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.4 56.6 11.2 1629.1 0 0.3 2 mfid1 0.0 1558.3 0.0 12468.3 0 0.0 6 ar0 0.0 0.6 0.0 3.2 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.8 72.2 14.4 1706.7 0 0.2 1 mfid1 0.0 1696.5 0.0 13596.2 0 0.0 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 1757.7 0.0 16093.5 351 648.0 48 mfid1 0.0 503.9 0.0 4031.3 0 0.2 10 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 125.0 0.0 2542.5 0 404.4 1 mfid1 0.0 1743.9 0.0 13951.1 0 0.1 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.2 73.4 9.6 2122.6 0 0.1 1 mfid1 0.0 1624.3 0.0 13020.0 0 0.0 4 ar0 0.0 2.0 0.0 32.0 0 1.8 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.2 67.4 1.6 1955.8 0 0.1 0 mfid1 0.0 1334.5 0.0 10699.7 0 0.0 5 ar0 0.0 0.4 0.0 6.4 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 64.6 8.0 1789.1 0 0.2 1 mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 78.4 8.0 1855.3 0 0.2 1 mfid1 0.0 1652.1 0.0 13217.0 0 0.0 5 ar0 0.0 0.8 0.0 6.4 0 0.7 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.6 691.6 4.8 7919.5 3461 82.4 11 mfid1 0.0 1477.7 0.0 11821.7 1 0.3 15 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.4 1346.2 3.2 12558.1 0 855.1 41 mfid1 0.0 1020.8 0.0 8190.6 0 0.5 39 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 62.2 8.0 1792.6 0 0.3 2 mfid1 0.0 1636.3 0.0 13112.4 0 0.0 4 ar0 0.0 0.8 0.0 12.8 0 0.8 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.2 63.0 1.6 1818.1 0 0.1 0 mfid1 0.0 1710.6 0.0 13684.6 0 0.0 5 ar0 0.0 1.2 0.0 19.2 0 1.2 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.8 83.8 6.4 2330.6 0 0.1 1 mfid1 0.0 1218.8 0.0 9750.2 0 0.0 4 ar0 0.0 0.4 0.0 6.4 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 68.6 0.0 1655.7 0 0.1 0 mfid1 0.0 1708.9 0.0 13671.4 0 0.0 5 ar0 0.0 0.6 0.0 3.2 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 71.6 0.0 1999.4 0 0.0 0 mfid1 0.0 1071.3 0.0 8594.1 0 0.1 3 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.4 1373.6 3.2 12956.5 0 550.7 35 mfid1 0.0 999.8 0.0 8022.7 0 0.8 37 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.8 68.8 8.0 1908.5 0 0.2 1 mfid1 0.0 1370.5 0.0 10962.3 0 0.0 3 ar0 0.0 0.6 0.0 9.6 0 0.6 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.2 63.8 9.6 1912.1 0 0.3 2 mfid1 0.0 1355.8 0.0 10846.2 0 0.0 3 ar0 0.0 1.0 0.0 16.0 0 1.1 0 Looking at /var/log/messages, everything appears to be ok. The only thing I found, but it was from over a month ago, was the following: Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory error Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000 Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280 Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory error Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000 Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141 Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory error Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141 Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory error Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000 Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707 But it has not reoccurred. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Thursday, December 22, 2011 10:55 AM > To: Benjamin Krajmalnik > Cc: pgsql-admin > Subject: Re: [ADMIN] Problem with pgstat timneouts > > "Benjamin Krajmalnik" <kraj@servoyant.com> writes: > > About a month ago, I started receiving quite a few pgstat timeouts on > my > > production database. > > > PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc > (GCC) > > 4.2.1 20070719 [FreeBSD], 64-bit > > > I am not sure where to start looking for the cause. > > Yeah, so are the rest of us :-(. People have been reporting sometimes > seeing these messages without any clear cause. We've been unable to > find any bug in Postgres that would explain them, and the obvious > explanation of "your system is so overloaded that the stats collector > is too slow to respond" isn't very satisfying if response time seems > snappy otherwise. I think we have seen one case where the problem was > provably caused by significant (multi-second) clock skew between > different processors, but that's likely not applicable to many people. > > Can you reconstruct what changed in your system around the time you > started seeing these? > > > It appears that these timeouts are affecting the performance of the > > application - sometimes it hangs for a few seconds before data gets > > returned, which was not the case before this. > > Usually these timeouts are reported by autovacuum, and so should not be > affecting the performance of foreground queries, at least not queries > that are not explicitly requesting pgstats data. The fact that your > log messages don't include any query context reinforces the idea that > they are coming from autovacuum for you too. So if you are seeing > query response time problems, maybe the pgstat timeouts are another > symptom, not the cause, of an underlying too-much-load problem. You > might try turning on query duration logging and see if you can > correlate > response-time glitches and pgstat timeouts to anything else observable, > such as peaks of system load average. > > regards, tom lane
"Benjamin Krajmalnik" <kraj@servoyant.com> writes: > The only thing which I see when I run iostat is load on the drives every > so often. Mifd0 is the database, mifd1 is the pg_xlog. Hmm. Maybe checkpoints or something else saturating your drives for a little while? You might want to be more aggressive about smoothing out checkpoints, cf http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm At the very least I'd enable checkpoint logging for awhile and see if there seems to be any correlation to the response-time glitches. regards, tom lane
checkpoint_segments = 1024 # in logfile segments, min 1, 16MB each checkpoint_timeout = 60min # range 30s-1h checkpoint_completion_target = 0.85 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables select * from pg_stat_bgwriter gives ne the following: checkpoints_timed - 6887 checkpoints_req - 36 A snapshot from the log (I am logging checkpoints) 2011-12-22 01:04:33 MST LOG: checkpoint starting: time 2011-12-22 01:12:32 MST WARNING: pgstat wait timeout 2011-12-22 01:12:32 MST WARNING: pgstat wait timeout 2011-12-22 01:13:12 MST WARNING: pgstat wait timeout 2011-12-22 01:13:12 MST WARNING: pgstat wait timeout 2011-12-22 01:13:12 MST WARNING: pgstat wait timeout 2011-12-22 01:13:18 MST WARNING: pgstat wait timeout 2011-12-22 01:13:18 MST WARNING: pgstat wait timeout 2011-12-22 01:35:35 MST WARNING: pgstat wait timeout 2011-12-22 01:35:35 MST WARNING: pgstat wait timeout 2011-12-22 01:38:32 MST WARNING: pgstat wait timeout 2011-12-22 01:38:32 MST WARNING: pgstat wait timeout 2011-12-22 01:41:28 MST LOG: checkpoint complete: wrote 329593 buffers (41.9%); 0 transaction log file(s) added, 0 removed, 731 rec 2011-12-22 09:04:33 MST LOG: checkpoint starting: time 2011-12-22 09:06:54 MST WARNING: pgstat wait timeout 2011-12-22 09:06:54 MST WARNING: pgstat wait timeout 2011-12-22 09:06:54 MST WARNING: pgstat wait timeout 2011-12-22 09:07:33 MST WARNING: pgstat wait timeout 2011-12-22 09:07:33 MST WARNING: pgstat wait timeout 2011-12-22 09:07:33 MST WARNING: pgstat wait timeout 2011-12-22 09:13:41 MST WARNING: pgstat wait timeout 2011-12-22 09:22:07 MST ERROR: canceling autovacuum task 2011-12-22 09:22:07 MST CONTEXT: automatic vacuum of table "ishield.public.tbltmptests" 2011-12-22 09:28:41 MST WARNING: pgstat wait timeout 2011-12-22 09:28:41 MST WARNING: pgstat wait timeout 2011-12-22 09:31:35 MST WARNING: pgstat wait timeout 2011-12-22 09:31:35 MST WARNING: pgstat wait timeout 2011-12-22 09:33:16 MST WARNING: pgstat wait timeout 2011-12-22 09:33:16 MST WARNING: pgstat wait timeout 2011-12-22 09:33:55 MST WARNING: pgstat wait timeout 2011-12-22 09:33:55 MST WARNING: pgstat wait timeout 2011-12-22 09:33:55 MST WARNING: pgstat wait timeout 2011-12-22 09:35:10 MST WARNING: pgstat wait timeout 2011-12-22 09:35:10 MST WARNING: pgstat wait timeout 2011-12-22 09:35:10 MST ERROR: canceling autovacuum task 2011-12-22 09:35:10 MST CONTEXT: automatic vacuum of table "ishield.public.tbltmptests" 2011-12-22 09:36:21 MST WARNING: pgstat wait timeout 2011-12-22 09:36:21 MST WARNING: pgstat wait timeout 2011-12-22 09:37:02 MST WARNING: pgstat wait timeout 2011-12-22 09:37:02 MST WARNING: pgstat wait timeout 2011-12-22 09:37:02 MST WARNING: pgstat wait timeout 2011-12-22 09:37:43 MST WARNING: pgstat wait timeout 2011-12-22 09:37:43 MST WARNING: pgstat wait timeout 2011-12-22 09:38:20 MST WARNING: pgstat wait timeout 2011-12-22 09:38:20 MST WARNING: pgstat wait timeout 2011-12-22 09:38:59 MST WARNING: pgstat wait timeout 2011-12-22 09:38:59 MST WARNING: pgstat wait timeout 2011-12-22 09:38:59 MST WARNING: pgstat wait timeout 2011-12-22 09:42:33 MST LOG: checkpoint complete: wrote 362856 buffers (46.1%); 0 transaction log file(s) added, 0 removed, 908 rec I do have an aggressive autovacuum because I have some tables that get updated thousands of times per minute and otherwise they bloat terribly, but it is very seldom that I see cancelation of autovacuum, and only on this table, which is a temporary table used for the insertion of the data, in which records are inserted and deleted - never updated. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Thursday, December 22, 2011 11:45 AM > To: Benjamin Krajmalnik > Cc: pgsql-admin > Subject: Re: [ADMIN] Problem with pgstat timneouts > > "Benjamin Krajmalnik" <kraj@servoyant.com> writes: > > The only thing which I see when I run iostat is load on the drives > every > > so often. Mifd0 is the database, mifd1 is the pg_xlog. > > Hmm. Maybe checkpoints or something else saturating your drives for a > little while? You might want to be more aggressive about smoothing out > checkpoints, cf > http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm > At the very least I'd enable checkpoint logging for awhile and see > if there seems to be any correlation to the response-time glitches. > > regards, tom lane
Just a WAG - is it possible this is being caused by the connection pooler (pgbouncer in my case)? > -----Original Message----- > From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin- > owner@postgresql.org] On Behalf Of Benjamin Krajmalnik > Sent: Thursday, December 22, 2011 11:56 AM > To: Tom Lane > Cc: pgsql-admin > Subject: Re: [ADMIN] Problem with pgstat timneouts > > checkpoint_segments = 1024 # in logfile segments, min 1, > 16MB each > checkpoint_timeout = 60min # range 30s-1h > checkpoint_completion_target = 0.85 # checkpoint target duration, > 0.0 - 1.0 > #checkpoint_warning = 30s # 0 disables > > select * from pg_stat_bgwriter gives ne the following: > checkpoints_timed - 6887 > checkpoints_req - 36 > > A snapshot from the log (I am logging checkpoints) > > 2011-12-22 01:04:33 MST LOG: checkpoint starting: time > 2011-12-22 01:12:32 MST WARNING: pgstat wait timeout > 2011-12-22 01:12:32 MST WARNING: pgstat wait timeout > 2011-12-22 01:13:12 MST WARNING: pgstat wait timeout > 2011-12-22 01:13:12 MST WARNING: pgstat wait timeout > 2011-12-22 01:13:12 MST WARNING: pgstat wait timeout > 2011-12-22 01:13:18 MST WARNING: pgstat wait timeout > 2011-12-22 01:13:18 MST WARNING: pgstat wait timeout > 2011-12-22 01:35:35 MST WARNING: pgstat wait timeout > 2011-12-22 01:35:35 MST WARNING: pgstat wait timeout > 2011-12-22 01:38:32 MST WARNING: pgstat wait timeout > 2011-12-22 01:38:32 MST WARNING: pgstat wait timeout > 2011-12-22 01:41:28 MST LOG: checkpoint complete: wrote 329593 buffers > (41.9%); 0 transaction log file(s) added, 0 removed, 731 rec > > > > 2011-12-22 09:04:33 MST LOG: checkpoint starting: time > 2011-12-22 09:06:54 MST WARNING: pgstat wait timeout > 2011-12-22 09:06:54 MST WARNING: pgstat wait timeout > 2011-12-22 09:06:54 MST WARNING: pgstat wait timeout > 2011-12-22 09:07:33 MST WARNING: pgstat wait timeout > 2011-12-22 09:07:33 MST WARNING: pgstat wait timeout > 2011-12-22 09:07:33 MST WARNING: pgstat wait timeout > 2011-12-22 09:13:41 MST WARNING: pgstat wait timeout > 2011-12-22 09:22:07 MST ERROR: canceling autovacuum task > 2011-12-22 09:22:07 MST CONTEXT: automatic vacuum of table > "ishield.public.tbltmptests" > 2011-12-22 09:28:41 MST WARNING: pgstat wait timeout > 2011-12-22 09:28:41 MST WARNING: pgstat wait timeout > 2011-12-22 09:31:35 MST WARNING: pgstat wait timeout > 2011-12-22 09:31:35 MST WARNING: pgstat wait timeout > 2011-12-22 09:33:16 MST WARNING: pgstat wait timeout > 2011-12-22 09:33:16 MST WARNING: pgstat wait timeout > 2011-12-22 09:33:55 MST WARNING: pgstat wait timeout > 2011-12-22 09:33:55 MST WARNING: pgstat wait timeout > 2011-12-22 09:33:55 MST WARNING: pgstat wait timeout > 2011-12-22 09:35:10 MST WARNING: pgstat wait timeout > 2011-12-22 09:35:10 MST WARNING: pgstat wait timeout > 2011-12-22 09:35:10 MST ERROR: canceling autovacuum task > 2011-12-22 09:35:10 MST CONTEXT: automatic vacuum of table > "ishield.public.tbltmptests" > 2011-12-22 09:36:21 MST WARNING: pgstat wait timeout > 2011-12-22 09:36:21 MST WARNING: pgstat wait timeout > 2011-12-22 09:37:02 MST WARNING: pgstat wait timeout > 2011-12-22 09:37:02 MST WARNING: pgstat wait timeout > 2011-12-22 09:37:02 MST WARNING: pgstat wait timeout > 2011-12-22 09:37:43 MST WARNING: pgstat wait timeout > 2011-12-22 09:37:43 MST WARNING: pgstat wait timeout > 2011-12-22 09:38:20 MST WARNING: pgstat wait timeout > 2011-12-22 09:38:20 MST WARNING: pgstat wait timeout > 2011-12-22 09:38:59 MST WARNING: pgstat wait timeout > 2011-12-22 09:38:59 MST WARNING: pgstat wait timeout > 2011-12-22 09:38:59 MST WARNING: pgstat wait timeout > 2011-12-22 09:42:33 MST LOG: checkpoint complete: wrote 362856 buffers > (46.1%); 0 transaction log file(s) added, 0 removed, 908 rec > > I do have an aggressive autovacuum because I have some tables that get > updated thousands of times per minute and otherwise they bloat > terribly, > but it is very seldom that I see cancelation of autovacuum, and only on > this table, which is a temporary table used for the insertion of the > data, in which records are inserted and deleted - never updated. > > > > -----Original Message----- > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > Sent: Thursday, December 22, 2011 11:45 AM > > To: Benjamin Krajmalnik > > Cc: pgsql-admin > > Subject: Re: [ADMIN] Problem with pgstat timneouts > > > > "Benjamin Krajmalnik" <kraj@servoyant.com> writes: > > > The only thing which I see when I run iostat is load on the drives > > every > > > so often. Mifd0 is the database, mifd1 is the pg_xlog. > > > > Hmm. Maybe checkpoints or something else saturating your drives for > a > > little while? You might want to be more aggressive about smoothing > out > > checkpoints, cf > > http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm > > At the very least I'd enable checkpoint logging for awhile and see > > if there seems to be any correlation to the response-time glitches. > > > > regards, tom lane > > -- > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-admin
I have proceeded to do some more checking, and I see in iostat that the pg_xlog drive has a significantly higher busy statethan before. Whereas it was barely busy when we first spun up the server (total %busy since we started the server isabout 6%) it is now in its 80's almost steady state. We have a set of partitioned tables which are continuously updated,and based on the size of them they no longer fit in the shared memory which was allocated. Pg_xlog is in a SAS RAID1. The server is set up with streaming replication to an identical server. The one thing which I just checked is theRAID mode on the server. db1# megacli64 -LDInfo -lAll -aAll Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 2.452 TB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 6 Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU Access Policy : Read/Write Disk Cache Policy : Disk's Default Encryption Type : None Virtual Drive: 1 (Target Id: 1) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 418.656 GB State : Optimal Strip Size : 256 KB Number Of Drives : 2 Span Depth : 1 Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU Access Policy : Read/Write Disk Cache Policy : Disk's Default Encryption Type : None Is it possible that the WriteThrough is what is causing the high io (and maybe the pgstat wait timeouts as well)? If this is the case, would it be safe to change the cache to Write back? Additionally, and somewhat unrelated, is there anything special which I need to do when restarting the primary server vis-à-visthe streaming replication server? In other words, if I were to restart the main server, will the streaming replicationserver reconnect and pick up once the primary comes online?
"Benjamin Krajmalnik" <kraj@servoyant.com> wrote: > Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache > if Bad BBU > Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write > Cache if Bad BBU > Is it possible that the WriteThrough is what is causing the high > io (and maybe the pgstat wait timeouts as well)? Yes. > If this is the case, would it be safe to change the cache to Write > back? Maybe. How did it get into this (non-default) state? Are batteries installed and healthy? > Additionally, and somewhat unrelated, is there anything special > which I need to do when restarting the primary server vis-à-vis > the streaming replication server? In other words, if I were to > restart the main server, will the streaming replication server > reconnect and pick up once the primary comes online? I think that should be pretty automatic as long as you haven't promoted the standby to be the new master. -Kevin
Batteries are installed and healthy. I wonder if it did it during one of the charge/discharge cycles and just did not revert(interestingly, I have one more server that did just that). I am trying to get the info from LSI as to what may havecaused it, if it is safe to revert the cache mode while it is running, and, of course, the exact command so I don't foobarthings up. > -----Original Message----- > From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov] > Sent: Wednesday, January 04, 2012 11:44 AM > To: pgsql-admin; Benjamin Krajmalnik > Subject: Re: [ADMIN] Problem with pgstat timeouts > > "Benjamin Krajmalnik" <kraj@servoyant.com> wrote: > > > Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache > > if Bad BBU > > Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write > > Cache if Bad BBU > > > Is it possible that the WriteThrough is what is causing the high > > io (and maybe the pgstat wait timeouts as well)? > > Yes. > > > If this is the case, would it be safe to change the cache to Write > > back? > > Maybe. How did it get into this (non-default) state? Are batteries > installed and healthy? > > > Additionally, and somewhat unrelated, is there anything special > > which I need to do when restarting the primary server vis-à-vis > > the streaming replication server? In other words, if I were to > > restart the main server, will the streaming replication server > > reconnect and pick up once the primary comes online? > > I think that should be pretty automatic as long as you haven't > promoted the standby to be the new master. > > -Kevin
Just a followup - thanks to Tom, Kevin, and the rest of the user group, as usual, for their great support. It appears that I have hit a nice firmware bug, so I am going to have to upgrade the controller firmware. Apparently, ithas some issues communicating with the BBU so although the BBU is showing as good, trying to bring it to writeback saysit is "below threshold". Since I am going to be taking down 4 servers (our entire infrastructure) to perform the firmwareupgrades, I will end up replacing the BBU's for good measure. Once again, thanks to everyone for their assistance, and I guess this pgstat timeout can be attributed to the controller:) > -----Original Message----- > From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin- > owner@postgresql.org] On Behalf Of Benjamin Krajmalnik > Sent: Wednesday, January 04, 2012 11:48 AM > To: Kevin Grittner; pgsql-admin > Subject: Re: [ADMIN] Problem with pgstat timeouts > > Batteries are installed and healthy. I wonder if it did it during one > of the charge/discharge cycles and just did not revert (interestingly, > I have one more server that did just that). I am trying to get the > info from LSI as to what may have caused it, if it is safe to revert > the cache mode while it is running, and, of course, the exact command > so I don't foobar things up. > > > > -----Original Message----- > > From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov] > > Sent: Wednesday, January 04, 2012 11:44 AM > > To: pgsql-admin; Benjamin Krajmalnik > > Subject: Re: [ADMIN] Problem with pgstat timeouts > > > > "Benjamin Krajmalnik" <kraj@servoyant.com> wrote: > > > > > Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache > > > if Bad BBU > > > Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write > > > Cache if Bad BBU > > > > > Is it possible that the WriteThrough is what is causing the high > > > io (and maybe the pgstat wait timeouts as well)? > > > > Yes. > > > > > If this is the case, would it be safe to change the cache to Write > > > back? > > > > Maybe. How did it get into this (non-default) state? Are batteries > > installed and healthy? > > > > > Additionally, and somewhat unrelated, is there anything special > > > which I need to do when restarting the primary server vis-à-vis > > > the streaming replication server? In other words, if I were to > > > restart the main server, will the streaming replication server > > > reconnect and pick up once the primary comes online? > > > > I think that should be pretty automatic as long as you haven't > > promoted the standby to be the new master. > > > > -Kevin > > -- > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-admin