Обсуждение: Who is causing all this i/o?
Our development server (PG 8.4.4 on Ubuntu server) is constantly doing something, and I can't figure out what. The two productionservers, which are essentially identical, don't show these symptoms. In a nutshell, it's showing 10K blocks persecond of data going out, all the time, and essentially zero blocks per second of input. To start with, if I turn off Postgres, everything is quiet: $ /etc/init.d/postgresql stop Stopping PostgreSQL: server stopped ok [radon:root] /emi/logs/londiste # vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 126140 1115944 112692 10466452 0 0 132 829 1 3 2 1 97 1 0 0 126140 1116068 112692 10466452 0 0 0 0 150 175 0 0 100 0 0 0 126140 1116184 112708 10466456 0 0 0 94 123 153 0 0 100 0 0 0 126140 1116432 112708 10466456 0 0 0 0 96 126 0 0 100 0 0 0 126140 1116928 112708 10466456 0 0 0 0 94 111 0 0 100 0 0 0 126140 1116688 112708 10466456 0 0 0 0 114 144 0 0 100 0 0 0 126140 1116364 112708 10466456 0 0 18 74 229 266 1 0 99 0 0 0 126140 1116488 112708 10466492 0 0 0 0 111 138 0 0 100 0 1 0 126140 1116868 112716 10466492 0 0 0 20 96 121 0 0 100 0 But when I restart Postgres, it looks like this, and runs this way 24/7. Notice the "bo" column: $ /etc/init.d/postgresql start Starting PostgreSQL: ok $ vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 126140 1078020 112744 10480652 0 0 132 829 1 3 2 1 97 1 0 1 126140 1054320 112744 10487792 0 0 52 10714 900 976 2 1 97 0 0 1 126140 1015988 112760 10511060 0 0 1828 9620 1491 3527 5 2 87 6 0 1 126140 1008316 112768 10517936 0 0 672 10994 982 1209 1 2 86 11 0 0 126140 1017780 112776 10528824 0 0 1466 10746 1003 1428 2 1 89 8 0 0 126140 1013324 112784 10533680 0 0 0 9718 766 858 1 0 99 0 1 0 126140 991600 112784 10539220 0 0 0 10714 962 1127 2 1 97 0 0 0 126140 999924 112788 10544380 0 0 6 10726 918 1227 2 1 97 0 0 0 126140 995700 112800 10548816 0 0 0 9744 823 889 1 1 99 0 2 0 126140 964560 112804 10554048 0 0 2 10718 903 1196 3 1 96 0 That seems to be a lot of output data going somewhere, but there doesn't seem to be anything going on that should cause it: postgres=# select datname, procpid, usename, substr(current_query,0,25) current_query, waiting from pg_stat_activity; datname | procpid | usename | current_query | waiting ------------+---------+----------+--------------------------+--------- postgres | 12927 | postgres | select datname, procpid, | f global | 12185 | postgres | <IDLE> | f accounting | 12367 | postgres | <IDLE> | f orders | 12225 | orders | <IDLE> | f archive | 12267 | postgres | <IDLE> | f (5 rows) After a lot of digging around, I found this in the /postgres/pg_stat_tmp directory. If I list the directory including thei-nodes once every second, I find that a new 2MB file is being created roughly once every two seconds: # ls -li total 2384 69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 2384 69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 2384 69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 2384 69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 3332 69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat 69731459 -rw------- 1 postgres postgres 970752 2011-05-20 14:07 pgstat.tmp # ls -li total 2384 69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 2384 69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 2384 69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat # ls -li total 2384 69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat This is a development server, so there are no queries coming in from the web or other random activities. Nothing is goingon. The other two servers are configured identically. If I diff the configuration files, the only difference is the IP addressesfor the "listen" section. Can anyone tell me what's going on? Why is pgstat.stat being rewritten on this server constantly and not on the other twoservers? Thanks, Craig
> The other two servers are configured identically. If I diff the > configuration files, the only difference is the IP addresses for the > "listen" section. > > Can anyone tell me what's going on? Why is pgstat.stat being rewritten > on this server constantly and not on the other two servers? Do you have something monitoring the stats of the system aggressively? (like pg_stat_user_tables?) JD > > Thanks, > Craig > > -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579
On 5/20/11 3:38 PM, Joshua D. Drake wrote: > >> The other two servers are configured identically. If I diff the >> configuration files, the only difference is the IP addresses for the >> "listen" section. >> >> Can anyone tell me what's going on? Why is pgstat.stat being rewritten >> on this server constantly and not on the other two servers? > > Do you have something monitoring the stats of the system aggressively? (like pg_stat_user_tables?) No, there is nothing like that running on this server. Just Apache, and Apache can only be accessed from inside the company,and its logs showed no activity at all. Thanks, Craig > > JD > > >> >> Thanks, >> Craig >> >> > >
On 05/20/2011 02:14 PM, Craig James wrote: > Our development server (PG 8.4.4 on Ubuntu server) is constantly doing > something, and I can't figure out what. The two production servers, > which are essentially identical, don't show these symptoms. In a > nutshell, it's showing 10K blocks per second of data going out, all > the time, and essentially zero blocks per second of input. ... There are people on this list better equipped than I to get into the guts of diagnosis, but a quick Google shows you aren't the only person to see high IO with disk activity. There are suggestions that a db with lots of temp-table creation and nothing triggering a vacuum could be related. Another thread suggested that bloated system tables could be to blame. Others look at the stats collectors. Groping a bit in the dark, I would try... Checking PG logs for info - especially anything indicating that vacuum isn't running or isn't being triggered. Manual vacuum. Vacuum full of system tables (pg_catalog - especially pg_class and pg_attribute) - possibly followed by a reindex all followed by a manual vacuum. Note: IIRC, that file is shared across databases so you will need to clean each one. Reset stats collector at startup. Check system after each change and report back if anything improves or degrades the situation. Cheers, Steve
On Fri, May 20, 2011 at 3:14 PM, Craig James <craig_james@emolecules.com> wrote: > Our development server (PG 8.4.4 on Ubuntu server) is constantly doing > something, and I can't figure out what. The two production servers, which > are essentially identical, don't show these symptoms. In a nutshell, it's > showing 10K blocks per second of data going out, all the time, and > essentially zero blocks per second of input. > After a lot of digging around, I found this in the /postgres/pg_stat_tmp > directory. If I list the directory including the i-nodes once every second, > I find that a new 2MB file is being created roughly once every two seconds: Have you got a lot of databases in your development environment? I think that can sometimes cause a lot of pg_stat writes.
On 5/20/11 4:25 PM, Scott Marlowe wrote: > On Fri, May 20, 2011 at 3:14 PM, Craig James<craig_james@emolecules.com> wrote: >> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing >> something, and I can't figure out what. The two production servers, which >> are essentially identical, don't show these symptoms. In a nutshell, it's >> showing 10K blocks per second of data going out, all the time, and >> essentially zero blocks per second of input. >> After a lot of digging around, I found this in the /postgres/pg_stat_tmp >> directory. If I list the directory including the i-nodes once every second, >> I find that a new 2MB file is being created roughly once every two seconds: > Have you got a lot of databases in your development environment? I > think that can sometimes cause a lot of pg_stat writes. Yes. The production servers have a dozen or so databases, but the development server has a couple hundred databases. Doesthat count as "a lot of databases"? We've had even more databases in the past (>500) and didn't see this sort of I/O activity. The odd thing is that this activity is gradually growing, so slowly that you don't notice it right away. A month or twoago, nothing. A couple weeks ago, a constant stream of "grass" in xload's graph. Today, xload is showing a 50% all thetime. I can only guess that the load will continue to increase. Is there a way to tell Postgres not to do this, or to do it less? It's not a big disaster (yet), but it's annoying to havethe system spewing a couple megabytes of pg_stat_tmp data every few seconds. Thanks, Craig
I am tracking down possibly similar behavior on a production server and a colleague introduced me to Dstat. I am using it right now to try and track down the root of my problem. May or may not be helpful in figuring out exactly whats causing this and then on to figuring out to fix it.
http://dag.wieers.com/home-made/dstat/#
then run something like:
dstat -t -l --cpu24 --sys --tcp --vm --nocolor --top-io --top-latency-avg --output /tmp/dstat.csv
sincerely
http://dag.wieers.com/home-made/dstat/#
then run something like:
dstat -t -l --cpu24 --sys --tcp --vm --nocolor --top-io --top-latency-avg --output /tmp/dstat.csv
sincerely
Craig James <craig_james@emolecules.com> writes: > On 5/20/11 4:25 PM, Scott Marlowe wrote: >> On Fri, May 20, 2011 at 3:14 PM, Craig James<craig_james@emolecules.com> wrote: >>> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing >>> something, and I can't figure out what. The two production servers, which >>> are essentially identical, don't show these symptoms. In a nutshell, it's >>> showing 10K blocks per second of data going out, all the time, and >>> essentially zero blocks per second of input. >>> After a lot of digging around, I found this in the /postgres/pg_stat_tmp >>> directory. If I list the directory including the i-nodes once every second, >>> I find that a new 2MB file is being created roughly once every two seconds: >> Have you got a lot of databases in your development environment? I >> think that can sometimes cause a lot of pg_stat writes. > Yes. The production servers have a dozen or so databases, but the development server has a couple hundred databases. Doesthat count as "a lot of databases"? Yeah. I think what is happening is that the autovacuum launcher is visiting every database, doing accesses to the system catalogs (and not much more than that), which results in access-count updates in the stats collector, which have to get written to disk. What's not apparent however is why the stats collector is writing disk so much. 8.4 does have the logic change to not write stats out unless something is asking to see them. So either it's really pre-8.4, or you have a monitoring task that is constantly asking to see stats. One possible band-aid solution is to increase autovacuum_naptime. This is defined as the AV cycle time *for each database*, so AV wakes up and touches another database every naptime/#databases seconds. If your number of databases has been growing over time, this would probably explain why the problem is getting worse. regards, tom lane
On 5/21/11 8:11 AM, Tom Lane wrote: > Craig James<craig_james@emolecules.com> writes: >> On 5/20/11 4:25 PM, Scott Marlowe wrote: >>> On Fri, May 20, 2011 at 3:14 PM, Craig James<craig_james@emolecules.com> wrote: >>>> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing >>>> something, and I can't figure out what. The two production servers, which >>>> are essentially identical, don't show these symptoms. In a nutshell, it's >>>> showing 10K blocks per second of data going out, all the time, and >>>> essentially zero blocks per second of input. >>>> After a lot of digging around, I found this in the /postgres/pg_stat_tmp >>>> directory. If I list the directory including the i-nodes once every second, >>>> I find that a new 2MB file is being created roughly once every two seconds: >>> Have you got a lot of databases in your development environment? I >>> think that can sometimes cause a lot of pg_stat writes. >> Yes. The production servers have a dozen or so databases, but the development server has a couple hundred databases.Does that count as "a lot of databases"? > Yeah. I think what is happening is that the autovacuum launcher is > visiting every database, doing accesses to the system catalogs (and not > much more than that), which results in access-count updates in the stats > collector, which have to get written to disk. > > What's not apparent however is why the stats collector is writing disk > so much. 8.4 does have the logic change to not write stats out unless > something is asking to see them. So either it's really pre-8.4, $ ./postgres --version postgres (PostgreSQL) 8.4.4 > or you > have a monitoring task that is constantly asking to see stats. Nothing like that. There's virtually nothing on this server but Postgres and Apache. There are some fast-cgi perl programsthat stay connected to Postgres all the time, but the constant I/O starts up even if I kill Apache. > One possible band-aid solution is to increase autovacuum_naptime. This > is defined as the AV cycle time *for each database*, so AV wakes up and > touches another database every naptime/#databases seconds. I'll try this. Since it's a development system, it tends to get large bursts of loading followed by a bit of searching. Cutting down on vacuums should be fine. > If your > number of databases has been growing over time, this would probably > explain why the problem is getting worse. For the last few months the number of databases has been almost static. Maybe five added (of 200) in the last three months,and probably a few others dropped. Thanks, Craig > regards, tom lane >
Tom Lane-2 wrote: > > What's not apparent however is why the stats collector is writing disk > so much. 8.4 does have the logic change to not write stats out unless > something is asking to see them. So either it's really pre-8.4, or you > have a monitoring task that is constantly asking to see stats. > We have a PostgreSQL 9.0.4 on CentOs for performance testing and we are seeing the similar issue. we have a "crazy" setup it has 1 database with 1000 identical schemas. There are occasional I/O write storm of over 100 MB/sec without any disk reads, and it could last for a couple of minutes when the schemas/data are aggressively populated by pg_restore. All the io writes seem to be on pgstat.tmp. The I/O write storm seemed to be trigger by Vacuum. The way we get around that is to point stats_temp_directory to ram disk which eliminates the I/O write storm effectively. Is it an issue of specific usage scenario or a general issue to be resolved? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Who-is-causing-all-this-i-o-tp4413558p4499692.html Sent from the PostgreSQL - admin mailing list archive at Nabble.com.
On 6/17/11 11:51 AM, Shianmiin wrote: > Tom Lane-2 wrote: >> What's not apparent however is why the stats collector is writing disk >> so much. 8.4 does have the logic change to not write stats out unless >> something is asking to see them. So either it's really pre-8.4, or you >> have a monitoring task that is constantly asking to see stats. >> > We have a PostgreSQL 9.0.4 on CentOs for performance testing and we are > seeing the similar issue. > we have a "crazy" setup it has 1 database with 1000 identical schemas. There > are occasional I/O write storm > of over 100 MB/sec without any disk reads, and it could last for a couple of > minutes when the schemas/data are aggressively populated by pg_restore. All > the io writes seem to be on pgstat.tmp. > > The I/O write storm seemed to be trigger by Vacuum. Based on the advice I got from my original question, I changed autovacuum_naptime to "5min", and the problem completely disappeared. (I know that's a long interval, but this particular server gets maybe 5-10 heavy updates per week and is idlethe rest of the time.) select count(1) from pg_database ; count ------- 267 It seems like there's a problem somewhere. Autovacuum has improved enormously in the last couple of years, but some changeto its algorithm causes a lot of I/O thrashing when there are more than a few hundred separate databases. Craig
Craig James <craig_james@emolecules.com> writes: > On 6/17/11 11:51 AM, Shianmiin wrote: >> We have a PostgreSQL 9.0.4 on CentOs for performance testing and we are >> seeing the similar issue. >> we have a "crazy" setup it has 1 database with 1000 identical schemas. There >> are occasional I/O write storm >> of over 100 MB/sec without any disk reads, and it could last for a couple of >> minutes when the schemas/data are aggressively populated by pg_restore. All >> the io writes seem to be on pgstat.tmp. > Based on the advice I got from my original question, I changed autovacuum_naptime to "5min", and the problem completelydisappeared. (I know that's a long interval, but this particular server gets maybe 5-10 heavy updates per weekand is idle the rest of the time.) > select count(1) from pg_database ; > count > ------- > 267 > It seems like there's a problem somewhere. Autovacuum has improved enormously in the last couple of years, but some changeto its algorithm causes a lot of I/O thrashing when there are more than a few hundred separate databases. Well, if you have a lot of databases then you definitely need to increase the naptime to keep autovac's demands for stats within bounds. I don't find that surprising, though I do wonder if we ought to redefine the way that GUC works: right now, you get one autovac wakeup every naptime/databases seconds. The OP seems to have some other issue, though, since he says he's only got 1 database. regards, tom lane