Обсуждение: Sudden connection and load average spikes with postgresql 9.3
Hello all, This is my very first message to the Postgresql community, and I really hope you can help me solve the trouble I'm facing. I've an 80 core server (multithread) with close to 500GB RAM. My configuration is: MaxConn: 1500 (was 850) Shared buffers: 188Gb work_mem: 110Mb (was 220Mb) maintenance_work_mem: 256Mb effective_cache_size: 340Gb The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS (build 3.13.0-55-generic) Two days from now, I've been experiencing that, randomly, the connections rise up till they reach max connections, and the load average of the server goes arround 300~400, making every command issued on the server take forever. When this happens, ram is relatively low (70Gb used), cores activity is lower than usual and sometimes swap happens (I've swappiness configured to 10%) I've been trying to find the cause of this server underperformance, even logging all queries in debug mode, but nothing strange found so far. I really don't know which can be my next step to try to isolate the problem and that's why I write to you guys. Have you ever seen this behaviour before? Could you kindly help me suggesting any step to follow? Thanks and best regards! E.v. -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
eudald_v <reaven.galaeindael@gmail.com> writes:
> This is my very first message to the Postgresql community, and I really hope
> you can help me solve the trouble I'm facing.
> I've an 80 core server (multithread) with close to 500GB RAM.
> My configuration is:
> MaxConn: 1500 (was 850)
> Shared buffers: 188Gb
> work_mem: 110Mb (was 220Mb)
> maintenance_work_mem: 256Mb
> effective_cache_size: 340Gb
> The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS
> (build 3.13.0-55-generic)
> Two days from now, I've been experiencing that, randomly, the connections
> rise up till they reach max connections, and the load average of the server
> goes arround 300~400, making every command issued on the server take
> forever. When this happens, ram is relatively low (70Gb used), cores
> activity is lower than usual and sometimes swap happens (I've swappiness
> configured to 10%)
You haven't described why you would suddenly be getting more connections,
but if that's just an expected byproduct of transactions taking too long
to finish, then a plausible theory is that something is acquiring a strong
lock on some heavily-used table, causing other transactions to back up
behind it. The pg_stat_activity and pg_locks views may help identify
the culprit.
In any case, a good mitigation plan would be to back off your
max_connections setting, and instead use a connection pooler to manage
large numbers of clients. The usual rule of thumb is that you don't want
number of backends much more than the number of cores. 1500 active
sessions is WAY more than your hardware can support, and even approaching
that will exacerbate whatever the root performance problem is.
I'm also a bit worried by your focus on CPU capacity without any mention
of the disk subsystem. It may well be that the actual bottleneck is disk
throughput, and that you can't even sustain 80 sessions unless they're
read-mostly. It would be a good idea to watch the i/o stats the next
time you see one of these high-load episodes.
It might also be a good idea to decrease the shared_buffers setting and
rely more on OS-level disk caching. Some have found that very large
shared buffer pools tend to increase contention without much payback.
regards, tom lane
Dear Tom,
Thanks for your fast approach.
First of all, yes, queries seems to take more time to process and they are
like queued up (you can even see inserts with status waiting on top/htop).
I didn't know about that connection tip, and I will absolutely find a moment
to add a pg_pooler to reduce the amount of active/idle connections shown at
the database.
I'm monitoring the pg_stat_activity table and when connections raise there's
no query being processed that could block a table and cause the other
queries to stack waiting. I'm checking pg_lock table too, with results to be
determined yet.
At least for the moment, it has spiked up once again (it's happening with
more frequency). I was able to catch this log from I/O. These are 24 300Gb
SAS disks on Raid 10 with hot spare.
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 42,00 136,00 96,00 136 96
avg-cpu: %user %nice %system %iowait %steal %idle
14,82 0,00 52,75 0,61 0,00 31,82
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 598,00 1764,00 15188,00 1764 15188
avg-cpu: %user %nice %system %iowait %steal %idle
17,79 0,00 82,19 0,01 0,00 0,01
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 230,00 28,00 6172,00 28 6172
avg-cpu: %user %nice %system %iowait %steal %idle
26,79 0,00 72,49 0,12 0,00 0,60
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 200,00 96,00 8364,00 96 8364
avg-cpu: %user %nice %system %iowait %steal %idle
17,88 0,00 81,95 0,01 0,00 0,15
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 87,00 48,00 4164,00 48 4164
avg-cpu: %user %nice %system %iowait %steal %idle
26,09 0,00 66,86 1,69 0,00 5,36
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 563,00 1932,00 9004,00 1932 9004
avg-cpu: %user %nice %system %iowait %steal %idle
19,30 0,00 80,38 0,10 0,00 0,21
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 159,00 348,00 5292,00 348 5292
avg-cpu: %user %nice %system %iowait %steal %idle
28,15 0,00 68,03 0,19 0,00 3,63
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 250,00 196,00 7388,00 196 7388
avg-cpu: %user %nice %system %iowait %steal %idle
11,03 0,00 38,68 0,14 0,00 50,16
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 294,00 1508,00 2916,00 1508 2916
Hope it helps. If you feel we should reduce shared_buffers, what would be a
nice value?
Thanks!
--
View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5855900.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 06/30/2015 07:52 AM, eudald_v wrote: > Two days from now, I've been experiencing that, randomly, the connections > rise up till they reach max connections, and the load average of the server > goes arround 300~400, making every command issued on the server take > forever. When this happens, ram is relatively low (70Gb used), cores > activity is lower than usual and sometimes swap happens (I've swappiness > configured to 10%) As Tom said, the most likely reason for this is application behavior and blocking locks. Try some of these queries on our scripts page: https://github.com/pgexperts/pgx_scripts/tree/master/locks However, I have seem some other things which cause these kinds of stalls: * runaway connection generation by the application, due to either a programming bug or an irresponsible web crawler (see https://www.pgexperts.com/blog/quinn_weaver/) * issues evicting blocks from shared_buffers: what is your shared_buffers set to? How large is your database? * Checkpoint stalls: what FS are you on? What are your transaction log settings for PostgreSQL? * Issues with the software/hardware stack around your storage, causing total IO stalls periodically. What does IO throughput look like before/during/after the stalls? The last was the cause the last time I dealt with a situation like yours; it turned out the issue was bad RAID card firmware where the card would lock up whenever the write-through buffer got too much pressure. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Dear Josh,
I'm sorry I didn't write before, but we have been very busy with this issue
and, you know, when something goes wrong, the apocalypse comes with it.
I've been working on everything you suggested.
I used your tables and script and I can give you a sample of it on
locked_query_start
2015-07-02 14:49:45.972129+02 | 15314 | | 4001 |
| "TABLE_Z" | tuple | ExclusiveLock | 24018:24 | relation
| ShareUpdateExclusiveLock | | | YYY.YYY.YYY.YYY/32
| 2015-07-02 14:49:26.635599+02 | 2015-07-02 14:49:26.635599+02 | 2015-07-02
14:49:26.635601+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3",
"field4", "field5", "field6", "field7") VALUES (22359509, 92, 5, 88713,
'XXX.XXX.XXX.XXX', 199, 10) | | | 2015-07-02
14:11:45.368709+02 | 2015-07-02 14:11:45.368709+02 | active |
2015-07-02 14:11:45.36871+02 | autovacuum: VACUUM ANALYZE public.TABLE_Z
2015-07-02 14:49:45.972129+02 | 15857 | | 4001 |
| "TABLE_Z" | tuple | ExclusiveLock | 24018:24 | relation
| ShareUpdateExclusiveLock | | | YYY.YYY.YYY.YYY/32
| 2015-07-02 14:49:22.79166+02 | 2015-07-02 14:49:22.79166+02 | 2015-07-02
14:49:22.791665+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3",
"field4", "field5", "field6", "field7") VALUES (14515978, 92, 5, 88713,
'XXX.XXX.XXX.XXX', 199, 10) | | | 2015-07-02
14:11:45.368709+02 | 2015-07-02 14:11:45.368709+02 | active |
2015-07-02 14:11:45.36871+02 | autovacuum: VACUUM ANALYZE public.TABLE_Z
2015-07-02 14:49:45.972129+02 | 15314 | | 14712 |
| "TABLE_Z" | tuple | ExclusiveLock | 24018:24 | relation
| AccessShareLock | | |
1YYY.YYY.YYY.YYY/32 | 2015-07-02 14:49:26.635599+02 | 2015-07-02
14:49:26.635599+02 | 2015-07-02 14:49:26.635601+02 | INSERT INTO "TABLE_X"
("field1", "field2", "field3", "field4", "field5", "field6", "field") VALUES
(22359509, 92, 5, 88713, 'XXX.XXX.XXX.XXX', 199, 10) | |
185.10.253.72/32 | 2015-07-02 14:48:48.841375+02 | 2015-07-02
14:48:48.841375+02 | active | 2015-07-02 14:48:48.841384+02 | INSERT
INTO "TABLE_Y" ("email_id", "sendout_id", "feed_id", "isp_id") VALUES
(46015879, 75471, 419, 0)
All that was recorded during a spike. From this log I have to point
something:
Tables TABLE_X and TABLE_Y have both a TRIGGER that does an INSERT to
TABLE_Z
As you can see, TABLE_Z was being VACUUM ANALYZED. I wonder if TRIGGERS and
VACUUM work well together, just to check another perspective.
We also have carefully looked at our scripts and we have performed some code
optimitzations (like close db connections earlier), but the spikes continue
to happen.
FS is ext4 and I don't know how can I check the transaction log
configuration
This is how IO lookslike before and after any problem happens:
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 357,00 7468,00 8840,00 7468 8840
avg-cpu: %user %nice %system %iowait %steal %idle
5,02 0,00 2,44 0,06 0,00 92,47
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 471,00 7032,00 13760,00 7032 13760
avg-cpu: %user %nice %system %iowait %steal %idle
5,14 0,00 2,92 0,03 0,00 91,92
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 376,00 7192,00 8048,00 7192 8048
avg-cpu: %user %nice %system %iowait %steal %idle
4,77 0,00 2,57 0,03 0,00 92,63
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 304,00 7280,00 8252,00 7280 8252
And this is how it looks like when the spike happens:
http://pastebin.com/2hAYuDZ5
Hope it can help into determining what's happening.
Thanks for all your efforts and collaboration!
Eudald
--
View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856298.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 07/02/2015 08:41 AM, eudald_v wrote: > All that was recorded during a spike. From this log I have to point > something: > Tables TABLE_X and TABLE_Y have both a TRIGGER that does an INSERT to > TABLE_Z > As you can see, TABLE_Z was being VACUUM ANALYZED. I wonder if TRIGGERS and > VACUUM work well together, just to check another perspective. Well, it's not triggers in particular, but vacuum does create some contention and possible sluggishness. Questions: * what kind of writes to the triggers do? * can they conflict between sessions? that is, are different writes on X and/or Y possibly overwriting the same rows on Z? * is that autovacuum a regular autovacuum, or is it "to prevent wraparound"? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 07/02/2015 08:41 AM, eudald_v wrote: > And this is how it looks like when the spike happens: > http://pastebin.com/2hAYuDZ5 Hmm, those incredibly high system % indicate that there's something wrong with your system. If you're not using software RAID or ZFS, you should never see that. I think you have a driver, kernel, Linux memory management, or IO stack issue. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> wrote: > On 07/02/2015 08:41 AM, eudald_v wrote: >> And this is how it looks like when the spike happens: >> [system CPU staying over 90%] > I think you have a driver, kernel, Linux memory management, or IO > stack issue. In my experience this is usually caused by failure to disable transparent huge page support. The larger shared_buffers is configured, the bigger the problem. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 30, 2015 at 8:52 AM, eudald_v <reaven.galaeindael@gmail.com> wrote: > Hello all, > This is my very first message to the Postgresql community, and I really hope > you can help me solve the trouble I'm facing. > > I've an 80 core server (multithread) with close to 500GB RAM. > > My configuration is: > MaxConn: 1500 (was 850) Drop this to 80 to 160, and use pgbouncer for pooling. pg_pooler is nice, but it's a lot harder to configure. pgbouncer takes literally a couple of minutes and you're up and running. > Shared buffers: 188Gb I have yet to see shared_buffers this big be a big help. I'd drop it down to 1 to 16G or so but that's just me. > work_mem: 110Mb (was 220Mb) 110MB*1500connections*1, 2, or 3 sorts per query == disaster. Drop this down as well. If you have ONE query that needs a lot, create a user for that query, alter that user for bigger work_mem and then use it only for that big query. > maintenance_work_mem: 256Mb > effective_cache_size: 340Gb > > The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS > (build 3.13.0-55-generic) Good kernel. OK so you've got 80 cores. Have you checked zone_reclaim_mode? Under no circumstances should that ever be turned on on a database server. if "sysctl -a|grep zone" returns: vm.zone_reclaim_mode = 1 then use /etc/sysctl.conf to set it to 0. It's a silent killer and it will make your machine run REALLY slow for periods of 10 to 30 minutes for no apparent reason. But first and foremost GET A POOLER in the loop NOW. Not having one is making your machine extremely vulnerable to overload, and making it impossible for you to manage it.
Hello guys! I finally got rid of it. It looks that at the end it was all due to transparent_hugepages values. I disabled them and cpu spikes disappeared. I am sorry cause it's something I usually disable on postgresql servers, but I forgot to do so on this one and never thought about it. Thanks a lot for all your helpful messages! Eudald -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Note that if you still have the settings you showed in your original post you're just moving the goal posts a few feet further back. Any heavy load can still trigger this kind of behaviour. On Tue, Jul 7, 2015 at 5:29 AM, eudald_v <reaven.galaeindael@gmail.com> wrote: > Hello guys! > > I finally got rid of it. > It looks that at the end it was all due to transparent_hugepages values. > > I disabled them and cpu spikes disappeared. I am sorry cause it's something > I usually disable on postgresql servers, but I forgot to do so on this one > and never thought about it. > > Thanks a lot for all your helpful messages! > > Eudald > > > > -- > View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html > Sent from the PostgreSQL - performance mailing list archive at Nabble.com. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- To understand recursion, one must first understand recursion.