Обсуждение: PG12 autovac issues
We have a database that isn't overly large (~20G), but gets incredibly frequent updates. Here's an example table: feedi=# select * from pg_stat_all_tables where schemaname = 'production' and relname = 'tita'; relid = 16786 schemaname = production relname = tita seq_scan = 23365 seq_tup_read = 403862091 idx_scan = 26612759248 idx_tup_fetch = 19415752701 n_tup_ins = 24608806 n_tup_upd = 4207076934 n_tup_del = 24566916 n_tup_hot_upd = 4073821079 n_live_tup = 79942 n_dead_tup = 71969 n_mod_since_analyze = 12020 last_vacuum = 2020-03-17 15:35:19.588859+00 last_autovacuum = 2020-03-17 21:31:08.248598+00 last_analyze = 2020-03-17 15:35:20.372875+00 last_autoanalyze = 2020-03-17 22:04:41.76743+00 vacuum_count = 9 autovacuum_count = 135693 analyze_count = 9 autoanalyze_count = 495877 As you can see in this table, there are only ~80K rows, but billions of updates. What we have observed is that the frozenxid reaches the 200M mark fairly quickly because of the amount of activity. What is interesting is that this happens with the 'postgres' and 'template1' databases as well and there is absolutely no activity in those databases. When the 'postgres' and/or 'template1' databases hit the freeze_max_age, there are cases where it kicks off an aggressive autovac of those tables which seems to prevent autovacs from running elsewhere. Oddly, this is not consistent, but that condition seems to be required. We have observed this across multiple PG12 servers (dev, test, staging, production) all with similar workloads. $ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c 17 Mar 17 06 34 Mar 17 07 31 Mar 17 08 31 Mar 17 09 30 Mar 17 10 34 Mar 17 11 33 Mar 17 12 19 Mar 17 13 40 Mar 17 15 31 Mar 17 16 36 Mar 17 17 34 Mar 17 18 35 Mar 17 19 35 Mar 17 20 33 Mar 17 21 As you can see above, we end up having around ~33 autovac/hr, and about 13:30 today, they stopped until we ran a "vacuum freeze verbose analyze;" against the 'postgres' database (around 15:30) which then caused the autovacs to resume running against the "feedi" database. I'm completely perplexed as to what is happening and why it suddenly started when we moved from PG10 > PG12. The configs and workload are essentially the same between versions. We realize we could simply increase the autovacuum_freeze_max_age, but that doesn't seem to actually resolve anything -- it just pushes the problem out. Has anyone seen anything similar to this? Thanks very much for the consideration. Justin King http://flightaware.com/
On Tue, Mar 17, 2020 at 6:19 PM Justin King <kingpin867@gmail.com> wrote:
We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:
feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
n_tup_upd = 4207076934
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
As you can see in this table, there are only ~80K rows, but billions
of updates.
But how long were those billions of updates spread over? You need to look at deltas, not absolute values. And note that almost all of those updates where HOT updates, which don't generate "vacuum debt"
What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.
When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere.
Yes, it is a known long-outstanding bug (or malfeature) that one database reaching autovacuum_freeze_max_age will starve all other databases of autovac attention. But since the introduction of the "freeze map" in 9.6, it is hard to see how this starvation due to an inactive database hitting autovacuum_freeze_max_age can last for any meaningful amount of time. Maybe a shared catalog?
Oddly, this is not consistent, but that condition seems to
be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.
$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
It is hard to figure out what the significance of the occurrence of the word 'vacuum' in the log file is, without being intimately familiar with your log files. Could you interpret this some more for us? How many of those are for 'tita'? How many for databases other than your active one?
Cheers,
Jeff
Hi, On 2020-03-17 17:18:57 -0500, Justin King wrote: > As you can see in this table, there are only ~80K rows, but billions > of updates. What we have observed is that the frozenxid reaches the > 200M mark fairly quickly because of the amount of activity. And each of those updates is in a separate transaction? Is that required? I.e. any chance to perform multiple of those updates in one transaction? Have you considered just increasing the vacuum limit? It's mostly there because it can increase space usage a bit, but given today's systems its not a usually a problem unless you have hundreds of postgres instances on a single system. It's pretty common to just increase that to 1.8 billion or so. From a single stats snapshot we can't actually understand the actual xid consumption - is it actually the xid usage that triggers the vacuums? What makes you think it is a problem that you have all these vacuums? If you actually update that much, and you have indexes, you're going want a lot of vacuums? > What is interesting is that this happens with the 'postgres' and > 'template1' databases as well and there is absolutely no activity in > those databases. That's normal. They should be pretty darn quick in v12? Greetings, Andres Freund