Обсуждение: hot_standby_feedback parameter doesn't work
We have a few database clusters (1 master, 2 hot standbys) with hot_standby_feedback setting on. But I am constantly seeing query conflicts on standbys because dead rows on the masters are deleted by VACUUM. Usually I notice it shortly after VACUUM ends, because the replication lag on replicas is starting to increase. vacuum_defer_cleanup_age = 0. PostgreSQL version is 9.5.13. Cluster load is about 1500 transactions per second. Any ideas? - With the best regards, Andrey Zhidenkov
> On Nov 12, 2018, at 10:28 PM, Andrey Zhidenkov <andrey.zhidenkov@gmail.com> wrote: > > We have a few database clusters (1 master, 2 hot standbys) with > hot_standby_feedback setting on. But I am constantly seeing query > conflicts on standbys because dead rows on the masters are deleted by > VACUUM. Usually I notice it shortly after VACUUM ends, because the > replication lag on replicas is starting to increase. > > vacuum_defer_cleanup_age = 0. PostgreSQL version is 9.5.13. Cluster > load is about 1500 transactions per second. > > Any ideas? > > - > With the best regards, Andrey Zhidenkov > Are these streaming replicas? The wal_streaming process on the master should show the xmin of the replicas; you can checkthe value from pg_stat_activity to make sure the feedback is occurring. Are the replicas disconnecting from the replication stream? As the would mean the xmin from the replica would no longer bein play.
Hi, On 2018-11-13 10:28:20 +0700, Andrey Zhidenkov wrote: > We have a few database clusters (1 master, 2 hot standbys) with > hot_standby_feedback setting on. But I am constantly seeing query > conflicts on standbys because dead rows on the masters are deleted by > VACUUM. Usually I notice it shortly after VACUUM ends, because the > replication lag on replicas is starting to increase. > > vacuum_defer_cleanup_age = 0. PostgreSQL version is 9.5.13. Cluster > load is about 1500 transactions per second. What's the precise conflict you get? I'd assume it's because of exclusive locks, not row version conflicts. Is that right? If so, that'd likely be because vacuum gets a chance to truncate the relations at the end of vacuum. Before theorizing further, it'd be good to get confirmation that that's the right theory. Greetings, Andres Freund
I have increasing replication lag on the hot standby. I grep processes on the stanby and see that WAL writer is in status "waiting". This always happens after (auto)vacuuming of table used by problem query. Once I kill problem query on the standby WAL segments are applied and the replication lag is no longer exists. Forgot to mention that max_standby_streaming_delay = 0 and max_standby_archive_delay = 0.
On Tue, Nov 13, 2018 at 2:34 PM Rui DeSousa <rui@crazybean.net> wrote: > > > > > On Nov 12, 2018, at 10:28 PM, Andrey Zhidenkov <andrey.zhidenkov@gmail.com> wrote: > > > > We have a few database clusters (1 master, 2 hot standbys) with > > hot_standby_feedback setting on. But I am constantly seeing query > > conflicts on standbys because dead rows on the masters are deleted by > > VACUUM. Usually I notice it shortly after VACUUM ends, because the > > replication lag on replicas is starting to increase. > > > > vacuum_defer_cleanup_age = 0. PostgreSQL version is 9.5.13. Cluster > > load is about 1500 transactions per second. > > > > Any ideas? > > > > - > > With the best regards, Andrey Zhidenkov > > > > Are these streaming replicas? The wal_streaming process on the master should show the xmin of the replicas; you can checkthe value from pg_stat_activity to make sure the feedback is occurring. > Yes, they are streaming replicas. Could you please tell me how to check xmin of the replica in pg_stat_activity? I didn't get the point. > Are the replicas disconnecting from the replication stream? As the would mean the xmin from the replica would no longerbe in play. No, the replicas are not disconnecting. -- - With best regards, Andrey Zhidenkov
>>>>> "Andrey" == Andrey Zhidenkov <andrey.zhidenkov@gmail.com> writes: Andrey> We have a few database clusters (1 master, 2 hot standbys) with Andrey> hot_standby_feedback setting on. But I am constantly seeing Andrey> query conflicts on standbys because dead rows on the masters Andrey> are deleted by VACUUM. Usually I notice it shortly after VACUUM Andrey> ends, because the replication lag on replicas is starting to Andrey> increase. Feedback can only try and avoid one of the approximately five possible causes of conflicts (albeit the most common one). Next most likely is a pin conflict, especially if you have any tables involved which are both small and with a relatively high update frequency. (Unfortunately with a max delay of 0 the standby queries won't be getting cancelled which in turn means that there are no statistics about the causes of conflicts.) The most likely scenario for a pin conflict is if you have queries which are (a) long-running, and (b) contain a sequential scan of a small (one or a few blocks) table that is _not_ underneath a Hash node or similar; for example, if the Seq Scan appears as the outer path of a Nestloop join at or near the top of the plan. In this case, vacuum of the small table may block because the query is holding pin for an extended period of time on the block that vacuum wants to clean up. (It can happen with index scans too, but is less likely.) -- Andrew (irc:RhodiumToad)
The problem is related to large tables (billions of rows) on the database with tps about 1500 transactions per second. I will continue investigate. Thank you for your tips. On Tue, Nov 13, 2018 at 5:04 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > > >>>>> "Andrey" == Andrey Zhidenkov <andrey.zhidenkov@gmail.com> writes: > > Andrey> We have a few database clusters (1 master, 2 hot standbys) with > Andrey> hot_standby_feedback setting on. But I am constantly seeing > Andrey> query conflicts on standbys because dead rows on the masters > Andrey> are deleted by VACUUM. Usually I notice it shortly after VACUUM > Andrey> ends, because the replication lag on replicas is starting to > Andrey> increase. > > Feedback can only try and avoid one of the approximately five possible > causes of conflicts (albeit the most common one). > > Next most likely is a pin conflict, especially if you have any tables > involved which are both small and with a relatively high update > frequency. > > (Unfortunately with a max delay of 0 the standby queries won't be getting > cancelled which in turn means that there are no statistics about the > causes of conflicts.) > > The most likely scenario for a pin conflict is if you have queries which > are (a) long-running, and (b) contain a sequential scan of a small (one > or a few blocks) table that is _not_ underneath a Hash node or similar; > for example, if the Seq Scan appears as the outer path of a Nestloop > join at or near the top of the plan. In this case, vacuum of the small > table may block because the query is holding pin for an extended period > of time on the block that vacuum wants to clean up. (It can happen with > index scans too, but is less likely.) > > -- > Andrew (irc:RhodiumToad) -- - With best regards, Andrey Zhidenkov
> On Nov 13, 2018, at 4:18 AM, Andrey Zhidenkov <andrey.zhidenkov@gmail.com> wrote: > > Yes, they are streaming replicas. Could you please tell me how to > check xmin of the replica in pg_stat_activity? I didn't get the point. > select pid , usename , application_name , backend_start , backend_xmin , state from pg_stat_replication ; select pid , state , backend_xid , backend_xmin , backend_type , backend_start from pg_stat_activity where backend_xmin is not null ; Vacuum will use the xmin to determine if a record is still needed or not thus I just wanted to make sure the replica aresending that information to the master. What is the exact error message?
It turned out that the problem is not caused by dead rows removing. The problem is that autovacuum process truncates empty pages at the end of the relation and takes AccessExclusiveLock on the relation. WAL receiver process, in turn, tries to take this log while replaying WAL segment with corresponding standby_redo command and fails because of the long-running query holding AccessShareLock on standby. Since max_standby_streaming_delay setting is set to -1 WAL receiver waits until query is finished and as a result the replication lag is increasing. There is a patch proposed in 2018/11 commit fest but it is not even reviewed yet: https://commitfest.postgresql.org/20/1683/ On Tue, Nov 13, 2018 at 10:53 PM Rui DeSousa <rui@crazybean.net> wrote: > > > > > On Nov 13, 2018, at 4:18 AM, Andrey Zhidenkov <andrey.zhidenkov@gmail.com> wrote: > > > > Yes, they are streaming replicas. Could you please tell me how to > > check xmin of the replica in pg_stat_activity? I didn't get the point. > > > > select pid > , usename > , application_name > , backend_start > , backend_xmin > , state > from pg_stat_replication > ; > > select pid > , state > , backend_xid > , backend_xmin > , backend_type > , backend_start > from pg_stat_activity > where backend_xmin is not null > ; > > Vacuum will use the xmin to determine if a record is still needed or not thus I just wanted to make sure the replica aresending that information to the master. > > What is the exact error message? > -- - With best regards, Andrey Zhidenkov