Обсуждение: What causes streaming replication delay?
Hi, I have a problem that's driving me insane. I have two PostgreSQL 9.1 installations (yes, I know... very old.) One is configured as a master and the other is a streaming-replication hot-standby that supports read-only queries. Every now and then, although the standby claims to be streaming transactions if you look with "ps", it seems to stall and the replication delay grows by one second for each second of elapsed time. If I restart the hot-standby, it consumes WAL files and very quickly catches up. And then a few minutes later, it happens again. So, what are the possible causes of this behaviour? There are no long transactions on the standby, nor on the master. And how can I find out exactly what query is causing the problem, assuming it's a query? The relevant settings in the hot-standby postgresql.conf are: hot_standby = on max_standby_archive_delay = 10s max_standby_streaming_delay = 10s hot_standby_feedback = on and on the master: wal_level = hot_standby archive_mode = on archive_command = (a command to send WAL files to the standby) max_wal_senders = 8 and all other parameters are at their defaults. Regards, Dianne.
On Wed, Aug 31, 2016 at 11:44 AM, Dianne Skoll <dfs@roaringpenguin.com> wrote:
Hi,
I have a problem that's driving me insane. I have two PostgreSQL 9.1
installations (yes, I know... very old.) One is configured as a master
and the other is a streaming-replication hot-standby that supports
read-only queries.
Every now and then, although the standby claims to be streaming transactions
if you look with "ps", it seems to stall and the replication delay grows
by one second for each second of elapsed time.
If I restart the hot-standby, it consumes WAL files and very quickly
catches up. And then a few minutes later, it happens again.
So, what are the possible causes of this behaviour? There are no
long transactions on the standby, nor on the master. And how can
I find out exactly what query is causing the problem, assuming
it's a query?
The relevant settings in the hot-standby postgresql.conf are:
hot_standby = on
max_standby_archive_delay = 10s
max_standby_streaming_delay = 10s
hot_standby_feedback = on
and on the master:
wal_level = hot_standby
archive_mode = on
archive_command = (a command to send WAL files to the standby)
max_wal_senders = 8
and all other parameters are at their defaults.
Regards,
Dianne.
--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
All depends on how you're monitoring your slave lag. Check this blog post and if you're seeing actual byte lag, then you could have cause for concern.
https://www.keithf4.com/monitoring_streaming_slave_lag/
https://www.keithf4.com/monitoring_streaming_slave_lag/
Keith
On Wed, 31 Aug 2016 12:04:23 -0400 Keith <keith@keithf4.com> wrote: > All depends on how you're monitoring your slave lag. Check this blog > post and if you're seeing actual byte lag, then you could have cause > for concern. It's not because the master is idle... there is tons of write traffic. And the way I measure the lag is about as directly as you can: I periodically insert a timestamp into a specific table on the master, and then poll the hot-standby until the inserted timestamp has appeared. Regards, Dianne.
Hi, I think I solved it. This is so ridiculous! The primary and standby databases are 200km apart and the path between them has a bandwidth of only 100 Mb/s. We had bursts of writes that created huge amounts of replication traffic. In addition to streaming replication, we're still copying WAL files over for safety (and also because this is 9.1...) We were hitting our bandwidth limit. By adding the --bwlimit parameter to the rsync command that copies the WAL files, we can limit the bandwidth usage during the bursts while still keeping it high enough for the WAL-file archiving to keep up on average. Keeping an eye on it for 24h before I completely declare it solved. Regards, Dianne.
Also, for the stupid-system-impaired-metric-didn't-put-us-on-the-moon, 200km is like 120 miles.
-------- Original message --------
From: Dianne Skoll <dfs@roaringpenguin.com>
Date: 08/31/2016 3:24 PM (GMT-06:00)
To: pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: [ADMIN] What causes streaming replication delay? [SOLVED, I think]
Or, for the LOCAL network administrator something like 650k feet.
So, we're not talking a direct cable...
;)
-------- Original message --------
From: Dianne Skoll <dfs@roaringpenguin.com>
Date: 08/31/2016 3:24 PM (GMT-06:00)
To: pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: [ADMIN] What causes streaming replication delay? [SOLVED, I think]
Hi,
I think I solved it. This is so ridiculous!
The primary and standby databases are 200km apart and the path between them
has a bandwidth of only 100 Mb/s. We had bursts of writes that created
huge amounts of replication traffic. In addition to streaming
replication, we're still copying WAL files over for safety (and also
because this is 9.1...)
We were hitting our bandwidth limit.
By adding the --bwlimit parameter to the rsync command that copies the WAL
files, we can limit the bandwidth usage during the bursts while still
keeping it high enough for the WAL-file archiving to keep up on average.
Keeping an eye on it for 24h before I completely declare it solved.
Regards,
Dianne.
--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
I think I solved it. This is so ridiculous!
The primary and standby databases are 200km apart and the path between them
has a bandwidth of only 100 Mb/s. We had bursts of writes that created
huge amounts of replication traffic. In addition to streaming
replication, we're still copying WAL files over for safety (and also
because this is 9.1...)
We were hitting our bandwidth limit.
By adding the --bwlimit parameter to the rsync command that copies the WAL
files, we can limit the bandwidth usage during the bursts while still
keeping it high enough for the WAL-file archiving to keep up on average.
Keeping an eye on it for 24h before I completely declare it solved.
Regards,
Dianne.
--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Journyx, Inc.
7600 Burnet Road #300
Austin, TX 78757
www.journyx.com
Austin, TX 78757
www.journyx.com
p 512.834.8888
f 512-834-8858
Do you receive our promotional emails? You can subscribe or unsubscribe to those emails at http://go.journyx.com/emailPreference/e/4932/714/
On Wed, 31 Aug 2016 16:22:22 -0400 Dianne Skoll <dfs@roaringpenguin.com> wrote: > Hi, > > I think I solved it. This is so ridiculous! > > The primary and standby databases are 200km apart and the path > between them has a bandwidth of only 100 Mb/s. We had bursts of > writes that created huge amounts of replication traffic. In addition > to streaming replication, we're still copying WAL files over for > safety (and also because this is 9.1...) > > We were hitting our bandwidth limit. > > By adding the --bwlimit parameter to the rsync command that copies > the WAL files, we can limit the bandwidth usage during the bursts > while still keeping it high enough for the WAL-file archiving to keep > up on average. > > Keeping an eye on it for 24h before I completely declare it solved. You were bitten by the RFC 1925 points 1, 2 and 9. A solution as is point 3 is "a bad idea". https://tools.ietf.org/html/rfc1925 > Regards, > > Dianne. > > > -- > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-admin --- --- Eduardo Morras <emorrasg@yahoo.es>