Обсуждение: pg_xlog-files not deleted on a standby after accidental server crash
Hi there,
i am currently encountering a weird problem:
3 Days ago we had an accidental server crash on one of our database servers (system just restarted).
After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server). All databases came up fine again.
Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctly and went ahead with receiving WAL via streaming replication. All fine so far.
Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising, seems like WAL files are not reused / deleted by postgres.
Afterwards i tried different things:
1. Restarted the slave multiple times without any effect on the xlog behaviour
2. Setting up the slave again from a fresh base backup of the master instance
3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked)
Now i'm running out of ideas why these files are not reused/deleted anymore.
I don't see any errors in the logfiles (not on slave, not on master).
Slave is only showing normal "restartpoint" logs:
Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156, longest=0.213 s, average=0.001 s
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:53:42.247769+00
Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729, longest=0.160 s, average=0.001 s
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:58:40.611415+00
Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672, longest=0.147 s, average=0.001 s
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:03:40.914148+00
Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758, longest=0.184 s, average=0.001 s
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:08:41.087379+00
Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728, longest=0.203 s, average=0.001 s
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:13:41.168975+00
Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821, longest=0.142 s, average=0.001 s
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:18:41.052593+00
Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687, longest=0.061 s, average=0.001 s
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:23:40.446308+00
Postgres-Version:
postgres=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.4.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
(1 row)
I would really appreciate somebodies advice on this.
Thanks in advance.
--
Mit freundlichen Grüßen / Best regards
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Friedensallee 290 • 22763 Hamburg
Paul Dunkler • Head of IT-Operations
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 35 73 001 41
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 35 73 001 99
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen
-----------------------------------------------------
Mit freundlichen Grüßen / Best regards
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Friedensallee 290 • 22763 Hamburg
Paul Dunkler • Head of IT-Operations
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 35 73 001 41
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 35 73 001 99
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen
-----------------------------------------------------
Вложения
Hi Paul,
Greetings.!I guess your salve is running properly, I think applying all WALs is taking time.
Bellow commands are use full check replication status.1. pg_stat_replication view on master/primary server.
postgres=# select * from pg_stat_replication ;
2. pg_is_in_recovery() : Function which tells whether standby is still in recovery mode or not.
postgres=# select pg_is_in_recovery();
3. pg_last_xlog_receive_location: Function which tells location of last transaction log which was streamed by Standby and also written on standby disk.
postgres=# select pg_last_xlog_receive_location();
4. pg_last_xlog_replay_location: Function which tells last transaction replayed during recovery process.
postgres=# select pg_last_xlog_replay_location();
5. pg_last_xact_replay_timestamp: This function tells about the time stamp of last transaction which was replayed during recovery. Below is an example:
postgres=# select pg_last_xact_replay_timestamp();
6. Lags in Bytes: For calculating lags in bytes, users can use the pg_stat_replication view on the master with the function pg_xlog_location_diff function.
postgres=# select pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location);
7.Calculating lags in Seconds. The following is SQL, which most people uses to find the lag in seconds:
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
8.If the slave is up in hot standby mode, you can tell the time in seconds the delay of transactions applied on the slave with this query:
postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
Thanks & Regards,
M Naveen Kuamr,
PostgreSQL Database Administrator,
M Naveen Kuamr,
PostgreSQL Database Administrator,
Mobile Number: +91 7755929449.
On Tue, Mar 24, 2015 at 9:47 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
Hi there,i am currently encountering a weird problem:3 Days ago we had an accidental server crash on one of our database servers (system just restarted).After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server). All databases came up fine again.Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctly and went ahead with receiving WAL via streaming replication. All fine so far.Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising, seems like WAL files are not reused / deleted by postgres.Afterwards i tried different things:1. Restarted the slave multiple times without any effect on the xlog behaviour2. Setting up the slave again from a fresh base backup of the master instance3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked)Now i'm running out of ideas why these files are not reused/deleted anymore.I don't see any errors in the logfiles (not on slave, not on master).Slave is only showing normal "restartpoint" logs:Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156, longest=0.213 s, average=0.001 s
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:53:42.247769+00
Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729, longest=0.160 s, average=0.001 s
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:58:40.611415+00
Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672, longest=0.147 s, average=0.001 s
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:03:40.914148+00
Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758, longest=0.184 s, average=0.001 s
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:08:41.087379+00
Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728, longest=0.203 s, average=0.001 s
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:13:41.168975+00
Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821, longest=0.142 s, average=0.001 s
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:18:41.052593+00
Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687, longest=0.061 s, average=0.001 s
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:23:40.446308+00Postgres-Version:postgres=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.4.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
(1 row)I would really appreciate somebodies advice on this.Thanks in advance.--
Mit freundlichen Grüßen / Best regards
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Friedensallee 290 • 22763 Hamburg
Paul Dunkler • Head of IT-Operations
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 35 73 001 41
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 35 73 001 99
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen
-----------------------------------------------------
Вложения
Hi there, thanks for your answer. We already have some "replay delay" checks implemented in nagios, which is showing fine for thisinstance. Anyway, i have tested all your commands (thanks, i didn't knew some of them), but the result is still the same. Replicationis fine and is not lagging... But pg_xlog content is growing and growing and growing. [root@ct-db1b pg_xlog]# ls -la | wc -l 1479 > I might be wrong, because Im one of the growing up guy in the PostgreSQL world. But you can try this. > > I guess your salve is running properly, I think applying all WALs is taking time. > > Bellow commands are use full check replication status. > > 1. pg_stat_replication view on master/primary server. > > postgres=# select * from pg_stat_replication ; postgres=# select * from pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state -------+----------+----------+------------------+-------------+-----------------+-------------+------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ 15919 | 10 | postgres | walreceiver | 172.16.0.2 | | 50495 | 2015-03-24 09:46:12.59108+00| | streaming | CA/CB1345F8 | CA/CB1345F8 | CA/CB1345F8 | CA/CB1345B8 | 0 | async (1 row) > 2. pg_is_in_recovery() : Function which tells whether standby is still in recovery mode or not. > > postgres=# select pg_is_in_recovery(); postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- t (1 row) > 3. pg_last_xlog_receive_location: Function which tells location of last transaction log which was streamed by Standby andalso written on standby disk. > > postgres=# select pg_last_xlog_receive_location(); > 4. pg_last_xlog_replay_location: Function which tells last transaction replayed during recovery process. > > postgres=# select pg_last_xlog_replay_location(); > > 5. pg_last_xact_replay_timestamp: This function tells about the time stamp of last transaction which was replayed duringrecovery. Below is an example: > > postgres=# select pg_last_xact_replay_timestamp(); postgres=# select pg_last_xlog_receive_location(); select pg_last_xlog_replay_location(); select pg_last_xact_replay_timestamp(); pg_last_xlog_receive_location ------------------------------- CA/CAF27218 (1 row) pg_last_xlog_replay_location ------------------------------ CA/CAF27218 (1 row) pg_last_xact_replay_timestamp ------------------------------- 2015-03-25 09:54:54.766285+00 (1 row) > 6. Lags in Bytes: For calculating lags in bytes, users can use the pg_stat_replication view on the master with the functionpg_xlog_location_diff function. > > postgres=# select pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location); postgres=# SELECT client_hostname, client_addr, pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location)AS byte_lag FROM pg_stat_replication; client_hostname | client_addr | byte_lag -----------------+-------------+---------- | 172.16.0.2 | 13024 (1 row) > 7.Calculating lags in Seconds. The following is SQL, which most people uses to find the lag in seconds: > > SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROM now()- pg_last_xact_replay_timestamp()) > END AS log_delay; postgres=# SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROMnow() - pg_last_xact_replay_timestamp()) END AS log_delay; log_delay ----------- 0 (1 row) > 8.If the slave is up in hot standby mode, you can tell the time in seconds the delay of transactions applied on the slavewith this query: > > postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay; postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay; replication_delay ------------------- 00:00:00.296506 (1 row) > And please verify archive_cleanup_command in recovery.conf on slave. I don't use this command on any slaves... > > > Thanks & Regards, > M Naveen Kuamr, > PostgreSQL Database Administrator, > Mobile Number: +91 7755929449. > > On Tue, Mar 24, 2015 at 9:47 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote: > Hi there, > > i am currently encountering a weird problem: > > 3 Days ago we had an accidental server crash on one of our database servers (system just restarted). > After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server).All databases came up fine again. > Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctlyand went ahead with receiving WAL via streaming replication. All fine so far. > > Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising,seems like WAL files are not reused / deleted by postgres. > Afterwards i tried different things: > 1. Restarted the slave multiple times without any effect on the xlog behaviour > 2. Setting up the slave again from a fresh base backup of the master instance > 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked) > > Now i'm running out of ideas why these files are not reused/deleted anymore. > I don't see any errors in the logfiles (not on slave, not on master). > > Slave is only showing normal "restartpoint" logs: >> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%);0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156,longest=0.213 s, average=0.001 s >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258 >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-2409:53:42.247769+00 >> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%);0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729,longest=0.160 s, average=0.001 s >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8 >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-2409:58:40.611415+00 >> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%);0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672,longest=0.147 s, average=0.001 s >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18 >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:03:40.914148+00 >> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%);0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758,longest=0.184 s, average=0.001 s >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40 >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:08:41.087379+00 >> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%);0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728,longest=0.203 s, average=0.001 s >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0 >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:13:41.168975+00 >> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821,longest=0.142 s, average=0.001 s >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320 >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:18:41.052593+00 >> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687,longest=0.061 s, average=0.001 s >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078 >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:23:40.446308+00 > > Postgres-Version: > >> postgres=# select version(); >> version >> -------------------------------------------------------------------------------------------------------------- >> PostgreSQL 9.4.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit >> (1 row) > > I would really appreciate somebodies advice on this. > > Thanks in advance. -- Mit freundlichen Grüßen / Kind regards Paul Dunkler
Вложения
archive_cleanup_command = 'pg_archivecleanup <archive_location> %r'
Refer this, it will help you.
http://www.postgresql.org/docs/9.3/static/pgarchivecleanup.html
Thanks & Regards,
M Naveen Kuamr,
PostgreSQL Database Administrator,
M Naveen Kuamr,
PostgreSQL Database Administrator,
Mobile Number: +91 7755929449.
On Wed, Mar 25, 2015 at 3:40 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
Hi there,
thanks for your answer. We already have some "replay delay" checks implemented in nagios, which is showing fine for this instance.
Anyway, i have tested all your commands (thanks, i didn't knew some of them), but the result is still the same. Replication is fine and is not lagging... But pg_xlog content is growing and growing and growing.
[root@ct-db1b pg_xlog]# ls -la | wc -l
1479
> I might be wrong, because Im one of the growing up guy in the PostgreSQL world. But you can try this.
>
> I guess your salve is running properly, I think applying all WALs is taking time.
>
> Bellow commands are use full check replication status.
>
> 1. pg_stat_replication view on master/primary server.
>
> postgres=# select * from pg_stat_replication ;
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+----------+------------------+-------------+-----------------+-------------+------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
15919 | 10 | postgres | walreceiver | 172.16.0.2 | | 50495 | 2015-03-24 09:46:12.59108+00 | | streaming | CA/CB1345F8 | CA/CB1345F8 | CA/CB1345F8 | CA/CB1345B8 | 0 | async
(1 row)
> 2. pg_is_in_recovery() : Function which tells whether standby is still in recovery mode or not.
>
> postgres=# select pg_is_in_recovery();
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
> 3. pg_last_xlog_receive_location: Function which tells location of last transaction log which was streamed by Standby and also written on standby disk.
>
> postgres=# select pg_last_xlog_receive_location();
> 4. pg_last_xlog_replay_location: Function which tells last transaction replayed during recovery process.
>
> postgres=# select pg_last_xlog_replay_location();
>
> 5. pg_last_xact_replay_timestamp: This function tells about the time stamp of last transaction which was replayed during recovery. Below is an example:
>
> postgres=# select pg_last_xact_replay_timestamp();
postgres=# select pg_last_xlog_receive_location(); select pg_last_xlog_replay_location(); select pg_last_xact_replay_timestamp();
pg_last_xlog_receive_location
-------------------------------
CA/CAF27218
(1 row)
pg_last_xlog_replay_location
------------------------------
CA/CAF27218
(1 row)
pg_last_xact_replay_timestamp
-------------------------------
2015-03-25 09:54:54.766285+00
(1 row)
> 6. Lags in Bytes: For calculating lags in bytes, users can use the pg_stat_replication view on the master with the function pg_xlog_location_diff function.
>
> postgres=# select pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location);
postgres=# SELECT client_hostname, client_addr, pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location) AS byte_lag FROM pg_stat_replication;
client_hostname | client_addr | byte_lag
-----------------+-------------+----------
| 172.16.0.2 | 13024
(1 row)
> 7.Calculating lags in Seconds. The following is SQL, which most people uses to find the lag in seconds:
>
> SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
> END AS log_delay;
postgres=# SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
log_delay
-----------
0
(1 row)
> 8.If the slave is up in hot standby mode, you can tell the time in seconds the delay of transactions applied on the slave with this query:
>
> postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
replication_delay
-------------------
00:00:00.296506
(1 row)
> And please verify archive_cleanup_command in recovery.conf on slave.
I don't use this command on any slaves...Mit freundlichen Grüßen / Kind regards
>
>
> Thanks & Regards,
> M Naveen Kuamr,
> PostgreSQL Database Administrator,
> Mobile Number: +91 7755929449.
>
> On Tue, Mar 24, 2015 at 9:47 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
> Hi there,
>
> i am currently encountering a weird problem:
>
> 3 Days ago we had an accidental server crash on one of our database servers (system just restarted).
> After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server). All databases came up fine again.
> Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctly and went ahead with receiving WAL via streaming replication. All fine so far.
>
> Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising, seems like WAL files are not reused / deleted by postgres.
> Afterwards i tried different things:
> 1. Restarted the slave multiple times without any effect on the xlog behaviour
> 2. Setting up the slave again from a fresh base backup of the master instance
> 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked)
>
> Now i'm running out of ideas why these files are not reused/deleted anymore.
> I don't see any errors in the logfiles (not on slave, not on master).
>
> Slave is only showing normal "restartpoint" logs:
>> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156, longest=0.213 s, average=0.001 s
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:53:42.247769+00
>> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729, longest=0.160 s, average=0.001 s
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:58:40.611415+00
>> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672, longest=0.147 s, average=0.001 s
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:03:40.914148+00
>> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758, longest=0.184 s, average=0.001 s
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:08:41.087379+00
>> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728, longest=0.203 s, average=0.001 s
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:13:41.168975+00
>> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821, longest=0.142 s, average=0.001 s
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:18:41.052593+00
>> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687, longest=0.061 s, average=0.001 s
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:23:40.446308+00
>
> Postgres-Version:
>
>> postgres=# select version();
>> version
>> --------------------------------------------------------------------------------------------------------------
>> PostgreSQL 9.4.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
>> (1 row)
>
> I would really appreciate somebodies advice on this.
>
> Thanks in advance.
--
Paul Dunkler
Hi there, > archive_cleanup_command = 'pg_archivecleanup <archive_location> %r' > > Refer this, it will help you. > http://www.postgresql.org/docs/9.3/static/pgarchivecleanup.html No that can´t help me - sorry than it seems you didn't got my problem. The "data/pg_xlog" folder is filling up and not getting deleted on the slave. I don't have any problem with the "archive"on slave. I don't even have an "archive" on the slave... By the way, my restore-command on the slave is configured to get the xlog_archives via rsync from the master´s xlog_archiveif it needs so. But thats only needed for startup recovery time after creating the slave from a base backup -and that works like expected... > Thanks & Regards, > M Naveen Kuamr, > PostgreSQL Database Administrator, > Mobile Number: +91 7755929449. > > On Wed, Mar 25, 2015 at 3:40 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote: > Hi there, > > thanks for your answer. We already have some "replay delay" checks implemented in nagios, which is showing fine for thisinstance. > > Anyway, i have tested all your commands (thanks, i didn't knew some of them), but the result is still the same. Replicationis fine and is not lagging... But pg_xlog content is growing and growing and growing. > > [root@ct-db1b pg_xlog]# ls -la | wc -l > 1479 > > > > I might be wrong, because Im one of the growing up guy in the PostgreSQL world. But you can try this. > > > > I guess your salve is running properly, I think applying all WALs is taking time. > > > > Bellow commands are use full check replication status. > > > > 1. pg_stat_replication view on master/primary server. > > > > postgres=# select * from pg_stat_replication ; > > postgres=# select * from pg_stat_replication; > pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state > -------+----------+----------+------------------+-------------+-----------------+-------------+------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ > 15919 | 10 | postgres | walreceiver | 172.16.0.2 | | 50495 | 2015-03-24 09:46:12.59108+00| | streaming | CA/CB1345F8 | CA/CB1345F8 | CA/CB1345F8 | CA/CB1345B8 | 0 | async > (1 row) > > > 2. pg_is_in_recovery() : Function which tells whether standby is still in recovery mode or not. > > > > postgres=# select pg_is_in_recovery(); > > postgres=# select pg_is_in_recovery(); > pg_is_in_recovery > ------------------- > t > (1 row) > > > 3. pg_last_xlog_receive_location: Function which tells location of last transaction log which was streamed by Standbyand also written on standby disk. > > > > postgres=# select pg_last_xlog_receive_location(); > > 4. pg_last_xlog_replay_location: Function which tells last transaction replayed during recovery process. > > > > postgres=# select pg_last_xlog_replay_location(); > > > > 5. pg_last_xact_replay_timestamp: This function tells about the time stamp of last transaction which was replayed duringrecovery. Below is an example: > > > > postgres=# select pg_last_xact_replay_timestamp(); > > postgres=# select pg_last_xlog_receive_location(); select pg_last_xlog_replay_location(); select pg_last_xact_replay_timestamp(); > pg_last_xlog_receive_location > ------------------------------- > CA/CAF27218 > (1 row) > > pg_last_xlog_replay_location > ------------------------------ > CA/CAF27218 > (1 row) > > pg_last_xact_replay_timestamp > ------------------------------- > 2015-03-25 09:54:54.766285+00 > (1 row) > > > > 6. Lags in Bytes: For calculating lags in bytes, users can use the pg_stat_replication view on the master with the functionpg_xlog_location_diff function. > > > > postgres=# select pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location); > > postgres=# SELECT client_hostname, client_addr, pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location)AS byte_lag FROM pg_stat_replication; > client_hostname | client_addr | byte_lag > -----------------+-------------+---------- > | 172.16.0.2 | 13024 > (1 row) > > > 7.Calculating lags in Seconds. The following is SQL, which most people uses to find the lag in seconds: > > > > SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROM now()- pg_last_xact_replay_timestamp()) > > END AS log_delay; > > postgres=# SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCHFROM now() - pg_last_xact_replay_timestamp()) > END AS log_delay; > log_delay > ----------- > 0 > (1 row) > > > 8.If the slave is up in hot standby mode, you can tell the time in seconds the delay of transactions applied on the slavewith this query: > > > > postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay; > > postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay; > replication_delay > ------------------- > 00:00:00.296506 > (1 row) > > > And please verify archive_cleanup_command in recovery.conf on slave. > > I don't use this command on any slaves... > > > > > > > Thanks & Regards, > > M Naveen Kuamr, > > PostgreSQL Database Administrator, > > Mobile Number: +91 7755929449. > > > > On Tue, Mar 24, 2015 at 9:47 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote: > > Hi there, > > > > i am currently encountering a weird problem: > > > > 3 Days ago we had an accidental server crash on one of our database servers (system just restarted). > > After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server).All databases came up fine again. > > Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctlyand went ahead with receiving WAL via streaming replication. All fine so far. > > > > Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising,seems like WAL files are not reused / deleted by postgres. > > Afterwards i tried different things: > > 1. Restarted the slave multiple times without any effect on the xlog behaviour > > 2. Setting up the slave again from a fresh base backup of the master instance > > 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked) > > > > Now i'm running out of ideas why these files are not reused/deleted anymore. > > I don't see any errors in the logfiles (not on slave, not on master). > > > > Slave is only showing normal "restartpoint" logs: > >> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%);0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156,longest=0.213 s, average=0.001 s > >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258 > >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 09:53:42.247769+00 > >> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%);0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729,longest=0.160 s, average=0.001 s > >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8 > >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 09:58:40.611415+00 > >> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%);0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672,longest=0.147 s, average=0.001 s > >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18 > >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:03:40.914148+00 > >> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%);0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758,longest=0.184 s, average=0.001 s > >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40 > >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:08:41.087379+00 > >> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%);0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728,longest=0.203 s, average=0.001 s > >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0 > >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:13:41.168975+00 > >> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821,longest=0.142 s, average=0.001 s > >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320 > >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:18:41.052593+00 > >> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687,longest=0.061 s, average=0.001 s > >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078 > >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:23:40.446308+00 > > > > Postgres-Version: > > > >> postgres=# select version(); > >> version > >> -------------------------------------------------------------------------------------------------------------- > >> PostgreSQL 9.4.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit > >> (1 row) > > > > I would really appreciate somebodies advice on this. > > > > Thanks in advance. -- Mit freundlichen Grüßen / Kind regards Paul Dunkler
Вложения
On Wed, Mar 25, 2015 at 3:17 AM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
Hi there,i am currently encountering a weird problem:3 Days ago we had an accidental server crash on one of our database servers (system just restarted).After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server). All databases came up fine again.Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctly and went ahead with receiving WAL via streaming replication. All fine so far.Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising, seems like WAL files are not reused / deleted by postgres.
This is weird. Do you see this behavior on master too ?
How many files do you see in pg_xlog location at any point-of-time ?
The number-of-files are much bigger than checkpoint_segments configured ??
Afterwards i tried different things:1. Restarted the slave multiple times without any effect on the xlog behaviour2. Setting up the slave again from a fresh base backup of the master instance3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked)Now i'm running out of ideas why these files are not reused/deleted anymore.
I don't see any errors in the logfiles (not on slave, not on master).Slave is only showing normal "restartpoint" logs:Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156, longest=0.213 s, average=0.001 s
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258
Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:53:42.247769+00
Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729, longest=0.160 s, average=0.001 s
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8
Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:58:40.611415+00
Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672, longest=0.147 s, average=0.001 s
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18
Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:03:40.914148+00
Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758, longest=0.184 s, average=0.001 s
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40
Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:08:41.087379+00
Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728, longest=0.203 s, average=0.001 s
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0
Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:13:41.168975+00
Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821, longest=0.142 s, average=0.001 s
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320
Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:18:41.052593+00
Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687, longest=0.061 s, average=0.001 s
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078
Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:23:40.446308+00
Looking at your logs, it does make sense why the WAL files are not getting deleted. If you observe you logfile - the checkpoint-log-message always says "0 removed and # recycled", this precisely means, files are getting reused and are not getting deleted. I am curious to know how did you get to a conclusion that WAL files are not getting reused.
If you are concerned about number of files in pg_xlog, then it must be due to huge checkpoint_segments or long checkpoint_timeout or you may have wal_keep_segments configured (i am completely not sure about this parameter effect on standby).
Ideally, CHECKPOINT process is responsible for deleting the WAL files from pg_xlog location periodically and it takes in to account WAL_KEEP_SEGMENTS. If you see any big transactions generating huge WAL files, then you must notice the same behavior on master as well.
Regards,
Venkata Balaji N
Вложения
> > i am currently encountering a weird problem: > > 3 Days ago we had an accidental server crash on one of our database servers (system just restarted). > After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server).All databases came up fine again. > Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctlyand went ahead with receiving WAL via streaming replication. All fine so far. > > Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising,seems like WAL files are not reused / deleted by postgres. > > This is weird. Do you see this behavior on master too ? No - thats the strange thing. I don't have the problem on the master side. No long running transactions or something like that (we even have a script which cares about these). > How many files do you see in pg_xlog location at any point-of-time ? The number is just growing and growing and growing... We let it grow for like 4 days and then we had like 3000 files in pg_xlogon the slave... > The number-of-files are much bigger than checkpoint_segments configured ?? checkpoint_segments is set to 40... > Afterwards i tried different things: > 1. Restarted the slave multiple times without any effect on the xlog behaviour > 2. Setting up the slave again from a fresh base backup of the master instance > 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked) > > Now i'm running out of ideas why these files are not reused/deleted anymore. > I don't see any errors in the logfiles (not on slave, not on master). > > Slave is only showing normal "restartpoint" logs: >> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%);0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156,longest=0.213 s, average=0.001 s >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258 >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-2409:53:42.247769+00 >> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%);0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729,longest=0.160 s, average=0.001 s >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8 >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-2409:58:40.611415+00 >> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%);0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672,longest=0.147 s, average=0.001 s >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18 >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:03:40.914148+00 >> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%);0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758,longest=0.184 s, average=0.001 s >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40 >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:08:41.087379+00 >> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%);0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728,longest=0.203 s, average=0.001 s >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0 >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:13:41.168975+00 >> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821,longest=0.142 s, average=0.001 s >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320 >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:18:41.052593+00 >> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687,longest=0.061 s, average=0.001 s >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078 >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:23:40.446308+00 > Looking at your logs, it does make sense why the WAL files are not getting deleted. If you observe you logfile - the checkpoint-log-messagealways says "0 removed and # recycled", this precisely means, files are getting reused and are notgetting deleted. I am curious to know how did you get to a conclusion that WAL files are not getting reused. Because i knew the behaviour from bevore and i see the behaviour from the other instances... Can't be that the number isjust growing and growing and growing... > > If you are concerned about number of files in pg_xlog, then it must be due to huge checkpoint_segments or long checkpoint_timeoutor you may have wal_keep_segments configured (i am completely not sure about this parameter effect on standby). as i said, checkpoint_segments is 40, checkpoint_timeout is 5 min and wal_keep_segments is set to 32 > Ideally, CHECKPOINT process is responsible for deleting the WAL files from pg_xlog location periodically and it takes into account WAL_KEEP_SEGMENTS. If you see any big transactions generating huge WAL files, then you must notice the samebehavior on master as well. For now we set the archive_cleanup command on the slave for deleting the xlogs under data/pg_xlog - But it shouldn't be likethat - This should happen automatically... Still out of ideas... -- Mit freundlichen Grüßen / Kind regards Paul Dunkler
Вложения
Hi Paul,
May I know, how did you setup replication ? are you using cascade replication ?Thanks & Regards,
M Naveen Kuamr,
PostgreSQL Database Administrator,
M Naveen Kuamr,
PostgreSQL Database Administrator,
Mobile Number: +91 7755929449.
On Mon, Mar 30, 2015 at 6:13 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
>
> i am currently encountering a weird problem:
>
> 3 Days ago we had an accidental server crash on one of our database servers (system just restarted).
> After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server). All databases came up fine again.
> Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctly and went ahead with receiving WAL via streaming replication. All fine so far.
>
> Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising, seems like WAL files are not reused / deleted by postgres.
>
> This is weird. Do you see this behavior on master too ?
No - thats the strange thing. I don't have the problem on the master side.
No long running transactions or something like that (we even have a script which cares about these).
> How many files do you see in pg_xlog location at any point-of-time ?
The number is just growing and growing and growing... We let it grow for like 4 days and then we had like 3000 files in pg_xlog on the slave...
> The number-of-files are much bigger than checkpoint_segments configured ??
checkpoint_segments is set to 40...Because i knew the behaviour from bevore and i see the behaviour from the other instances... Can't be that the number is just growing and growing and growing...
> Afterwards i tried different things:
> 1. Restarted the slave multiple times without any effect on the xlog behaviour
> 2. Setting up the slave again from a fresh base backup of the master instance
> 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked)
>
> Now i'm running out of ideas why these files are not reused/deleted anymore.
> I don't see any errors in the logfiles (not on slave, not on master).
>
> Slave is only showing normal "restartpoint" logs:
>> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156, longest=0.213 s, average=0.001 s
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:53:42.247769+00
>> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729, longest=0.160 s, average=0.001 s
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 09:58:40.611415+00
>> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672, longest=0.147 s, average=0.001 s
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:03:40.914148+00
>> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758, longest=0.184 s, average=0.001 s
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:08:41.087379+00
>> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728, longest=0.203 s, average=0.001 s
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:13:41.168975+00
>> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821, longest=0.142 s, average=0.001 s
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:18:41.052593+00
>> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687, longest=0.061 s, average=0.001 s
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time 2015-03-24 10:23:40.446308+00
> Looking at your logs, it does make sense why the WAL files are not getting deleted. If you observe you logfile - the checkpoint-log-message always says "0 removed and # recycled", this precisely means, files are getting reused and are not getting deleted. I am curious to know how did you get to a conclusion that WAL files are not getting reused.
>
> If you are concerned about number of files in pg_xlog, then it must be due to huge checkpoint_segments or long checkpoint_timeout or you may have wal_keep_segments configured (i am completely not sure about this parameter effect on standby).
as i said, checkpoint_segments is 40, checkpoint_timeout is 5 min and wal_keep_segments is set to 32
> Ideally, CHECKPOINT process is responsible for deleting the WAL files from pg_xlog location periodically and it takes in to account WAL_KEEP_SEGMENTS. If you see any big transactions generating huge WAL files, then you must notice the same behavior on master as well.
For now we set the archive_cleanup command on the slave for deleting the xlogs under data/pg_xlog - But it shouldn't be like that - This should happen automatically...
Still out of ideas...
--
Mit freundlichen Grüßen / Kind regards
Paul Dunkler
Hi there, > May I know, how did you setup replication ? are you using cascade replication ? Sure - we don't use cascade replication. We just have one master and one slave server. Both servers are directly connected via 1Gbps LAN-interconnect. Slave is connected via "primary_conninfo"-parameters to do streaming async replication with the master. > Thanks & Regards, > M Naveen Kuamr, > PostgreSQL Database Administrator, > Mobile Number: +91 7755929449. > > On Mon, Mar 30, 2015 at 6:13 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote: > > > > i am currently encountering a weird problem: > > > > 3 Days ago we had an accidental server crash on one of our database servers (system just restarted). > > After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor server).All databases came up fine again. > > Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover correctlyand went ahead with receiving WAL via streaming replication. All fine so far. > > > > Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and raising,seems like WAL files are not reused / deleted by postgres. > > > > This is weird. Do you see this behavior on master too ? > > No - thats the strange thing. I don't have the problem on the master side. > No long running transactions or something like that (we even have a script which cares about these). > > > How many files do you see in pg_xlog location at any point-of-time ? > > The number is just growing and growing and growing... We let it grow for like 4 days and then we had like 3000 files inpg_xlog on the slave... > > > The number-of-files are much bigger than checkpoint_segments configured ?? > > checkpoint_segments is set to 40... > > > Afterwards i tried different things: > > 1. Restarted the slave multiple times without any effect on the xlog behaviour > > 2. Setting up the slave again from a fresh base backup of the master instance > > 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked) > > > > Now i'm running out of ideas why these files are not reused/deleted anymore. > > I don't see any errors in the logfiles (not on slave, not on master). > > > > Slave is only showing normal "restartpoint" logs: > >> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG: restartpoint complete: wrote 21264 buffers (0.4%);0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync files=1156,longest=0.213 s, average=0.001 s > >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG: recovery restart point at C3/E9AD3258 > >> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 09:53:42.247769+00 > >> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG: restartpoint complete: wrote 6402 buffers (0.1%);0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync files=729,longest=0.160 s, average=0.001 s > >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG: recovery restart point at C3/EDA836D8 > >> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 09:58:40.611415+00 > >> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG: restartpoint complete: wrote 11746 buffers (0.2%);0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync files=672,longest=0.147 s, average=0.001 s > >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG: recovery restart point at C3/F046DA18 > >> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:03:40.914148+00 > >> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG: restartpoint complete: wrote 11155 buffers (0.2%);0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync files=758,longest=0.184 s, average=0.001 s > >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG: recovery restart point at C3/F3C51B40 > >> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:08:41.087379+00 > >> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG: restartpoint complete: wrote 12602 buffers (0.2%);0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync files=728,longest=0.203 s, average=0.001 s > >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG: recovery restart point at C3/F93AC8F0 > >> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:13:41.168975+00 > >> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG: restartpoint complete: wrote 10234 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync files=821,longest=0.142 s, average=0.001 s > >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG: recovery restart point at C3/FDD61320 > >> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:18:41.052593+00 > >> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG: restartpoint starting: time > >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG: restartpoint complete: wrote 9105 buffers (0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync files=687,longest=0.061 s, average=0.001 s > >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG: recovery restart point at C4/290A078 > >> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL: last completed transaction was at log time2015-03-24 10:23:40.446308+00 > > Looking at your logs, it does make sense why the WAL files are not getting deleted. If you observe you logfile - thecheckpoint-log-message always says "0 removed and # recycled", this precisely means, files are getting reused and arenot getting deleted. I am curious to know how did you get to a conclusion that WAL files are not getting reused. > > Because i knew the behaviour from bevore and i see the behaviour from the other instances... Can't be that the number isjust growing and growing and growing... > > > > > If you are concerned about number of files in pg_xlog, then it must be due to huge checkpoint_segments or long checkpoint_timeoutor you may have wal_keep_segments configured (i am completely not sure about this parameter effect on standby). > > as i said, checkpoint_segments is 40, checkpoint_timeout is 5 min and wal_keep_segments is set to 32 > > > Ideally, CHECKPOINT process is responsible for deleting the WAL files from pg_xlog location periodically and it takesin to account WAL_KEEP_SEGMENTS. If you see any big transactions generating huge WAL files, then you must notice thesame behavior on master as well. > > For now we set the archive_cleanup command on the slave for deleting the xlogs under data/pg_xlog - But it shouldn't belike that - This should happen automatically... > > Still out of ideas... -- Mit freundlichen Grüßen / Kind regards Paul Dunkler