Hello all,
I'm having an issue with a postgresql 9.2 cluster during failover and hope you all can help. I have been attempting
tofollow the guide provided at ClusterLabs(1) but not having much luck and I don't quite understand where the issue is.
I'm running on debian wheezy.
I have my crm_mon output below. One server is PRI and operating normally after taking over. I have pg setup to do
thewal archiving via rsync to the opposite node. <archive_command = 'rsync -a %p
test-node2:/db/data/postgresql/9.2/pg_archive/%f'> The rsync is working and I do see WAL files going to the other host
appropriately.
Node2 was the PRI... So after node1 that was previously in HA:sync promoted last night to PRI and node2 is stopped.
TheWAL files are arriving from node1 on node2. I cleaned-up the /tmp/PGSQL.lock file and proceed with a pg_basebackup
restorefrom node1. This all went well without error in the node1 postgresql log.
After running a crm cleanup on the msPostgresql resource, node2 keeps showing 'LATEST' but gets hung up at HS:alone.
PlusI don't understand why the xlog-loc of node2 shows 0000001EB9053DD8 which is farther ahead of node1's
master-baselineof 0000001EB2000080. I saw the 'cannot stat ... 000000010000001E000000BB' error, but that seems to
alwayshappen for the current xlog filename.
And if I wasn't confused enough, the pg log on node2 says "streaming replication successfully connected to primary"
andthe pg_stat_replication query on node1 shows connected, but ASYNC.
Any ideas?
Very much appreciated!
-With kind regards,
Peter Brunnengräber
References:
(1) http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over
###
============
Last updated: Wed Jul 13 14:51:53 2016
Last change: Wed Jul 13 14:49:17 2016 via crmd on test-node2
Stack: openais
Current DC: test-node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ test-node1 test-node2 ]
Full list of resources:
Resource Group: g_master
ClusterIP-Net1 (ocf::heartbeat:IPaddr2): Started test-node1
ReplicationIP-Net2 (ocf::heartbeat:IPaddr2): Started test-node1
Master/Slave Set: msPostgresql [pgsql]
Masters: [ test-node1 ]
Slaves: [ test-node2 ]
Node Attributes:
* Node test-node1:
+ master-pgsql:0 : 1000
+ master-pgsql:1 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000001EB2000080
+ pgsql-status : PRI
* Node test-node2:
+ master-pgsql:0 : -INFINITY
+ master-pgsql:1 : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000001EB9053DD8
Migration summary:
* Node test-node2:
* Node test-node1:
#### Node2
2016-07-13 14:55:09 UTC LOG: database system was interrupted; last known up at 2016-07-13 14:54:27 UTC
2016-07-13 14:55:09 UTC LOG: creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG: entering standby mode
2016-07-13 14:55:09 UTC LOG: restored log file "000000010000001E000000BA" from archive
2016-07-13 14:55:09 UTC FATAL: the database system is starting up
2016-07-13 14:55:09 UTC LOG: redo starts at 1E/BA000020
2016-07-13 14:55:09 UTC LOG: consistent recovery state reached at 1E/BA05FED8
2016-07-13 14:55:09 UTC LOG: database system is ready to accept read only connections
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/000000010000001E000000BB': No such file or directory
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG: streaming replication successfully connected to primary
#### Node1
postgres=# select application_name,upper(state),upper(sync_state) from pg_stat_replication;
+------------------+-----------+-------+
| application_name | upper | upper |
+------------------+-----------+-------+
| test-node2 | STREAMING | ASYNC |
+------------------+-----------+-------+
(1 row)