Обсуждение: Pacemaker dynamic membership

Поиск

Список

Период

Сортировка

Pacemaker dynamic membership

От

Nikolay Popov

Дата:

07 октября 2015 г., 07:44:26

Hello.

We was looking the ways to utilize Corosync/Pacemaker stack for creating a

high-availability cluster of PostgreSQL servers with automatic failover.

We are using Corosync (2.3.4) as a messaging layer and a stateful master/slave

Resource Agent (pgsql) with Pacemaker (1.1.12) on CentOS 7.1.

Things work pretty well for a static cluster - where membership is defined up front.

However, we needed to be able to seamlessly add new machines (node) to the cluster and remove

existing ones from it, without service interruption. And we ran into a problem.

Is it possible to add a new node dynamically without interruption?

Here are the steps we are using to add a node:

# pcs property show
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: mycluster1
dc-version: 1.1.13-a14efad
have-watchdog: false
last-lrm-refresh: 1444042099
no-quorum-policy: stop
stonith-action: reboot
stonith-enabled: true
Node Attributes:
pi01: pgsql-data-status=STREAMING|SYNC
pi02: pgsql-data-status=STREAMING|POTENTIAL
pi03: pgsql-data-status=LATEST

# pcs resource show --full

Group: master-group

Resource: vip-master (class=ocf provider=heartbeat type=IPaddr2)

Attributes: ip=192.168.242.100 nic=eth0 cidr_netmask=24

Operations: start interval=0s timeout=60s on-fail=restart (vip-master-start-interval-0s)

monitor interval=10s timeout=60s on-fail=restart (vip-master-monitor-interval-10s)

stop interval=0s timeout=60s on-fail=block (vip-master-stop-interval-0s)

Resource: vip-rep (class=ocf provider=heartbeat type=IPaddr2)

Attributes: ip=192.168.242.101 nic=eth0 cidr_netmask=24

Meta Attrs: migration-threshold=0

Operations: start interval=0s timeout=60s on-fail=stop (vip-rep-start-interval-0s)

monitor interval=10s timeout=60s on-fail=restart (vip-rep-monitor-interval-10s)

stop interval=0s timeout=60s on-fail=ignore (vip-rep-stop-interval-0s)

Master: msPostgresql

Meta Attrs: master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true

Resource: pgsql (class=ocf provider=heartbeat type=pgsql)

Attributes: pgctl=/usr/pgsql-9.5/bin/pg_ctl psql=/usr/pgsql-9.5/bin/psql pgdata=/var/lib/pgsql/9.5/data/ rep_mode=sync node_list="pi01 pi02 pi03" restore_command="cp /var/lib/pgsql/9.5/data/wal_archive/%f %p" primary_conninfo_opt="user=repl password=super-pass-for-repl keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip=192.168.242.100 restart_on_promote=true check_wal_receiver=true

Operations: start interval=0s timeout=60s on-fail=restart (pgsql-start-interval-0s)

monitor interval=4s timeout=60s on-fail=restart (pgsql-monitor-interval-4s)

monitor role=Master timeout=60s on-fail=restart interval=3s (pgsql-monitor-interval-3s-role-Master)

promote interval=0s timeout=60s on-fail=restart (pgsql-promote-interval-0s)

demote interval=0s timeout=60s on-fail=stop (pgsql-demote-interval-0s)

stop interval=0s timeout=60s on-fail=block (pgsql-stop-interval-0s)

notify interval=0s timeout=60s (pgsql-notify-interval-0s)

# pcs cluster auth pi01 pi02 pi03 pi05 -u hacluster -p hacluster

pi01: Authorized

pi02: Authorized

pi03: Authorized

pi05: Authorized

# pcs cluster node add pi05 --start

pi01: Corosync updated

pi02: Corosync updated

pi03: Corosync updated

pi05: Succeeded

pi05: Starting Cluster...

# crm_mon -Afr1

Last updated: Fri Oct 2 16:59:54 2015 Last change: Fri Oct 2 16:59:23 2015 by hacluster via crmd on pi02

Stack: corosync

Current DC: pi02 (version 1.1.13-a14efad) - partition with quorum

4 nodes and 8 resources configured

Online: [ pi01 pi02 pi03 pi05 ]

Full list of resources:

Resource Group: master-group

vip-master (ocf::heartbeat:IPaddr2): Started pi02

vip-rep (ocf::heartbeat:IPaddr2): Started pi02

Master/Slave Set: msPostgresql [pgsql]

Masters: [ pi02 ]

Slaves: [ pi01 pi03 ]

fence-pi01 (stonith:fence_ssh): Started pi02

fence-pi02 (stonith:fence_ssh): Started pi01

fence-pi03 (stonith:fence_ssh): Started pi01

Node Attributes:

* Node pi01:

+ master-pgsql : 100

+ pgsql-data-status : STREAMING|SYNC

+ pgsql-receiver-status : normal

+ pgsql-status : HS:sync

* Node pi02:

+ master-pgsql : 1000

+ pgsql-data-status : LATEST

+ pgsql-master-baseline : 0000000008000098

+ pgsql-receiver-status : ERROR

+ pgsql-status : PRI

* Node pi03:

+ master-pgsql : -INFINITY

+ pgsql-data-status : STREAMING|POTENTIAL

+ pgsql-receiver-status : normal

+ pgsql-status : HS:potential

* Node pi05:

Migration Summary:

* Node pi01:

* Node pi03:

* Node pi02:

* Node pi05:

# pcs resource update msPostgresql pgsql master-max=1 master-node-max=1 clone-max=4 clone-node-max=1 notify=true

# crm_mon -Afr1

Last updated: Fri Oct 2 17:04:36 2015 Last change: Fri Oct 2 17:04:07 2015 by root via

cibadmin on pi01

Stack: corosync

Current DC: pi02 (version 1.1.13-a14efad) - partition with quorum

4 nodes and 9 resources configured

Online: [ pi01 pi02 pi03 pi05 ]

Full list of resources:

Resource Group: master-group

vip-master (ocf::heartbeat:IPaddr2): Started pi02

vip-rep (ocf::heartbeat:IPaddr2): Started pi02

Master/Slave Set: msPostgresql [pgsql]

Masters: [ pi02 ]

Slaves: [ pi01 pi03 ]

Stopped: [ pi05 ]

fence-pi01 (stonith:fence_ssh): Started pi02

fence-pi02 (stonith:fence_ssh): Started pi01

fence-pi03 (stonith:fence_ssh): Started pi01

Node Attributes:

* Node pi01:

+ master-pgsql : 100

+ pgsql-data-status : STREAMING|SYNC

+ pgsql-receiver-status : normal

+ pgsql-status : HS:sync

* Node pi02:

+ master-pgsql : 1000

+ pgsql-data-status : LATEST

+ pgsql-master-baseline : 0000000008000098

+ pgsql-receiver-status : ERROR

+ pgsql-status : PRI

* Node pi03:

+ master-pgsql : -INFINITY

+ pgsql-data-status : STREAMING|POTENTIAL

+ pgsql-receiver-status : normal

+ pgsql-status : HS:potential

* Node pi05:

+ master-pgsql : -INFINITY

+ pgsql-status : STOP

Migration Summary:

* Node pi01:

* Node pi03:

* Node pi02:

* Node pi05:

pgsql: migration-threshold=1 fail-count=1000000 last-failure='Fri Oct 2 17:04:13 2015'

Failed Actions:

* pgsql_start_0 on pi05 'unknown error' (1): call=27, status=complete, exitreason='My data may be

inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',

last-rc-change='Fri Oct 2 17:04:10 2015', queued=0ms, exec=2553ms

# pcs resource update pgsql pgsql node_list="pi01 pi02 pi03 pi05"

And here we fall into the trouble pgsql-status is now STOP!!!!!!!!

# crm_mon -Afr1

Last updated: Fri Oct 2 17:07:05 2015 Last change: Fri Oct 2 17:06:37 2015

by root via cibadmin on pi01

Stack: corosync

Current DC: pi02 (version 1.1.13-a14efad) - partition with quorum

4 nodes and 9 resources configured

Online: [ pi01 pi02 pi03 pi05 ]

Full list of resources:

Resource Group: master-group

vip-master (ocf::heartbeat:IPaddr2): Stopped

vip-rep (ocf::heartbeat:IPaddr2): Stopped

Master/Slave Set: msPostgresql [pgsql]

Slaves: [ pi02 ]

Stopped: [ pi01 pi03 pi05 ]

fence-pi01 (stonith:fence_ssh): Started pi02

fence-pi02 (stonith:fence_ssh): Started pi01

fence-pi03 (stonith:fence_ssh): Started pi01

Node Attributes:

* Node pi01:

+ master-pgsql : -INFINITY

+ pgsql-data-status : STREAMING|SYNC

+ pgsql-status : STOP

* Node pi02:

+ master-pgsql : -INFINITY

+ pgsql-data-status : LATEST

+ pgsql-status : STOP

* Node pi03:

+ master-pgsql : -INFINITY

+ pgsql-data-status : STREAMING|POTENTIAL

+ pgsql-status : STOP

* Node pi05:

+ master-pgsql : -INFINITY

+ pgsql-status : STOP

Migration Summary:

* Node pi01:

* Node pi03:

* Node pi02:

* Node pi05:

Do you know the way to add new node to cluster without this disruption? Maybe some command or something else?

-- 
Nikolay Popov
n.popov@postgrespro.ru
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Pacemaker dynamic membership

От

张文升

Дата:

10 октября 2015 г., 08:04:21

hi ：
I also do not skilled, but can give you some advice. We can discuss.

first：
pcs resource update postgresql node_list="pi01 pi02 pi03........"

then on pi05：
1、stop pacemaker
/etc/init.d/pacemaker stop

2、find “PGSQL.lock” file and remove it，eg.,

rm -f /tmp/PGSQL.lock

rm -f /var/lib/pgsql/tmp/PGSQL.lock

3、start corosync and pacemaker

On 2015年10月07日 15:44, Nikolay Popov wrote:

pgsql_start_0 on pi05 'unknown error' (1): call=27, status=complete, exitreason='My data may be
inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
last-rc-change='Fri Oct 2 17:04:10 2015', queued=0ms, exec=2553ms

-- 
----------------------
张文升 | PostgreSQL DBA 
----------------------
pg开发指南 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=58058230
pg发布流程 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=56215301
pg值班列表 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=50508626
pg机器列表 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=36438672

Re: Pacemaker dynamic membership

От

Nikolay Popov

Дата:

12 октября 2015 г., 07:56:27

Hi, Wensheng.

Thanks for the advice.
But it's not working.

10.10.2015 11:03, 张文升 пишет:

hi ：
I also do not skilled, but can give you some advice. We can discuss.

first：
pcs resource update postgresql node_list="pi01 pi02 pi03........"

then on pi05：
1、stop pacemaker
/etc/init.d/pacemaker stop

2、find “PGSQL.lock” file and remove it，eg.,
rm -f /tmp/PGSQL.lock
rm -f /var/lib/pgsql/tmp/PGSQL.lock

3、start corosync and pacemaker

On 2015年10月07日 15:44, Nikolay Popov wrote:

pgsql_start_0 on pi05 'unknown error' (1): call=27, status=complete, exitreason='My data may be
inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
last-rc-change='Fri Oct 2 17:04:10 2015', queued=0ms, exec=2553ms

-- 
----------------------
张文升 | PostgreSQL DBA 
----------------------
pg开发指南 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=58058230
pg发布流程 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=56215301
pg值班列表 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=50508626
pg机器列表 http://wiki.corp.qunar.com/pages/viewpage.action?pageId=36438672

-- 
Nikolay Popov
n.popov@postgrespro.ru
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Pacemaker dynamic membership

От

张文升

Дата:

12 октября 2015 г., 16:47:03

I'm sorry I had not helped.
can you show the files：

1. cat /etc/cluster/cluster.conf
2. ls /etc/corosync/service.d/ .if the directory is not empty , show config files
3. pcs cluster cib /tmp/cib.cfg && cat /tmp/cib.cfg
4. cat /etc/corosync/corosync.conf
5. tailf -n 100 /var/log/cluster/corosync.log
6. /var/log/message

On 2015年10月12日 15:56, Nikolay Popov wrote:

Hi, Wensheng.

Thanks for the advice.
But it's not working.

10.10.2015 11:03, 张文升 пишет:
hi ：
I also do not skilled, but can give you some advice. We can discuss.

first：
pcs resource update postgresql node_list="pi01 pi02 pi03........"

then on pi05：
1、stop pacemaker
/etc/init.d/pacemaker stop

2、find “PGSQL.lock” file and remove it，eg.,
rm -f /tmp/PGSQL.lock
rm -f /var/lib/pgsql/tmp/PGSQL.lock

3、start corosync and pacemaker

On 2015年10月07日 15:44, Nikolay Popov wrote:

pgsql_start_0 on pi05 'unknown error' (1): call=27, status=complete, exitreason='My data may be
inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
last-rc-change='Fri Oct 2 17:04:10 2015', queued=0ms, exec=2553ms

Nikolay Popov n.popov@postgrespro.ru Postgres Professional: http://www.postgrespro.com The Russian Postgres Company

-- 
----------------------
张文升 | PostgreSQL DBA 
----------------------

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Pacemaker dynamic membership

Pacemaker dynamic membership

Re: Pacemaker dynamic membership

Re: Pacemaker dynamic membership

Re: Pacemaker dynamic membership