Re: repmgr cannot bring up the standby database after switchover manaully

Поиск
Список
Период
Сортировка
От Pavan Kumar
Тема Re: repmgr cannot bring up the standby database after switchover manaully
Дата
Msg-id CA+M0sHE6-bvgr=pMHtWHvLm711OxCeno9pAC1CchZ+=MSehWrw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: repmgr cannot bring up the standby database after switchover manaully  (Tayyab Fayyaz <tayyab.humayl@gmail.com>)
Список pgsql-admin
Hello Tayyab Fayyaz 

== > As I understand, automatically re-adding the old primary as a standby is not an out-of-the-box feature and needs to be handled manually. Is that correct?
Yes, that is correct. by default  repmgr does not  take the failed primary, clean it up, rewind it, and reattach it as a standby in failover case.

On all nodes (primary & standbys):
===================================
wal_level = replica
max_wal_senders = 10 (depend on no of nodes in a cluster)
max_replication_slots = 10 (depend on no of nodes in a cluster)
hot_standby = on (standbys)
wal_keep_size = 512MB (or sized for your network/WAL shipping risk)
archive_mode = on (recommended)
archive_command = 'test ! -f /pgarchive/%f && cp %p /pgarchive/%f' (example; adapt)
hot_standby_feedback = on (optional; helps reduce vacuum conflicts)
shared_preload_libraries — not required by repmgr (leave as is)
set wal_log_hints = on

repmgr configuration file
==========================
primary node

node_id=1                 # unique per node
node_name='node_a'
conninfo='host=node_a dbname=repmgr user=repmgr port=5432'
data_directory='/pgdata/15'
use_replication_slots=yes           # if you want slots managed
failover=automatic                  # if using repmgrd for auto-failover
promote_command='repmgr standby promote -f /etc/repmgr.conf' ( you can have shell script for it )
follow_command='repmgr standby follow -f /etc/repmgr.conf'
log_file='/var/log/repmgr/repmgr.log'

standby node

node_id=2                 # unique per node
node_name='node_b'
conninfo='host=node_b dbname=repmgr user=repmgr port=5432'
data_directory='/pgdata/15'
use_replication_slots=yes           # if you want slots managed
failover=automatic                  # if using repmgrd for auto-failover
promote_command='repmgr standby promote -f /etc/repmgr.conf' #( you can have shell script for it )
follow_command='repmgr standby follow -f /etc/repmgr.conf'
log_file='/var/log/repmgr/repmgr.log'


during switchover
====================
make sure repmgr daemon are running and not in pause state
repmgr -f repmgr.conf daemon status
make sure no lag.
run checkpoint on primary
run switchover command

this will convert your standby as primary and demote old primary as standby

during failover
================
your new standby will become primary and if you have any other standby's then other standby will follow new primary once follow command
is executed

to bring back old primary as standby you need to run node rejoin command
example syntax
repmgr -f /etc/repmgr.conf node rejoin -d "host=node_b dbname=repmgr user=repmgr port=5432" --force-rewind (you can use the dry run as well) 
below cases force-rewind will fail
==================================
Prerequisites missing : You didn’t enable wal_log_hints=on or initialize the cluster with data checksums.
If the cluster crashed hard and the data directory is corrupted, rewind can’t make sense of it.
If critical control files (like pg_control) are missing or inconsistent.
pg_rewind works by comparing timelines between the new primary and the old primary.
If the old primary has WAL records that don’t exist in the new primary’s timeline, rewind will refuse.
Example: The old primary accepted transactions after a network partition, then you promoted a standby. Those “lost” transactions make divergence irreversible.
Required WAL not available: The new primary must still have WAL history needed to reconcile the divergence.
If those WAL segments were already removed (due to low wal_keep_size, no archive, or aggressive retention), rewind cannot proceed.


On Fri, Oct 3, 2025 at 8:51 AM Tayyab Fayyaz <tayyab.humayl@gmail.com> wrote:
Hello Pavan,

Please share required parameters for PostgreSQL, I will compare with my existing configuration.

As I understand, automatically re-adding the old primary as a standby is not an out-of-the-box feature and needs to be handled manually. Is that correct?

Tayyab


On Fri, 3 Oct 2025, 6:20 pm Pavan Kumar, <pavan.dba27@gmail.com> wrote:
Hello  Chris,

I hope you configured required parameters in PostgreSQL. I do noticed the same issue when your primary is Idle (no activity).
Before doing switchover please perform checkpoint on primary and run switchover command.
review repmgr -f repmgr.conf cluster events , this will provide more information on what happened during switchover.

Note: Make sure repmgr daemon are running and not in pause mode before switchover .




On Wed, Oct 1, 2025 at 3:03 PM Fernando Hevia <fhevia@gmail.com> wrote:
 
In my recent experience, there was no issue starting the old primary—it came up normally. However, it resulted in a split-brain situation where the old primary continued to accept both read and write operations while still assuming the other two nodes were replicas.

Hi Tayyab,

A split-brain is definitely an unexpected behavior. After issuing a failover or switchover command, always check the exit code to ensure it was successful. If not, you should find in the command output or in the postgresql logs an indication of what went wrong.

Seems that either the previous primary couldn't be shutdown or repmgr failed somehow to change it to a standby. Repmgr sets the node's role by creating the standby.signal file in the data directory. Upon startup, if Postgres finds the signal file, it will assume the standby role (providing the postgresql.conf file has the correct configuration too). I can only theorize here, but maybe repmgr failed to write the signal file in $PGDATA either due to lack of permissions or a network failure.

The exact output would help in figuring out what went wrong.

Regards,
Fernando





El mié, 1 oct 2025 a la(s) 4:04 p.m., Tayyab Fayyaz (tayyab.humayl@gmail.com) escribió:
Hello Fernando,

In my recent experience, there was no issue starting the old primary—it came up normally. However, it resulted in a split-brain situation where the old primary continued to accept both read and write operations while still assuming the other two nodes were replicas.

This issue occurred with the following environment:

  • OS version: RHEL 8.10

  • Postgres DB version: 14.9

  • repmgr version: 5.5.0

Tayyab

On Wed, Oct 1, 2025 at 11:52 AM Fernando Hevia <fhevia@gmail.com> wrote:

I have 2 postgresql servers. One is the primary and another one is the standby. I am trying to setup repmgr to do the switchover manually. Passwordless ssh have been setup for postgres ID on both servers.

I use this command "repmgr standby switchover --log-level=DEBUG --verbose". The standy database is able to promote to be the primary. For the previous primary database, it was shutdown. It was not able to bring up as standby by repmgr.  

In a switchover the primary server is shutdown and restarted as a standby server after the newly promoted primary (former secondary) node has been started.
If the primary did not start, there must have been an issue since this is not the standard behavior for a switchover command.

Have you checked the Postgres log file for the previous primary? You should find the startup failure cause in the log.

Regards,
Fernando

 

El mié, 1 oct 2025 a la(s) 7:30 a.m., Chris Lee (clee.hk@gmail.com) escribió:
Hi Tayyab,

Thanks for your information . I also want to find out whether that is the default behavior,  or I am not configuring repmgr correctly.

Regards,
Chris

On Wed, 1 Oct 2025, 18:12 Imran Khan, <imran.k.23@gmail.com> wrote:
Hi Tayyab,

 Is this a default behavior? We have 4 nodes cluster but never had issue in switchovers. 

Thanks, 
Imran

On Wed, Oct 1, 2025, 1:10 PM Tayyab Fayyaz <tayyab.humayl@gmail.com> wrote:
Hello Chris,

I faced this issue it will not add automatically as standby you have to add it manually.

But I wrote a script which perform to add old primary as standby once it's back online.

Tayyab


On Wed, 1 Oct 2025, 3:02 pm Chris Lee, <clee.hk@gmail.com> wrote:
Hi all,

I have 2 postgresql servers. One is the primary and another one is the standby. I am trying to setup repmgr to do the switchover manually. Passwordless ssh have been setup for postgres ID on both servers.

I use this command "repmgr standby switchover --log-level=DEBUG --verbose". The standy database is able to promote to be the primary. For the previous primary database, it was shutdown. It was not able to bring up as standby by repmgr.  

Does anyone encounter this issue before? Thanks a lot for any suggestions.

Here is my OS and DB versions:

OS version: CentOS Stream release 8
Postgres DB version:  15.12
rempmgr version: 5.5.0

Here is the repmgr conf files:
>>>>>
node_id=1  # Use 2 on standby
node_name='primary'
conninfo='host=centos804 user=repmgr dbname=repmgr password=xxx connect_timeout=15'
use_primary_conninfo_password=true
data_directory='/var/lib/pgsql/15/data'  # Adjust for your setup
pg_bindir='/usr/pgsql-15/bin'
service_start_command = 'sudo systemctl start postgresql-15'
service_stop_command  = 'sudo systemctl stop postgresql-15'
<<<<<

>>>>>
node_id=2  # Use 2 on standby
node_name='standby'
conninfo='host=centos803 user=repmgr dbname=repmgr password=xxx connect_timeout=15'
use_primary_conninfo_password=true
data_directory='/var/lib/pgsql/15/data'  # Adjust for your setup
pg_bindir='/usr/pgsql-15/bin'
service_start_command = 'sudo systemctl start postgresql-15'
service_stop_command  = 'sudo systemctl stop postgresql-15'
<<<<<

Regards,
Chris


--
Regards,

#!  Pavan Kumar
----------------------------------------------
-
Sr. Database Administrator..!

NEXT GENERATION PROFESSIONALS, LLC
Cell    #  267-799-3182 #  pavan.dba27 (Gtalk)  
India   # 9000459083

Take Risks; if you win, you will be very happy. If you lose you will be Wise  


--
Regards,

#!  Pavan Kumar
----------------------------------------------
-
Sr. Database Administrator..!

NEXT GENERATION PROFESSIONALS, LLC
Cell    #  267-799-3182 #  pavan.dba27 (Gtalk)  
India   # 9000459083

Take Risks; if you win, you will be very happy. If you lose you will be Wise  

В списке pgsql-admin по дате отправления: