Обсуждение: Temporary disabling a replica in a Patroni cluster
Dear Colleagues, Do you perchance know what is the correct procedure of temporarily taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of hardware maintenance? The problem is that after stopping the patroni process (service) on a replica, patroni removes the corresponding physical replication slot from the leader, and unless the wal_keep_size value is unsanely high, the replica, when up again, cannot restart streaming because the WAL segments are already gone from the leader. Well, you all know: <%%%>LOG: started streaming WAL from primary at B4A0/E2000000 on timeline 8 <%%%>FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000080000B4A0000000E2 has already beenremoved <%%%>LOG: waiting for WAL to become available at B4A0/E2002000 Do you think there is a way to tell Patroni that a replica is down temporarily and its replication slot should not be removed? Or, what am I missing? -- Victor Sudakov VAS4-RIPE http://vas.tomsk.ru/ 2:5005/49@fidonet
Hello Victor, Am 25.08.2023 um 13:18 schrieb Victor Sudakov: > Dear Colleagues, > > Do you perchance know what is the correct procedure of temporarily > taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of > hardware maintenance? > > The problem is that after stopping the patroni process (service) on a > replica, patroni removes the corresponding physical replication slot > from the leader, and unless the wal_keep_size value is unsanely high, > the replica, when up again, cannot restart streaming because the WAL > segments are already gone from the leader. > > Well, you all know: > <%%%>LOG: started streaming WAL from primary at B4A0/E2000000 on timeline 8 > <%%%>FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000080000B4A0000000E2 has alreadybeen removed > <%%%>LOG: waiting for WAL to become available at B4A0/E2002000 > > Do you think there is a way to tell Patroni that a replica is down > temporarily and its replication slot should not be removed? > > Or, what am I missing? you may use patronictl pause + resume keep in mind to set wal_keep_size (or wal_keep_segments depending on your PG version high enough) regards Georg
Georg H. wrote: > Hello Victor, > > > Am 25.08.2023 um 13:18 schrieb Victor Sudakov: > > Dear Colleagues, > > > > Do you perchance know what is the correct procedure of temporarily > > taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of > > hardware maintenance? > > > > The problem is that after stopping the patroni process (service) on a > > replica, patroni removes the corresponding physical replication slot > > from the leader, and unless the wal_keep_size value is unsanely high, > > the replica, when up again, cannot restart streaming because the WAL > > segments are already gone from the leader. > > > > Well, you all know: > > <%%%>LOG: started streaming WAL from primary at B4A0/E2000000 on timeline 8 > > <%%%>FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000080000B4A0000000E2 has alreadybeen removed > > <%%%>LOG: waiting for WAL to become available at B4A0/E2002000 > > > > Do you think there is a way to tell Patroni that a replica is down > > temporarily and its replication slot should not be removed? > > > > Or, what am I missing? > > > you may use patronictl pause + resume I would like to do the maintenance on one node only and keep the rest of the cluster functioning normally. > > keep in mind to set wal_keep_size (or wal_keep_segments depending on > your PG version high enough) I have written above about "unless the wal_keep_size value is unsanely high" :-) Keeping wal_keep_size very high is a waste of disk space and still provides no real guarantee, unfortunately. Why does Patroni use slots at all then? -- Victor Sudakov VAS4-RIPE http://vas.tomsk.ru/ 2:5005/49@fidonet
Victor Sudakov wrote: > > Do you perchance know what is the correct procedure of temporarily > taking down a replica in a Patroni cluster, e.g. for 5-10 minutes of > hardware maintenance? > > The problem is that after stopping the patroni process (service) on a > replica, patroni removes the corresponding physical replication slot > from the leader, and unless the wal_keep_size value is unsanely high, > the replica, when up again, cannot restart streaming because the WAL > segments are already gone from the leader. > > Well, you all know: > <%%%>LOG: started streaming WAL from primary at B4A0/E2000000 on timeline 8 > <%%%>FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000080000B4A0000000E2 has alreadybeen removed > <%%%>LOG: waiting for WAL to become available at B4A0/E2002000 > > Do you think there is a way to tell Patroni that a replica is down > temporarily and its replication slot should not be removed? > > Or, what am I missing? As WAL archiving (wal-g) is enabled in this cluster anyway, do you think adding "postgresql.parameters.restore_command" to the Patroni config will help in this situation? restore_command works very well in regular Postgres clusters catching up from a big replication delay and permits to have wal_keep_size=0, however does anyone know if there are any Patroni-specific reasons not to use restore_command under Patroni? -- Victor Sudakov VAS4-RIPE http://vas.tomsk.ru/ 2:5005/49@fidonet