Обсуждение: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary
The parameter max_active_replication_origins should be added to the list of mandatory settings that must match between primary and replica during creation
In our three-node setup:
The Publisher (Primary) is the source database
The Subscriber receives data via logical replication from the Publisher
The Read Replica is created as a hot standby of the Subscriber node
Issue: When the Read Replica's max_active_replication_origins setting is lower than the Publisher's, replica termination occurs.
The max_active_replication_origins parameter should be added to mandatory configuration checks
postgres=# select * from pg_stat_subscription;
subid | subname | worker_type | pid | leader_pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
-------+---------+-------------+-------+------------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
16413 | sub1 | apply | 67783 | | | 0/0485A188 | 2025-09-16 20:37:57.948905+00 | 2025-09-16 20:37:57.949141+00 | 0/0485A188 | 2025-09-16 20:37:57.948905+00
16414 | sub2 | apply | 67785 | | | 0/0485A188 | 2025-09-16 20:37:57.948896+00 | 2025-09-16 20:37:57.949146+00 | 0/0485A188 | 2025-09-16 20:37:57.948896+00
16415 | sub3 | apply | 67788 | | | 0/0485A188 | 2025-09-16 20:37:57.948929+00 | 2025-09-16 20:37:57.949178+00 | 0/0485A188 | 2025-09-16 20:37:57.948929+00
(3 rows)
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)
Hot_standby (Read replica) node
max_active_replication_origins 10
postgres@ip-]$ psql
psql (19devel)
Type "help" for help.
postgres=#
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)
Now, when we lower the setting for max_active_replication_origins to 1 (less than the number of subscriptions) on the replica and attempting a reload, the replica aborts and fails to start
2025-09-16 20:38:26.457 UTC [59579] PANIC: could not find free replication state, increase "max_active_replication_origins"
2025-09-16 20:38:34.329 UTC [59563] LOG: startup process (PID 59579) was terminated by signal 6: Aborted
2025-09-16 20:38:34.329 UTC [59563] LOG: aborting startup due to startup process failure
2025-09-16 20:38:34.330 UTC [59563] LOG: database system is shut down
In our three-node setup:
The Publisher (Primary) is the source database
The Subscriber receives data via logical replication from the Publisher
The Read Replica is created as a hot standby of the Subscriber node
Issue: When the Read Replica's max_active_replication_origins setting is lower than the Publisher's, replica termination occurs.
The max_active_replication_origins parameter should be added to mandatory configuration checks
Repro
Publisher Node
wal_level logical
max_active_replication_origins 10
[postgres@ip-~]$ psql
psql (19devel)
Type "help" for help.
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)
Subscriber Node
wal_level logical
max_active_replication_origins 10
We have 2 subscriptions
Publisher Node
wal_level logical
max_active_replication_origins 10
[postgres@ip-~]$ psql
psql (19devel)
Type "help" for help.
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)
Subscriber Node
wal_level logical
max_active_replication_origins 10
We have 2 subscriptions
postgres=# select * from pg_stat_subscription;
subid | subname | worker_type | pid | leader_pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
-------+---------+-------------+-------+------------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
16413 | sub1 | apply | 67783 | | | 0/0485A188 | 2025-09-16 20:37:57.948905+00 | 2025-09-16 20:37:57.949141+00 | 0/0485A188 | 2025-09-16 20:37:57.948905+00
16414 | sub2 | apply | 67785 | | | 0/0485A188 | 2025-09-16 20:37:57.948896+00 | 2025-09-16 20:37:57.949146+00 | 0/0485A188 | 2025-09-16 20:37:57.948896+00
16415 | sub3 | apply | 67788 | | | 0/0485A188 | 2025-09-16 20:37:57.948929+00 | 2025-09-16 20:37:57.949178+00 | 0/0485A188 | 2025-09-16 20:37:57.948929+00
(3 rows)
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)
Hot_standby (Read replica) node
max_active_replication_origins 10
postgres@ip-]$ psql
psql (19devel)
Type "help" for help.
postgres=#
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)
Now, when we lower the setting for max_active_replication_origins to 1 (less than the number of subscriptions) on the replica and attempting a reload, the replica aborts and fails to start
2025-09-16 20:38:26.457 UTC [59579] PANIC: could not find free replication state, increase "max_active_replication_origins"
2025-09-16 20:38:34.329 UTC [59563] LOG: startup process (PID 59579) was terminated by signal 6: Aborted
2025-09-16 20:38:34.329 UTC [59563] LOG: aborting startup due to startup process failure
2025-09-16 20:38:34.330 UTC [59563] LOG: database system is shut down
On Tue, Sep 16, 2025 at 3:52 PM Nazneen Jafri <jafrinazneen@gmail.com> wrote: > > The parameter max_active_replication_origins should be added to the list of mandatory settings that must match betweenprimary and replica during creation > > In our three-node setup: > > The Publisher (Primary) is the source database > The Subscriber receives data via logical replication from the Publisher > The Read Replica is created as a hot standby of the Subscriber node > Issue: When the Read Replica's max_active_replication_origins setting is lower than the Publisher's, replica terminationoccurs. > The max_active_replication_origins parameter should be added to mandatory configuration checks > > Repro > > Publisher Node > wal_level logical > max_active_replication_origins 10 > > [postgres@ip-~]$ psql > psql (19devel) > Type "help" for help. > > postgres=# select * from test_table; > id | data | created_at > ----+---------------------+---------------------------- > 7 | Initial test data 1 | 2025-09-16 19:51:24.658967 > 8 | Initial test data 2 | 2025-09-16 19:51:24.658967 > 9 | Initial test data 3 | 2025-09-16 19:51:24.658967 > 10 | Initial test data 4 | 2025-09-16 20:32:51.567525 > 11 | Initial test data 5 | 2025-09-16 20:32:51.567525 > 12 | Initial test data 6 | 2025-09-16 20:32:51.567525 > (6 rows) > > Subscriber Node > > wal_level logical > max_active_replication_origins 10 > > We have 2 subscriptions > > postgres=# select * from pg_stat_subscription; > subid | subname | worker_type | pid | leader_pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time > -------+---------+-------------+-------+------------+-------+--------------+-------------------------------+-------------------------------+----------------+------------------------------- > 16413 | sub1 | apply | 67783 | | | 0/0485A188 | 2025-09-16 20:37:57.948905+00 | 2025-09-1620:37:57.949141+00 | 0/0485A188 | 2025-09-16 20:37:57.948905+00 > 16414 | sub2 | apply | 67785 | | | 0/0485A188 | 2025-09-16 20:37:57.948896+00 | 2025-09-1620:37:57.949146+00 | 0/0485A188 | 2025-09-16 20:37:57.948896+00 > 16415 | sub3 | apply | 67788 | | | 0/0485A188 | 2025-09-16 20:37:57.948929+00 | 2025-09-1620:37:57.949178+00 | 0/0485A188 | 2025-09-16 20:37:57.948929+00 > (3 rows) > > postgres=# select * from test_table; > id | data | created_at > ----+---------------------+---------------------------- > 7 | Initial test data 1 | 2025-09-16 19:51:24.658967 > 8 | Initial test data 2 | 2025-09-16 19:51:24.658967 > 9 | Initial test data 3 | 2025-09-16 19:51:24.658967 > 10 | Initial test data 4 | 2025-09-16 20:32:51.567525 > 11 | Initial test data 5 | 2025-09-16 20:32:51.567525 > 12 | Initial test data 6 | 2025-09-16 20:32:51.567525 > (6 rows) > > Hot_standby (Read replica) node > max_active_replication_origins 10 > > postgres@ip-]$ psql > psql (19devel) > Type "help" for help. > > > postgres=# > postgres=# select * from test_table; > id | data | created_at > ----+---------------------+---------------------------- > 7 | Initial test data 1 | 2025-09-16 19:51:24.658967 > 8 | Initial test data 2 | 2025-09-16 19:51:24.658967 > 9 | Initial test data 3 | 2025-09-16 19:51:24.658967 > 10 | Initial test data 4 | 2025-09-16 20:32:51.567525 > 11 | Initial test data 5 | 2025-09-16 20:32:51.567525 > 12 | Initial test data 6 | 2025-09-16 20:32:51.567525 > (6 rows) > > > > Now, when we lower the setting for max_active_replication_origins to 1 (less than the number of subscriptions) on the replicaand attempting a reload, the replica aborts and fails to start > > 2025-09-16 20:38:26.457 UTC [59579] PANIC: could not find free replication state, increase "max_active_replication_origins" > 2025-09-16 20:38:34.329 UTC [59563] LOG: startup process (PID 59579) was terminated by signal 6: Aborted > 2025-09-16 20:38:34.329 UTC [59563] LOG: aborting startup due to startup process failure > 2025-09-16 20:38:34.330 UTC [59563] LOG: database system is shut down > Thank you for the report! As reported, the standby could not continue the recovery (especially replaying XLOG_REPLORIGIN_ records) if its max_active_replication_origins is less than the primary's setting. One idea to fix this issue is to require for standbys to have at least the same max_active_replication_origins value as the primary as we do for other GUC parameters such as max_worker_processes and max_wal_senders. It needs to add max_active_replication_origins to the control file and bumps the PG_CONTROL_VERSION. Given that we've released 18RC1 and probably are close to 18 release, I'd like to hear opinions whether such a fix is acceptable or not. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Sep 16, 2025 at 04:42:39PM -0700, Masahiko Sawada wrote: > On Tue, Sep 16, 2025 at 3:52 PM Nazneen Jafri <jafrinazneen@gmail.com> wrote: >> The parameter max_active_replication_origins should be added to the list >> of mandatory settings that must match between primary and replica during >> creation >> >> [...] > > Thank you for the report! > > As reported, the standby could not continue the recovery (especially > replaying XLOG_REPLORIGIN_ records) if its > max_active_replication_origins is less than the primary's setting. One > idea to fix this issue is to require for standbys to have at least the > same max_active_replication_origins value as the primary as we do for > other GUC parameters such as max_worker_processes and max_wal_senders. > It needs to add max_active_replication_origins to the control file and > bumps the PG_CONTROL_VERSION. Given that we've released 18RC1 and > probably are close to 18 release, I'd like to hear opinions whether > such a fix is acceptable or not. I haven't tried reproducing it on older versions (with max_replication_slots instead of max_active_replication_origins), but after looking at the code for a bit, I'm growing skeptical that this is new to v18. In any case, the PANIC provides a clear error message, which is roughly the same as what we'd say with the control file approach, right? -- nathan
On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Sep 16, 2025 at 04:42:39PM -0700, Masahiko Sawada wrote: > > On Tue, Sep 16, 2025 at 3:52 PM Nazneen Jafri <jafrinazneen@gmail.com> wrote: > >> The parameter max_active_replication_origins should be added to the list > >> of mandatory settings that must match between primary and replica during > >> creation > >> > >> [...] > > > > Thank you for the report! > > > > As reported, the standby could not continue the recovery (especially > > replaying XLOG_REPLORIGIN_ records) if its > > max_active_replication_origins is less than the primary's setting. One > > idea to fix this issue is to require for standbys to have at least the > > same max_active_replication_origins value as the primary as we do for > > other GUC parameters such as max_worker_processes and max_wal_senders. > > It needs to add max_active_replication_origins to the control file and > > bumps the PG_CONTROL_VERSION. Given that we've released 18RC1 and > > probably are close to 18 release, I'd like to hear opinions whether > > such a fix is acceptable or not. > > I haven't tried reproducing it on older versions (with > max_replication_slots instead of max_active_replication_origins), but after > looking at the code for a bit, I'm growing skeptical that this is new to > v18. Right, it's actually not a new behavior to v18 as we can reproduce it with max_replication_slots. I guess that the reason why we didn't require standbys to set max_replication_slots no smaller than the primary's value is that in principle the maximum number of replication slots is not related to the recovery work. max_replication_slots juse used to be re-used for the maximum number of active replication origins for the sake of simplicity. Now that we have separated the maximum number of active replication origins from max_replication_slots, it seems to me that max_active_replication_origins is now clearly related to the recovery. > In any case, the PANIC provides a clear error message, which is > roughly the same as what we'd say with the control file approach, right? Yes. With the control file approach, we raise a FATAL (or pause the recovery with a WARNING) instead of PANIC. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Sep 16, 2025 at 09:05:33PM -0700, Masahiko Sawada wrote: > On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I haven't tried reproducing it on older versions (with >> max_replication_slots instead of max_active_replication_origins), but after >> looking at the code for a bit, I'm growing skeptical that this is new to >> v18. > > Right, it's actually not a new behavior to v18 as we can reproduce it > with max_replication_slots. I guess that the reason why we didn't > require standbys to set max_replication_slots no smaller than the > primary's value is that in principle the maximum number of replication > slots is not related to the recovery work. max_replication_slots juse > used to be re-used for the maximum number of active replication > origins for the sake of simplicity. Now that we have separated the > maximum number of active replication origins from > max_replication_slots, it seems to me that > max_active_replication_origins is now clearly related to the recovery. Given that it's existing behavior, I'm not seeing a strong reason to try to do anything about this for v18. But I could be misunderstanding the nuance here. >> In any case, the PANIC provides a clear error message, which is >> roughly the same as what we'd say with the control file approach, right? > > Yes. With the control file approach, we raise a FATAL (or pause the > recovery with a WARNING) instead of PANIC. I drafted up what that would look like. One very small nitpick is that it messes up the alignment of the pg_controldata output. Otherwise, it seems pretty straightforward. -- nathan
Вложения
On Tue, Sep 16, 2025 at 9:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Sep 16, 2025 at 09:05:33PM -0700, Masahiko Sawada wrote: > > On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> I haven't tried reproducing it on older versions (with > >> max_replication_slots instead of max_active_replication_origins), but after > >> looking at the code for a bit, I'm growing skeptical that this is new to > >> v18. > > > > Right, it's actually not a new behavior to v18 as we can reproduce it > > with max_replication_slots. I guess that the reason why we didn't > > require standbys to set max_replication_slots no smaller than the > > primary's value is that in principle the maximum number of replication > > slots is not related to the recovery work. max_replication_slots juse > > used to be re-used for the maximum number of active replication > > origins for the sake of simplicity. Now that we have separated the > > maximum number of active replication origins from > > max_replication_slots, it seems to me that > > max_active_replication_origins is now clearly related to the recovery. > > Given that it's existing behavior, I'm not seeing a strong reason to try to > do anything about this for v18. But I could be misunderstanding the nuance > here. > > >> In any case, the PANIC provides a clear error message, which is > >> roughly the same as what we'd say with the control file approach, right? > > > > Yes. With the control file approach, we raise a FATAL (or pause the > > recovery with a WARNING) instead of PANIC. > > I drafted up what that would look like. One very small nitpick is that it > messes up the alignment of the pg_controldata output. Otherwise, it seems > pretty straightforward. Thank you for drafting the patch. It looks good to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Sep 16, 2025 at 9:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Sep 16, 2025 at 09:05:33PM -0700, Masahiko Sawada wrote: > > On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> I haven't tried reproducing it on older versions (with > >> max_replication_slots instead of max_active_replication_origins), but after > >> looking at the code for a bit, I'm growing skeptical that this is new to > >> v18. > > > > Right, it's actually not a new behavior to v18 as we can reproduce it > > with max_replication_slots. I guess that the reason why we didn't > > require standbys to set max_replication_slots no smaller than the > > primary's value is that in principle the maximum number of replication > > slots is not related to the recovery work. max_replication_slots juse > > used to be re-used for the maximum number of active replication > > origins for the sake of simplicity. Now that we have separated the > > maximum number of active replication origins from > > max_replication_slots, it seems to me that > > max_active_replication_origins is now clearly related to the recovery. > > Given that it's existing behavior, I'm not seeing a strong reason to try to > do anything about this for v18. But I could be misunderstanding the nuance > here. > After reviewing the issue again, I agree that we don't have a strong reason to have such a change for v18. While it would probably be safer to require standbys to set max_active_replication_origins no smaller than the primary, it's not an item for v18. We can discuss it separately for v19 or later. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Sep 17, 2025, at 1:23 AM, Nathan Bossart wrote: > I drafted up what that would look like. One very small nitpick is that it > messes up the alignment of the pg_controldata output. Otherwise, it seems > pretty straightforward. > As Masahiko said this behavior is not new in v18 so I wouldn't consider fix it to v18. I don't think your patch works for one detail: StartupReplicationOrigin() is called earlier than CheckRequiredParameterValues(). The StartupReplicationOrigin() uses max_active_replication_origins to emit the PANIC message. I didn't check the implications to postpone the StartupReplicationOrigin(). Maybe an ugly solution is to check the max_active_replication_origins inside or even before calling StartupReplicationOrigin(). I'm attaching a script that I used to simulate the issue. Change the GUC name to simulate the issue in a pre v18. -- Euler Taveira EDB https://www.enterprisedb.com/