Обсуждение: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

Поиск

Список

Период

Сортировка

Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Nazneen Jafri

Дата:

17 сентября, 01:52:23

The parameter max_active_replication_origins should be added to the list of mandatory settings that must match between primary and replica during creation

In our three-node setup:

The Publisher (Primary) is the source database
The Subscriber receives data via logical replication from the Publisher
The Read Replica is created as a hot standby of the Subscriber node
Issue: When the Read Replica's max_active_replication_origins setting is lower than the Publisher's, replica termination occurs.
The max_active_replication_origins parameter should be added to mandatory configuration checks

Repro

Publisher Node
wal_level logical
max_active_replication_origins 10

[postgres@ip-~]$ psql
psql (19devel)
Type "help" for help.

postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)

Subscriber Node

wal_level logical
max_active_replication_origins 10

We have 2 subscriptions

postgres=# select * from pg_stat_subscription;
subid | subname | worker_type | pid | leader_pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
-------+---------+-------------+-------+------------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
16413 | sub1 | apply | 67783 | | | 0/0485A188 | 2025-09-16 20:37:57.948905+00 | 2025-09-16 20:37:57.949141+00 | 0/0485A188 | 2025-09-16 20:37:57.948905+00
16414 | sub2 | apply | 67785 | | | 0/0485A188 | 2025-09-16 20:37:57.948896+00 | 2025-09-16 20:37:57.949146+00 | 0/0485A188 | 2025-09-16 20:37:57.948896+00
16415 | sub3 | apply | 67788 | | | 0/0485A188 | 2025-09-16 20:37:57.948929+00 | 2025-09-16 20:37:57.949178+00 | 0/0485A188 | 2025-09-16 20:37:57.948929+00
(3 rows)

postgres@ip-]$ psql
psql (19devel)
Type "help" for help.

postgres=#
postgres=# select * from test_table;
id | data | created_at
----+---------------------+----------------------------
7 | Initial test data 1 | 2025-09-16 19:51:24.658967
8 | Initial test data 2 | 2025-09-16 19:51:24.658967
9 | Initial test data 3 | 2025-09-16 19:51:24.658967
10 | Initial test data 4 | 2025-09-16 20:32:51.567525
11 | Initial test data 5 | 2025-09-16 20:32:51.567525
12 | Initial test data 6 | 2025-09-16 20:32:51.567525
(6 rows)

Now, when we lower the setting for max_active_replication_origins to 1 (less than the number of subscriptions) on the replica and attempting a reload, the replica aborts and fails to start

2025-09-16 20:38:26.457 UTC [59579] PANIC: could not find free replication state, increase "max_active_replication_origins"
2025-09-16 20:38:34.329 UTC [59563] LOG: startup process (PID 59579) was terminated by signal 6: Aborted
2025-09-16 20:38:34.329 UTC [59563] LOG: aborting startup due to startup process failure
2025-09-16 20:38:34.330 UTC [59563] LOG: database system is shut down

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Masahiko Sawada

Дата:

17 сентября, 02:42:39

On Tue, Sep 16, 2025 at 3:52 PM Nazneen Jafri <jafrinazneen@gmail.com> wrote:
>
> The parameter max_active_replication_origins should be added to the list of mandatory settings that must match
betweenprimary and replica during creation 
>
>  In our three-node setup:
>
> The Publisher (Primary) is the source database
> The Subscriber receives data via logical replication from the Publisher
> The Read Replica is created as a hot standby of the Subscriber node
> Issue: When the Read Replica's max_active_replication_origins setting is lower than the Publisher's, replica
terminationoccurs. 
> The max_active_replication_origins parameter should be added to mandatory configuration checks
>
> Repro
>
> Publisher Node
> wal_level logical
> max_active_replication_origins 10
>
> [postgres@ip-~]$ psql
> psql (19devel)
> Type "help" for help.
>
> postgres=# select * from test_table;
>  id |        data         |         created_at
> ----+---------------------+----------------------------
>   7 | Initial test data 1 | 2025-09-16 19:51:24.658967
>   8 | Initial test data 2 | 2025-09-16 19:51:24.658967
>   9 | Initial test data 3 | 2025-09-16 19:51:24.658967
>  10 | Initial test data 4 | 2025-09-16 20:32:51.567525
>  11 | Initial test data 5 | 2025-09-16 20:32:51.567525
>  12 | Initial test data 6 | 2025-09-16 20:32:51.567525
> (6 rows)
>
> Subscriber Node
>
> wal_level logical
> max_active_replication_origins 10
>
>  We have  2 subscriptions
>
> postgres=# select * from pg_stat_subscription;
>  subid | subname | worker_type |  pid  | leader_pid | relid | received_lsn |      last_msg_send_time       |
last_msg_receipt_time    | latest_end_lsn |        latest_end_time 
>
-------+---------+-------------+-------+------------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
>  16413 | sub1    | apply       | 67783 |            |       | 0/0485A188   | 2025-09-16 20:37:57.948905+00 |
2025-09-1620:37:57.949141+00 | 0/0485A188     | 2025-09-16 20:37:57.948905+00 
>  16414 | sub2    | apply       | 67785 |            |       | 0/0485A188   | 2025-09-16 20:37:57.948896+00 |
2025-09-1620:37:57.949146+00 | 0/0485A188     | 2025-09-16 20:37:57.948896+00 
>  16415 | sub3    | apply       | 67788 |            |       | 0/0485A188   | 2025-09-16 20:37:57.948929+00 |
2025-09-1620:37:57.949178+00 | 0/0485A188     | 2025-09-16 20:37:57.948929+00 
> (3 rows)
>
> postgres=# select * from test_table;
>  id |        data         |         created_at
> ----+---------------------+----------------------------
>   7 | Initial test data 1 | 2025-09-16 19:51:24.658967
>   8 | Initial test data 2 | 2025-09-16 19:51:24.658967
>   9 | Initial test data 3 | 2025-09-16 19:51:24.658967
>  10 | Initial test data 4 | 2025-09-16 20:32:51.567525
>  11 | Initial test data 5 | 2025-09-16 20:32:51.567525
>  12 | Initial test data 6 | 2025-09-16 20:32:51.567525
> (6 rows)
>
> Hot_standby (Read replica) node
>  max_active_replication_origins 10
>
> postgres@ip-]$ psql
> psql (19devel)
> Type "help" for help.
>
>
> postgres=#
> postgres=# select * from test_table;
>  id |        data         |         created_at
> ----+---------------------+----------------------------
>   7 | Initial test data 1 | 2025-09-16 19:51:24.658967
>   8 | Initial test data 2 | 2025-09-16 19:51:24.658967
>   9 | Initial test data 3 | 2025-09-16 19:51:24.658967
>  10 | Initial test data 4 | 2025-09-16 20:32:51.567525
>  11 | Initial test data 5 | 2025-09-16 20:32:51.567525
>  12 | Initial test data 6 | 2025-09-16 20:32:51.567525
> (6 rows)
>
>
>
> Now, when we lower the setting for max_active_replication_origins to 1 (less than the number of subscriptions) on the
replicaand attempting a reload, the replica aborts and fails to start 
>
> 2025-09-16 20:38:26.457 UTC [59579] PANIC:  could not find free replication state, increase
"max_active_replication_origins"
> 2025-09-16 20:38:34.329 UTC [59563] LOG:  startup process (PID 59579) was terminated by signal 6: Aborted
> 2025-09-16 20:38:34.329 UTC [59563] LOG:  aborting startup due to startup process failure
> 2025-09-16 20:38:34.330 UTC [59563] LOG:  database system is shut down
>

Thank you for the report!

As reported, the standby could not continue the recovery (especially
replaying XLOG_REPLORIGIN_ records) if its
max_active_replication_origins is less than the primary's setting. One
idea to fix this issue is to require for standbys to have at least the
same  max_active_replication_origins value as the primary as we do for
other GUC parameters such as max_worker_processes and max_wal_senders.
It needs to add max_active_replication_origins to the control file and
bumps the PG_CONTROL_VERSION. Given that we've released 18RC1 and
probably are close to 18 release, I'd like to hear opinions whether
such a fix is acceptable or not.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Nathan Bossart

Дата:

17 сентября, 05:45:39

On Tue, Sep 16, 2025 at 04:42:39PM -0700, Masahiko Sawada wrote:
> On Tue, Sep 16, 2025 at 3:52 PM Nazneen Jafri <jafrinazneen@gmail.com> wrote:
>> The parameter max_active_replication_origins should be added to the list
>> of mandatory settings that must match between primary and replica during
>> creation
>>
>> [...]
> 
> Thank you for the report!
> 
> As reported, the standby could not continue the recovery (especially
> replaying XLOG_REPLORIGIN_ records) if its
> max_active_replication_origins is less than the primary's setting. One
> idea to fix this issue is to require for standbys to have at least the
> same  max_active_replication_origins value as the primary as we do for
> other GUC parameters such as max_worker_processes and max_wal_senders.
> It needs to add max_active_replication_origins to the control file and
> bumps the PG_CONTROL_VERSION. Given that we've released 18RC1 and
> probably are close to 18 release, I'd like to hear opinions whether
> such a fix is acceptable or not.

I haven't tried reproducing it on older versions (with
max_replication_slots instead of max_active_replication_origins), but after
looking at the code for a bit, I'm growing skeptical that this is new to
v18.  In any case, the PANIC provides a clear error message, which is
roughly the same as what we'd say with the control file approach, right?

-- 
nathan

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Masahiko Sawada

Дата:

17 сентября, 07:05:33

On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Sep 16, 2025 at 04:42:39PM -0700, Masahiko Sawada wrote:
> > On Tue, Sep 16, 2025 at 3:52 PM Nazneen Jafri <jafrinazneen@gmail.com> wrote:
> >> The parameter max_active_replication_origins should be added to the list
> >> of mandatory settings that must match between primary and replica during
> >> creation
> >>
> >> [...]
> >
> > Thank you for the report!
> >
> > As reported, the standby could not continue the recovery (especially
> > replaying XLOG_REPLORIGIN_ records) if its
> > max_active_replication_origins is less than the primary's setting. One
> > idea to fix this issue is to require for standbys to have at least the
> > same  max_active_replication_origins value as the primary as we do for
> > other GUC parameters such as max_worker_processes and max_wal_senders.
> > It needs to add max_active_replication_origins to the control file and
> > bumps the PG_CONTROL_VERSION. Given that we've released 18RC1 and
> > probably are close to 18 release, I'd like to hear opinions whether
> > such a fix is acceptable or not.
>
> I haven't tried reproducing it on older versions (with
> max_replication_slots instead of max_active_replication_origins), but after
> looking at the code for a bit, I'm growing skeptical that this is new to
> v18.

Right, it's actually not a new behavior to v18 as we can reproduce it
with max_replication_slots. I guess that the reason why we didn't
require standbys to set max_replication_slots no smaller than the
primary's value is that in principle the maximum number of replication
slots is not related to the recovery work. max_replication_slots juse
used to be re-used for the maximum number of active replication
origins for the sake of simplicity. Now that we have separated the
maximum number of active replication origins from
max_replication_slots, it seems to me that
max_active_replication_origins is now clearly related to the recovery.

> In any case, the PANIC provides a clear error message, which is
> roughly the same as what we'd say with the control file approach, right?

Yes. With the control file approach, we raise a FATAL (or pause the
recovery with a WARNING) instead of PANIC.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Nathan Bossart

Дата:

17 сентября, 07:23:18

On Tue, Sep 16, 2025 at 09:05:33PM -0700, Masahiko Sawada wrote:
> On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I haven't tried reproducing it on older versions (with
>> max_replication_slots instead of max_active_replication_origins), but after
>> looking at the code for a bit, I'm growing skeptical that this is new to
>> v18.
> 
> Right, it's actually not a new behavior to v18 as we can reproduce it
> with max_replication_slots. I guess that the reason why we didn't
> require standbys to set max_replication_slots no smaller than the
> primary's value is that in principle the maximum number of replication
> slots is not related to the recovery work. max_replication_slots juse
> used to be re-used for the maximum number of active replication
> origins for the sake of simplicity. Now that we have separated the
> maximum number of active replication origins from
> max_replication_slots, it seems to me that
> max_active_replication_origins is now clearly related to the recovery.

Given that it's existing behavior, I'm not seeing a strong reason to try to
do anything about this for v18.  But I could be misunderstanding the nuance
here.

>> In any case, the PANIC provides a clear error message, which is
>> roughly the same as what we'd say with the control file approach, right?
> 
> Yes. With the control file approach, we raise a FATAL (or pause the
> recovery with a WARNING) instead of PANIC.

I drafted up what that would look like.  One very small nitpick is that it
messes up the alignment of the pg_controldata output.  Otherwise, it seems
pretty straightforward.

-- 
nathan

Вложения

v1-0001-max_active_replication_origins-fix.patch

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Masahiko Sawada

Дата:

17 сентября, 07:32:59

On Tue, Sep 16, 2025 at 9:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Sep 16, 2025 at 09:05:33PM -0700, Masahiko Sawada wrote:
> > On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> I haven't tried reproducing it on older versions (with
> >> max_replication_slots instead of max_active_replication_origins), but after
> >> looking at the code for a bit, I'm growing skeptical that this is new to
> >> v18.
> >
> > Right, it's actually not a new behavior to v18 as we can reproduce it
> > with max_replication_slots. I guess that the reason why we didn't
> > require standbys to set max_replication_slots no smaller than the
> > primary's value is that in principle the maximum number of replication
> > slots is not related to the recovery work. max_replication_slots juse
> > used to be re-used for the maximum number of active replication
> > origins for the sake of simplicity. Now that we have separated the
> > maximum number of active replication origins from
> > max_replication_slots, it seems to me that
> > max_active_replication_origins is now clearly related to the recovery.
>
> Given that it's existing behavior, I'm not seeing a strong reason to try to
> do anything about this for v18.  But I could be misunderstanding the nuance
> here.
>
> >> In any case, the PANIC provides a clear error message, which is
> >> roughly the same as what we'd say with the control file approach, right?
> >
> > Yes. With the control file approach, we raise a FATAL (or pause the
> > recovery with a WARNING) instead of PANIC.
>
> I drafted up what that would look like.  One very small nitpick is that it
> messes up the alignment of the pg_controldata output.  Otherwise, it seems
> pretty straightforward.

Thank you for drafting the patch. It looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

Masahiko Sawada

Дата:

17 сентября, 19:00:36

On Tue, Sep 16, 2025 at 9:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Sep 16, 2025 at 09:05:33PM -0700, Masahiko Sawada wrote:
> > On Tue, Sep 16, 2025 at 7:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> I haven't tried reproducing it on older versions (with
> >> max_replication_slots instead of max_active_replication_origins), but after
> >> looking at the code for a bit, I'm growing skeptical that this is new to
> >> v18.
> >
> > Right, it's actually not a new behavior to v18 as we can reproduce it
> > with max_replication_slots. I guess that the reason why we didn't
> > require standbys to set max_replication_slots no smaller than the
> > primary's value is that in principle the maximum number of replication
> > slots is not related to the recovery work. max_replication_slots juse
> > used to be re-used for the maximum number of active replication
> > origins for the sake of simplicity. Now that we have separated the
> > maximum number of active replication origins from
> > max_replication_slots, it seems to me that
> > max_active_replication_origins is now clearly related to the recovery.
>
> Given that it's existing behavior, I'm not seeing a strong reason to try to
> do anything about this for v18.  But I could be misunderstanding the nuance
> here.
>

After reviewing the issue again, I agree that we don't have a strong
reason to have such a change for v18. While it would probably be safer
to require standbys to set max_active_replication_origins no smaller
than the primary, it's not an item for v18. We can discuss it
separately for v19 or later.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

От

"Euler Taveira"

Дата:

18 сентября, 01:34:56

On Wed, Sep 17, 2025, at 1:23 AM, Nathan Bossart wrote:
> I drafted up what that would look like.  One very small nitpick is that it
> messes up the alignment of the pg_controldata output.  Otherwise, it seems
> pretty straightforward.
>

As Masahiko said this behavior is not new in v18 so I wouldn't consider fix it
to v18.

I don't think your patch works for one detail: StartupReplicationOrigin() is
called earlier than CheckRequiredParameterValues(). The
StartupReplicationOrigin() uses max_active_replication_origins to emit the
PANIC message. I didn't check the implications to postpone the
StartupReplicationOrigin(). Maybe an ugly solution is to check the
max_active_replication_origins inside or even before calling
StartupReplicationOrigin().

I'm attaching a script that I used to simulate the issue. Change the GUC name
to simulate the issue in a pre v18.

-- 
Euler Taveira
EDB   https://www.enterprisedb.com/

Вложения

test-maro.sh

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Read Replica termination occurs when its max_active_replication_origins setting is lower than the primary

Вложения

Вложения