Обсуждение: High CPU consumption in cascade replication with large number of walsenders
High CPU consumption in cascade replication with large number of walsenders
От
Alexey Makhmutov
Дата:
Hello hackers, This is a continuation of the thread https://www.postgresql.org/message-id/flat/076eb7bd-52e6-4a51-ba00-c744d027b15c%40postgrespro.ru, with focus only on the patch related to improving performance in case of large number of cascaded walsenders. We’ve faced an interesting situation on a standby environment with configured cascade replication and large number (~100) of configured walsenders. We’ve noticed a very high CPU consumption on such environment with the most time-consuming operation being signal delivery from startup recovery process to walsenders via WalSndWakeup invocations from ApplyWalRecord in xlogrecovery.c. The startup standby process notifies walsenders for downstream systems using ConditionVariableBroadcast (CV), so only processes waiting on this CV need to be contacted. However in case of high load we seems to be hitting here a bottleneck anyway. The current implementation tries to send notification after processing of each WAL record (i.e. during each invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup invocations. At the same time, this also provides each walsender with very small chunk of data to process, so almost every process will be present in the CV wait list for the next iteration. As result, waiting list should be always fully packed in such case, which additionally reduces performance of WAL records processing by the standby instance. To reproduce such behavior we could use a simple environment with three servers: primary instance, attached physical standby and its downstream server with large number of logical replication subscriptions. Attached is the synthetic test case (test_scenario.zip) to reproduce this behavior: script ‘test_prepare.sh’ could be used to create required environment with test data and ‘test_execute.sh’ script executes ‘pgbench’ tool with simple updates against primary instance to trigger replication to other servers. With just about 6 clients I could observe high CPU consumption by the 'startup recovering process' (and it may be sufficient to completely saturate the CPU on a smaller machine). Please check the environment properties at the top of these scripts before running them, as they need to be updated in order to specify location for installed PG build, target location for database instances creation and used ports. After thinking about possible ways to improve such case, we've decided to implement batching for notification delivery. We try to slightly postpone sending notification until recovery has applied some number of messages.This reduces rate of CV notifications and also gives receivers more data to process, so they may not need to enter the CV wait state so often. Counting applied records is not difficult, but the tricky part here is to ensure that we do not postpone notifications for too long in case of low load. To reduce such delay we use a timer handler, which sets a timeout flag, which is checked in ProcessStartupProcInterrupts. This allow us to send signal on timeout if the startup process is waiting for the arrival of new WAL records (in ReadRecord). The WalSndWakeup will be invoked either after applying certain number of messages or after expiration of timeout since last notification. The notification however may be delayed while record is being applied (during redo handler invocation from ApplyWalRecord). This could increase delay for some corner cases with non-trivial WAL records like ‘drop database’, but this should be a rare case and walsender process have its own limit on the wait time, so the delay won’t be indefinite even in this case. The patch introduces two GUCs to control the batching behavior. The first one controls size of batched messages ('cascade_replication_batch_size') and is set to 0 by default, so the functionality is effectively disabled. The second one controls timed delay during batching ('cascade_replication_batch_delay'), which is by default set to 500ms. The delay is used only if batching is enabled. With this patch applied we’ve noticed a significant reduction in CPU consumption while using the synthetic test program mentioned above. It would be great to hear any thoughts on these observations and fixing approaches, as well as possible pitfalls of proposed changes. Thanks, Alexey
Вложения
Re: High CPU consumption in cascade replication with large number of walsenders
От
Alexander Korotkov
Дата:
Hi, Alexey!
Thank you for spotting this problem, and thank you for working on it.
On Sun, Aug 31, 2025 at 2:47 AM Alexey Makhmutov <a.makhmutov@postgrespro.ru> wrote:
> This is a continuation of the thread
> https://www.postgresql.org/message-id/flat/076eb7bd-52e6-4a51-ba00-c744d027b15c%40postgrespro.ru,
> with focus only on the patch related to improving performance in case of
> large number of cascaded walsenders.
>
> We’ve faced an interesting situation on a standby environment with
> configured cascade replication and large number (~100) of configured
> walsenders. We’ve noticed a very high CPU consumption on such
> environment with the most time-consuming operation being signal delivery
> from startup recovery process to walsenders via WalSndWakeup invocations
> from ApplyWalRecord in xlogrecovery.c.
>
> The startup standby process notifies walsenders for downstream systems
> using ConditionVariableBroadcast (CV), so only processes waiting on this
> CV need to be contacted. However in case of high load we seems to be
> hitting here a bottleneck anyway. The current implementation tries to
> send notification after processing of each WAL record (i.e. during each
> invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup
> invocations. At the same time, this also provides each walsender with
> very small chunk of data to process, so almost every process will be
> present in the CV wait list for the next iteration. As result, waiting
> list should be always fully packed in such case, which additionally
> reduces performance of WAL records processing by the standby instance.
>
> To reproduce such behavior we could use a simple environment with three
> servers: primary instance, attached physical standby and its downstream
> server with large number of logical replication subscriptions. Attached
> is the synthetic test case (test_scenario.zip) to reproduce this
> behavior: script ‘test_prepare.sh’ could be used to create required
> environment with test data and ‘test_execute.sh’ script executes
> ‘pgbench’ tool with simple updates against primary instance to trigger
> replication to other servers. With just about 6 clients I could observe
> high CPU consumption by the 'startup recovering process' (and it may be
> sufficient to completely saturate the CPU on a smaller machine). Please
> check the environment properties at the top of these scripts before
> running them, as they need to be updated in order to specify location
> for installed PG build, target location for database instances creation
> and used ports.
>
> After thinking about possible ways to improve such case, we've decided
> to implement batching for notification delivery. We try to slightly
> postpone sending notification until recovery has applied some number of
> messages.This reduces rate of CV notifications and also gives receivers
> more data to process, so they may not need to enter the CV wait state so
> often. Counting applied records is not difficult, but the tricky part
> here is to ensure that we do not postpone notifications for too long in
> case of low load. To reduce such delay we use a timer handler, which
> sets a timeout flag, which is checked in ProcessStartupProcInterrupts.
> This allow us to send signal on timeout if the startup process is
> waiting for the arrival of new WAL records (in ReadRecord). The
> WalSndWakeup will be invoked either after applying certain number of
> messages or after expiration of timeout since last notification. The
> notification however may be delayed while record is being applied
> (during redo handler invocation from ApplyWalRecord). This could
> increase delay for some corner cases with non-trivial WAL records like
> ‘drop database’, but this should be a rare case and walsender process
> have its own limit on the wait time, so the delay won’t be indefinite
> even in this case.
This approach makes sense to me. Do you think it might have corner cases? I suggest the test scenario might include some delay between "UPDATE" queries. Then we can see how changing of this delay interacts with cascade_replication_batch_delay.
/*
* If time line has switched, then we do not want to delay the
* notification, otherwise we will wait until we apply specified
* number of records before notifying downstream logical
* walsenders.
*/
This comment tells about logical walsenders, but they same will be applied to physical walsenders, right?
> The patch introduces two GUCs to control the batching behavior. The
> first one controls size of batched messages
> ('cascade_replication_batch_size') and is set to 0 by default, so the
> functionality is effectively disabled. The second one controls timed
> delay during batching ('cascade_replication_batch_delay'), which is by
> default set to 500ms. The delay is used only if batching is enabled.
I see these two GUCs are both PGC_POSTMASTER. Could they be PGC_SIGHUP? Also I think there is a typo in the the description of cascade_replication_batch_size, it must say "0 disables".
I also think these GUCs should be in the sample file, possibly disabled by default because it only make sense to set up them with high number of cascaded walsenders.
> With this patch applied we’ve noticed a significant reduction in CPU
> consumption while using the synthetic test program mentioned above. It
> would be great to hear any thoughts on these observations and fixing
> approaches, as well as possible pitfalls of proposed changes.
Great!
------
Regards,
Alexander Korotkov
Supabase
Thank you for spotting this problem, and thank you for working on it.
On Sun, Aug 31, 2025 at 2:47 AM Alexey Makhmutov <a.makhmutov@postgrespro.ru> wrote:
> This is a continuation of the thread
> https://www.postgresql.org/message-id/flat/076eb7bd-52e6-4a51-ba00-c744d027b15c%40postgrespro.ru,
> with focus only on the patch related to improving performance in case of
> large number of cascaded walsenders.
>
> We’ve faced an interesting situation on a standby environment with
> configured cascade replication and large number (~100) of configured
> walsenders. We’ve noticed a very high CPU consumption on such
> environment with the most time-consuming operation being signal delivery
> from startup recovery process to walsenders via WalSndWakeup invocations
> from ApplyWalRecord in xlogrecovery.c.
>
> The startup standby process notifies walsenders for downstream systems
> using ConditionVariableBroadcast (CV), so only processes waiting on this
> CV need to be contacted. However in case of high load we seems to be
> hitting here a bottleneck anyway. The current implementation tries to
> send notification after processing of each WAL record (i.e. during each
> invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup
> invocations. At the same time, this also provides each walsender with
> very small chunk of data to process, so almost every process will be
> present in the CV wait list for the next iteration. As result, waiting
> list should be always fully packed in such case, which additionally
> reduces performance of WAL records processing by the standby instance.
>
> To reproduce such behavior we could use a simple environment with three
> servers: primary instance, attached physical standby and its downstream
> server with large number of logical replication subscriptions. Attached
> is the synthetic test case (test_scenario.zip) to reproduce this
> behavior: script ‘test_prepare.sh’ could be used to create required
> environment with test data and ‘test_execute.sh’ script executes
> ‘pgbench’ tool with simple updates against primary instance to trigger
> replication to other servers. With just about 6 clients I could observe
> high CPU consumption by the 'startup recovering process' (and it may be
> sufficient to completely saturate the CPU on a smaller machine). Please
> check the environment properties at the top of these scripts before
> running them, as they need to be updated in order to specify location
> for installed PG build, target location for database instances creation
> and used ports.
>
> After thinking about possible ways to improve such case, we've decided
> to implement batching for notification delivery. We try to slightly
> postpone sending notification until recovery has applied some number of
> messages.This reduces rate of CV notifications and also gives receivers
> more data to process, so they may not need to enter the CV wait state so
> often. Counting applied records is not difficult, but the tricky part
> here is to ensure that we do not postpone notifications for too long in
> case of low load. To reduce such delay we use a timer handler, which
> sets a timeout flag, which is checked in ProcessStartupProcInterrupts.
> This allow us to send signal on timeout if the startup process is
> waiting for the arrival of new WAL records (in ReadRecord). The
> WalSndWakeup will be invoked either after applying certain number of
> messages or after expiration of timeout since last notification. The
> notification however may be delayed while record is being applied
> (during redo handler invocation from ApplyWalRecord). This could
> increase delay for some corner cases with non-trivial WAL records like
> ‘drop database’, but this should be a rare case and walsender process
> have its own limit on the wait time, so the delay won’t be indefinite
> even in this case.
This approach makes sense to me. Do you think it might have corner cases? I suggest the test scenario might include some delay between "UPDATE" queries. Then we can see how changing of this delay interacts with cascade_replication_batch_delay.
/*
* If time line has switched, then we do not want to delay the
* notification, otherwise we will wait until we apply specified
* number of records before notifying downstream logical
* walsenders.
*/
This comment tells about logical walsenders, but they same will be applied to physical walsenders, right?
> The patch introduces two GUCs to control the batching behavior. The
> first one controls size of batched messages
> ('cascade_replication_batch_size') and is set to 0 by default, so the
> functionality is effectively disabled. The second one controls timed
> delay during batching ('cascade_replication_batch_delay'), which is by
> default set to 500ms. The delay is used only if batching is enabled.
I see these two GUCs are both PGC_POSTMASTER. Could they be PGC_SIGHUP? Also I think there is a typo in the the description of cascade_replication_batch_size, it must say "0 disables".
I also think these GUCs should be in the sample file, possibly disabled by default because it only make sense to set up them with high number of cascaded walsenders.
> With this patch applied we’ve noticed a significant reduction in CPU
> consumption while using the synthetic test program mentioned above. It
> would be great to hear any thoughts on these observations and fixing
> approaches, as well as possible pitfalls of proposed changes.
Great!
------
Regards,
Alexander Korotkov
Supabase
Re: High CPU consumption in cascade replication with large number of walsenders
От
Alexey Makhmutov
Дата:
Hi, Alexander!
Thank you very much for looking at the patch and providing valuable
feedback!
> This approach makes sense to me. Do you think it might have corner
cases? I suggest the test scenario might include some delay between
"UPDATE" queries. Then we can see how changing of this delay interacts
with cascade_replication_batch_delay.
The effect of 'cascade_replication_batch_delay' setting could be more
easily observed by manually changing a single row in the primary
database ('A' instance in the test) and then observing the delay before
such change became visible on the 'C' instance. Something like following:
On C instance:
select c0 where test_repli_test_t1 where id=0 \watch 1
On A instance, first set the initial value:
update test_repli_test_t1 set c0=0 where id=0;
... and then update the row and wait for it to became visible on C instance:
update test_repli_test_t1 set c0=c0+1 where id=0;
In my tests with enabled batching and without enabling delay limit (i.e.
by setting the 'cascade_replication_batch_delay' to 0), the change
became visible in about 5-6 seconds (as walsender on B instance seems to
wake up by itself anyway). With 'cascade_replication_batch_delay' set to
500 (ms) the value became visible almost immediately.
> This comment tells about logical walsenders, but they same will be
applied to physical walsenders, right?
Yes, this item probably needs some clarification. In this code path we
are dealing with logical walsenders, as physical walsenders are notified
in XLogWalRcvFlush. However, when TLI changes, this code will notify
both physical and logical walsenders. So, I've changed the comment now
to describe this behavior more clearly.
Another question is whether we really need to notify physical walsenders
at this point. This was the logic of the original code, so I kept it
when adding batching support. However, it seems that physical sender
should not be very interested in knowing that logical decoding has
discovered change in timeline ID, as it should be either already
notified by walreceiver or discover it by itself in the stored WAL data
if recovery was invoked at startup. So, maybe the better approach here
is just to keep notifications for logical walsenders only.
> I see these two GUCs are both PGC_POSTMASTER. Could they be PGC_SIGHUP?
This is a good suggestion. I've tried to implement support for
PGC_SIGHUP context in the new patch version. Now the current batch
should be flushed immediately as parameters are changed and then new
values will be used for processing once next WAL record is applied. This
also makes testing a little simpler: if we start test script for longer
interval (i.e. 300 seconds instead of 60), then it's possible to see how
CPU load is changed on the fly as batching is enabled or disabled.
> Also I think there is a typo in the the description of
cascade_replication_batch_size, it must say "0 disables".
Sure, thanks for catching this!
> I also think these GUCs should be in the sample file, possibly
disabled by default because it only make sense to set up them with high
number of cascaded walsenders.
Yes, it was my intention for having 'cascade_replication_batch_size'
disabled by default as it was described in the mail message, but I
forget to actually set it to '0' in the previous patch version. Thank
you for noticing this! The 'cascade_replication_batch_delay' is working
only if batching is enabled (i.e. batch size is set to value greater
than 1), so a value of 500 (ms) seems to be a reasonable default settings.
I've also added both values to the sample configuration in the new patch
version, as suggested.
The new patch version with changes described above and rebased on top of
current master is attached.
Thank you again for looking on this proposal!
Thanks,
Alexey
Вложения
Re: High CPU consumption in cascade replication with large number of walsenders
От
Alexander Korotkov
Дата:
Hi, Alexey!
Thank you for your comment and patch revision. I have some further question to you.
> Thank you very much for looking at the patch and providing valuable
> feedback!
>
> > This approach makes sense to me. Do you think it might have corner
> cases? I suggest the test scenario might include some delay between
> "UPDATE" queries. Then we can see how changing of this delay interacts
> with cascade_replication_batch_delay.
>
> The effect of 'cascade_replication_batch_delay' setting could be more
> easily observed by manually changing a single row in the primary
> database ('A' instance in the test) and then observing the delay before
> such change became visible on the 'C' instance. Something like following:
> On C instance:
> select c0 where test_repli_test_t1 where id=0 \watch 1
> On A instance, first set the initial value:
> update test_repli_test_t1 set c0=0 where id=0;
> ... and then update the row and wait for it to became visible on C instance:
> update test_repli_test_t1 set c0=c0+1 where id=0;
>
> In my tests with enabled batching and without enabling delay limit (i.e.
> by setting the 'cascade_replication_batch_delay' to 0), the change
> became visible in about 5-6 seconds (as walsender on B instance seems to
> wake up by itself anyway). With 'cascade_replication_batch_delay' set to
> 500 (ms) the value became visible almost immediately.
>
> > This comment tells about logical walsenders, but they same will be
> applied to physical walsenders, right?
>
> Yes, this item probably needs some clarification. In this code path we
> are dealing with logical walsenders, as physical walsenders are notified
> in XLogWalRcvFlush. However, when TLI changes, this code will notify
> both physical and logical walsenders. So, I've changed the comment now
> to describe this behavior more clearly.
>
> Another question is whether we really need to notify physical walsenders
> at this point. This was the logic of the original code, so I kept it
> when adding batching support. However, it seems that physical sender
> should not be very interested in knowing that logical decoding has
> discovered change in timeline ID, as it should be either already
> notified by walreceiver or discover it by itself in the stored WAL data
> if recovery was invoked at startup. So, maybe the better approach here
> is just to keep notifications for logical walsenders only.
Could you, please, also comment change from check for AllowCascadeReplication() to StandbyWithCascadeReplication()? Do you think this is beneficial and saves us from sending the notifications when they are useless?
Also, could you comment this condition.
if (cascadeReplicationMaxBatchSize <= 1 && appliedRecords == 0)
Does this mean that if batching was disabled in config then enforced by SIGHUP, we will still wait for the current batch to be completed? Would it be better to stop batching immediately?
Also, this patch lacks documentation. I would especially like to see combinations of GUCs described (cascade_replication_batch_size is enabled, but cascade_replication_batch_delay disabled, and vise versa).
------
Regards,
Alexander Korotkov
Supabase
Re: High CPU consumption in cascade replication with large number of walsenders
От
Alexey Makhmutov
Дата:
Hi, Alexander! Thank you again for the attention to this patch! > Could you, please, also comment change from check for AllowCascadeReplication() to StandbyWithCascadeReplication()? Do you think this is beneficial and saves us from sending the notifications when they are useless? The original intention of this check was to avoid enabling all the timer-based machinery for cases, when we definitely don't need to send notifications. So, we try to avoid potential impact of the patch on such cases even if new options are enabled. For example, this may happen if we restore server from backup and apply WAL archive on it (as PerformWalRecovery will be invoked in this case as well). Both 'hot_standby' and 'wal_senders' parameters are enabled by default, so even primary server may pass the the 'AllowCascadeReplication' condition in such case. So, we want to be sure that we the 'StandbyMode' is actually set and we are part of Startup process. However, the change may be also reasonable by itself, to avoid calling WalSndWakeup at all in such cases (thus avoiding acquiring/releasing CV mutex), although I do not think that it will provide measurable improvements. > Also, could you comment this condition. > if (cascadeReplicationMaxBatchSize <= 1 && appliedRecords == 0) > Does this mean that if batching was disabled in config then enforced by SIGHUP, we will still wait for the current batch to be completed? Would it be better to stop batching immediately? Sure, if either of 'cascade_replication_batch_size' or 'cascade_replication_batch_delay' is changed, then we need to flush current batch (send notification) and then decide whether we need to perform batching for next records. This is why we set 'replicationNotificationPending' flag in 'assign_cascade_replication_batch_values' (and disable timer if it is set), so it could be processed in final part 'ProcessStartupProcInterrupts'. So, technically we should not find ourselves in the situation when we have 'cascadeReplicationMaxBatchSize' set to 1 and appliedRecords set to non-zero value. This check for 'appliedRecords' value seems to be a remnant of an my intermediate patch version, where these values were already runtime modifiable, but not yet processed 'assign_cascade_replication_batch_values'. I think it could be safely removed to avoid confusion. Thank you for noticing this! > Also, this patch lacks documentation. I would especially like to see combinations of GUCs described (cascade_replication_batch_size is enabled, but cascade_replication_batch_delay disabled, and vise versa). I've added the documentation for these two parameters in the new version of the patch (config.sgml). The new patch version also contains the change for minimal parameter value for 'cascade_replication_batch_size' - now minimal value is 1 to indicate disabled batching. In previous version both '0' and '1' values were valid options to disable batching, but this was looking ambiguous in the documentation. So, I've decided to leave only '1' as valid value to make it simpler to describe. I've also rebased the patch on top of current master to fix failures during the build checks. Thanks, Alexey