RE: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: Synchronizing slots from primary to standby
Дата
Msg-id OS0PR01MB571671A91E602BEFD3083EEE942A2@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Ответы RE: Synchronizing slots from primary to standby  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Список pgsql-hackers
On Friday, March 8, 2024 1:09 PM shveta malik <shveta.malik@gmail.com> wrote:
> On Fri, Mar 8, 2024 at 9:56 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >> Pushed with minor modifications. I'll keep an eye on BF.
> >>
> >> BTW, one thing that we should try to evaluate a bit more is the
> >> traversal of slots in StandbySlotsHaveCaughtup() where we verify if
> >> all the slots mentioned in standby_slot_names have received the
> >> required WAL. Even if the standby_slot_names list is short the total
> >> number of slots can be much larger which can lead to an increase in
> >> CPU usage during traversal. There is an optimization that allows to
> >> cache ss_oldest_flush_lsn and ensures that we don't need to traverse
> >> the slots each time so it may not hit frequently but still there is a
> >> chance. I see it is possible to further optimize this area by caching
> >> the position of each slot mentioned in standby_slot_names in
> >> replication_slots array but not sure whether it is worth.
> >>
> >>
> >
> > I tried to test this by configuring a large number of logical slots while making
> sure the standby slots are at the end of the array and checking if there was any
> performance hit in logical replication from these searches.
> >
> 

Thanks Nisha for conducting some additional tests and discussing with me
internally. We have collected the performance data on HEAD. Basically, we don't
see a noticeable difference in the performance data and StandbySlotsHaveCaughtup
also does not stand out in the profile.

Here are the details:

> 1) Redoing XLogSendLogical time-log related test with
>    'sync_replication_slots' enabled.

Setup:
- one primary + 3standbys + one subscriber with one active subscription
- ran 15 min pgbench for all cases
- hot_standby_feedback=ON and sync_replication_slots=TRUE

(To maximize the impact of SearchNamedReplicationSlot clear, the standby slot
is at the end of the ReplicationSlotCtl->replication_slots array in each test)

Case1 - 1 slot:     895.305565 secs
Case2 - 100 slots:  894.936039 secs
Case3 - 500 slots:  895.256412 secs


> 2) pg_recvlogical test to monitor lag in StandbySlotsHaveCaughtup() for a
>    large number of slots.

We reran the XLogSendLogical() wait time analysis tests.
Setup:
- One primary node and 3 standby nodes
- Created logical slots using "test_decoding" and activated one walsender by running pg_recvlogical on one slot.
- hot_standby_feedback=ON and sync_replication_slots=TRUE
- Did one run for each case with pgbench for 15 min

(To maximize the impact of SearchNamedReplicationSlot clear, the stanbys slot
is at the end of the ReplicationSlotCtl->replication_slots array in each test)

Case1 - 1 slot:     894.83775 secs
Case2 - 100 slots:  894.449356 secs
Case3 - 500 slots:  894.98479 secs

There is no noticeable regression when the number of replication slots increases.


> 3) Profiling to see if StandbySlotsHaveCaughtup() is noticeable in the report
>    when there are a large number of slots to traverse.

The setup is the same as 2). To maximize the impact of
SearchNamedReplicationSlot clear, the stanbys slot is at the end of the
ReplicationSlotCtl->replication_slots array.

The StandbySlotsHaveCaughtup is not noticeable in the profile.

0.03%     0.00%  postgres  postgres            [.] StandbySlotsHaveCaughtup

After some investigation, it appears that the cached 'ss_oldest_flush_lsn'
plays a crucial role in optimizing this workload, effectively reducing the need
for frequent strcmp operations within the loop.

To test the impact of frequent strcmp calls, we conducted a test by removing
the 'ss_oldest_flush_lsn' check and re-evaluating the profile. This time, although the
profile indicated a small increase in the StandbySlotsHaveCaughtup metric,
it still does not raise significant concerns.

--1.47%--NeedToWaitForWal
|        NeedToWaitForStandbys
|        StandbySlotsHaveCaughtup
|        |          
|         --0.96%--SearchNamedReplicationSlot


The scripts that were used to setup the test environment for all above tests are attached.
The machine configuration for above tests is as follows:
CPU : E7-4890v2(2.8Ghz/15core)×4
MEM : 768GB
HDD : 600GB×2
OS : RHEL 7.9


While no noticeable overhead was observed in the SearchNamedReplicationSlot
operation, we explored a strategy to enhance efficiency by minimizing the
search for standby slots within the loop. The idea is to cache the
position of each standby slot within ReplicationSlotCtl->replication_slots. We
will reference the slot directly through
ReplicationSlotCtl->replication_slots[index]. If the slot name matches, we will
perform other checks including the restart_lsn; otherwise,
SearchNamedReplicationSlot is invoked to update the index cache accordingly.
This optimization can reduce the cost from O(n*m) to O(n).

Note that since we didn't see the overhead in the test, I am not proposing to
push this patch now. But just share the idea and a small patch in case anyone
came across a workload where performance impact of SearchNamedReplicationSlot
becomes noticeable.

Best Regards,
Hou zj

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Hayato Kuroda (Fujitsu)"
Дата:
Сообщение: RE: speed up a logical replica setup
Следующее
От: jian he
Дата:
Сообщение: Re: Make COPY format extendable: Extract COPY TO format implementations