RE: Synchronizing slots from primary to standby
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Synchronizing slots from primary to standby |
Дата | |
Msg-id | OS0PR01MB571671A91E602BEFD3083EEE942A2@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Ответы |
RE: Synchronizing slots from primary to standby
("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
|
Список | pgsql-hackers |
On Friday, March 8, 2024 1:09 PM shveta malik <shveta.malik@gmail.com> wrote: > On Fri, Mar 8, 2024 at 9:56 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > >> Pushed with minor modifications. I'll keep an eye on BF. > >> > >> BTW, one thing that we should try to evaluate a bit more is the > >> traversal of slots in StandbySlotsHaveCaughtup() where we verify if > >> all the slots mentioned in standby_slot_names have received the > >> required WAL. Even if the standby_slot_names list is short the total > >> number of slots can be much larger which can lead to an increase in > >> CPU usage during traversal. There is an optimization that allows to > >> cache ss_oldest_flush_lsn and ensures that we don't need to traverse > >> the slots each time so it may not hit frequently but still there is a > >> chance. I see it is possible to further optimize this area by caching > >> the position of each slot mentioned in standby_slot_names in > >> replication_slots array but not sure whether it is worth. > >> > >> > > > > I tried to test this by configuring a large number of logical slots while making > sure the standby slots are at the end of the array and checking if there was any > performance hit in logical replication from these searches. > > > Thanks Nisha for conducting some additional tests and discussing with me internally. We have collected the performance data on HEAD. Basically, we don't see a noticeable difference in the performance data and StandbySlotsHaveCaughtup also does not stand out in the profile. Here are the details: > 1) Redoing XLogSendLogical time-log related test with > 'sync_replication_slots' enabled. Setup: - one primary + 3standbys + one subscriber with one active subscription - ran 15 min pgbench for all cases - hot_standby_feedback=ON and sync_replication_slots=TRUE (To maximize the impact of SearchNamedReplicationSlot clear, the standby slot is at the end of the ReplicationSlotCtl->replication_slots array in each test) Case1 - 1 slot: 895.305565 secs Case2 - 100 slots: 894.936039 secs Case3 - 500 slots: 895.256412 secs > 2) pg_recvlogical test to monitor lag in StandbySlotsHaveCaughtup() for a > large number of slots. We reran the XLogSendLogical() wait time analysis tests. Setup: - One primary node and 3 standby nodes - Created logical slots using "test_decoding" and activated one walsender by running pg_recvlogical on one slot. - hot_standby_feedback=ON and sync_replication_slots=TRUE - Did one run for each case with pgbench for 15 min (To maximize the impact of SearchNamedReplicationSlot clear, the stanbys slot is at the end of the ReplicationSlotCtl->replication_slots array in each test) Case1 - 1 slot: 894.83775 secs Case2 - 100 slots: 894.449356 secs Case3 - 500 slots: 894.98479 secs There is no noticeable regression when the number of replication slots increases. > 3) Profiling to see if StandbySlotsHaveCaughtup() is noticeable in the report > when there are a large number of slots to traverse. The setup is the same as 2). To maximize the impact of SearchNamedReplicationSlot clear, the stanbys slot is at the end of the ReplicationSlotCtl->replication_slots array. The StandbySlotsHaveCaughtup is not noticeable in the profile. 0.03% 0.00% postgres postgres [.] StandbySlotsHaveCaughtup After some investigation, it appears that the cached 'ss_oldest_flush_lsn' plays a crucial role in optimizing this workload, effectively reducing the need for frequent strcmp operations within the loop. To test the impact of frequent strcmp calls, we conducted a test by removing the 'ss_oldest_flush_lsn' check and re-evaluating the profile. This time, although the profile indicated a small increase in the StandbySlotsHaveCaughtup metric, it still does not raise significant concerns. --1.47%--NeedToWaitForWal | NeedToWaitForStandbys | StandbySlotsHaveCaughtup | | | --0.96%--SearchNamedReplicationSlot The scripts that were used to setup the test environment for all above tests are attached. The machine configuration for above tests is as follows: CPU : E7-4890v2(2.8Ghz/15core)×4 MEM : 768GB HDD : 600GB×2 OS : RHEL 7.9 While no noticeable overhead was observed in the SearchNamedReplicationSlot operation, we explored a strategy to enhance efficiency by minimizing the search for standby slots within the loop. The idea is to cache the position of each standby slot within ReplicationSlotCtl->replication_slots. We will reference the slot directly through ReplicationSlotCtl->replication_slots[index]. If the slot name matches, we will perform other checks including the restart_lsn; otherwise, SearchNamedReplicationSlot is invoked to update the index cache accordingly. This optimization can reduce the cost from O(n*m) to O(n). Note that since we didn't see the overhead in the test, I am not proposing to push this patch now. But just share the idea and a small patch in case anyone came across a workload where performance impact of SearchNamedReplicationSlot becomes noticeable. Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: jian heДата:
Сообщение: Re: Make COPY format extendable: Extract COPY TO format implementations