Re: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Ajin Cherian
Тема Re: Synchronizing slots from primary to standby
Дата
Msg-id CAFPTHDbERkhu3NoP+rYKwro+epng2eXyQqts+-tOYXi3J-ng=g@mail.gmail.com
обсуждение исходный текст
Ответ на RE: Synchronizing slots from primary to standby  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Список pgsql-hackers


On Thu, Feb 29, 2024 at 12:34 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
On Monday, February 26, 2024 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 26, 2024 at 7:49 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Attach the V98 patch set which addressed above comments.
> >
>
> Few comments:
> =============
> 1.
>  WalSndWaitForWal(XLogRecPtr loc)
>  {
>   int wakeEvents;
> + bool wait_for_standby = false;
> + uint32 wait_event;
> + List    *standby_slots = NIL;
>   static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
>
> + if (MyReplicationSlot->data.failover && replication_active)
> + standby_slots = GetStandbySlotList(true);
> +
>   /*
> - * Fast path to avoid acquiring the spinlock in case we already know we
> - * have enough WAL available. This is particularly interesting if we're
> - * far behind.
> + * Check if all the standby servers have confirmed receipt of WAL up to
> + * RecentFlushPtr even when we already know we have enough WAL available.
> + *
> + * Note that we cannot directly return without checking the status of
> + * standby servers because the standby_slot_names may have changed,
> + which
> + * means there could be new standby slots in the list that have not yet
> + * caught up to the RecentFlushPtr.
>   */
> - if (RecentFlushPtr != InvalidXLogRecPtr &&
> - loc <= RecentFlushPtr)
> - return RecentFlushPtr;
> + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr) {
> + FilterStandbySlots(RecentFlushPtr, &standby_slots);
>
> I think even if the slot list is not changed, we will always process each slot
> mentioned in standby_slot_names once. Can't we cache the previous list of
> slots for we have already waited for? In that case, we won't even need to copy
> the list via GetStandbySlotList() unless we need to wait.
>
> 2.
>  WalSndWaitForWal(XLogRecPtr loc)
>  {
> + /*
> + * Update the standby slots that have not yet caught up to the flushed
> + * position. It is good to wait up to RecentFlushPtr and then let it
> + * send the changes to logical subscribers one by one which are
> + * already covered in RecentFlushPtr without needing to wait on every
> + * change for standby confirmation.
> + */
> + if (wait_for_standby)
> + FilterStandbySlots(RecentFlushPtr, &standby_slots);
> +
>   /* Update our idea of the currently flushed position. */
> - if (!RecoveryInProgress())
> + else if (!RecoveryInProgress())
>   RecentFlushPtr = GetFlushRecPtr(NULL);
>   else
>   RecentFlushPtr = GetXLogReplayRecPtr(NULL); ...
> /*
> * If postmaster asked us to stop, don't wait anymore.
> *
> * It's important to do this check after the recomputation of
> * RecentFlushPtr, so we can send all remaining data before shutting
> * down.
> */
> if (got_STOPPING)
> break;
>
> I think because 'wait_for_standby' may not be set in the first or consecutive
> cycles we may send the WAL to the logical subscriber before sending it to the
> physical subscriber during shutdown.

Here is the v101 patch set which addressed above comments.

This version will cache the oldest standby slot's LSN each time we waited for
them to catch up. The cached LSN is invalidated when we reload the GUC config.
In the WalSndWaitForWal function, instead of traversing the entire standby list
each time, we can check the cached LSN to quickly determine if the standbys
have caught up. When a shutdown signal is received, we continue to wait for the
standby slots to catch up. When waiting for the standbys to catch up after
receiving the shutdown signal, an ERROR is reported if any slots are dropped,
invalidated, or inactive. This measure is taken to prevent the walsender from
waiting indefinitely.


Thanks for the patch.
I did some performance test run on PATCH v101 with synchronous_commit turned on to check how much logical replication changes affects transaction speed on primary compared to HEAD code. In all configurations, there is a primary, a logical subscriber and a physical standby and the logical subscriber is listed in the synchronous_standby_names. This means all transactions on primary will not be committed until the logical subscriber has confirmed receipt of this transaction.


Machine details:
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 800GB RAM

My addition configuration on each instance is:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

All tests are done using pgbench running for 15 minutes:
Creating tables: pgbench -p 6972 postgres -qis 2
Running benchmark: pgbench postgres -p 6972 -c 10 -j 3 -T 900 -P 5

HEAD code-
 Primary with Synchronous_commit=on, physical standby with hot_standby_feedback=off
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.226658 8.17815 8.202404

HEAD code-
Primary with Synchronous_commit=on, physical standby with hot_standby_feedback=on
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.134901 8.229066 8.1819835 -- degradation from first config -0.25%

PATCHED code - (v101-0001)
Primary with synchronous_commit=on, physical standby with hot_standby_feedback=on, standby_slot_names not configured, logical subscriber not failover enabled, physical standby not configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.18839 8.18839 8.18839-- degradation from first config -0.17%

PATCHED code - (v98-0001)
Synchronous_commit=on, hot_standby_feedback=on, standby_slot_names configured to physical standby, logical subscriber failover enabled, physical standby configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.173062 8.068536 8.120799-- degradation from first config -0.99%

Overall, I do not see any significant performance degradation with the patch and sync-slot enabled with one logical subscriber and one physical standby.
Attaching script for my final test configuration for reference.

regards,
Ajin Cherian
Fujitsu Australia
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jakub Wartak
Дата:
Сообщение: Re: index prefetching
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Synchronizing slots from primary to standby