Re: Synchronizing slots from primary to standby
От | Drouvot, Bertrand |
---|---|
Тема | Re: Synchronizing slots from primary to standby |
Дата | |
Msg-id | 5a858c63-cd59-4622-a9ca-eebbb7bd59cc@gmail.com обсуждение исходный текст |
Ответ на | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Synchronizing slots from primary to standby
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-hackers |
Hi, On 11/7/23 11:55 AM, Amit Kapila wrote: > On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 10/31/23 10:37 AM, shveta malik wrote: >>> On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand >>>> >>>> Agree with your test case, but in my case I was not using pub/sub. >>>> >>>> I was not clear, so when I said: >>>> >>>>>> - create logical_slot1 on the primary (and don't start using it) >>>> >>>> I meant don't start decoding from it (like using pg_recvlogical() or >>>> pg_logical_slot_get_changes()). >>>> >>>> By using pub/sub the "don't start using it" is not satisfied. >>>> >>>> My test case is: >>>> >>>> " >>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); >>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); >>>> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - >>>> " >>>> >>> >>> Okay, I am able to reproduce it now. Thanks for clarification. I have >>> tried to change the algorithm as per suggestion by Amit in [1] >>> >>> [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com >> >> Thanks! >> >>> >>> This is not full proof solution but optimization over first one. Now >>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots >>> are available to be created). In first attempt, we do not wait >>> indefinitely on inactive slots, we wait only for a fixed amount of >>> time and if remote-slot is still behind, then we add that to the >>> pending list and move to the next slot. Once we are done with first >>> attempt, in second attempt, we go for the pending ones and now we wait >>> on each of them until the primary catches up. >> >> Aren't we "just" postponing the "issue"? I mean if there is really no activity >> on, say, the first created slot, then once we move to the second attempt then any newly >> created slot from that time would wait to be synced forever, no? >> > > We have to wait at some point in time for such inactive slots and the > same is true even for manually created slots on standby. Do you have > any better ideas to deal with it? > What about: - get rid of the second attempt and the pending_slot_list - keep the wait_count and PrimaryCatchupWaitAttempt logic so basically, get rid of: /* * Now sync the pending slots which were failed to be created in first * attempt. */ foreach(cell, pending_slot_list) { RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); /* Wait until the primary server catches up */ PrimaryCatchupWaitAttempt = 0; synchronize_one_slot(wrconn, remote_slot, NULL); } and the pending_slot_list list. That way, for each slot that have not been created and synced yet: - it will be created on the standby - we will wait up to PrimaryCatchupWaitAttempt attempts - the slot will be synced or removed on/from the standby That way an inactive slot on the primary would not "block" any other slots on the standby. By "created" here I mean calling ReplicationSlotCreate() (not to be confused with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); " which is confusing as mentioned up-thread). The problem I can see with this proposal is that the "sync" window waiting for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt attempts (as the slot will be dropped/recreated). If we think this window is too short we could: - increase it or - don't drop the slot once created (even if there is no activity on the primary during PrimaryCatchupWaitAttempt attempts) so that the next loop of attempts will compare with "older" LSN/xmin (as compare to dropping and re-creating the slot). That way the window would be since the initial slot creation. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: