Re: Improve pg_sync_replication_slots() to wait for primary to advance
| От | Ajin Cherian | 
|---|---|
| Тема | Re: Improve pg_sync_replication_slots() to wait for primary to advance | 
| Дата | |
| Msg-id | CAFPTHDZ+A5cXUTPz=nS225Ks1uVsmawX5disLHrRCtVr2ssCZQ@mail.gmail.com обсуждение исходный текст  | 
		
| Ответ на | Re: Improve pg_sync_replication_slots() to wait for primary to advance (shveta malik <shveta.malik@gmail.com>) | 
| Ответы | 
                	
            		Re: Improve pg_sync_replication_slots() to wait for primary to advance
            		
            		 | 
		
| Список | pgsql-hackers | 
On Thu, Oct 30, 2025 at 10:16 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> Hi Ajin,
>
> I have reviewed v20 and got a few comments:
>
> > On Oct 30, 2025, at 18:18, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > <v20-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch>
>
> 1 - slotsync.c
> ```
> +               if (slot_names)
> +                       list_free_deep(slot_names);
>
>                 /* Cleanup the synced temporary slots */
>                 ReplicationSlotCleanup(true);
> @@ -1762,5 +2026,5 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
>                 /* We are done with sync, so reset sync flag */
>                 reset_syncing_flag();
>         }
> -       PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
> +       PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(&fparams));
> ```
>
> I am afraid there is a risk of double memory free. Slot_names has been assigned to fparams.slot_names within the  for
loop,and it’s freed after the loop. If something gets wrong and slotsync_failure_callback() is called, the function
willfree fparams.slot_names again. 
>
Yes, good catch. I have changed to set fparams.slot_names to NIL after
freeing it, so that it isn't freed in slotsync_failure_callback().
> 2 - slotsync.c
> ```
> +                       /*
> +                        * Fetch remote slot info for the given slot_names. If slot_names is NIL,
> +                        * fetch all failover-enabled slots. Note that we reuse slot_names from
> +                        * the first iteration; re-fetching all failover slots each time could
> +                        * cause an endless loop. Instead of reprocessing only the pending slots
> +                        * in each iteration, it's better to process all the slots received in
> +                        * the first iteration. This ensures that by the time we're done, all
> +                        * slots reflect the latest values.
> +                        */
> +                       remote_slots = fetch_remote_slots(wrconn, slot_names);
> +
> +                       /* Attempt to synchronize slots */
> +                       some_slot_updated = synchronize_slots(wrconn, remote_slots,
> +
&slot_persistence_pending);
> +
> +                       /*
> +                        * If slot_persistence_pending is true, extract slot names
> +                        * for future iterations (only needed if we haven't done it yet)
> +                        */
> +                       if (slot_names == NIL && slot_persistence_pending)
> +                       {
> +                               slot_names = extract_slot_names(remote_slots);
> +
> +                               /* Update the failure structure so that it can be freed on error */
> +                               fparams.slot_names = slot_names;
> +                       }
> ```
>
> I am thinking if that could be a problem. As you now extract_slot_names() only in the first iteration, if a slot is
dropped,and a new slot comes with the same name, will the new slot be incorrectly synced? 
It doesn't matter, because the new slot will anyway have a later
restart_lsn and xmin, and all other attributes of the slot are also
updated as part of the sync. So, the old slot on the standby will
resemble the new slot on the primary.
On Fri, Oct 31, 2025 at 3:42 PM Japin Li <japinli@hotmail.com> wrote:
>
> Thanks for updating the patch.  Here are some comments on v20.
>
> 1. Since the content is unchanged, no modification is needed here.
>
> -                * We do not drop the slot because the restart_lsn can be ahead of the
> -                * current location when recreating the slot in the next cycle. It may
> -                * take more time to create such a slot. Therefore, we keep this slot
> -                * and attempt the synchronization in the next cycle.
> +                * We do not drop the slot because the restart_lsn can be
> +                * ahead of the current location when recreating the slot in
> +                * the next cycle. It may take more time to create such a
> +                * slot. Therefore, we keep this slot and attempt the
> +                * synchronization in the next cycle.
>
Changed.
> 2. Could we align the parameter comment style for synchronize_slots() and
> fetch_remote_slots() for better consistency?
>
Fixed.
> 3. Is this redundant? It was already initialized to false during declaration.
>
> +                       /* Reset flag before every iteration */
> +                       slot_persistence_pending = false;
>
Removed.
> 4. A minor nitpick.  The opening brace should be on a new line for style
> consistency.
>
> +                       if (!IsTransactionState()) {
> +                               StartTransactionCommand();
> +                               started_tx = true;
> +                       }
>
Fixed.
> 5. Given that fparams.slot_names is a list, I suggest we replace NULL with NIL
> for type consistency.
>
> +       fparams.slot_names = NULL;
>
Changed.
On Fri, Oct 31, 2025 at 4:34 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Thu, Oct 30, 2025 at 3:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> > Thanks for your review, Japin. Here's patch v20 addressing the comments.
> >
>
> Thank You for the patch. Please find a few comment son test:
>
>
> 1)
> +# until the slot becomes sync-ready (when the standby catches up to the
> +# slot's restart_lsn).
>
> I think it should be 'when the primary server catches up' or 'when the
> remote slot catches up with the locally reserved position.'
>
Changed.
> 2)
> +# Attempt to synchronize slots using API. This will initially fail because
> +# the slot is not yet sync-ready (standby hasn't caught up to slot's
> restart_lsn),
> +# but the API will wait and retry. Call the API in a background process.
>
> a)
> 'This will initially fail ' seems like the API will give an error,
> which is not the case
>
> b) 'standby hasn't caught up to slot's restart_lsn' is not correct.
>
> We can rephrase to:
> # Attempt to synchronize slots using the API. The API will continue
> retrying synchronization until the remote slot catches up with the
> locally reserved position.
>
changed accordingly.
> 3)
> +# Enable the Subscription, so that the slot catches up
>
> slot --> remote slot
>
> 4)
> +# Create xl_running_xacts records on the primary for which the
> standby is waiting
>
> Shall we rephrase to below or anything better if you have?:
> Create xl_running_xacts on the primary to speed up restart_lsn advancement.
>
> 5)
> +# Confirm that the logical failover slot is created on the standby and is
> +# flagged as 'synced'
>
> Suggestion:
> Verify that the logical failover slot is created on the standby,
> marked as 'synced', and persisted.
>
> (It is important to mention persisted because even temporary slot is
> marked as synced)
>
changed as recommended.
I have addressed the above comments in patch v21.
regards,
Ajin Cherian
Fujitsu Australia
		
	Вложения
В списке pgsql-hackers по дате отправления: