Обсуждение: Re: Improve pg_sync_replication_slots() to wait for primary to advance
On Thu, 06 Nov 2025 at 18:53, Ajin Cherian <itsajin@gmail.com> wrote:
> On Tue, Nov 4, 2025 at 5:23 PM shveta malik <shveta.malik@gmail.com> wrote:
>>
>> >
>> > I have addressed the above comments in patch v21.
>> >
>>
>> Thank You. Please find a few comments:
>>
>> 1)
>> + fparams.slot_names = slot_names = NIL;
>>
>> I think it is not needed to set slot_names to NIL.
>>
>> 2)
>> - WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN);
>> + WAIT_EVENT_REPLICATION_SLOTSYNC_PRIMARY_CATCHUP);
>>
>> The new name does not seem appropriate. For the slotsync-worker case,
>> even when the primary is not behind, the worker still waits but it is
>> not waiting for primary to catch-up. I could not find a better name
>> except the original one 'WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN'. We can
>> change the explanation to :
>>
>> "Waiting in main loop of slot sync worker and slot sync API."
>> Or
>> "Waiting in main loop of slot synchronization."
>>
>> If anyone has any better name suggestions, we can consider changing.
>
> Changed as suggested above.
>
>>
>> 3)
>>
>> +# Attempt to synchronize slots using API. The API will continue retrying
>> +# synchronization until the remote slot catches up with the locally reserved
>> +# position. The API will not return until this happens, to be able to make
>> +# further calls, call the API in a background process.
>>
>> Shall we remove 'with the locally reserved position', it’s already
>> explained in the test header and the comment is good enough even
>> without it.
>>
>
> Changed.
>
>> 4)
>> +# Confirm log that the slot has been synced after becoming sync-ready.
>>
>> Shall we just say:
>> Confirm from the log that the slot is sync-ready now.
>>
>> 5)
>> # Synchronize the primary server slots to the standby.
>> $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
>> @@ -945,6 +1007,7 @@ $subscriber1->safe_psql('postgres',
>> is( $standby1->safe_psql(
>> 'postgres',
>> q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
>> ('lsub1_slot', 'snap_test_slot') AND synced AND NOT temporary;}
>> +
>> ),
>>
>> Redundant change.
>
> Removed.
>
> Attaching patch v22 addressing the above comments.
@@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
-REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
REPLICATION_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."
+REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot synchronization."
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
I've noticed that all events are sorted alphabetically. I think we should keep
the order of REPLICATION_SLOTSYNC_MAIN unchanged.
--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.
On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli@hotmail.com> wrote: > > > > > Attaching patch v22 addressing the above comments. > > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process." > LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process." > LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process." > RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery." > -REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker." > REPLICATION_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down." > +REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot synchronization." > SYSLOGGER_MAIN "Waiting in main loop of syslogger process." > WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process." > WAL_SENDER_MAIN "Waiting in main loop of WAL sender process." > > I've noticed that all events are sorted alphabetically. I think we should keep > the order of REPLICATION_SLOTSYNC_MAIN unchanged. > +1. Few trivial comments: 1) Since we have always used the term 'SQL function' rather than API in existing code, shall we change all references of API to 'SQL function' in current patch: + * If the pg_sync_replication API is used to sync the slots, and if the slots "If the SQL function pg_sync_replication_slots() is used.." + * the reasons mentioned above, then the API also waits and retries until the API --> SQL function + * persist. It is utilized by the pg_sync_replication_slots() API. pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots() + * the API can retry. API --> SQL function + /* Set this, so that API can retry */ API --> SQL function + * persist. It is utilized by the pg_sync_replication_slots() API. pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots() + * slot_persistence_pending - boolean used by pg_sync_replication_slots + * API to track if any slots could not be pg_sync_replication_slots API --> SQL function pg_sync_replication_slots() + * Interrupt handler for pg_sync_replication_slots() API. pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots() 2) ProcessSlotSyncAPIInterrupts slotsync_api_reread_config -- These also have API in it, but I do not have any better name suggestions here, we can retain the current ones and see what others say. 3) /* * Re-read the config file. * * Exit if any of the slot sync GUCs have changed. The postmaster will * restart it. */ static void slotsync_reread_config(void) Shall we change this existing comment to: Re-read the config file for slot sync worker. 4) +/* + * Re-read the config file and check for critical parameter changes. + * + */ +static void +slotsync_api_reread_config(void) Shall we change comment to: /* * Re-read the config file for SQL function pg_sync_replication_slots() * * Emit error if any of the slot sync GUCs have changed. */ thanks Shveta
On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
>
>
>
> 2)
> ProcessSlotSyncAPIInterrupts
> slotsync_api_reread_config
> -- These also have API in it, but I do not have any better name
> suggestions here, we can retain the current ones and see what others
> say.
ProcessSlotSyncInterrupts() handles shutdown waiting,
ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
be good to explain why we need two different functions for worker and
SQL function and also explain the difference between them.
$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
+$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
I think, the intention behind dropping the slot is to be able to
create it again for the next test. But there is no comment here
explaining that. There is also no comment explaining why we are
dropping both slots here; when the test only needs dropping one.
That's going to create confusion. One might think that all the slots
need to be dropped at this stage, and drop and create any future slots
that are used by prior code, for example. At the end of this test, we
recreate the slot using pg_create_logical_replication_slot(), which is
different method of creating slot than this test does. Though I can
understand the reason, it's not apparent. Generally reusing slot names
across multiple tests (in this file) is a source of confusion. But at
least for the test you are adding, you could use a different slot name
to avoid confusion.
+# Create some DDL on the primary so that the slot lags behind the standby
+$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
+
+# Attempt to synchronize slots using API. The API will continue retrying
+# synchronization until the remote slot catches up.
+# The API will not return until this happens, to be able to make
+# further calls, call the API in a background process.
+my $log_offset = -s $standby1->logfile;
+
+my $h = $standby1->background_psql('postgres', on_error_stop => 0);
+
+$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
If the standby does not receive the WAL corresponding to the DDL
before this function is executed, the slot will get synchronized
immediately. I think we have to make sure that the standby has
received the DDL before executing this function.
Also most of the code which uses query_until has pattern like this:
$h->query_until(qr/start/, q{\echo start
SQL command});
But we expect an empty string here. Why this difference?
I think we need a similar test to test promotion while the function is
waiting for the slot to become sync-ready.
SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
are similar with some functional differences. Some part of their code
needs to be kept in sync in future. How do we achieve that? At least
we need a comment saying so in each of those patches and keep those
two codes in proximity.
--
Best Wishes,
Ashutosh Bapat
On Mon, Nov 10, 2025 at 8:31 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
> >
> > >
> > > Attaching patch v22 addressing the above comments.
> >
> > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
> > LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
> > LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
> > RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
> > -REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
> > REPLICATION_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."
> > +REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot synchronization."
> > SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
> > WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
> > WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
> >
> > I've noticed that all events are sorted alphabetically. I think we should keep
> > the order of REPLICATION_SLOTSYNC_MAIN unchanged.
> >
>
> +1.
>
Yes, changed.
> Few trivial comments:
>
> 1)
> Since we have always used the term 'SQL function' rather than API in
> existing code, shall we change all references of API to 'SQL function'
> in current patch:
>
> + * If the pg_sync_replication API is used to sync the slots, and if the slots
> "If the SQL function pg_sync_replication_slots() is used.."
>
> + * the reasons mentioned above, then the API also waits and retries until the
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * the API can retry.
> API --> SQL function
>
> + /* Set this, so that API can retry */
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * slot_persistence_pending - boolean used by pg_sync_replication_slots
> + * API to track if any slots could not be
> pg_sync_replication_slots API --> SQL function pg_sync_replication_slots()
>
> + * Interrupt handler for pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
>
> 2)
> ProcessSlotSyncAPIInterrupts
> slotsync_api_reread_config
> -- These also have API in it, but I do not have any better name
> suggestions here, we can retain the current ones and see what others
> say.
>
Changed.
> 3)
> /*
> * Re-read the config file.
> *
> * Exit if any of the slot sync GUCs have changed. The postmaster will
> * restart it.
> */
> static void
> slotsync_reread_config(void)
>
> Shall we change this existing comment to: Re-read the config file for
> slot sync worker.
>
> 4)
>
> +/*
> + * Re-read the config file and check for critical parameter changes.
> + *
> + */
> +static void
> +slotsync_api_reread_config(void)
>
> Shall we change comment to:
> /*
> * Re-read the config file for SQL function pg_sync_replication_slots()
> *
> * Emit error if any of the slot sync GUCs have changed.
> */
>
Changed.
On Mon, Nov 10, 2025 at 9:44 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> >
> >
> > 2)
> > ProcessSlotSyncAPIInterrupts
> > slotsync_api_reread_config
> > -- These also have API in it, but I do not have any better name
> > suggestions here, we can retain the current ones and see what others
> > say.
>
> ProcessSlotSyncInterrupts() handles shutdown waiting,
> ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
> be good to explain why we need two different functions for worker and
> SQL function and also explain the difference between them.
>
I've updated the function header to explain this. The slot sync worker
is a specific background worker while the API runs in the regular
backend, so special handling is not needed.
> $primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
> +$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
>
> I think, the intention behind dropping the slot is to be able to
> create it again for the next test. But there is no comment here
> explaining that. There is also no comment explaining why we are
> dropping both slots here; when the test only needs dropping one.
> That's going to create confusion. One might think that all the slots
> need to be dropped at this stage, and drop and create any future slots
> that are used by prior code, for example. At the end of this test, we
> recreate the slot using pg_create_logical_replication_slot(), which is
> different method of creating slot than this test does. Though I can
> understand the reason, it's not apparent. Generally reusing slot names
> across multiple tests (in this file) is a source of confusion. But at
> least for the test you are adding, you could use a different slot name
> to avoid confusion.
I've added a comment there that dropping both the slots is required
for the next test. Also I cannot change the name of the slot as the
next tests need the same slot synced.
> +# Create some DDL on the primary so that the slot lags behind the standby
> +$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
> +
> +# Attempt to synchronize slots using API. The API will continue retrying
> +# synchronization until the remote slot catches up.
> +# The API will not return until this happens, to be able to make
> +# further calls, call the API in a background process.
> +my $log_offset = -s $standby1->logfile;
> +
> +my $h = $standby1->background_psql('postgres', on_error_stop => 0);
> +
> +$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
>
> If the standby does not receive the WAL corresponding to the DDL
> before this function is executed, the slot will get synchronized
> immediately. I think we have to make sure that the standby has
> received the DDL before executing this function.
Yes, I've added a line to make sure that the stanbdy has caught up.
>
> Also most of the code which uses query_until has pattern like this:
> $h->query_until(qr/start/, q{\echo start
> SQL command});
> But we expect an empty string here. Why this difference?
>
I've modified it as suggested.
> I think we need a similar test to test promotion while the function is
> waiting for the slot to become sync-ready.
>
Unfortunately, that will make this test too long if I add one more
wait loop for slot sync.
> SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
> are similar with some functional differences. Some part of their code
> needs to be kept in sync in future. How do we achieve that? At least
> we need a comment saying so in each of those patches and keep those
> two codes in proximity.
>
I've added a comment in the header of ReplSlotSyncWorkerMain to suggest this.
Attaching patch v23 addressing these comments.
regards,
Ajin Cherian
Fujitsu Australia