Re: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Synchronizing slots from primary to standby
Дата
Msg-id CAD21AoCYXhDYOQDAS-rhGasC2T+tYbV=8Y18o94sB=5AxcW+yA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Ответы Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Список pgsql-hackers
On Mon, Feb 19, 2024 at 9:59 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Few comments on 0001
>
> Thanks for the feedback.
>
> > ====================
> > 1. I think it is better to error out when the valid GUC or option is
> > not set in ensure_valid_slotsync_params() and
> > ensure_valid_remote_info() instead of waiting. And we shouldn't start
> > the worker in the first place if not all required GUCs are set. This
> > will additionally simplify the code a bit.
>
> Sure, removed 'ensure' functions. Moved the validation checks to the
> postmaster before starting the slot sync worker.
>
> > 2.
> > +typedef struct SlotSyncWorkerCtxStruct
> >  {
> > - /* prevents concurrent slot syncs to avoid slot overwrites */
> > + pid_t pid;
> > + bool stopSignaled;
> >   bool syncing;
> > + time_t last_start_time;
> >   slock_t mutex;
> > -} SlotSyncCtxStruct;
> > +} SlotSyncWorkerCtxStruct;
> >
> > I think we don't need to change the name of this struct as this can be
> > used both by the worker and the backend. We can probably add the
> > comment to indicate that all the fields except 'syncing' are used by
> > slotsync worker.
>
> Modified.
>
> > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't
> > be changed to SlotSyncWorkerShmemInit().
>
> Modified.
>
> > 4.
> > +ReplSlotSyncWorkerMain(int argc, char *argv[])
> > {
> > ...
> > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0);
> > ...
> > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
> > ...
> > }
> >
> > Do we need two separate callbacks? Can't we have just one (say
> > slotsync_failure_callback) that cleans additional things in case of
> > slotsync worker? And, if we need both those callbacks then please add
> > some comments for both and why one is before_shmem_exit() and the
> > other is on_shmem_exit().
>
> I think we can merge these now. Earlier 'on_shmem_exit' was needed to
> avoid race-condition between startup and slot sync worker process to
> drop 'i' slots on promotion.  Now we do not have any such scenario.
> But I need some time to analyze it well. Will do it in the next
> version.
>
> > In addition to the above, I have made a few changes in the comments
> > and code (cosmetic code changes). Please include those in the next
> > version if you find them okay. You need to rename .txt file to remove
> > .txt and apply atop v90-0001*.
>
> Sure, included these.
>
> Please find the patch001 attached.

I've reviewed the v91 patch. Here are random comments:

---
 /*
  * Checks the remote server info.
  *
- * We ensure that the 'primary_slot_name' exists on the remote server and the
- * remote server is not a standby node.
+ * Check whether we are a cascading standby. For non-cascading standbys, it
+ * also ensures that the 'primary_slot_name' exists on the remote server.
  */

IIUC what the validate_remote_info() does doesn't not change by this
patch, so the previous comment seems to be clearer to me.

---
    if (remote_in_recovery)
+   {
+       /*
+        * If we are a cascading standby, no need to check further for
+        * 'primary_slot_name'.
+        */
        ereport(ERROR,
                errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                errmsg("cannot synchronize replication slots from a
standby server"));
+   }
+   else
+   {
+       bool        primary_slot_valid;

-   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
-   Assert(!isnull);
+       primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
+       Assert(!isnull);

-   if (!primary_slot_valid)
-       ereport(ERROR,
-               errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-               errmsg("bad configuration for slot synchronization"),
-       /* translator: second %s is a GUC variable name */
-               errdetail("The replication slot \"%s\" specified by
\"%s\" does not exist on the primary server.",
-                         PrimarySlotName, "primary_slot_name"));
+       if (!primary_slot_valid)
+           ereport(ERROR,
+                   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                   errmsg("bad configuration for slot synchronization"),
+           /* translator: second %s is a GUC variable name */
+                   errdetail("The replication slot \"%s\" specified
by \"%s\" does not exist on the primary server.",
+                             PrimarySlotName, "primary_slot_name"));
+   }

I think it's a refactoring rather than changes required by the
slotsync worker. We can do that in a separate patch but why do we need
this change in the first place?

---
+        ValidateSlotSyncParams(ERROR);
+
         /* Load the libpq-specific functions */
         load_file("libpqwalreceiver", false);

-        ValidateSlotSyncParams();
+        (void) CheckDbnameInConninfo();

Is there any reason why we move where to check the parameters?

Some comments not related to the patch but to the existing code:

---
It might have already been discussed but is the
src/backend/replication/logical the right place for the slocsync.c? If
it's independent of logical decoding/replication, is under
src/backend/replication could be more appropriate?

---
    /* Construct query to fetch slots with failover enabled. */
    appendStringInfo(&s,
                     "SELECT slot_name, plugin, confirmed_flush_lsn,"
                     " restart_lsn, catalog_xmin, two_phase, failover,"
                     " database, conflict_reason"
                     " FROM pg_catalog.pg_replication_slots"
                     " WHERE failover and NOT temporary");

    /* Execute the query */
    res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow);
    pfree(s.data);

We don't need 's' as the query is constant.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrei Lepikhov
Дата:
Сообщение: Re: Memory consumed by paths during partitionwise join planning
Следующее
От: Erik Wienhold
Дата:
Сообщение: Re: Patch: Add parse_type Function