RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Поиск
Список
Период
Сортировка
От Yu Shi (Fujitsu)
Тема RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Дата
Msg-id OSZPR01MB6310133C87996F1182F7920FFD4DA@OSZPR01MB6310.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
Список pgsql-hackers
On Thu, Jun 1, 2023 6:54 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> 
> Hi,
> 
> I rebased the patch and addressed the following reviews.
> 

Thanks for updating the patch. Here are some comments on 0001 patch.

1.
-    ereport(LOG,
-            (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has
finished",
-                    MySubscription->name,
-                    get_rel_name(MyLogicalRepWorker->relid))));

Could we move this to somewhere else instead of removing it?

2.
+    if (!OidIsValid(originid))
+        originid = replorigin_create(originname);
+    replorigin_session_setup(originid, 0);
+    replorigin_session_origin = originid;
+    *origin_startpos = replorigin_session_get_progress(false);
+    CommitTransactionCommand();
+
+    /* Is the use of a password mandatory? */
+    must_use_password = MySubscription->passwordrequired &&
+        !superuser_arg(MySubscription->owner);
+    LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true,
+                                            must_use_password,
+                                            MySubscription->name, &err);

It seems that there is a problem when refactoring.
See commit e7e7da2f8d5.

3.
+    /* Set this to false for safety, in case we're already reusing the worker */
+    MyLogicalRepWorker->ready_to_reuse = false;
+

I am not sure do we need to lock when setting it.

4.
+    /*
+     * Allocate the origin name in long-lived context for error context
+     * message.
+     */
+    StartTransactionCommand();
+    ReplicationOriginNameForLogicalRep(MySubscription->oid,
+                                       MyLogicalRepWorker->relid,
+                                       originname,
+                                       originname_size);
+    CommitTransactionCommand();

Do we need the call to StartTransactionCommand() and CommitTransactionCommand()
here? Besides, the comment here is the same as the comment atop
set_apply_error_context_origin(), do we need it?

5.
I saw a segmentation fault when debugging.

It happened when calling sync_worker_exit() called (see the code below in
LogicalRepSyncTableStart()). In the case that this is not the first table the
worker synchronizes, clean_sync_worker() has been called before (in
TablesyncWorkerMain()), and LogRepWorkerWalRcvConn has been set to NULL. Then, a
segmentation fault happened because LogRepWorkerWalRcvConn is a null pointer.

    switch (relstate)
    {
        case SUBREL_STATE_SYNCDONE:
        case SUBREL_STATE_READY:
        case SUBREL_STATE_UNKNOWN:
            sync_worker_exit();    /* doesn't return */
    }

Here is the backtrace.

#0  0x00007fc8a8ce4c95 in libpqrcv_disconnect (conn=0x0) at libpqwalreceiver.c:757
#1  0x000000000092b8c0 in clean_sync_worker () at tablesync.c:150
#2  0x000000000092b8ed in sync_worker_exit () at tablesync.c:164
#3  0x000000000092d8f6 in LogicalRepSyncTableStart (origin_startpos=0x7ffd50f30f08) at tablesync.c:1293
#4  0x0000000000934f76 in start_table_sync (origin_startpos=0x7ffd50f30f08, myslotname=0x7ffd50f30e80) at
worker.c:4457
#5  0x000000000093513b in run_tablesync_worker (options=0x7ffd50f30ec0, slotname=0x0, originname=0x7ffd50f30f10
"pg_16394_16395",
    originname_size=64, origin_startpos=0x7ffd50f30f08) at worker.c:4532
#6  0x0000000000935a3a in TablesyncWorkerMain (main_arg=1) at worker.c:4853
#7  0x00000000008e97f6 in StartBackgroundWorker () at bgworker.c:864
#8  0x00000000008f350b in do_start_bgworker (rw=0x12fc1a0) at postmaster.c:5762
#9  0x00000000008f38b7 in maybe_start_bgworkers () at postmaster.c:5986
#10 0x00000000008f2975 in process_pm_pmsignal () at postmaster.c:5149
#11 0x00000000008ee98a in ServerLoop () at postmaster.c:1770
#12 0x00000000008ee3bb in PostmasterMain (argc=3, argv=0x12c4af0) at postmaster.c:1463
#13 0x00000000007b6d3a in main (argc=3, argv=0x12c4af0) at main.c:198


The steps to reproduce: 
Worker1, in TablesyncWorkerMain(), the relstate of new table to sync (obtained
by GetSubscriptionRelations) is SUBREL_STATE_INIT, and in the foreach loop,
before the following Check (it needs a breakpoint before locking),

            LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);
            if (rstate->state != SUBREL_STATE_SYNCDONE &&
                !logicalrep_worker_find(MySubscription->oid, rstate->relid, false))
            {
                /* Update worker state for the next table */
                MyLogicalRepWorker->relid = rstate->relid;
                MyLogicalRepWorker->relstate = rstate->state;
                MyLogicalRepWorker->relstate_lsn = rstate->lsn;
                LWLockRelease(LogicalRepWorkerLock);
                break;
            }
            LWLockRelease(LogicalRepWorkerLock);

let this table to be synchronized by another table sync worker (Worker2), and
Worker2 has finished before logicalrep_worker_find was called(). Then Worker1
tried to sync a table whose state is SUBREL_STATE_READY and the segmentation
fault happened.

Regards,
Shi Yu

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ranier Vilela
Дата:
Сообщение: Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)
Следующее
От: Ranier Vilela
Дата:
Сообщение: Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)