Обсуждение: Logrep launcher race conditions leading to slow tests

Поиск

Список

Период

Сортировка

Logrep launcher race conditions leading to slow tests

От

Tom Lane

Дата:

24 июня 2025 г., 02:56:47

I've been annoyed for awhile because, while a parallel check-world
run usually takes a bit over a minute on my machine, sometimes it
takes between three and four minutes.  I was finally able to
track down what is happening, and it's this: sometimes one or
another of the src/test/subscription tests takes an extra three
minutes because the logical replication launcher is sleeping
instead of launching the next task.  It eventually reaches its
hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
wakes up and notices it has something to do, and then we're
on our merry way again.  I'm not sure how often this is a problem
in the real world, but it happens often enough to be annoying
during development.

There are two distinct bugs involved:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

2. ApplyLauncherMain ignores a failure return from
logicalrep_worker_launch, which is bad because (unless it has
another worker launch pending) it will then sleep for
DEFAULT_NAPTIME_PER_CYCLE before reconsidering.  What it ought to do
is try again after wal_retrieve_retry_interval.  This situation can
arise in resource-exhaustion cases (cf. the early exits in
logicalrep_worker_launch), but what's more likely in the regression
tests is that the worker stops due to some error condition before
WaitForReplicationWorkerAttach sees it attached, which is then duly
reported as a failure.

It's possible to make the test slowness extremely reproducible
with this change, which widens the race condition window for
both problems:

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..724e82bcdc1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
          */
         rc = WaitLatch(MyLatch,
                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       10L, WAIT_EVENT_BGWORKER_STARTUP);
+                       1000L, WAIT_EVENT_BGWORKER_STARTUP);

         if (rc & WL_LATCH_SET)
         {

I don't recommend that as a permanent change, but it's helpful
for testing the attached patch.

In the attached, I made two other non-cosmetic changes:

3. In WaitForReplicationWorkerAttach, capture worker->in_use
before not after releasing LogicalRepWorkerLock.  Maybe there
is a reason why that's not a dangerous race condition, but
it sure is un-obvious to me.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.
Otherwise we might be changing a hashtable entry that's
no longer relevant to this worker.  I'm not sure exactly
where the failed worker will be cleaned-up-after, but
it could very easily be out of the system entirely before
logicalrep_worker_launch returns.

Barring objections, I plan to apply and back-patch this.

            regards, tom lane

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..14d8efbd25b 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -175,12 +175,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
                                uint16 generation,
                                BackgroundWorkerHandle *handle)
 {
-    BgwHandleStatus status;
-    int            rc;
+    bool        result = false;
+    bool        dropped_latch = false;

     for (;;)
     {
+        BgwHandleStatus status;
         pid_t        pid;
+        int            rc;

         CHECK_FOR_INTERRUPTS();

@@ -189,8 +191,9 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
         /* Worker either died or has started. Return false if died. */
         if (!worker->in_use || worker->proc)
         {
+            result = worker->in_use;
             LWLockRelease(LogicalRepWorkerLock);
-            return worker->in_use;
+            break;
         }

         LWLockRelease(LogicalRepWorkerLock);
@@ -205,7 +208,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
             if (generation == worker->generation)
                 logicalrep_worker_cleanup(worker);
             LWLockRelease(LogicalRepWorkerLock);
-            return false;
+            break;                /* result is already false */
         }

         /*
@@ -220,8 +223,18 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
         {
             ResetLatch(MyLatch);
             CHECK_FOR_INTERRUPTS();
+            dropped_latch = true;
         }
     }
+
+    /*
+     * If we had to clear a latch event in order to wait, be sure to restore
+     * it before exiting.  Otherwise caller may miss events.
+     */
+    if (dropped_latch)
+        SetLatch(MyLatch);
+
+    return result;
 }

 /*
@@ -1194,10 +1207,21 @@ ApplyLauncherMain(Datum main_arg)
                 (elapsed = TimestampDifferenceMilliseconds(last_start, now)) >= wal_retrieve_retry_interval)
             {
                 ApplyLauncherSetWorkerStartTime(sub->oid, now);
-                logicalrep_worker_launch(WORKERTYPE_APPLY,
-                                         sub->dbid, sub->oid, sub->name,
-                                         sub->owner, InvalidOid,
-                                         DSM_HANDLE_INVALID);
+                if (!logicalrep_worker_launch(WORKERTYPE_APPLY,
+                                              sub->dbid, sub->oid, sub->name,
+                                              sub->owner, InvalidOid,
+                                              DSM_HANDLE_INVALID))
+                {
+                    /*
+                     * We get here either if we failed to launch a worker
+                     * (perhaps for resource-exhaustion reasons) or if we
+                     * launched one but it immediately quit.  Either way, it
+                     * seems appropriate to try again after
+                     * wal_retrieve_retry_interval.
+                     */
+                    wait_time = Min(wait_time,
+                                    wal_retrieve_retry_interval);
+                }
             }
             else
             {
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 8e1e8762f62..b679156e3d8 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -603,6 +603,12 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
                         TimestampDifferenceExceeds(hentry->last_start_time, now,
                                                    wal_retrieve_retry_interval))
                     {
+                        /*
+                         * Set the last_start_time even if we fail to start
+                         * the worker, so that we won't retry until
+                         * wal_retrieve_retry_interval has elapsed.
+                         */
+                        hentry->last_start_time = now;
                         logicalrep_worker_launch(WORKERTYPE_TABLESYNC,
                                                  MyLogicalRepWorker->dbid,
                                                  MySubscription->oid,
@@ -610,7 +616,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
                                                  MyLogicalRepWorker->userid,
                                                  rstate->relid,
                                                  DSM_HANDLE_INVALID);
-                        hentry->last_start_time = now;
                     }
                 }
             }

Re: Logrep launcher race conditions leading to slow tests

От

Amit Kapila

Дата:

24 июня 2025 г., 13:20:37

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I've been annoyed for awhile because, while a parallel check-world
> run usually takes a bit over a minute on my machine, sometimes it
> takes between three and four minutes.  I was finally able to
> track down what is happening, and it's this: sometimes one or
> another of the src/test/subscription tests takes an extra three
> minutes because the logical replication launcher is sleeping
> instead of launching the next task.  It eventually reaches its
> hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
> wakes up and notices it has something to do, and then we're
> on our merry way again.  I'm not sure how often this is a problem
> in the real world, but it happens often enough to be annoying
> during development.
>
> There are two distinct bugs involved:
>
> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> latch event so that it can keep waiting for the worker to launch.
> It neglects to set the latch again, allowing ApplyLauncherMain
> to miss events.
>

There was a previous discussion to fix this behavior. Heikki has
proposed a similar fix for this, but at the caller. See the patch
attached in email [1]. The comments in his patch are more explicit
about the kind of events that are missed. Then there is also a
discussion solving it by having two latches, one for worker exiting
and the other for subscription changes. There is a slight advantage
with the approach of using two latches, which is probably to avoid
extra looping to traverse the subscriptions when nothing has changed.
I find one of your or Heikki's proposals to fix this issue is good
enough.

> 2. ApplyLauncherMain ignores a failure return from
> logicalrep_worker_launch, which is bad because (unless it has
> another worker launch pending) it will then sleep for
> DEFAULT_NAPTIME_PER_CYCLE before reconsidering.  What it ought to do
> is try again after wal_retrieve_retry_interval.  This situation can
> arise in resource-exhaustion cases (cf. the early exits in
> logicalrep_worker_launch), but what's more likely in the regression
> tests is that the worker stops due to some error condition before
> WaitForReplicationWorkerAttach sees it attached, which is then duly
> reported as a failure.
>

LGTM.

> It's possible to make the test slowness extremely reproducible
> with this change, which widens the race condition window for
> both problems:
>
> diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
> index 1c3c051403d..724e82bcdc1 100644
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
>                  */
>                 rc = WaitLatch(MyLatch,
>                                            WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> -                                          10L, WAIT_EVENT_BGWORKER_STARTUP);
> +                                          1000L, WAIT_EVENT_BGWORKER_STARTUP);
>
>                 if (rc & WL_LATCH_SET)
>                 {
>
> I don't recommend that as a permanent change, but it's helpful
> for testing the attached patch.
>
> In the attached, I made two other non-cosmetic changes:
>
> 3. In WaitForReplicationWorkerAttach, capture worker->in_use
> before not after releasing LogicalRepWorkerLock.  Maybe there
> is a reason why that's not a dangerous race condition, but
> it sure is un-obvious to me.
>

I also can't think of a reason to use current coding. Your change
looks good to me.

> 4. In process_syncing_tables_for_apply (the other caller of
> logicalrep_worker_launch), it seems okay to ignore the
> result of logicalrep_worker_launch, but I think it should
> fill hentry->last_start_time before not after the call.

With this, won't we end up retrying to launch the worker sooner if the
launch took time, but still failed to launch the worker?

[1] - https://www.postgresql.org/message-id/ff0663d9-8011-420f-a169-efbf57327cb5%40iki.fi

--
With Regards,
Amit Kapila.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Logrep launcher race conditions leading to slow tests

Logrep launcher race conditions leading to slow tests

Re: Logrep launcher race conditions leading to slow tests