Обсуждение: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

Поиск
Список
Период
Сортировка

BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18155
Logged by:          Drew Callahan
Email address:      callaan@amazon.com
PostgreSQL version: 16.0
Operating system:   Ubuntu
Description:

Depending on the Major Version an untimely timeout or termination of the
main apply worker when the table apply worker is waiting for the
subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
two really painful experiences.

If on Rel14+, the untimely exit can lead to the main apply worker becoming
indefinitely stuck while it waits for a table apply worker, that has already
exited and won't be launched again, to change the subscription relstate to
SUBREL_STATE_SYNCDONE.  In order to unwedge, a system restart is required to
clear the corrupted transient subscription relstate data.

If on Rel13+, then the untimely exit can lead to silent data loss. This will
occur if the table apply worker performed a copy at LSN X. If the main apply
worker is now at LSN Y > X, the system requires the table sync worker to
apply all changes between X & Y that were skipped by the main apply worker
in a catch up phase. Due to the untimely exit, the table apply worker will
assume that the main apply worker was actually behind, skip the catch up
work, and exit. As a result, all data between X & Y will be lost for that
table.

The cause of both issues is that wait_for_worker_state_change() is handled
like a void function by the table apply worker. However, if the main apply
worker does not currently exist on the system due to some issue such as a
timeout triggering it to safely exit, then the function will return *before*
the state change has occurred and return false.

The fix for this issue is relatively, we just need to validate the state
actually did change. If it did not, then we can circle back later to check
again which is relatively cheap on Rel14+. Pasted a potential solution
below. A solution on Rel13- would want to introduce a retry loop in
wait_for_worker_state_change() to avoid having to do the entire COPY again
first which is very expensive on large tables.

diff --git a/src/backend/replication/logical/tablesync.c
b/src/backend/replication/logical/tablesync.c
index e2cee92cf2..25be978e15 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1529,10 +1529,30 @@ copy_table_done:
        SpinLockRelease(&MyLogicalRepWorker->relmutex);

        /*
-        * Finally, wait until the leader apply worker tells us to catch up
and
-        * then return to let LogicalRepApplyLoop do it.
+        * Finally, wait until the main apply worker tells us to catch up
and then
+        * return to let LogicalRepApplyLoop do it. If this wait fails due
to the
+        * apply worker not being up, we fail this attempt and force a
retry.
         */
-       wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
+       if (!wait_for_worker_state_change(SUBREL_STATE_CATCHUP))
+       {
+               ereport(ERROR,
+                               (errcode(ERRCODE_INTERNAL_ERROR),
+                               errmsg("logical replication table
synchronization worker for subscription \"%s\", relid \"%u\" "
+                                               "unable to finish due to
main apply worker being unavailable.",
+                                               MySubscription->name,
MyLogicalRepWorker->relid),
+                               errhint("this worker will attempt to sync
the table again once logical replication has resumed")));
+       }
+
+       if (MyLogicalRepWorker->relstate != SUBREL_STATE_CATCHUP)
+       {
+               ereport(ERROR,
+                               (errcode(ERRCODE_INTERNAL_ERROR),
+                               errmsg("logical replication table
synchronization worker for subscription \"%s\", relid \"%u\" "
+                                               "unable to make progress
while apply worker is down",
+                                               MySubscription->name,
MyLogicalRepWorker->relid),
+                               errhint("this worker will attempt to sync
the table again once logical replication has resumed")));
+       }
+
        return slotname;
 }


Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

От
Amit Kapila
Дата:
On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> Depending on the Major Version an untimely timeout or termination of the
> main apply worker when the table apply worker is waiting for the
> subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
> two really painful experiences.
>
> If on Rel14+, the untimely exit can lead to the main apply worker becoming
> indefinitely stuck while it waits for a table apply worker, that has already
> exited and won't be launched again, to change the subscription relstate to
> SUBREL_STATE_SYNCDONE.  In order to unwedge, a system restart is required to
> clear the corrupted transient subscription relstate data.
>
> If on Rel13+, then the untimely exit can lead to silent data loss. This will
> occur if the table apply worker performed a copy at LSN X. If the main apply
> worker is now at LSN Y > X, the system requires the table sync worker to
> apply all changes between X & Y that were skipped by the main apply worker
> in a catch up phase. Due to the untimely exit, the table apply worker will
> assume that the main apply worker was actually behind, skip the catch up
> work, and exit. As a result, all data between X & Y will be lost for that
> table.
>
> The cause of both issues is that wait_for_worker_state_change() is handled
> like a void function by the table apply worker. However, if the main apply
> worker does not currently exist on the system due to some issue such as a
> timeout triggering it to safely exit, then the function will return *before*
> the state change has occurred and return false.
>

But even if wait_for_worker_state_change() returns false, ideally
table sync worker shouldn't exit without marking the relstate as
SYNCDONE. The table sync worker should keep looping till the state
changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync()
which doesn't allow to exit the table sync worker exit. Your
observation doesn't match with this analysis, so can you please share
how did you reach to the conclusion that the table sync worker will
exit and won't restart?

--
With Regards,
Amit Kapila.



Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

От
"Callahan, Drew"
Дата:
Hi Amit,

Thanks for commenting, you're correct. I had misremembered the case and upon review found that
I had the situation inverted for Rel14. We saw that none of the main apply workers were launching
and all of the table apply workers were not transitioning out of the syncwait state. As a result, they
kept replaying changes since the logic only checks if we have crossed the LSN if in the CATCHUP state.

                slot_name                |  plugin  | slot_type | datoid | database | temporary | active | active_pid |
xmin| catalog_xmin |  restart_lsn  | confirmed_flush_lsn | pg_current_wal_lsn | pg_current_wal_flush_lsn |
pg_current_wal_insert_lsn| pg_wal_lsn_diff 
 

-----------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------+--------------------+--------------------------+---------------------------+-----------------
 main_slot_5                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1512105843 | 22D5/33FC5B60 | 22D5/34C8D910       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    203488955600
 
 main_slot_3                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515152530 | 22D5/71A15460 | 22D5/7323FA88       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202454733776
 
 main_slot_1                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515398672 | 22D5/76AB8500 | 22D5/77806A10       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202370179888
 
 main_slot_2                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515398672 | 22D5/76AB8500 | 22D5/77039830       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202370179888
 
 main_slot_0                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515277554 | 22D5/74320CF8 | 22D5/74539970       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202411694904
 
 main_slot_4                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1509956699 | 22D5/8643A70  | 22D5/AEFE230        | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    204220345792
 
 pg_21077_sync_20174_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |      97665 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21173_sync_20205_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8349 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21121_sync_20141_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4319 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21178_sync_20206_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8279 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21081_sync_20101_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4065 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21081_sync_20109_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8302 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21173_sync_20195_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4385 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21077_sync_20183_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       6690 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21178_sync_20198_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4548 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21077_sync_20152_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |      97564 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21121_sync_20149_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8384 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21077_sync_20159_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       5559 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
(18 rows)

On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing further
and further since they workers would not transition to the catchup state after several hours due to this.

Thanks,
Drew


On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form
<noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote:
>
> Depending on the Major Version an untimely timeout or termination of the
> main apply worker when the table apply worker is waiting for the
> subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
> two really painful experiences.
>
> If on Rel14+, the untimely exit can lead to the main apply worker becoming
> indefinitely stuck while it waits for a table apply worker, that has already
> exited and won't be launched again, to change the subscription relstate to
> SUBREL_STATE_SYNCDONE. In order to unwedge, a system restart is required to
> clear the corrupted transient subscription relstate data.
>
> If on Rel13+, then the untimely exit can lead to silent data loss. This will
> occur if the table apply worker performed a copy at LSN X. If the main apply
> worker is now at LSN Y > X, the system requires the table sync worker to
> apply all changes between X & Y that were skipped by the main apply worker
> in a catch up phase. Due to the untimely exit, the table apply worker will
> assume that the main apply worker was actually behind, skip the catch up
> work, and exit. As a result, all data between X & Y will be lost for that
> table.
>
> The cause of both issues is that wait_for_worker_state_change() is handled
> like a void function by the table apply worker. However, if the main apply
> worker does not currently exist on the system due to some issue such as a
> timeout triggering it to safely exit, then the function will return *before*
> the state change has occurred and return false.
>


But even if wait_for_worker_state_change() returns false, ideally
table sync worker shouldn't exit without marking the relstate as
SYNCDONE. The table sync worker should keep looping till the state
changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync()
which doesn't allow to exit the table sync worker exit. Your
observation doesn't match with this analysis, so can you please share
how did you reach to the conclusion that the table sync worker will
exit and won't restart?


--
With Regards,
Amit Kapila.




Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

От
Amit Kapila
Дата:
On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote:
>
> On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing
further
> and further since they workers would not transition to the catchup state after several hours due to this.
>

One possibility is that the system has reached
'max_logical_replication_workers' limit due to which it is not
allowing to launch the apply worker. If so, then consider increasing
the value of 'max_logical_replication_workers'. You can query
'pg_stat_subscription' to know more information about workers. See the
description of subscriber-side parameters [1].

[1] - https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER

--
With Regards,
Amit Kapila.



Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

От
Michael Paquier
Дата:
On Tue, Oct 17, 2023 at 09:50:52AM +0530, Amit Kapila wrote:
> On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote:
>> On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing
further
>> and further since they workers would not transition to the catchup state after several hours due to this.
>
> One possibility is that the system has reached
> 'max_logical_replication_workers' limit due to which it is not
> allowing to launch the apply worker. If so, then consider increasing
> the value of 'max_logical_replication_workers'. You can query
> 'pg_stat_subscription' to know more information about workers. See the
> description of subscriber-side parameters [1].

Hmm.  So you basically mean that not being able to launch new workers
prevents the existing workers to move on with their individual sync,
freeing slots once their sync is done for other tables.  Then, this
causes all all of the existing workers to remain in a syncwait state,
further increasing the gap in WAL replay.  Am I getting that right?
--
Michael

Вложения

Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

От
Michael Paquier
Дата:
On Wed, Oct 18, 2023 at 10:22:14AM +0900, Michael Paquier wrote:
> On Tue, Oct 17, 2023 at 09:50:52AM +0530, Amit Kapila wrote:
>> On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote:
>>> On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing
further
>>> and further since they workers would not transition to the catchup state after several hours due to this.
>>
>> One possibility is that the system has reached
>> 'max_logical_replication_workers' limit due to which it is not
>> allowing to launch the apply worker. If so, then consider increasing
>> the value of 'max_logical_replication_workers'. You can query
>> 'pg_stat_subscription' to know more information about workers. See the
>> description of subscriber-side parameters [1].
>
> Hmm.  So you basically mean that not being able to launch new workers
> prevents the existing workers to move on with their individual sync,
> freeing slots once their sync is done for other tables.  Then, this
> causes all all of the existing workers to remain in a syncwait state,
> further increasing the gap in WAL replay.  Am I getting that right?

I was looking more at what we could do here, played with the apply and
sync workers and I really doubt that just ERROR-ing in a sync worker
is a good idea on v14~ just for the sake of freeing bgworker slots,
which would have the effect of spawning more apply workers.  I mean,
if your system is under pressure because of a lack of bgworker slots,
assuming that a set of N apply workers are gone, leaving M sync
workers (M >= N) lying around without being able to get updates from
an apply worker, ERRORs in sync workers would give more room to spawn
apply workers, but then the launcher worker may fight for the free'd
slots to attempt to spawn sync workers for the fresh apply workers,
no?  It also seems to make the reasoning around the table states
harder to think about, as any future apply workers need to handle more
error cases related to sync workers.

I got a few extra questions about all that:
- How did the apply worker stop?  That can be said to be the origin of
the issue.
- Did logicalrep_worker_launch() provide some WARNINGs because the
cluster is running out of slots, preventing the spawn of a new apply
worker to replace the one that's down, so as the table sync workers
are able to move on with their sync phases?  I guess it did..  The
frequency would offer hints about the pressure put on the system.
--
Michael

Вложения