Обсуждение: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 18155 Logged by: Drew Callahan Email address: callaan@amazon.com PostgreSQL version: 16.0 Operating system: Ubuntu Description: Depending on the Major Version an untimely timeout or termination of the main apply worker when the table apply worker is waiting for the subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of two really painful experiences. If on Rel14+, the untimely exit can lead to the main apply worker becoming indefinitely stuck while it waits for a table apply worker, that has already exited and won't be launched again, to change the subscription relstate to SUBREL_STATE_SYNCDONE. In order to unwedge, a system restart is required to clear the corrupted transient subscription relstate data. If on Rel13+, then the untimely exit can lead to silent data loss. This will occur if the table apply worker performed a copy at LSN X. If the main apply worker is now at LSN Y > X, the system requires the table sync worker to apply all changes between X & Y that were skipped by the main apply worker in a catch up phase. Due to the untimely exit, the table apply worker will assume that the main apply worker was actually behind, skip the catch up work, and exit. As a result, all data between X & Y will be lost for that table. The cause of both issues is that wait_for_worker_state_change() is handled like a void function by the table apply worker. However, if the main apply worker does not currently exist on the system due to some issue such as a timeout triggering it to safely exit, then the function will return *before* the state change has occurred and return false. The fix for this issue is relatively, we just need to validate the state actually did change. If it did not, then we can circle back later to check again which is relatively cheap on Rel14+. Pasted a potential solution below. A solution on Rel13- would want to introduce a retry loop in wait_for_worker_state_change() to avoid having to do the entire COPY again first which is very expensive on large tables. diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index e2cee92cf2..25be978e15 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -1529,10 +1529,30 @@ copy_table_done: SpinLockRelease(&MyLogicalRepWorker->relmutex); /* - * Finally, wait until the leader apply worker tells us to catch up and - * then return to let LogicalRepApplyLoop do it. + * Finally, wait until the main apply worker tells us to catch up and then + * return to let LogicalRepApplyLoop do it. If this wait fails due to the + * apply worker not being up, we fail this attempt and force a retry. */ - wait_for_worker_state_change(SUBREL_STATE_CATCHUP); + if (!wait_for_worker_state_change(SUBREL_STATE_CATCHUP)) + { + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("logical replication table synchronization worker for subscription \"%s\", relid \"%u\" " + "unable to finish due to main apply worker being unavailable.", + MySubscription->name, MyLogicalRepWorker->relid), + errhint("this worker will attempt to sync the table again once logical replication has resumed"))); + } + + if (MyLogicalRepWorker->relstate != SUBREL_STATE_CATCHUP) + { + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("logical replication table synchronization worker for subscription \"%s\", relid \"%u\" " + "unable to make progress while apply worker is down", + MySubscription->name, MyLogicalRepWorker->relid), + errhint("this worker will attempt to sync the table again once logical replication has resumed"))); + } + return slotname; }
On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > Depending on the Major Version an untimely timeout or termination of the > main apply worker when the table apply worker is waiting for the > subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of > two really painful experiences. > > If on Rel14+, the untimely exit can lead to the main apply worker becoming > indefinitely stuck while it waits for a table apply worker, that has already > exited and won't be launched again, to change the subscription relstate to > SUBREL_STATE_SYNCDONE. In order to unwedge, a system restart is required to > clear the corrupted transient subscription relstate data. > > If on Rel13+, then the untimely exit can lead to silent data loss. This will > occur if the table apply worker performed a copy at LSN X. If the main apply > worker is now at LSN Y > X, the system requires the table sync worker to > apply all changes between X & Y that were skipped by the main apply worker > in a catch up phase. Due to the untimely exit, the table apply worker will > assume that the main apply worker was actually behind, skip the catch up > work, and exit. As a result, all data between X & Y will be lost for that > table. > > The cause of both issues is that wait_for_worker_state_change() is handled > like a void function by the table apply worker. However, if the main apply > worker does not currently exist on the system due to some issue such as a > timeout triggering it to safely exit, then the function will return *before* > the state change has occurred and return false. > But even if wait_for_worker_state_change() returns false, ideally table sync worker shouldn't exit without marking the relstate as SYNCDONE. The table sync worker should keep looping till the state changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync() which doesn't allow to exit the table sync worker exit. Your observation doesn't match with this analysis, so can you please share how did you reach to the conclusion that the table sync worker will exit and won't restart? -- With Regards, Amit Kapila.
Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
От
"Callahan, Drew"
Дата:
Hi Amit, Thanks for commenting, you're correct. I had misremembered the case and upon review found that I had the situation inverted for Rel14. We saw that none of the main apply workers were launching and all of the table apply workers were not transitioning out of the syncwait state. As a result, they kept replaying changes since the logic only checks if we have crossed the LSN if in the CATCHUP state. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin| catalog_xmin | restart_lsn | confirmed_flush_lsn | pg_current_wal_lsn | pg_current_wal_flush_lsn | pg_current_wal_insert_lsn| pg_wal_lsn_diff -----------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------+--------------------+--------------------------+---------------------------+----------------- main_slot_5 | pgoutput | logical | 16396 | users | f | f | | | 1512105843 | 22D5/33FC5B60 | 22D5/34C8D910 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 203488955600 main_slot_3 | pgoutput | logical | 16396 | users | f | f | | | 1515152530 | 22D5/71A15460 | 22D5/7323FA88 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 202454733776 main_slot_1 | pgoutput | logical | 16396 | users | f | f | | | 1515398672 | 22D5/76AB8500 | 22D5/77806A10 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 202370179888 main_slot_2 | pgoutput | logical | 16396 | users | f | f | | | 1515398672 | 22D5/76AB8500 | 22D5/77039830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 202370179888 main_slot_0 | pgoutput | logical | 16396 | users | f | f | | | 1515277554 | 22D5/74320CF8 | 22D5/74539970 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 202411694904 main_slot_4 | pgoutput | logical | 16396 | users | f | f | | | 1509956699 | 22D5/8643A70 | 22D5/AEFE230 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 204220345792 pg_21077_sync_20174_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 97665 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21173_sync_20205_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 8349 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21121_sync_20141_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 4319 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21178_sync_20206_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 8279 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21081_sync_20101_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 4065 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21081_sync_20109_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 8302 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21173_sync_20195_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 4385 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21077_sync_20183_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 6690 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21178_sync_20198_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 4548 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21077_sync_20152_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 97564 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21121_sync_20149_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 8384 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 pg_21077_sync_20159_7063584474467025194 | pgoutput | logical | 16396 | users | f | t | 5559 | | 2116325349 | 2304/94B86518 | 2304/94DF6830 | 2304/94DF6830 | 2304/94DF6830 | 2304/94E68B18 | 2556696 (18 rows) On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing further and further since they workers would not transition to the catchup state after several hours due to this. Thanks, Drew On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form <noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote: > > Depending on the Major Version an untimely timeout or termination of the > main apply worker when the table apply worker is waiting for the > subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of > two really painful experiences. > > If on Rel14+, the untimely exit can lead to the main apply worker becoming > indefinitely stuck while it waits for a table apply worker, that has already > exited and won't be launched again, to change the subscription relstate to > SUBREL_STATE_SYNCDONE. In order to unwedge, a system restart is required to > clear the corrupted transient subscription relstate data. > > If on Rel13+, then the untimely exit can lead to silent data loss. This will > occur if the table apply worker performed a copy at LSN X. If the main apply > worker is now at LSN Y > X, the system requires the table sync worker to > apply all changes between X & Y that were skipped by the main apply worker > in a catch up phase. Due to the untimely exit, the table apply worker will > assume that the main apply worker was actually behind, skip the catch up > work, and exit. As a result, all data between X & Y will be lost for that > table. > > The cause of both issues is that wait_for_worker_state_change() is handled > like a void function by the table apply worker. However, if the main apply > worker does not currently exist on the system due to some issue such as a > timeout triggering it to safely exit, then the function will return *before* > the state change has occurred and return false. > But even if wait_for_worker_state_change() returns false, ideally table sync worker shouldn't exit without marking the relstate as SYNCDONE. The table sync worker should keep looping till the state changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync() which doesn't allow to exit the table sync worker exit. Your observation doesn't match with this analysis, so can you please share how did you reach to the conclusion that the table sync worker will exit and won't restart? -- With Regards, Amit Kapila.
On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote: > > On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing further > and further since they workers would not transition to the catchup state after several hours due to this. > One possibility is that the system has reached 'max_logical_replication_workers' limit due to which it is not allowing to launch the apply worker. If so, then consider increasing the value of 'max_logical_replication_workers'. You can query 'pg_stat_subscription' to know more information about workers. See the description of subscriber-side parameters [1]. [1] - https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER -- With Regards, Amit Kapila.
Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
От
Michael Paquier
Дата:
On Tue, Oct 17, 2023 at 09:50:52AM +0530, Amit Kapila wrote: > On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote: >> On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing further >> and further since they workers would not transition to the catchup state after several hours due to this. > > One possibility is that the system has reached > 'max_logical_replication_workers' limit due to which it is not > allowing to launch the apply worker. If so, then consider increasing > the value of 'max_logical_replication_workers'. You can query > 'pg_stat_subscription' to know more information about workers. See the > description of subscriber-side parameters [1]. Hmm. So you basically mean that not being able to launch new workers prevents the existing workers to move on with their individual sync, freeing slots once their sync is done for other tables. Then, this causes all all of the existing workers to remain in a syncwait state, further increasing the gap in WAL replay. Am I getting that right? -- Michael
Вложения
Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
От
Michael Paquier
Дата:
On Wed, Oct 18, 2023 at 10:22:14AM +0900, Michael Paquier wrote: > On Tue, Oct 17, 2023 at 09:50:52AM +0530, Amit Kapila wrote: >> On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote: >>> On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing further >>> and further since they workers would not transition to the catchup state after several hours due to this. >> >> One possibility is that the system has reached >> 'max_logical_replication_workers' limit due to which it is not >> allowing to launch the apply worker. If so, then consider increasing >> the value of 'max_logical_replication_workers'. You can query >> 'pg_stat_subscription' to know more information about workers. See the >> description of subscriber-side parameters [1]. > > Hmm. So you basically mean that not being able to launch new workers > prevents the existing workers to move on with their individual sync, > freeing slots once their sync is done for other tables. Then, this > causes all all of the existing workers to remain in a syncwait state, > further increasing the gap in WAL replay. Am I getting that right? I was looking more at what we could do here, played with the apply and sync workers and I really doubt that just ERROR-ing in a sync worker is a good idea on v14~ just for the sake of freeing bgworker slots, which would have the effect of spawning more apply workers. I mean, if your system is under pressure because of a lack of bgworker slots, assuming that a set of N apply workers are gone, leaving M sync workers (M >= N) lying around without being able to get updates from an apply worker, ERRORs in sync workers would give more room to spawn apply workers, but then the launcher worker may fight for the free'd slots to attempt to spawn sync workers for the fresh apply workers, no? It also seems to make the reasoning around the table states harder to think about, as any future apply workers need to handle more error cases related to sync workers. I got a few extra questions about all that: - How did the apply worker stop? That can be said to be the origin of the issue. - Did logicalrep_worker_launch() provide some WARNINGs because the cluster is running out of slots, preventing the spawn of a new apply worker to replace the one that's down, so as the table sync workers are able to move on with their sync phases? I guess it did.. The frequency would offer hints about the pressure put on the system. -- Michael