Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

Поиск

Список

Период

Сортировка

От	Amit Kapila
Тема	Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Дата	13 октября 2023 г. 04:51:25
Msg-id	CAA4eK1JRhXZCg46OL--V+4ZqDGtBySqiraWGrJZXkyh2H-M1eg@mail.gmail.com обсуждение исходный текст
Ответ на	BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss (PG Bug reporting form <noreply@postgresql.org>)
Ответы	Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Список	pgsql-bugs

Дерево обсуждения

On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> Depending on the Major Version an untimely timeout or termination of the
> main apply worker when the table apply worker is waiting for the
> subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
> two really painful experiences.
>
> If on Rel14+, the untimely exit can lead to the main apply worker becoming
> indefinitely stuck while it waits for a table apply worker, that has already
> exited and won't be launched again, to change the subscription relstate to
> SUBREL_STATE_SYNCDONE.  In order to unwedge, a system restart is required to
> clear the corrupted transient subscription relstate data.
>
> If on Rel13+, then the untimely exit can lead to silent data loss. This will
> occur if the table apply worker performed a copy at LSN X. If the main apply
> worker is now at LSN Y > X, the system requires the table sync worker to
> apply all changes between X & Y that were skipped by the main apply worker
> in a catch up phase. Due to the untimely exit, the table apply worker will
> assume that the main apply worker was actually behind, skip the catch up
> work, and exit. As a result, all data between X & Y will be lost for that
> table.
>
> The cause of both issues is that wait_for_worker_state_change() is handled
> like a void function by the table apply worker. However, if the main apply
> worker does not currently exist on the system due to some issue such as a
> timeout triggering it to safely exit, then the function will return *before*
> the state change has occurred and return false.
>

But even if wait_for_worker_state_change() returns false, ideally
table sync worker shouldn't exit without marking the relstate as
SYNCDONE. The table sync worker should keep looping till the state
changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync()
which doesn't allow to exit the table sync worker exit. Your
observation doesn't match with this analysis, so can you please share
how did you reach to the conclusion that the table sync worker will
exit and won't restart?

--
With Regards,
Amit Kapila.

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss