Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Дата
Msg-id CAA4eK1JRhXZCg46OL--V+4ZqDGtBySqiraWGrJZXkyh2H-M1eg@mail.gmail.com
обсуждение исходный текст
Ответ на BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss  (PG Bug reporting form <noreply@postgresql.org>)
Ответы Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss  ("Callahan, Drew" <callaan@amazon.com>)
Список pgsql-bugs
On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> Depending on the Major Version an untimely timeout or termination of the
> main apply worker when the table apply worker is waiting for the
> subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
> two really painful experiences.
>
> If on Rel14+, the untimely exit can lead to the main apply worker becoming
> indefinitely stuck while it waits for a table apply worker, that has already
> exited and won't be launched again, to change the subscription relstate to
> SUBREL_STATE_SYNCDONE.  In order to unwedge, a system restart is required to
> clear the corrupted transient subscription relstate data.
>
> If on Rel13+, then the untimely exit can lead to silent data loss. This will
> occur if the table apply worker performed a copy at LSN X. If the main apply
> worker is now at LSN Y > X, the system requires the table sync worker to
> apply all changes between X & Y that were skipped by the main apply worker
> in a catch up phase. Due to the untimely exit, the table apply worker will
> assume that the main apply worker was actually behind, skip the catch up
> work, and exit. As a result, all data between X & Y will be lost for that
> table.
>
> The cause of both issues is that wait_for_worker_state_change() is handled
> like a void function by the table apply worker. However, if the main apply
> worker does not currently exist on the system due to some issue such as a
> timeout triggering it to safely exit, then the function will return *before*
> the state change has occurred and return false.
>

But even if wait_for_worker_state_change() returns false, ideally
table sync worker shouldn't exit without marking the relstate as
SYNCDONE. The table sync worker should keep looping till the state
changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync()
which doesn't allow to exit the table sync worker exit. Your
observation doesn't match with this analysis, so can you please share
how did you reach to the conclusion that the table sync worker will
exit and won't restart?

--
With Regards,
Amit Kapila.



В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18156: Self-referential foreign key in partitioned table not enforced on deletes