Re: Deadlock between logrep apply worker and tablesync worker

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Deadlock between logrep apply worker and tablesync worker
Дата
Msg-id CAA4eK1KAO3cUBwim_27DGNCReN8Xm+K9+ktkhh449B=CiiT7aQ@mail.gmail.com
обсуждение исходный текст
Ответ на Deadlock between logrep apply worker and tablesync worker  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Deadlock between logrep apply worker and tablesync worker  (vignesh C <vignesh21@gmail.com>)
Список pgsql-hackers
On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> On my machine, the src/test/subscription/t/002_types.pl test
> usually takes right about 1.5 seconds:
>
> $ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl
> ...
> [14:22:12] t/002_types.pl .. ok     1550 ms ( 0.00 usr  0.00 sys +  0.70 cusr  0.25 csys =  0.95 CPU)
> [14:22:13]
>
> I noticed however that sometimes (at least one try in ten, for me)
> it takes about 2.5 seconds:
>
> [14:22:16] t/002_types.pl .. ok     2591 ms ( 0.00 usr  0.00 sys +  0.69 cusr  0.28 csys =  0.97 CPU)
> [14:22:18]
>
> and I've even seen 3.5 seconds.  I dug into this and eventually
> identified the cause: it's a deadlock between a subscription's apply
> worker and a tablesync worker that it's spawned.  Sometimes the apply
> worker calls wait_for_relation_state_change (to wait for the tablesync
> worker to finish) while it's holding a lock on pg_replication_origin.
> If that's the case, then when the tablesync worker reaches
> process_syncing_tables_for_sync it is able to perform
> UpdateSubscriptionRelState and reach the transaction commit below
> that; but when it tries to do replorigin_drop_by_name a little further
> down, it blocks on acquiring ExclusiveLock on pg_replication_origin.
> So we have an undetected deadlock.  We escape that because
> wait_for_relation_state_change has a 1-second timeout, after which
> it rechecks GetSubscriptionRelState and is able to see the committed
> relation state change; so it continues, and eventually releases its
> transaction and the lock, permitting the tablesync worker to finish.
>
> I've not tracked down the exact circumstances in which the apply
> worker ends up holding a problematic lock, but it seems likely
> that it corresponds to cases where its main loop has itself called
> replorigin_drop_by_name, a bit further up, for some other concurrent
> tablesync operation.  (In all the cases I've traced through, the apply
> worker is herding multiple tablesync workers when this happens.)
>
> I experimented with having the apply worker release its locks
> before waiting for the tablesync worker, as attached.
>

I don't see any problem with your proposed change but I was wondering
if it would be better to commit the transaction and release locks
immediately after performing the replication origin drop? By doing
that, we will minimize the amount of time the transaction holds the
lock.

>  This passes
> check-world and it seems to eliminate the test runtime instability,
> but I wonder whether it's semantically correct.  This whole business
> of taking table-wide ExclusiveLock on pg_replication_origin looks
> like a horrid kluge that we should try to get rid of, not least
> because I don't see any clear documentation of what hazard it's
> trying to prevent.
>

IIRC, this is done to prevent concurrent drops of origin drop say by
exposed API pg_replication_origin_drop(). See the discussion in [1]
related to it. If we want we can optimize it so that we can acquire
the lock on the specific origin as mentioned in comments
replorigin_drop_by_name() but it was not clear that this operation
would be frequent enough.

[1] - https://www.postgresql.org/message-id/CAHut%2BPuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Add a new pg_walinspect function to extract FPIs from WAL records
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: Exit walsender before confirming remote flush in logical replication