Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Дата
Msg-id CAA4eK1Kn_B4UUiFMnZ6G50tRz2PRD7XSX8_B3CKkaMTXqtiwoQ@mail.gmail.com
обсуждение исходный текст
Ответ на Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming  (Zane Duffield <duffieldzane@gmail.com>)
Список pgsql-bugs
On Mon, Aug 11, 2025 at 11:51 AM Zane Duffield <duffieldzane@gmail.com> wrote:
>
> In testing out the subscription parameter "streaming" we've noticed some unusual spikes in replay_lsn lag (and also
confirmed_flush_lsnlag) when using "streaming = parallel". 
> During the periods of unusual lag (when compared to a separate "streaming = off" subscriber) we also see some
additionallogging from the (replica) postgres server about waiting for / acquiring locks, as well as error messages
aboutlogical apply workers exiting. 
>
> Then, today we found the logical replica stuck with a lag of about 30GB. The logs revealed a looping lock timeout
withthe parallel apply workers, the relevant section of which is attached. 
>
> While these workers were stuck, we queried the pg_locks catalog table in an attempt to collect more information. The
queryitself, and the resulting data (run with "\watch 1") are attached. 
>
> What I see in the output is that the lead apply worker (pid 2857187) is blocking the parallel worker (pid 2857523),
thenlater another parallel worker (pid 2861846) is also blocking the lead apply worker (pid 2857187). 
> What I can't see, is anything blocking pid 2861846, and the logs don't show much about this worker (only that it
startsand then is terminated). It's not clear to me whether this is a deadlock, or a case where we need to increase the
lock_timeout(we didn't try). 
>

It is not the case of deadlock. Increasing lock_timeout could only
help if the parallel worker (2861846) is slow but still processing the
changes but if it is stuck in some other wait like LWLock then it
won't help. The key part of the puzzle is what exactly the process
2861846 is busy with? You can either share more server LOGs for that
or it would be better to increase the LOG level to DEBUG-1/2 to see if
we can get some more information about process 2861846.

The other wait you see for worker 2857523 is normal as it is waiting
for the next set of changes from the leader worker after processing
its current set.

> At this point, I don't have a description of the upstream load that leads to this issue, and I don't have a
reproducerscript 
>

Please see, if you can get that because the current set of information
is not sufficient to establish that it is the problem of parallel
apply.

> One thing I should mention is that the schema of the publisher and subscriber are completely identical; the
subscriberwas created using pg_createsubscriber, and has not had any DDL applied (except for temp tables) since. I know
thatthere is a risk of deadlocks in the case of schema differences, but there certainly isn't an extra index,
constraint,or anything like that on the subscriber. 
>
> Given the poor performance and issues we are having with the "streaming = parallel" option, we will likely be
stickingwith the "streaming = on" option, which successfully worked to reduce our logical replication lag. 
> I noticed that the "parallel" option is slated to become the default for new subscriptions in PostgreSQL 18, and I
wonderwhether this is wise, given the potential for locking issues; in our replication cluster, several distinct
scheduledjobs lead to the parallel streaming logical replication falling far behind the non-streaming version. 
>

It is worth considering but we should at least first figure out what
is going wrong. It could be some minor bug either in parallel-apply
infrastructure or a pre-existing one which is exposed due to
parallel-apply activity.

--
With Regards,
Amit Kapila.



В списке pgsql-bugs по дате отправления: