RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
Дата
Msg-id OS0PR01MB57160C5CAAF37FF206A7F2029428A@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming  (Zane Duffield <duffieldzane@gmail.com>)
Список pgsql-bugs
On Monday, August 11, 2025 2:21 PM Zane Duffield <duffieldzane@gmail.com>  wrote:

> Hi all,
> 
> In testing out the subscription parameter "streaming" we've noticed some
> unusual spikes in replay_lsn lag (and also confirmed_flush_lsn lag) when using
> "streaming = parallel".
> 
> During the periods of unusual lag (when compared to a separate "streaming =
> off" subscriber) we also see some additional logging from the (replica)
> postgres server about waiting for / acquiring locks, as well as error messages
> about logical apply workers exiting.
> 
> Then, today we found the logical replica stuck with a lag of about 30GB. The
> logs revealed a looping lock timeout with the parallel apply workers, the
> relevant section of which is attached.
> 
> While these workers were stuck, we queried the pg_locks catalog table in an
> attempt to collect more information. The query itself, and the resulting data
> (run with "\watch 1") are attached.
> 
> What I see in the output is that the lead apply worker (pid 2857187) is
> blocking the parallel worker (pid 2857523), then later another parallel worker
> (pid
> 2861846) is also blocking the lead apply worker (pid 2857187).
> 
> What I can't see, is anything blocking pid 2861846, and the logs don't show
> much about this worker (only that it starts and then is terminated). It's not
> clear to me whether this is a deadlock, or a case where we need to increase
> the lock_timeout (we didn't try).
> 
> At this point, I don't have a description of the upstream load that leads to
> this issue, and I don't have a reproducer script
> 
> One thing I should mention is that the schema of the publisher and subscriber
> are completely identical; the subscriber was created using
> pg_createsubscriber, and has not had any DDL applied (except for temp tables)
> since. I know that there is a risk of deadlocks in the case of schema
> differences, but there certainly isn't an extra index, constraint, or anything
> like that on the subscriber.

Thank you for reporting the issue. I reviewed the logs and query results but did
not find any evidence of a suspicious strong lock. Therefore, if possible, could
you please query pg_lock and pg_stat_activity separately? This will allow us to
determine whether there are any other wait events, besides strong locks, that
might be causing blocking.

Additionally, it might be worthwhile to check whether the parallel apply worker
is too slow to process all the changes on time. To investigate this, could you
intermittently query pg_stat_wal and pg_current_wal_insert_lsn() to see if the
if WAL inserts are continuously occurring, possibly initiated by the parallel
apply worker?

Lastly, as a precaution, could you please confirm whether there are any
other backends on the subscriber doing DMLs ?

Best Regards,
Hou zj

В списке pgsql-bugs по дате отправления: