Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
От | Zane Duffield |
---|---|
Тема | Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Дата | |
Msg-id | CACMiCkXyC4au74kvE2g6Y=mCEF8X6r-Ne_ty4r7qWkUjRE4+oQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Список | pgsql-bugs |
Hi all,
In testing out the subscription parameter "streaming" we've noticed some unusual spikes in replay_lsn lag (and also confirmed_flush_lsn lag) when using "streaming = parallel".
During the periods of unusual lag (when compared to a separate "streaming = off" subscriber) we also see some additional logging from the (replica) postgres server about waiting for / acquiring locks, as well as error messages about logical apply workers exiting.
Then, today we found the logical replica stuck with a lag of about 30GB. The logs revealed a looping lock timeout with the parallel apply workers, the relevant section of which is attached.
While these workers were stuck, we queried the pg_locks catalog table in an attempt to collect more information. The query itself, and the resulting data (run with "\watch 1") are attached.
What I see in the output is that the lead apply worker (pid 2857187) is blocking the parallel worker (pid 2857523), then later another parallel worker (pid 2861846) is also blocking the lead apply worker (pid 2857187).
What I can't see, is anything blocking pid 2861846, and the logs don't show much about this worker (only that it starts and then is terminated). It's not clear to me whether this is a deadlock, or a case where we need to increase the lock_timeout (we didn't try).
At this point, I don't have a description of the upstream load that leads to this issue, and I don't have a reproducer script
One thing I should mention is that the schema of the publisher and subscriber are completely identical; the subscriber was created using pg_createsubscriber, and has not had any DDL applied (except for temp tables) since. I know that there is a risk of deadlocks in the case of schema differences, but there certainly isn't an extra index, constraint, or anything like that on the subscriber.
Given the poor performance and issues we are having with the "streaming = parallel" option, we will likely be sticking with the "streaming = on" option, which successfully worked to reduce our logical replication lag.
I noticed that the "parallel" option is slated to become the default for new subscriptions in PostgreSQL 18, and I wonder whether this is wise, given the potential for locking issues; in our replication cluster, several distinct scheduled jobs lead to the parallel streaming logical replication falling far behind the non-streaming version.
PostgreSQL version and some relevant configuration information:
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 17.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit
shared_buffers work_mem max_worker_processes max_logical_replication_workers max_parallel_apply_workers_per_subscription Thanks,
----------------
98172MB
----------
6GB
----------------------
36
---------------------------------
8
---------------------------------------------
8
Zane
Вложения
В списке pgsql-bugs по дате отправления: