Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
От | Amit Kapila |
---|---|
Тема | Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Дата | |
Msg-id | CAA4eK1Kn_B4UUiFMnZ6G50tRz2PRD7XSX8_B3CKkaMTXqtiwoQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming (Zane Duffield <duffieldzane@gmail.com>) |
Список | pgsql-bugs |
On Mon, Aug 11, 2025 at 11:51 AM Zane Duffield <duffieldzane@gmail.com> wrote: > > In testing out the subscription parameter "streaming" we've noticed some unusual spikes in replay_lsn lag (and also confirmed_flush_lsnlag) when using "streaming = parallel". > During the periods of unusual lag (when compared to a separate "streaming = off" subscriber) we also see some additionallogging from the (replica) postgres server about waiting for / acquiring locks, as well as error messages aboutlogical apply workers exiting. > > Then, today we found the logical replica stuck with a lag of about 30GB. The logs revealed a looping lock timeout withthe parallel apply workers, the relevant section of which is attached. > > While these workers were stuck, we queried the pg_locks catalog table in an attempt to collect more information. The queryitself, and the resulting data (run with "\watch 1") are attached. > > What I see in the output is that the lead apply worker (pid 2857187) is blocking the parallel worker (pid 2857523), thenlater another parallel worker (pid 2861846) is also blocking the lead apply worker (pid 2857187). > What I can't see, is anything blocking pid 2861846, and the logs don't show much about this worker (only that it startsand then is terminated). It's not clear to me whether this is a deadlock, or a case where we need to increase the lock_timeout(we didn't try). > It is not the case of deadlock. Increasing lock_timeout could only help if the parallel worker (2861846) is slow but still processing the changes but if it is stuck in some other wait like LWLock then it won't help. The key part of the puzzle is what exactly the process 2861846 is busy with? You can either share more server LOGs for that or it would be better to increase the LOG level to DEBUG-1/2 to see if we can get some more information about process 2861846. The other wait you see for worker 2857523 is normal as it is waiting for the next set of changes from the leader worker after processing its current set. > At this point, I don't have a description of the upstream load that leads to this issue, and I don't have a reproducerscript > Please see, if you can get that because the current set of information is not sufficient to establish that it is the problem of parallel apply. > One thing I should mention is that the schema of the publisher and subscriber are completely identical; the subscriberwas created using pg_createsubscriber, and has not had any DDL applied (except for temp tables) since. I know thatthere is a risk of deadlocks in the case of schema differences, but there certainly isn't an extra index, constraint,or anything like that on the subscriber. > > Given the poor performance and issues we are having with the "streaming = parallel" option, we will likely be stickingwith the "streaming = on" option, which successfully worked to reduce our logical replication lag. > I noticed that the "parallel" option is slated to become the default for new subscriptions in PostgreSQL 18, and I wonderwhether this is wise, given the potential for locking issues; in our replication cluster, several distinct scheduledjobs lead to the parallel streaming logical replication falling far behind the non-streaming version. > It is worth considering but we should at least first figure out what is going wrong. It could be some minor bug either in parallel-apply infrastructure or a pre-existing one which is exposed due to parallel-apply activity. -- With Regards, Amit Kapila.
В списке pgsql-bugs по дате отправления: