At Thu, 9 Sep 2021 14:52:25 +0900, Abhishek Bhola <abhishek.bhola@japannext.co.jp> wrote in
> I have found some questions about the same error, but didn't find any of
> them answering my problem.
>
> The setup is that I have two Postgres11 clusters (A and B) and they are
> making use of publication and subscription features to copy data from A to
> B.
>
> A (source DB- publication) --------------> B (target DB - subscription)
>
> This works fine, but often (not always) when the data volume being inserted
> on a table in node A increases, it gives the following error.
>
> "terminating walsender process due to replication timeout"
>
> The data volume at the moment being entered is about 30K rows per second
> continuously for hours through COPY command.
>
> Earlier the wal_sender_timeout was set to 5 sec and I would see this error
> much often. I then increased it to 1 min and the frequency of this error
> reduced. But I don't want to keep increasing it without understanding what
> is causing it. I looked at the code of walsender.c and know the exact lines
> where it's coming from.
>
> But I am still not clear which parameter is making the sender assume that
> the receiver node is inactive and therefore it should stop the wal_sender.
>
> Can anyone please suggest what changes I should make to remove this error?
What minor-version is the Postgres server mentioned? PostgreSQL 11
have gotten the following fix at 11.6, which could be related to the
trouble.
https://www.postgresql.org/docs/11/release-11-6.html
> Fix timeout handling in logical replication walreceiver processes
> (Julien Rouhaud)
>
> Erroneous logic prevented wal_receiver_timeout from working in
> logical replication deployments.
The details of the fix is here.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f60f690fac1bf375b92cf2f8682e8fe8f69098
> Fix timeout handling in logical replication worker
>
> The timestamp tracking the last moment a message is received in a
> logical replication worker was initialized in each loop checking if a
> message was received or not, causing wal_receiver_timeout to be ignored
> in basically any logical replication deployments. This also broke the
> ping sent to the server when reaching half of wal_receiver_timeout.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center