Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
От | Achilleas Mantzios |
---|---|
Тема | Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" |
Дата | |
Msg-id | c39ea7f4-c7a2-2111-6d09-58f25df5b6c5@matrix.gatewaynet.com обсуждение исходный текст |
Ответ на | Re: PostgreSQL 10.5 : Logical replication timeout results in PANICin pg_wal "No space left on device" (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Список | pgsql-admin |
Hi Alvaro! On 23/11/18 1:10 μ.μ., Alvaro Herrera wrote: > On 2018-Nov-14, Rui DeSousa wrote: > >>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote: >>> >>> Our sysadms (seasoned linux/network guys : we have been working here >>> for more than 10 yrs) were absolute in that we run no firewall or >>> other traffic shaping system between the two hosts. (if we did the >>> problem would manifest itself earlier). Can you recommend what to >>> look for exactly regarding both TCP stacks ? The subscriber node is >>> a clone of the primary. We have : >>> >>> # sysctl -a | grep -i keepaliv >>> net.ipv4.tcp_keepalive_intvl = 75 >>> net.ipv4.tcp_keepalive_probes = 9 >>> net.ipv4.tcp_keepalive_time = 7200 >> Those keep alive settings are linux’s defaults and work out to be 18 >> hours before the abandon connection is dropped. So, the WAL receiver >> should have corrected itself after that time. For reference, I run >> terminating abandon session within 15 mins as they take-up valuable >> database resources and could potentially hold on to locks, snapshots, >> etc. > Where does your 18h figure come from? As I understand it, these numbers > mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the > connection if not reply is obtained. So that works out to about 131 > minutes (modulo fencepost bug). Certainly not 18 hours ... Thanks, yes it sums up to 2Hrs 11 Mins. The moments after the primary crushed I didn't have the nerves/patience/guts to waitthat long and actually prove that the subscriber listened happily to a ghost/stuck connection. > > Now ... I have seen Linux kernel code that seemed to me to cause network > transmission get stuck *in the kernel* without any way out. Now I'm not > a kernel expert and I don't know if this applies to your case (maybe it > got fixed already), but it was definitely some process that was stuck > with "wchan" set to a network kernel call and way beyond TCP keepalives. It seems we'll have to upgrade our systems/kernels ASAP. Thanks a lot! > -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
В списке pgsql-admin по дате отправления: