Обсуждение: BUG #17345: pg_basebackup stucked for 2 hours before timeout

Поиск
Список
Период
Сортировка

BUG #17345: pg_basebackup stucked for 2 hours before timeout

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17345
Logged by:          Bo Chen
Email address:      bchen90@163.com
PostgreSQL version: 11.13
Operating system:   euleros v2r9 x86_64
Description:

Hello experts,
    I am facing an issue for pg_basebackup in docker env. when the primary
VM restarted while pg_basebackup is running on the standby docker in VM. It
takes 2 hours before pg_basebackup times out. 
    After analysis and reproduce the problem, I think the reason is the
parent process for fetching data files is blocking for tcp keeplive, and it
ignore or block SIGCHLD when running poll API. So we add signaling the
parent when fetching wal exit not zero.

Belowing is the modifing code.
 #include "streamutil.h"
+#include <sys/prctl.h>
 
 #define ERRCODE_DATA_CORRUPTED    "XX001"
 
@@ -565,6 +566,8 @@ StartLogStreamer(char *startpos, uint32 timeline, char
*sysidentifier)
     uint32        hi,
                 lo;
     char        statusdir[MAXPGPATH];
+    pid_t bgpid;
+    int ret;
 
     param = pg_malloc0(sizeof(logstreamer_param));
     param->timeline = timeline;
@@ -662,12 +665,24 @@ StartLogStreamer(char *startpos, uint32 timeline, char
*sysidentifier)
      * a fork(). On Windows, we create a thread.
      */
 #ifndef WIN32
+    bgpid = getpid();
+
     bgchild = fork();
     if (bgchild == 0)
     {
+        (void)prctl(PR_SET_PDEATHSIG, SIGQUIT);
         /* in child process */
-        exit(LogStreamerMain(param));
+        ret = LogStreamerMain(param);
+        if (ret != 0)
+        {
+            kill(bgpid, SIGINT);
+        }
+        exit(ret);
     }
     else if (bgchild < 0)
     {

This is the stacks when pg_basebackup stucking
#0  0xf7f6e039 in __kernel_vsyscall ()
#1  0xf7a1f2ea in poll () from /usr/lib/libc.so.6
#2  0xf7b25ea0 in pqSocketPoll (sock=5, forRead=1, forWrite=0, end_time=-1)
at fe-misc.c:1127

Belowing is the same issue from Ninad Shah.
https://www.postgresql.org/message-id/CAOFEiBd9j620TsBZPT0%2BuvdemQqwTrCLohcLjuDfQ2ye-xdswQ%40mail.gmail.com

Regards,
Bo Chenbo


Re: BUG #17345: pg_basebackup stucked for 2 hours before timeout

От
Masahiko Sawada
Дата:
Hi,

On Mon, Dec 27, 2021 at 1:23 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17345
> Logged by:          Bo Chen
> Email address:      bchen90@163.com
> PostgreSQL version: 11.13
> Operating system:   euleros v2r9 x86_64
> Description:
>
> Hello experts,
>     I am facing an issue for pg_basebackup in docker env. when the primary
> VM restarted while pg_basebackup is running on the standby docker in VM. It
> takes 2 hours before pg_basebackup times out.
>     After analysis and reproduce the problem, I think the reason is the
> parent process for fetching data files is blocking for tcp keeplive, and it
> ignore or block SIGCHLD when running poll API. So we add signaling the
> parent when fetching wal exit not zero.

This seems to be addressed by the patch discussed here[1]. I'm not
sure it's going to be backpatched but is there any chance you could
test this patch?

Regards,

[1] https://www.postgresql.org/message-id/0F69E282-97F9-4DB7-8D6D-F927AA6340C8%40yesql.se

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/