Обсуждение: BUG #16199: pg_restore stuck on interrupts

Поиск
Список
Период
Сортировка

BUG #16199: pg_restore stuck on interrupts

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16199
Logged by:          Raúl Marín
Email address:      git@rmr.ninja
PostgreSQL version: 11.5
Operating system:   Linux
Description:

We are seeing stuck pg_restore processes in several of our CI servers, both
with PG10 (10.2) and PG11 (11.5).
These servers test an external application which uses pg_restore quite often
to set up test environments.

A sample call might look like this:
```
/usr/lib/postgresql/10/bin/pg_restore -e --verbose -j4 --disable-triggers
-Fc
/tmp/user_imports/1113d139-22c6-4dd9-8ae6-836343ad6854/data/org_cd2b2fb0-96f9-476f-9589-b1d5e0b93957.dump
-U postgres -h localhost -p 5432 -d
cartodb_test_user_684a0497-9358-49ae-89b7-5feea5a6b7d7_db
--section=post-data
--use-list=/tmp/user_imports/1113d139-22c6-4dd9-8ae6-836343ad6854/data/user_org44000054-admin.list
```

With the appropriate symbols I can attach to the process and extract the
callstack (not always fully available):
```
(gdb) bt
#0  __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f4b03e152bd in _int_free (av=0x7f4b04159b20 <main_arena>,
p=0x55d43590e3b0, have_lock=0) at malloc.c:3962
#2  0x00007f4b03e1953c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#3  0x00007f4b01bbf464 in _gnutls_ext_deinit () at gnutls_extensions.c:405
#4  0x00007f4b01bc6720 in _gnutls_global_deinit (destructor=1) at
gnutls_global.c:389
#5  0x00007f4b047eede7 in _dl_fini () at dl-fini.c:235
#6  0x00007f4b03dceff8 in __run_exit_handlers (status=status@entry=1,
listp=0x7f4b041595f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true)
    at exit.c:82
#7  0x00007f4b03dcf045 in __GI_exit (status=status@entry=1) at exit.c:104
#8  0x000055d4337dbb5e in sigTermHandler (postgres_signal_arg=<optimized
out>)
    at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:613
#9  <signal handler called>
#10 _int_free (av=0x7f4b04159b20 <main_arena>, p=<optimized out>,
have_lock=0) at malloc.c:4017
#11 0x00007f4b03e1953c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#12 0x00007f4b03e1054b in __GI__IO_setb (f=f@entry=0x55d435914240,
b=b@entry=0x0, eb=eb@entry=0x0, a=a@entry=0) at genops.c:382
#13 0x00007f4b03e0e8ee in _IO_new_file_close_it (fp=fp@entry=0x55d435914240)
at fileops.c:185
#14 0x00007f4b03e023ef in _IO_new_fclose (fp=0x55d435914240) at
iofclose.c:58
#15 0x000055d4337d798e in _ReopenArchive (AH=0x55d435aa94e0)
    at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_backup_custom.c:763
#16 0x000055d4337dc3af in RunWorker (slot=0x55d435ada4b8,
AH=0x55d435aa94e0)
    at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:866
#17 ParallelBackupStart (AH=AH@entry=0x55d43590fe00) at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:1027
#18 0x000055d4337d4418 in RestoreArchive (AHX=0x55d43590fe00)
    at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_backup_archiver.c:649
#19 0x000055d4337ce9ca in main (argc=17, argv=<optimized out>)
    at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_restore.c:429
```

I have a some extra processes with the same issue  (7 full stacks out of 20,
the others are garbage) and, from what I see, they all have in common that
the process has received a signal while it was doing a memory operation,
either a malloc or a free:

* Malloc example:
```
#8  0x00005634a8ebeb5e in sigTermHandler (postgres_signal_arg=<optimized
out>)
    at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:613
#9  <signal handler called>
#10 _int_malloc (av=av@entry=0x7fce15142b20 <main_arena>,
bytes=bytes@entry=17584) at malloc.c:3644
#11 0x00007fce14e02184 in __GI___libc_malloc (bytes=17584) at
malloc.c:2913
#12 0x00007fce14732e78 in CRYPTO_malloc () from
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x00007fce14b3dafd in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#14 0x00007fce14b3db68 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x00007fce14b3fb6c in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x00007fce155a4fb5 in open_client_SSL (conn=0x5634a9a1a370)
```

* Free example:
```
#8  0x0000563455b60e7e in sigTermHandler (postgres_signal_arg=<optimized
out>)
    at
/build/postgresql-11-3orpls/postgresql-11-11.5.2+carto/build/../src/bin/pg_dump/parallel.c:615
#9  <signal handler called>
#10 0x00007fbba21bb8e4 in _int_free (av=0x7fbba24ffb20 <main_arena>,
p=0x5634579993b0, have_lock=0) at malloc.c:3976
#11 0x00007fbba21bf53c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#12 0x0000563455b6140e in set_archive_cancel_info
(AH=AH@entry=0x5634579ad350, conn=conn@entry=0x0)
```

After that, the process gives control to `sigTermHandler` which calls
`exit(1)` to die, but that gives control to whoever has setup atexit
functions, in our case `gnutls`. Once gnutls tries to call `free` as part of
its cleanup process, it will try to acquire the lock and block, but since
the lock is being held by the function interrupted by the signal the process
self locks and hangs forever.

I think if would be safer to use a similar approach to other processes, that
is use the handler to only enable a global flag and check that in the main
loop, but I'm having a hard time locating what the proper place to check the
flag would be.

Regards,
Raúl Marín.


Re: BUG #16199: pg_restore stuck on interrupts

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> We are seeing stuck pg_restore processes in several of our CI servers, both
> with PG10 (10.2) and PG11 (11.5).

You didn't actually say, but you must be interrupting parallel restores
with SIGINT or the like?

> I have a some extra processes with the same issue  (7 full stacks out of 20,
> the others are garbage) and, from what I see, they all have in common that
> the process has received a signal while it was doing a memory operation,
> either a malloc or a free:

Yeah.  Ugh :-(

> I think if would be safer to use a similar approach to other processes, that
> is use the handler to only enable a global flag and check that in the main
> loop, but I'm having a hard time locating what the proper place to check the
> flag would be.

I think the odds of that being an improvement are minimal --- you'd be
trading a risk of failure during exit for a risk of not exiting (in any
timely fashion) in the first place.

sigTermHandler tries to be safe to run in a signal context, but I'm
afraid we didn't think hard about what exit() might call.  The way
I'd be inclined to fix this is to call _exit() instead of exit(),
and the heck with what any atexit handlers think.  Can you try that
and see if it improves matters for you?

            regards, tom lane



Re: BUG #16199: pg_restore stuck on interrupts

От
Raúl Marín
Дата:
On 8/1/20 18:13, Tom Lane wrote:

> You didn't actually say, but you must be interrupting parallel restores
> with SIGINT or the like?

Yes, CI (Jenkins) is interrupted automatically with new pushes and 
that's supposed to send a SIGTERM to the process group, which includes 
the pg_restore process.


> sigTermHandler tries to be safe to run in a signal context, but I'm
> afraid we didn't think hard about what exit() might call.  The way
> I'd be inclined to fix this is to call _exit() instead of exit(),
> and the heck with what any atexit handlers think.  Can you try that
> and see if it improves matters for you?


Initially I didn't like this idea since that means not cleaning up 
gnutls stuff, and modifying things related to crypto is always scary; 
but following the same reasoning, I trust that any good cryto library 
shouldn't leak anything important due to a fast exit.

I'll set up some of the servers to use _exit() for some days and see if 
that fixes it.

Thanks!
Raúl Marín.



Re: BUG #16199: pg_restore stuck on interrupts

От
Raúl Marín
Дата:
Hi,

After a little more than week with half of the servers in the farm 
patched, the issue has only appeared in the unpatched servers and I 
haven't seen any odd behaviour from the ones patched.


I'm attaching the patch applied for PG11, but it applies cleanly in 
master and PG12 too.

Regards,
Raúl Marín.



On 8/1/20 18:53, Raúl Marín wrote:
> On 8/1/20 18:13, Tom Lane wrote:
> 
>> You didn't actually say, but you must be interrupting parallel restores
>> with SIGINT or the like?
> 
> Yes, CI (Jenkins) is interrupted automatically with new pushes and 
> that's supposed to send a SIGTERM to the process group, which includes 
> the pg_restore process.
> 
> 
>> sigTermHandler tries to be safe to run in a signal context, but I'm
>> afraid we didn't think hard about what exit() might call.  The way
>> I'd be inclined to fix this is to call _exit() instead of exit(),
>> and the heck with what any atexit handlers think.  Can you try that
>> and see if it improves matters for you?
> 
> 
> Initially I didn't like this idea since that means not cleaning up 
> gnutls stuff, and modifying things related to crypto is always scary; 
> but following the same reasoning, I trust that any good cryto library 
> shouldn't leak anything important due to a fast exit.
> 
> I'll set up some of the servers to use _exit() for some days and see if 
> that fixes it.
> 
> Thanks!
> Raúl Marín.


Вложения

Re: BUG #16199: pg_restore stuck on interrupts

От
Tom Lane
Дата:
=?UTF-8?B?UmHDumwgTWFyw61u?= <admin@rmr.ninja> writes:
> After a little more than week with half of the servers in the farm 
> patched, the issue has only appeared in the unpatched servers and I 
> haven't seen any odd behaviour from the ones patched.

Great!  I pushed the fix.

            regards, tom lane



Re: BUG #16199: pg_restore stuck on interrupts

От
Raúl Marín
Дата:
On 20/1/20 18:58, Tom Lane wrote:
> Great!  I pushed the fix.

Thanks a lot!