PG in container w/ pid namespace is init, process exits cause restart

Поиск
Список
Период
Сортировка
От Andres Freund
Тема PG in container w/ pid namespace is init, process exits cause restart
Дата
Msg-id 20210503190707.apw4s5jiol4bvndk@alap3.anarazel.de
обсуждение исходный текст
Ответы Re: PG in container w/ pid namespace is init, process exits cause restart  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: PG in container w/ pid namespace is init, process exits cause restart  (Andrew Dunstan <andrew@dunslane.net>)
Список pgsql-hackers
Hi,

A colleague debugged an issue where their postgres was occasionally
crash-restarting under load.

The cause turned out to be that a relatively complex archive_command was
used, which could in some rare circumstances have a bash subshell
pipeline not succeed.  It wasn't at all obvious why that'd cause a crash
though - the archive command handles the error.

The issue turns out to be that postgres was in a container, with pid
namespaces enabled. Because postgres was run directly in the container,
without a parent process inside, it thus becomes pid 1. Which mostly
works without a problem. Until, as the case here with the archive
command, a sub-sub process exits while it still has a child. Then that
child gets re-parented to postmaster (as init).

Such a child is likely to have exited not just with 0 or 1, but
something else. As the pid won't match anything in reaper(), we'll go to
CleanupBackend(). Where any exit status but 0/1 will unconditionally
trigger a restart:

    if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
    {
        HandleChildCrash(pid, exitstatus, _("server process"));
        return;
    }


This kind of thing is pretty hard to debug, because it's not easy to
even figure out what the "crashing" pid belonged to.

I wonder if we should work a bit harder to try to identify whether an
exiting process was a "server process" before identifying it as such?

And perhaps we ought to warn about postgres running as "init" unless we
make that robust?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: [PATCH] Identify LWLocks in tracepoints
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Granting control of SUSET gucs to non-superusers