Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows

Поиск
Список
Период
Сортировка
От James L. Morton
Тема Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows
Дата
Msg-id CALzMLzwMd6BqQTyyi9d00BbN7x7LZBRKNVbTpzvm0b9597=0-Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Our authentication_timeout value is set to the default value (1 minute).
 We've added monitoring for this particular problem and saw this happen
once more today.  We noticed that this seems to correlate with a burst of
log messages from auditd:

Mar 14 18:06:29 XXXX kernel: printk: 2675 messages suppressed.
Mar 14 18:06:29 XXXX kernel: audit: audit_backlog=8193 >
audit_backlog_limit=8192
Mar 14 18:06:29 XXXX kernel: audit: audit_lost=958159 audit_rate_limit=0
audit_backlog_limit=8192

We looked back at old log files, and we see the same error also occurred
the previous times we've encountered this problem.

When auditd exhausts its buffer, by default it begins logging to printk(),
which in turn gets picked up by syslog.  Perhaps this burst of messages is
causing syslog to pause long enough to reach authentication_timeout?  It
seems odd, especially given that we have our Postgres syslog facility set
to async.

Some searching shows that the PgPool project ran into a similar-looking
deadlock while calling syslog in a signal handler:
http://www.pgpool.net/mantisbt/view.php?id=9

Thanks,
James Morton

On Thu, Mar 13, 2014 at 12:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> jmorton@gmail.com writes:
> > (gdb) bt
> > #0  0x00000030820e104e in __lll_lock_wait_private () from
> /lib64/libc.so.6
> > #1  0x00000030820d1bf7 in _L_lock_646 () from /lib64/libc.so.6
> > #2  0x00000030820d168a in __vsyslog_chk () from /lib64/libc.so.6
> > #3  0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> > #4  0x00000000006dd6f1 in EmitErrorReport ()
> > #5  0x00000000006ddaf5 in errfinish ()
> > #6  0x0000000000623f6a in ProcessInterrupts ()
> > #7  0x000000000062439c in StatementCancelHandler ()
> > #8  <signal handler called>
> > #9  0x00000030820306f7 in kill () from /lib64/libc.so.6
> > #10 0x0000000000616d01 in ?? ()
> > #11 0x0000000000616e3f in handle_sig_alarm ()
> > #12 <signal handler called>
> > #13 0x00000030820d5eb5 in send () from /lib64/libc.so.6
> > #14 0x00000030820d1750 in __vsyslog_chk () from /lib64/libc.so.6
> > #15 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> > #16 0x00000000006dd6f1 in EmitErrorReport ()
> > #17 0x00000000006ddaf5 in errfinish ()
> > #18 0x0000000000583c57 in ?? ()
> > #19 0x0000000000583c81 in ?? ()
> > #20 0x00000000005858af in ClientAuthentication ()
> > #21 0x00000000006e9bb6 in InitPostgres ()
> > #22 0x000000000062609d in PostgresMain ()
> > #23 0x00000000005e9dfc in ?? ()
> > #24 0x00000000005eab81 in PostmasterMain ()
> > #25 0x000000000058e38e in main ()
>
> Interesting stack trace.  Evidently, syslog() on your platform has some
> internal mutex.  What seems to have happened is that SIGALRM interrupted
> an attempt to log a client-authentication-related message, and then the
> error message that it attempted to print resulted in recursive entry to
> syslog(), causing a hang since the outer invocation already had the mutex.
>
> This pathway would be impossible if the error level were ERROR, because
> errfinish() clears ImmediateInterruptOK in that case, which would prevent
> the SIGALRM interrupt handler from doing anything much.  A likely bet
> though is that the error level in the outer call was FATAL (ie,
> authentication failure); or it might've been just a LOG message.
>
> I'm thinking the most robust solution to this is to have errfinish()
> do this at entry:
>
>    bool save_ImmediateInterruptOK = ImmediateInterruptOK;
>    ImmediateInterruptOK = false;
>
> and then, *if* it decides to return to caller, do this just before
> returning:
>
>    ImmediateInterruptOK = save_ImmediateInterruptOK;
>    if (save_ImmediateInterruptOK)
>        CHECK_FOR_INTERRUPTS();
>
> This should make the error reporting logic safe against interrupts even
> if invoked while ImmediateInterruptOK is true; which probably shouldn't
> happen during normal backend operation but is definitely possible during
> client authentication.
>
> BTW, are you using an unusually short value for authentication_timeout?
> This case doesn't seem likely to arise with normal values, since the
> timeout isn't expected to occur unless something's totally wedged.
>
>                         regards, tom lane
>

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: BUG #9118: WAL Sender does not disconnect replication clients during shutdown