Обсуждение: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows

Поиск
Список
Период
Сортировка

BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows

От
jmorton@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      9551
Logged by:          James Morton
Email address:      jmorton@gmail.com
PostgreSQL version: 9.1.11
Operating system:   CentOS 5.10
Description:

Hi,

Recently, we've twice encountered a problem where connection will hang in
state "authentication."  While these connections are hung in this state,
they seem to block autovacuum from freeing dead rows, quickly causing
heavily modified tables to bloat.  These connections do not respond to a
SIGQUIT signal.

When this last happened, we captured a backtrace of the hung connection:

(gdb) bt
#0  0x00000030820e104e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00000030820d1bf7 in _L_lock_646 () from /lib64/libc.so.6
#2  0x00000030820d168a in __vsyslog_chk () from /lib64/libc.so.6
#3  0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
#4  0x00000000006dd6f1 in EmitErrorReport ()
#5  0x00000000006ddaf5 in errfinish ()
#6  0x0000000000623f6a in ProcessInterrupts ()
#7  0x000000000062439c in StatementCancelHandler ()
#8  <signal handler called>
#9  0x00000030820306f7 in kill () from /lib64/libc.so.6
#10 0x0000000000616d01 in ?? ()
#11 0x0000000000616e3f in handle_sig_alarm ()
#12 <signal handler called>
#13 0x00000030820d5eb5 in send () from /lib64/libc.so.6
#14 0x00000030820d1750 in __vsyslog_chk () from /lib64/libc.so.6
#15 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
#16 0x00000000006dd6f1 in EmitErrorReport ()
#17 0x00000000006ddaf5 in errfinish ()
#18 0x0000000000583c57 in ?? ()
#19 0x0000000000583c81 in ?? ()
#20 0x00000000005858af in ClientAuthentication ()
#21 0x00000000006e9bb6 in InitPostgres ()
#22 0x000000000062609d in PostgresMain ()
#23 0x00000000005e9dfc in ?? ()
#24 0x00000000005eab81 in PostmasterMain ()
#25 0x000000000058e38e in main ()

>From the stack trace, it looked to us like the connections were blocked
writing to syslog.  We attempted restarting syslog, but that had no affect
on the hung connections.

Our authentication method for the hung connection should have been MD5,
though we do have LDAP enabled for other users.

Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows

От
Tom Lane
Дата:
jmorton@gmail.com writes:
> (gdb) bt
> #0  0x00000030820e104e in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x00000030820d1bf7 in _L_lock_646 () from /lib64/libc.so.6
> #2  0x00000030820d168a in __vsyslog_chk () from /lib64/libc.so.6
> #3  0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> #4  0x00000000006dd6f1 in EmitErrorReport ()
> #5  0x00000000006ddaf5 in errfinish ()
> #6  0x0000000000623f6a in ProcessInterrupts ()
> #7  0x000000000062439c in StatementCancelHandler ()
> #8  <signal handler called>
> #9  0x00000030820306f7 in kill () from /lib64/libc.so.6
> #10 0x0000000000616d01 in ?? ()
> #11 0x0000000000616e3f in handle_sig_alarm ()
> #12 <signal handler called>
> #13 0x00000030820d5eb5 in send () from /lib64/libc.so.6
> #14 0x00000030820d1750 in __vsyslog_chk () from /lib64/libc.so.6
> #15 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> #16 0x00000000006dd6f1 in EmitErrorReport ()
> #17 0x00000000006ddaf5 in errfinish ()
> #18 0x0000000000583c57 in ?? ()
> #19 0x0000000000583c81 in ?? ()
> #20 0x00000000005858af in ClientAuthentication ()
> #21 0x00000000006e9bb6 in InitPostgres ()
> #22 0x000000000062609d in PostgresMain ()
> #23 0x00000000005e9dfc in ?? ()
> #24 0x00000000005eab81 in PostmasterMain ()
> #25 0x000000000058e38e in main ()

Interesting stack trace.  Evidently, syslog() on your platform has some
internal mutex.  What seems to have happened is that SIGALRM interrupted
an attempt to log a client-authentication-related message, and then the
error message that it attempted to print resulted in recursive entry to
syslog(), causing a hang since the outer invocation already had the mutex.

This pathway would be impossible if the error level were ERROR, because
errfinish() clears ImmediateInterruptOK in that case, which would prevent
the SIGALRM interrupt handler from doing anything much.  A likely bet
though is that the error level in the outer call was FATAL (ie,
authentication failure); or it might've been just a LOG message.

I'm thinking the most robust solution to this is to have errfinish()
do this at entry:

   bool save_ImmediateInterruptOK = ImmediateInterruptOK;
   ImmediateInterruptOK = false;

and then, *if* it decides to return to caller, do this just before
returning:

   ImmediateInterruptOK = save_ImmediateInterruptOK;
   if (save_ImmediateInterruptOK)
       CHECK_FOR_INTERRUPTS();

This should make the error reporting logic safe against interrupts even
if invoked while ImmediateInterruptOK is true; which probably shouldn't
happen during normal backend operation but is definitely possible during
client authentication.

BTW, are you using an unusually short value for authentication_timeout?
This case doesn't seem likely to arise with normal values, since the
timeout isn't expected to occur unless something's totally wedged.

            regards, tom lane

Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows

От
Tom Lane
Дата:
"James L. Morton" <jmorton@gmail.com> writes:
>  We've added monitoring for this particular problem and saw this happen
> once more today.  We noticed that this seems to correlate with a burst of
> log messages from auditd:
> ...
> When auditd exhausts its buffer, by default it begins logging to printk(),
> which in turn gets picked up by syslog.  Perhaps this burst of messages is
> causing syslog to pause long enough to reach authentication_timeout?  It
> seems odd, especially given that we have our Postgres syslog facility set
> to async.

Interesting.  Yeah, you'd certainly imagine that it shouldn't wait if
properly configured, but that's what seems to be happening.

I've committed a fix (which will appear in next week's releases) that
prevents the recursive call of syslog() by delaying processing of the
authentication_timeout interrupt until elog() is done.  That should fix
your problem as long as syslog's load-induced pause is transient.
However, if syslog() is getting wedged for some reason, you're still
going to have a stuck backend, because authentication_timeout won't
unwedge it.

            regards, tom lane

Re: BUG #9551: Hang in State "authentication" Prevents Vacuum from Freeing Dead Rows

От
"James L. Morton"
Дата:
Our authentication_timeout value is set to the default value (1 minute).
 We've added monitoring for this particular problem and saw this happen
once more today.  We noticed that this seems to correlate with a burst of
log messages from auditd:

Mar 14 18:06:29 XXXX kernel: printk: 2675 messages suppressed.
Mar 14 18:06:29 XXXX kernel: audit: audit_backlog=8193 >
audit_backlog_limit=8192
Mar 14 18:06:29 XXXX kernel: audit: audit_lost=958159 audit_rate_limit=0
audit_backlog_limit=8192

We looked back at old log files, and we see the same error also occurred
the previous times we've encountered this problem.

When auditd exhausts its buffer, by default it begins logging to printk(),
which in turn gets picked up by syslog.  Perhaps this burst of messages is
causing syslog to pause long enough to reach authentication_timeout?  It
seems odd, especially given that we have our Postgres syslog facility set
to async.

Some searching shows that the PgPool project ran into a similar-looking
deadlock while calling syslog in a signal handler:
http://www.pgpool.net/mantisbt/view.php?id=9

Thanks,
James Morton

On Thu, Mar 13, 2014 at 12:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> jmorton@gmail.com writes:
> > (gdb) bt
> > #0  0x00000030820e104e in __lll_lock_wait_private () from
> /lib64/libc.so.6
> > #1  0x00000030820d1bf7 in _L_lock_646 () from /lib64/libc.so.6
> > #2  0x00000030820d168a in __vsyslog_chk () from /lib64/libc.so.6
> > #3  0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> > #4  0x00000000006dd6f1 in EmitErrorReport ()
> > #5  0x00000000006ddaf5 in errfinish ()
> > #6  0x0000000000623f6a in ProcessInterrupts ()
> > #7  0x000000000062439c in StatementCancelHandler ()
> > #8  <signal handler called>
> > #9  0x00000030820306f7 in kill () from /lib64/libc.so.6
> > #10 0x0000000000616d01 in ?? ()
> > #11 0x0000000000616e3f in handle_sig_alarm ()
> > #12 <signal handler called>
> > #13 0x00000030820d5eb5 in send () from /lib64/libc.so.6
> > #14 0x00000030820d1750 in __vsyslog_chk () from /lib64/libc.so.6
> > #15 0x00000030820d1a73 in __syslog_chk () from /lib64/libc.so.6
> > #16 0x00000000006dd6f1 in EmitErrorReport ()
> > #17 0x00000000006ddaf5 in errfinish ()
> > #18 0x0000000000583c57 in ?? ()
> > #19 0x0000000000583c81 in ?? ()
> > #20 0x00000000005858af in ClientAuthentication ()
> > #21 0x00000000006e9bb6 in InitPostgres ()
> > #22 0x000000000062609d in PostgresMain ()
> > #23 0x00000000005e9dfc in ?? ()
> > #24 0x00000000005eab81 in PostmasterMain ()
> > #25 0x000000000058e38e in main ()
>
> Interesting stack trace.  Evidently, syslog() on your platform has some
> internal mutex.  What seems to have happened is that SIGALRM interrupted
> an attempt to log a client-authentication-related message, and then the
> error message that it attempted to print resulted in recursive entry to
> syslog(), causing a hang since the outer invocation already had the mutex.
>
> This pathway would be impossible if the error level were ERROR, because
> errfinish() clears ImmediateInterruptOK in that case, which would prevent
> the SIGALRM interrupt handler from doing anything much.  A likely bet
> though is that the error level in the outer call was FATAL (ie,
> authentication failure); or it might've been just a LOG message.
>
> I'm thinking the most robust solution to this is to have errfinish()
> do this at entry:
>
>    bool save_ImmediateInterruptOK = ImmediateInterruptOK;
>    ImmediateInterruptOK = false;
>
> and then, *if* it decides to return to caller, do this just before
> returning:
>
>    ImmediateInterruptOK = save_ImmediateInterruptOK;
>    if (save_ImmediateInterruptOK)
>        CHECK_FOR_INTERRUPTS();
>
> This should make the error reporting logic safe against interrupts even
> if invoked while ImmediateInterruptOK is true; which probably shouldn't
> happen during normal backend operation but is definitely possible during
> client authentication.
>
> BTW, are you using an unusually short value for authentication_timeout?
> This case doesn't seem likely to arise with normal values, since the
> timeout isn't expected to occur unless something's totally wedged.
>
>                         regards, tom lane
>