Обсуждение: Buildfarm failure from overly noisy warning message

Поиск
Список
Период
Сортировка

Buildfarm failure from overly noisy warning message

От
Tom Lane
Дата:
chipmunk failed last night
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2015-07-26%2007%3A36%3A32

like so:

================== pgsql.build/src/test/regress/regression.diffs ===================
*** /home/pgbfarm/buildroot/REL9_3_STABLE/pgsql.build/src/test/regress/expected/create_index.out    Sun Jul 26 10:37:41
2015
--- /home/pgbfarm/buildroot/REL9_3_STABLE/pgsql.build/src/test/regress/results/create_index.out    Sun Jul 26 10:51:48
2015
***************
*** 14,19 ****
--- 14,20 ---- CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops); CREATE INDEX tenk1_thous_tenthous ON
tenk1(thousand, tenthous); CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
 
+ WARNING:  could not send signal to process 30123: No such process CREATE INDEX tenk2_unique2 ON tenk2 USING
btree(unique2int4_ops); CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops); CREATE INDEX rix ON road
USINGbtree (name text_ops);
 

======================================================================

What's evidently happened here is that our session tried to boot an
autovacuum process off a table lock, only that process was gone by the
time we issued the kill() call.  No problem really ... but aside from
causing buildfarm noise, I could see somebody getting really panicky
if this message appeared on a production server.

I'm inclined to reduce the WARNING to LOG, and/or skip it altogether
if the error is ESRCH.  The relevant code in ProcSleep() is:
               ereport(LOG,                     (errmsg("sending cancel to blocking autovacuum PID %d",
           pid),                      errdetail_log("%s", logbuf.data)));
 
               if (kill(pid, SIGINT) < 0)               {                   /* Just a warning to allow multiple callers
*/                  ereport(WARNING,                           (errmsg("could not send signal to process %d: %m",
                           pid)));               }
 

so logging failures at LOG level seems pretty reasonable.  One could
also argue that both of these ereports ought to be downgraded to DEBUG1
or less, since this mechanism is pretty well shaken out by now.

Thoughts?
        regards, tom lane



Re: Buildfarm failure from overly noisy warning message

От
Andres Freund
Дата:
Hi,

On 2015-07-26 10:56:05 -0400, Tom Lane wrote:
>   CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
> + WARNING:  could not send signal to process 30123: No such process

> What's evidently happened here is that our session tried to boot an
> autovacuum process off a table lock, only that process was gone by the
> time we issued the kill() call.  No problem really ... but aside from
> causing buildfarm noise, I could see somebody getting really panicky
> if this message appeared on a production server.

Oops.

> I'm inclined to reduce the WARNING to LOG

Hm, that doesn't seem like a very nice solution, given that LOG is even
more likely to end up in the server log.

> and/or skip it altogether if the error is ESRCH.

Combined with a comment why we're ignoring that case that'd work for me.


- Andres



Re: Buildfarm failure from overly noisy warning message

От
Andrew Dunstan
Дата:
On 07/26/2015 11:00 AM, Andres Freund wrote:
> Hi,
>
> On 2015-07-26 10:56:05 -0400, Tom Lane wrote:
>>    CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
>> + WARNING:  could not send signal to process 30123: No such process
>> What's evidently happened here is that our session tried to boot an
>> autovacuum process off a table lock, only that process was gone by the
>> time we issued the kill() call.  No problem really ... but aside from
>> causing buildfarm noise, I could see somebody getting really panicky
>> if this message appeared on a production server.
> Oops.
>
>> I'm inclined to reduce the WARNING to LOG
> Hm, that doesn't seem like a very nice solution, given that LOG is even
> more likely to end up in the server log.
>
>> and/or skip it altogether if the error is ESRCH.
> Combined with a comment why we're ignoring that case that'd work for me.
>


+1 for this. if the process is gone we shouldn't really have a problem, 
should we?

cheers

andrew



Re: Buildfarm failure from overly noisy warning message

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 07/26/2015 11:00 AM, Andres Freund wrote:
>> On 2015-07-26 10:56:05 -0400, Tom Lane wrote:
>>> I'm inclined to reduce the WARNING to LOG

>> Hm, that doesn't seem like a very nice solution, given that LOG is even
>> more likely to end up in the server log.

>>> and/or skip it altogether if the error is ESRCH.

>> Combined with a comment why we're ignoring that case that'd work for me.

> +1 for this. if the process is gone we shouldn't really have a problem, 
> should we?

The only real reason for printing anything here is the possibility that
the shared lock table is sufficiently corrupted that we think there's
an autovacuum process blocking us when there isn't.  However, I don't
see any particular reason to think that this is more probable than any
other bad consequence of corrupted shmem, so I don't feel a need to be
in the user's face with a WARNING.  The evidence so far is that the
race condition is real, but I don't recall anyone ever reporting this
in a context where we'd suspect actual shmem corruption.

I do, however, think it might be reasonable to LOG the failure given
that we LOG'd the action.  Is nobody else on board with downgrading
both ereports to DEBUG?
        regards, tom lane



Re: Buildfarm failure from overly noisy warning message

От
Kevin Grittner
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> + WARNING:  could not send signal to process 30123: No such process

> What's evidently happened here is that our session tried to boot an
> autovacuum process off a table lock, only that process was gone by the
> time we issued the kill() call.

> I'm inclined to reduce the WARNING to LOG, and/or skip it altogether
> if the error is ESRCH.

> One could also argue that both of these ereports ought to be downgraded to DEBUG1
> or less, since this mechanism is pretty well shaken out by now.

> Thoughts?

I think a LOG entry when an autovacuum process is actually canceled
has value just in case it is happening on a particular table so
frequently that the table starts to bloat.  I see no reason to log
anything if there is an intention to cancel an autovacuum process
but it actually completes before we can do so.  IMV the best
solution would be to proceed straight to the kill() and only log
something if it was found by kill().

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Buildfarm failure from overly noisy warning message

От
Tom Lane
Дата:
Kevin Grittner <kgrittn@ymail.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What's evidently happened here is that our session tried to boot an
>> autovacuum process off a table lock, only that process was gone by the
>> time we issued the kill() call.

> I think a LOG entry when an autovacuum process is actually canceled
> has value just in case it is happening on a particular table so
> frequently that the table starts to bloat.  I see no reason to log
> anything if there is an intention to cancel an autovacuum process
> but it actually completes before we can do so.  IMV the best
> solution would be to proceed straight to the kill() and only log
> something if it was found by kill().

Hm.  By that logic, I'm not sure if we need anything to be logged here,
because the autovacuum process will log something about having received
a query cancel signal.  Also, if we do it like that there's a race
condition: it's entirely possible (maybe even likely, on single-CPU
machines) that the autovacuum process would issue its log entry before
the sender is able to log its request.  That would probably confuse people
no end.

If we're in the business of minimizing log chatter, I'd suggest that
we remove the entirely-routine "sending cancel" log message, and only
log something in the uncommon case where the kill() fails (but, per
original point, reduce that entry to LOG or so; or else print something
only for not-ESRCH cases).
        regards, tom lane



Re: Buildfarm failure from overly noisy warning message

От
Kevin Grittner
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgrittn@ymail.com> writes:

>> I think a LOG entry when an autovacuum process is actually canceled
>> has value just in case it is happening on a particular table so
>> frequently that the table starts to bloat.  I see no reason to log
>> anything if there is an intention to cancel an autovacuum process
>> but it actually completes before we can do so.

> Hm.  By that logic, I'm not sure if we need anything to be logged here,
> because the autovacuum process will log something about having received
> a query cancel signal.

That seems sufficient to me for normal cases.

> If we're in the business of minimizing log chatter, I'd suggest that
> we remove the entirely-routine "sending cancel" log message, and only
> log something in the uncommon case where the kill() fails (but, per
> original point, reduce that entry to LOG or so; or else print something
> only for not-ESRCH cases).

+1 for only printing for the non-ESRCH cases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Buildfarm failure from overly noisy warning message

От
Tom Lane
Дата:
Kevin Grittner <kgrittn@ymail.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Kevin Grittner <kgrittn@ymail.com> writes:
>>> I think a LOG entry when an autovacuum process is actually canceled
>>> has value just in case it is happening on a particular table so
>>> frequently that the table starts to bloat.  I see no reason to log
>>> anything if there is an intention to cancel an autovacuum process
>>> but it actually completes before we can do so.

>> Hm.  By that logic, I'm not sure if we need anything to be logged here,
>> because the autovacuum process will log something about having received
>> a query cancel signal.

> That seems sufficient to me for normal cases.

Rather than remove the "sending signal" elog entirely, I reduced it to
DEBUG1; that will avoid log chatter for normal cases but the info can
still be obtained at need.

>> If we're in the business of minimizing log chatter, I'd suggest that
>> we remove the entirely-routine "sending cancel" log message, and only
>> log something in the uncommon case where the kill() fails (but, per
>> original point, reduce that entry to LOG or so; or else print something
>> only for not-ESRCH cases).

> +1 for only printing for the non-ESRCH cases.

Left that one as a WARNING, but it doesn't print for ESRCH.
        regards, tom lane



Re: Buildfarm failure from overly noisy warning message

От
Jeff Janes
Дата:
On Tue, Jul 28, 2015 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Kevin Grittner <kgrittn@ymail.com> writes:
>>> I think a LOG entry when an autovacuum process is actually canceled
>>> has value just in case it is happening on a particular table so
>>> frequently that the table starts to bloat.  I see no reason to log
>>> anything if there is an intention to cancel an autovacuum process
>>> but it actually completes before we can do so.

>> Hm.  By that logic, I'm not sure if we need anything to be logged here,
>> because the autovacuum process will log something about having received
>> a query cancel signal.

> That seems sufficient to me for normal cases.

Rather than remove the "sending signal" elog entirely, I reduced it to
DEBUG1; that will avoid log chatter for normal cases but the info can
still be obtained at need.

This part doesn't seem right to me.  Now the autovac worker logs that it was cancelled, but we have no idea why it was cancelled i.e. which lock request caused it to be cancelled and which query caused the lock request.

(It looks like the build-farm failures were fixed by the other part of the change, which I have no objection to)

This effectively undoes commit d7318d43d891bd63e82dcfc27.  I thought that that commit was a huge improvement and would prefer it not be undone. 

commit d7318d43d891bd63e82dcfc27948113ed7b1db80
Author: Robert Haas <rhaas@postgresql.org>
Date:   Thu Jul 26 09:18:32 2012 -0400

    Log a better message when canceling autovacuum.

    The old message was at DEBUG2, so typically it didn't show up in the
    log at all.  As a result, in most cases where autovacuum was canceled,
    the only information that was logged was the table being vacuumed,
    with no indication as to what problem caused the cancel.  Crank up
    the level to LOG and add some more details to assist with debugging.

Although looking at the code from that patch, it is not clear to me why all the string building needs to be done under the ProcArrayLock.  The string depends only on the local state of the lock being blocked, not on the proc doing the blocking.

Cheers,

Jeff

Re: Buildfarm failure from overly noisy warning message

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Tue, Jul 28, 2015 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Rather than remove the "sending signal" elog entirely, I reduced it to
>> DEBUG1; that will avoid log chatter for normal cases but the info can
>> still be obtained at need.

> This part doesn't seem right to me.  Now the autovac worker logs that it
> was cancelled, but we have no idea why it was cancelled i.e. which lock
> request caused it to be cancelled and which query caused the lock request.

The argument for changing it was basically that there's not any very good
reason to care, and if this happens a lot you're just bloating the log.
While I am not going to defend that proposition to the death, I've
certainly heard plenty of complaints that the postmaster log is too chatty
about routine occurrences.  Why is this information important enough
to log by default?

> Although looking at the code from that patch, it is not clear to me why all
> the string building needs to be done under the ProcArrayLock.  The string
> depends only on the local state of the lock being blocked, not on the proc
> doing the blocking.

Actually, I'm not sure I see the point of taking the ProcArrayLock at all
in this stanza.  If the state of the autovac process can change under us
then the lines just in front of the LWLockAcquire are already broken, no?
Worst case, the blocker might not be there at all anymore.
        regards, tom lane