Обсуждение: last_statrequest is in the future

Поиск

Список

Период

Сортировка

last_statrequest is in the future

От

Tom Lane

Дата:

24 марта 2010 г., 12:39:26

Well, I didn't actually think that this patch
http://archives.postgresql.org/pgsql-committers/2010-03/msg00181.php
would yield much insight, but lookee what we have here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-24%2004:00:07

[4ba99150.5099:483] LOG:  statement: VACUUM ANALYZE num_exp_add;
[4ba99145.5071:1] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:2] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:3] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:4] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:5] LOG:  last_statrequest is in the future, resetting
...
[4ba99145.5071:497] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:498] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:499] LOG:  last_statrequest is in the future, resetting
[4ba99145.5071:500] LOG:  last_statrequest is in the future, resetting
[4ba99150.5099:484] WARNING:  pgstat wait timeout

There are multiple occurrences of "pgstat wait timeout" in the
postmaster log (some evidently from autovacuum, because they don't show
up as regression diffs), and every one of them is associated with a
bunch of "last_statrequest is in the future" bleats.

So at least on jaguar, it seems that the reason for this behavior is
that the system clock is significantly skewed between the stats
collector process and the backends, to the point where stats updates
generated by the collector will never appear new enough to satisfy the
requesting backends.  I think I'm going to go back and modify the code
to show the actual numbers involved so we can see just how bad it is ---
but the skew must be more than five seconds or we'd not be seeing this
failure.  That seems to me to put it in the class of "system bug".

Should we redesign the stats signaling logic to work around this,
or just hope we can nag kernel people into fixing it?
        regards, tom lane

Re: last_statrequest is in the future

От

Tom Lane

Дата:

25 марта 2010 г., 13:45:53

I wrote:
> ... lookee what we have here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-24%2004:00:07

Modifying the code to print the actual timestamps resulted in a wealth
of information that can be dredged from the postmaster log here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-25%2004:00:04
Just grep for log messages like
[4baafb13.cd2:1] LOG:  last_statrequest 2010-03-25 06:58:42.118734+01 is later than collector's time 2010-03-25
06:58:40.817027+01

What this shows is that system timekeeping on jaguar is just utterly
broken.  gettimeofday() sometimes returns answers that are 5 seconds or
more skewed in different processes, and there are episodes where its
value fails to move at all in the stats collector while multiple seconds
elapse in the opinion of another process.

I wonder whether this doesn't also explain jaguar's propensity to show
some other unexplainable failures, like these recent examples:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-20%2004:00:03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-15%2004:00:04
which are not at all unique in that machine's buildfarm history.

I had always supposed that these were caused by the extreme overall
slowness induced by CLOBBER_CACHE_ALWAYS, but now I think they are
more likely explained by system timekeeping issues.  In particular the
last-cited regression failure can be explained if now() fails to change
value across pg_sleep(0.1), and that's isomorphic to what seems to be
happening in the stats collector in some of the episodes captured today.

In short: either this machine has got broken clock hardware, or there's
a very serious bug in the kernel code for reading the clock.
        regards, tom lane

Re: last_statrequest is in the future

От

Greg Smith

Дата:

26 марта 2010 г., 14:06:47

Tom Lane wrote:
> Should we redesign the stats signaling logic to work around this,
> or just hope we can nag kernel people into fixing it?
>   

Even if there was something to be done in kernel space, how many years 
from now would it be before it made this problem go away for the 
majority of near future 9.0 users?  We've been seeing a fairly regular 
stream of "pgstat wait timeout" reports come in.  The one I reported was 
from recent hardware and a very mainstream Linux setup.  I'm not real 
optimistic that this one can get punted toward the OS and get anything 
done about it in time to head off problems in the field.

This particularly pathologic case with jaguar is great because it's made 
it possible to nail down how to report the problem.  I don't think it's 
possible to make a strong conclusion about how to resolve this just from 
that data though.  What we probably need is for your additional logging 
code to catch this again on some systems that are not so obviously 
broken, to get a better idea what a normal (rather than extreme) 
manifestation looks like.  How much skew is showing up, whether those do 
in fact correspond with the wait timeouts, that sort of thing.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: last_statrequest is in the future

От

Tom Lane

Дата:

26 марта 2010 г., 14:18:13

Greg Smith <greg@2ndquadrant.com> writes:
> Even if there was something to be done in kernel space, how many years 
> from now would it be before it made this problem go away for the 
> majority of near future 9.0 users?  We've been seeing a fairly regular 
> stream of "pgstat wait timeout" reports come in.  The one I reported was 
> from recent hardware and a very mainstream Linux setup.

The behavior I'm seeing in jaguar's reports is difficult to characterize
as anything except a seriously broken machine (see the followup message
with further details).  It may be that there is some
not-obviously-broken-hardware case that we need to deal with, but I have
not seen evidence of it yet.  Can you reproduce your test case with CVS
HEAD and send the log messages?
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: last_statrequest is in the future

last_statrequest is in the future

Re: last_statrequest is in the future

Re: last_statrequest is in the future

Re: last_statrequest is in the future