Обсуждение: buildfarm animals and 'snapshot too old'

Поиск
Список
Период
Сортировка

buildfarm animals and 'snapshot too old'

От
"Tomas Vondra"
Дата:
Hi all,

today I got a few of errors like these (this one is from last week, though):
  Status Line: 493 snapshot too old: Wed May  7 04:36:57 2014 GMT  Content:  snapshot to old: Wed May  7 04:36:57 2014
GMT

on the new buildfarm animals. I believe it was my mistake (incorrectly
configured local git mirror), but it got me thinking about how this will
behave with the animals running CLOBBER_CACHE_RECURSIVELY.

If I understand the Perl code correctly, it does this:

(1) update the repository
(2) run the tests
(3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
(4) fail if older

Now, imagine that the test runs for days/weeks. This pretty much means
it's wasted, because the results will be thrown away anyway, no?

regards
Tomas




Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/15/2014 12:43 PM, Tomas Vondra wrote:
> Hi all,
>
> today I got a few of errors like these (this one is from last week, though):
>
>     Status Line: 493 snapshot too old: Wed May  7 04:36:57 2014 GMT
>     Content:
>     snapshot to old: Wed May  7 04:36:57 2014 GMT
>
> on the new buildfarm animals. I believe it was my mistake (incorrectly
> configured local git mirror), but it got me thinking about how this will
> behave with the animals running CLOBBER_CACHE_RECURSIVELY.
>
> If I understand the Perl code correctly, it does this:
>
> (1) update the repository
> (2) run the tests
> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
> (4) fail if older
>
> Now, imagine that the test runs for days/weeks. This pretty much means
> it's wasted, because the results will be thrown away anyway, no?
>


The 24 hours runs from the time of the latest commit on the branch in 
question, not the current time, but basically yes.

We've never had machines with runs that long. The longest in recent 
times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes 
around 4.5 hours. But we have had misconfigured machines reporting 
unbelievable snapshot times.  I'll take a look and see if we can tighten 
up the sanity check. It's worth noting that one thing friarbird does is 
skip the install-check stage - it's almost certainly not going to have 
terribly much interesting to tell us from that, given it has already run 
a plain "make check".

How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems 
kinda nuts.

cheers

andrew



Re: buildfarm animals and 'snapshot too old'

От
"Tomas Vondra"
Дата:
On 15 Květen 2014, 19:46, Andrew Dunstan wrote:
>
> On 05/15/2014 12:43 PM, Tomas Vondra wrote:
>> Hi all,
>>
>> today I got a few of errors like these (this one is from last week,
>> though):
>>
>>     Status Line: 493 snapshot too old: Wed May  7 04:36:57 2014 GMT
>>     Content:
>>     snapshot to old: Wed May  7 04:36:57 2014 GMT
>>
>> on the new buildfarm animals. I believe it was my mistake (incorrectly
>> configured local git mirror), but it got me thinking about how this will
>> behave with the animals running CLOBBER_CACHE_RECURSIVELY.
>>
>> If I understand the Perl code correctly, it does this:
>>
>> (1) update the repository
>> (2) run the tests
>> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
>> (4) fail if older
>>
>> Now, imagine that the test runs for days/weeks. This pretty much means
>> it's wasted, because the results will be thrown away anyway, no?
>>
>
>
> The 24 hours runs from the time of the latest commit on the branch in
> question, not the current time, but basically yes.
>
> We've never had machines with runs that long. The longest in recent
> times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
> around 4.5 hours. But we have had misconfigured machines reporting
> unbelievable snapshot times.  I'll take a look and see if we can tighten
> up the sanity check. It's worth noting that one thing friarbird does is
> skip the install-check stage - it's almost certainly not going to have
> terribly much interesting to tell us from that, given it has already run
> a plain "make check".
>
> How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems
> kinda nuts.

I don't know. According to this comment from cache/inval.c, it's expected
to be way slower (~100x) compared to CLOBBER_CACHE_ALWAYS.

/** Test code to force cache flushes anytime a flush could happen.** If used with CLOBBER_FREED_MEMORY,
CLOBBER_CACHE_ALWAYSprovides a* fairly thorough test that the system contains no cache-flush hazards.* However, it also
makesthe system unbelievably slow --- the regression* tests take about 100 times longer than normal.** If you're a
gluttonfor punishment, try CLOBBER_CACHE_RECURSIVELY. This* slows things by at least a factor of 10000, so I wouldn't
suggest*trying to run the entire regression tests that way.It's useful to try* a few simple tests, to make sure that
cachereload isn't subject to* internal cache-flush hazards, but after you've done a few thousand* recursive reloads
it'sunlikely you'll learn more.*/
 

regards
Tomas




Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/15/2014 03:57 PM, Tomas Vondra wrote:
>> How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems
>> kinda nuts.
> I don't know. According to this comment from cache/inval.c, it's expected
> to be way slower (~100x) compared to CLOBBER_CACHE_ALWAYS.
>
> /*
>   * Test code to force cache flushes anytime a flush could happen.
>   *
>   * If used with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS provides a
>   * fairly thorough test that the system contains no cache-flush hazards.
>   * However, it also makes the system unbelievably slow --- the regression
>   * tests take about 100 times longer than normal.
>   *
>   * If you're a glutton for punishment, try CLOBBER_CACHE_RECURSIVELY. This
>   * slows things by at least a factor of 10000, so I wouldn't suggest
>   * trying to run the entire regression tests that way.It's useful to try
>   * a few simple tests, to make sure that cache reload isn't subject to
>   * internal cache-flush hazards, but after you've done a few thousand
>   * recursive reloads it's unlikely you'll learn more.
>   */
>

Yes, I've seen that. Frankly, a test that takes something like 500 hours 
is a bit crazy.

If we really want to run this in the buildfarm we should probably try to 
create a massively cut down test schedule for use in this case.

Incidentally, should the CLOBBER_CACHE_ALWAYS machines also be defining 
CLOBBER_FREED_MEMORY?

cheers

andrew




Re: buildfarm animals and 'snapshot too old'

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> Incidentally, should the CLOBBER_CACHE_ALWAYS machines also be defining 
> CLOBBER_FREED_MEMORY?

The former does need the latter or it's not very thorough.  However,
CLOBBER_FREED_MEMORY is defined automatically by --enable-cassert,
so you shouldn't need to use a -D switch for it.
        regards, tom lane



Re: buildfarm animals and 'snapshot too old'

От
Stefan Kaltenbrunner
Дата:
On 05/15/2014 07:46 PM, Andrew Dunstan wrote:
> 
> On 05/15/2014 12:43 PM, Tomas Vondra wrote:
>> Hi all,
>>
>> today I got a few of errors like these (this one is from last week,
>> though):
>>
>>     Status Line: 493 snapshot too old: Wed May  7 04:36:57 2014 GMT
>>     Content:
>>     snapshot to old: Wed May  7 04:36:57 2014 GMT
>>
>> on the new buildfarm animals. I believe it was my mistake (incorrectly
>> configured local git mirror), but it got me thinking about how this will
>> behave with the animals running CLOBBER_CACHE_RECURSIVELY.
>>
>> If I understand the Perl code correctly, it does this:
>>
>> (1) update the repository
>> (2) run the tests
>> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
>> (4) fail if older
>>
>> Now, imagine that the test runs for days/weeks. This pretty much means
>> it's wasted, because the results will be thrown away anyway, no?
>>
> 
> 
> The 24 hours runs from the time of the latest commit on the branch in
> question, not the current time, but basically yes.
> 
> We've never had machines with runs that long. The longest in recent
> times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
> around 4.5 hours. But we have had misconfigured machines reporting
> unbelievable snapshot times.  I'll take a look and see if we can tighten
> up the sanity check. It's worth noting that one thing friarbird does is
> skip the install-check stage - it's almost certainly not going to have
> terribly much interesting to tell us from that, given it has already run
> a plain "make check".

well I'm not sure about about "misconfigured" but both my personal
buildfarm members and pginfra run ones (like gaibasaurus) got errors
complaining about "snapshot too old" in the past for long running tests
so I'm not sure it is really a "we never had machine with runs that
long". So maybe we should not reject those submissions at submission
time but rather mark them clearly on the dashboard and leave the final
interpretation to a human...




Stefan



Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/15/2014 04:30 PM, Stefan Kaltenbrunner wrote:
> On 05/15/2014 07:46 PM, Andrew Dunstan wrote:
>> On 05/15/2014 12:43 PM, Tomas Vondra wrote:
>>> Hi all,
>>>
>>> today I got a few of errors like these (this one is from last week,
>>> though):
>>>
>>>      Status Line: 493 snapshot too old: Wed May  7 04:36:57 2014 GMT
>>>      Content:
>>>      snapshot to old: Wed May  7 04:36:57 2014 GMT
>>>
>>> on the new buildfarm animals. I believe it was my mistake (incorrectly
>>> configured local git mirror), but it got me thinking about how this will
>>> behave with the animals running CLOBBER_CACHE_RECURSIVELY.
>>>
>>> If I understand the Perl code correctly, it does this:
>>>
>>> (1) update the repository
>>> (2) run the tests
>>> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
>>> (4) fail if older
>>>
>>> Now, imagine that the test runs for days/weeks. This pretty much means
>>> it's wasted, because the results will be thrown away anyway, no?
>>>
>>
>> The 24 hours runs from the time of the latest commit on the branch in
>> question, not the current time, but basically yes.
>>
>> We've never had machines with runs that long. The longest in recent
>> times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
>> around 4.5 hours. But we have had misconfigured machines reporting
>> unbelievable snapshot times.  I'll take a look and see if we can tighten
>> up the sanity check. It's worth noting that one thing friarbird does is
>> skip the install-check stage - it's almost certainly not going to have
>> terribly much interesting to tell us from that, given it has already run
>> a plain "make check".
> well I'm not sure about about "misconfigured" but both my personal
> buildfarm members and pginfra run ones (like gaibasaurus) got errors
> complaining about "snapshot too old" in the past for long running tests
> so I'm not sure it is really a "we never had machine with runs that
> long". So maybe we should not reject those submissions at submission
> time but rather mark them clearly on the dashboard and leave the final
> interpretation to a human...
>
>


That's a LOT harder and more work to arrange. Frankly, there are more 
important things to do.

I would like to know the circumstances of these very long runs. I drive 
some of my VMs pretty hard on pretty modest hardware, and they don't 
come close to running 24 hours.

The current behaviour goes back to this commit from December 2011:
   commit a8b5049e64f9cb08f8e165d0737139dab74e3bce   Author: Andrew Dunstan <andrew@dunslane.net>   Date:   Wed Dec 14
14:38:442011 -0800
 
        Use git snapshot instead of fixed 10 day timeout.
        The sanity checks made sure that an animal wasn't submitting a        snapshot that was too old. But sometimes
anold branch doesn't        get any changes for more than 10 days. So accept a snapshot that        is not more than 1
dayolder than the last known snapshot. Per        complaint from Stefan.
 


I'm prepared to increase the sanity check time if there is a serious 
demand for it, but I'd like to know what to increase it to.

cheers

andrew






Re: buildfarm animals and 'snapshot too old'

От
Tomas Vondra
Дата:
On 15.5.2014 22:56, Andrew Dunstan wrote:
> 
> On 05/15/2014 04:30 PM, Stefan Kaltenbrunner wrote:
>
>> well I'm not sure about about "misconfigured" but both my personal
>> buildfarm members and pginfra run ones (like gaibasaurus) got errors
>> complaining about "snapshot too old" in the past for long running tests
>> so I'm not sure it is really a "we never had machine with runs that
>> long". So maybe we should not reject those submissions at submission
>> time but rather mark them clearly on the dashboard and leave the final
>> interpretation to a human...
>>
> 
> That's a LOT harder and more work to arrange. Frankly, there are more
> important things to do.
> 
> I would like to know the circumstances of these very long runs. I drive
> some of my VMs pretty hard on pretty modest hardware, and they don't
> come close to running 24 hours.
> 
> The current behaviour goes back to this commit from December 2011:
> 
>    commit a8b5049e64f9cb08f8e165d0737139dab74e3bce
>    Author: Andrew Dunstan <andrew@dunslane.net>
>    Date:   Wed Dec 14 14:38:44 2011 -0800
> 
>         Use git snapshot instead of fixed 10 day timeout.
> 
>         The sanity checks made sure that an animal wasn't submitting a
>         snapshot that was too old. But sometimes an old branch doesn't
>         get any changes for more than 10 days. So accept a snapshot that
>         is not more than 1 day older than the last known snapshot. Per
>         complaint from Stefan.
> 
> 
> I'm prepared to increase the sanity check time if there is a serious
> demand for it, but I'd like to know what to increase it to.

I doubt there's no "one size fits all" limit. If the machines running
"recursive clobber" tests need tens of days to complete the tests. then
that limit is pretty useless to most regular animals.

So what about keeping the current value for most animals, but allowing
an override for some selected ones? I'd expect this to be much simpler
to implement, and it shouldn't require any human intervention.

Tomas



Re: buildfarm animals and 'snapshot too old'

От
Tomas Vondra
Дата:
On 15.5.2014 22:07, Andrew Dunstan wrote:
> 
> Yes, I've seen that. Frankly, a test that takes something like 500 
> hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

> If we really want to run this in the buildfarm we should probably
> try to create a massively cut down test schedule for use in this
> case.

If we can run cut this down in a meaningful way (i.e. without
sacrificing most of the benefits) then sure - let's do that. But I think
that's what CLOBBER_CACHE_ALWAYS is about.

regards
Tomas



Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/15/2014 07:47 PM, Tomas Vondra wrote:
> On 15.5.2014 22:07, Andrew Dunstan wrote:
>> Yes, I've seen that. Frankly, a test that takes something like 500
>> hours is a bit crazy.
> Maybe. It certainly is not a test people will use during development.
> But if it can detect some hard-to-find errors in the code, that might
> possibly lead to serious problems, then +1 from me to run them at least
> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.
>
> Also, once you know where it fails the developer can run just that
> single test (which might take minutes/hours, but not days).



I have made a change that omits the snapshot sanity check for 
CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See 
<https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

cheers

andrew





Re: buildfarm animals and 'snapshot too old'

От
Tomas Vondra
Дата:
On 17.5.2014 19:58, Andrew Dunstan wrote:
> 
> On 05/15/2014 07:47 PM, Tomas Vondra wrote:
>> On 15.5.2014 22:07, Andrew Dunstan wrote:
>>> Yes, I've seen that. Frankly, a test that takes something like 500
>>> hours is a bit crazy.
>> Maybe. It certainly is not a test people will use during development.
>> But if it can detect some hard-to-find errors in the code, that might
>> possibly lead to serious problems, then +1 from me to run them at least
>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.
>>
>> Also, once you know where it fails the developer can run just that
>> single test (which might take minutes/hours, but not days).
> 
> 
> 
> I have made a change that omits the snapshot sanity check for
> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

OK, thanks. Seems reasonable.

Tomas




Re: buildfarm animals and 'snapshot too old'

От
Tomas Vondra
Дата:
On 17.5.2014 22:35, Tomas Vondra wrote:
> On 17.5.2014 19:58, Andrew Dunstan wrote:
>>
>> On 05/15/2014 07:47 PM, Tomas Vondra wrote:
>>> On 15.5.2014 22:07, Andrew Dunstan wrote:
>>>> Yes, I've seen that. Frankly, a test that takes something like 500
>>>> hours is a bit crazy.
>>> Maybe. It certainly is not a test people will use during development.
>>> But if it can detect some hard-to-find errors in the code, that might
>>> possibly lead to serious problems, then +1 from me to run them at least
>>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.
>>>
>>> Also, once you know where it fails the developer can run just that
>>> single test (which might take minutes/hours, but not days).
>>
>>
>>
>> I have made a change that omits the snapshot sanity check for
>> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
>> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>
> 
> OK, thanks. Seems reasonable.

Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
The problem is that the git mirror is refreshed only at the very
beginning, and while a single branch does not exceed the limit, all the
branches do.

Could this be solved by keeping a local mirror, without a mirror in the
build root? I mean, something like this:
   git_keep_mirror => 0,   scmrepo => '/path/to/local/mirror'

And of course a cron script updating the mirror every hour or so.

regards
Tomas



Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/19/2014 03:40 PM, Tomas Vondra wrote:
> On 17.5.2014 22:35, Tomas Vondra wrote:
>> On 17.5.2014 19:58, Andrew Dunstan wrote:
>>> On 05/15/2014 07:47 PM, Tomas Vondra wrote:
>>>> On 15.5.2014 22:07, Andrew Dunstan wrote:
>>>>> Yes, I've seen that. Frankly, a test that takes something like 500
>>>>> hours is a bit crazy.
>>>> Maybe. It certainly is not a test people will use during development.
>>>> But if it can detect some hard-to-find errors in the code, that might
>>>> possibly lead to serious problems, then +1 from me to run them at least
>>>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.
>>>>
>>>> Also, once you know where it fails the developer can run just that
>>>> single test (which might take minutes/hours, but not days).
>>>
>>>
>>> I have made a change that omits the snapshot sanity check for
>>> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
>>> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>
>> OK, thanks. Seems reasonable.
> Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
> The problem is that the git mirror is refreshed only at the very
> beginning, and while a single branch does not exceed the limit, all the
> branches do.
>
> Could this be solved by keeping a local mirror, without a mirror in the
> build root? I mean, something like this:
>
>      git_keep_mirror => 0,
>      scmrepo => '/path/to/local/mirror'
>
> And of course a cron script updating the mirror every hour or so.
>

No, the git mirror should be refreshed at the start of each branch 
build. It's not done at all by run_branches.pl. So your premise is 
false. This should only happen if the actual run takes more than 24 hours.

cheers

andrew




Re: buildfarm animals and 'snapshot too old'

От
Tomas Vondra
Дата:
On 19.5.2014 23:04, Andrew Dunstan wrote:
>
> On 05/19/2014 03:40 PM, Tomas Vondra wrote:
>> On 17.5.2014 22:35, Tomas Vondra wrote:
>>> On 17.5.2014 19:58, Andrew Dunstan wrote:
>>>> On 05/15/2014 07:47 PM, Tomas Vondra wrote:
>>>>> On 15.5.2014 22:07, Andrew Dunstan wrote:
>>>>>> Yes, I've seen that. Frankly, a test that takes something like 500
>>>>>> hours is a bit crazy.
>>>>> Maybe. It certainly is not a test people will use during development.
>>>>> But if it can detect some hard-to-find errors in the code, that might
>>>>> possibly lead to serious problems, then +1 from me to run them at
>>>>> least
>>>>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.
>>>>>
>>>>> Also, once you know where it fails the developer can run just that
>>>>> single test (which might take minutes/hours, but not days).
>>>>
>>>>
>>>> I have made a change that omits the snapshot sanity check for
>>>> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
>>>> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>
>>>>
>>> OK, thanks. Seems reasonable.
>> Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
>> The problem is that the git mirror is refreshed only at the very
>> beginning, and while a single branch does not exceed the limit, all the
>> branches do.
>>
>> Could this be solved by keeping a local mirror, without a mirror in the
>> build root? I mean, something like this:
>>
>>      git_keep_mirror => 0,
>>      scmrepo => '/path/to/local/mirror'
>>
>> And of course a cron script updating the mirror every hour or so.
>>
>
> No, the git mirror should be refreshed at the start of each branch
> build. It's not done at all by run_branches.pl. So your premise is
> false. This should only happen if the actual run takes more than 24 hours.

OK. I think I understand what's wrong. This is a summary of log from
'leech' (attached is the full log):

========================================================================
Sat May 17 23:43:24 2014: buildfarm run for leech:REL8_4_STABLE starting
[23:43:24] checking out source ...
...
[04:32:47] OK
Sun May 18 04:32:51 2014: buildfarm run for leech:REL9_0_STABLE starting
[04:32:51] checking out source ...
...
[08:58:57] OK
Sun May 18 08:59:01 2014: buildfarm run for leech:REL9_1_STABLE starting
[08:59:01] checking out source ...
...
[14:09:08] OK
Sun May 18 14:09:12 2014: buildfarm run for leech:REL9_2_STABLE starting
[14:09:12] checking out source ...
...
[00:13:59] OK
Mon May 19 00:14:04 2014: buildfarm run for leech:REL9_3_STABLE starting
[00:14:04] checking out source ...
...
[14:26:29] OK
Query for: stage=OK&animal=leech&ts=1400451244
Target:
http://www.pgbuildfarm.org/cgi-bin/pgstatus.pl/28116b975a4186275d83f7e0f5c3fc92b1a75e85
Status Line: 493 snapshot too old: Fri May 16 07:05:50 2014 GMT
Content:
snapshot to old: Fri May 16 07:05:50 2014 GMT

Web txn failed with status: 1
Mon May 19 14:26:31 2014: buildfarm run for leech:HEAD starting
[14:26:31] checking out source ...
...
[20:07:23] checking test-decoding
Buildfarm member leech failed on HEAD stage test-decoding-check
========================================================================

So the REL9_3_STABLE starts at 00:14, completes at 14:26 (i.e. 14h
runtime), but fails with 'snapshot too old'.

These are the commits for each branch:

REL8_4_STABLE => 1dd0b3eeccaffd33b9c970a91c53fe42692ce8c2 (May 15, 2014)
REL9_0_STABLE => 0fc94340753f19da8acca5fc53039adbf2fa3632 (May 16, 2014)
REL9_1_STABLE => 39b3739c05688b5cd5d5da8c52fa5476304eff11 (May 16, 2014)
REL9_2_STABLE => 0d4c75f4de68012bb6f3bc52ebb58234334259d2 (May 16, 2014)
REL9_3_STABLE => d6a9767404cfee7f037a58e445b601af5837e4a5 (May 16, 2014)
HEAD          => f097d70b7227c1f9aa2ed0af1d6deb05367ef657 (May 19, 2014)

IMHO the problem is that d6a97674 was the last revision in the
REL9_3_STABLE branch when the test started (00:14), but at 06:06
777d07d7 got committed. So the check at the end failed, because the
tested revision was suddenly ~2 days over the limit.

This seems wrong to me, because even a very fast test including the
commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly
like this.

This is more probable on the old stable branches, because the commits
are not that frequent (on HEAD the commits are usually less than a few
hours apart, so the new one won't obsolete the previous one). It's also
made more likely to hit by the long runtime, because it increases the
probability something will be committed into the branch. And it also
makes it more "expensive" because it effectively throws all the cpu time
to /dev/null.

regards
Tomas

Вложения

Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/19/2014 05:37 PM, Tomas Vondra wrote:

> IMHO the problem is that d6a97674 was the last revision in the
> REL9_3_STABLE branch when the test started (00:14), but at 06:06
> 777d07d7 got committed. So the check at the end failed, because the
> tested revision was suddenly ~2 days over the limit.
>
> This seems wrong to me, because even a very fast test including the
> commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly
> like this.
>
> This is more probable on the old stable branches, because the commits
> are not that frequent (on HEAD the commits are usually less than a few
> hours apart, so the new one won't obsolete the previous one). It's also
> made more likely to hit by the long runtime, because it increases the
> probability something will be committed into the branch. And it also
> makes it more "expensive" because it effectively throws all the cpu time
> to /dev/null.
>
>


Well, the original code was put in for a reason, presumably that we were 
getting some stale data and wanted to exclude it. So I'm unwilling to 
throw it out altogether. If someone can propose a reasonable sanity 
check then I'm prepared to implement it.

cheers

andrew



Re: buildfarm animals and 'snapshot too old'

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> Well, the original code was put in for a reason, presumably that we were 
> getting some stale data and wanted to exclude it. So I'm unwilling to 
> throw it out altogether. If someone can propose a reasonable sanity 
> check then I'm prepared to implement it.

Would it be practical to make the timeout configurable on a per-animal
basis?  A small number of hours is certainly sufficient for most of the
critters, but for ones that are particularly slow or are doing CLOBBER
type tests, we could raise the limit.
        regards, tom lane



Re: buildfarm animals and 'snapshot too old'

От
Robert Haas
Дата:
On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Well, the original code was put in for a reason, presumably that we were
> getting some stale data and wanted to exclude it. So I'm unwilling to throw
> it out altogether. If someone can propose a reasonable sanity check then I'm
> prepared to implement it.

While I generally agree that long-established code shouldn't be
changed for light or transient causes, I have to admit I'm pretty
skeptical about this particular instance.  I can't think of any
particularly compelling reason why it's BAD for an old result to show
up.  We now show the commit ID on the main page, so if you see 512abc4
in the middle of a bunch of ef9ab5f's, you'll notice.  And if you
don't notice, so what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: buildfarm animals and 'snapshot too old'

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> Well, the original code was put in for a reason, presumably that we were
>> getting some stale data and wanted to exclude it. So I'm unwilling to throw
>> it out altogether. If someone can propose a reasonable sanity check then I'm
>> prepared to implement it.

> While I generally agree that long-established code shouldn't be
> changed for light or transient causes, I have to admit I'm pretty
> skeptical about this particular instance.  I can't think of any
> particularly compelling reason why it's BAD for an old result to show
> up.  We now show the commit ID on the main page, so if you see 512abc4
> in the middle of a bunch of ef9ab5f's, you'll notice.  And if you
> don't notice, so what?

Robert's got a point here.  In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's").  That is an issue of incorrect system clock, not of how long
it takes to do the run.  I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere?  Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.
        regards, tom lane



Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/20/2014 07:09 AM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>> Well, the original code was put in for a reason, presumably that we were
>>> getting some stale data and wanted to exclude it. So I'm unwilling to throw
>>> it out altogether. If someone can propose a reasonable sanity check then I'm
>>> prepared to implement it.
>> While I generally agree that long-established code shouldn't be
>> changed for light or transient causes, I have to admit I'm pretty
>> skeptical about this particular instance.  I can't think of any
>> particularly compelling reason why it's BAD for an old result to show
>> up.  We now show the commit ID on the main page, so if you see 512abc4
>> in the middle of a bunch of ef9ab5f's, you'll notice.  And if you
>> don't notice, so what?
> Robert's got a point here.  In my usage, the annoying thing is not animals
> that take a long time to report in; it's the ones that lie about the
> snapshot time (which is how you get "512abc4 in the middle of a bunch of
> ef9ab5f's").  That is an issue of incorrect system clock, not of how long
> it takes to do the run.  I wonder if the buildfarm script could be taught
> to get the timestamp from an NTP server somewhere?  Or at least
> sanity-check the system clock reading by comparing it to the newest commit
> timestamp in the git repo.


Showing the commit id is a relatively recent phenomenon, dating back to 
July 2013. I agree with Robert that it might obsolete this check, so I 
have disabled it for now. I have also disabled the other timestamp 
check, on the time the client actually took the snapshot (as opposed to 
the time of the last commit in the snapshot) for the 
CLOBBER_CACHE_RECURSIVELY case.

Regarding clock skew, I think we can do better then what you suggest. 
The web transaction code in the client adds its own timestamp just 
before running the web transaction. It would be quite reasonable to 
reject reports from machines with skewed clocks based on this value. I'm 
not sure what a reasonable skew might be. Somewhere in the range of 5 to 
15 minutes seems reasonable.

cheers

andrew



Re: buildfarm animals and 'snapshot too old'

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 05/20/2014 07:09 AM, Tom Lane wrote:
>> Robert's got a point here.  In my usage, the annoying thing is not animals
>> that take a long time to report in; it's the ones that lie about the
>> snapshot time (which is how you get "512abc4 in the middle of a bunch of
>> ef9ab5f's").  That is an issue of incorrect system clock, not of how long
>> it takes to do the run.  I wonder if the buildfarm script could be taught
>> to get the timestamp from an NTP server somewhere?  Or at least
>> sanity-check the system clock reading by comparing it to the newest commit
>> timestamp in the git repo.

> Regarding clock skew, I think we can do better then what you suggest. 
> The web transaction code in the client adds its own timestamp just 
> before running the web transaction. It would be quite reasonable to 
> reject reports from machines with skewed clocks based on this value. I'm 
> not sure what a reasonable skew might be. Somewhere in the range of 5 to 
> 15 minutes seems reasonable.

Rather than reject, why not take the result and adjust the claimed start
timestamp by the difference between the web transaction timestamp and the
buildfarm server's time?
        regards, tom lane



Re: buildfarm animals and 'snapshot too old'

От
Gavin Flower
Дата:
<div class="moz-cite-prefix">On 21/05/14 01:42, Tom Lane wrote:<br /></div><blockquote
cite="mid:24480.1400593335@sss.pgh.pa.us"type="cite"><pre wrap="">Andrew Dunstan <a class="moz-txt-link-rfc2396E"
href="mailto:andrew@dunslane.net"><andrew@dunslane.net></a>writes:
 
</pre><blockquote type="cite"><pre wrap="">On 05/20/2014 07:09 AM, Tom Lane wrote:
</pre><blockquote type="cite"><pre wrap="">Robert's got a point here.  In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's").  That is an issue of incorrect system clock, not of how long
it takes to do the run.  I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere?  Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.
</pre></blockquote></blockquote><pre wrap="">
</pre><blockquote type="cite"><pre wrap="">Regarding clock skew, I think we can do better then what you suggest. 
The web transaction code in the client adds its own timestamp just 
before running the web transaction. It would be quite reasonable to 
reject reports from machines with skewed clocks based on this value. I'm 
not sure what a reasonable skew might be. Somewhere in the range of 5 to 
15 minutes seems reasonable.
</pre></blockquote><pre wrap="">
Rather than reject, why not take the result and adjust the claimed start
timestamp by the difference between the web transaction timestamp and the
buildfarm server's time?
        regards, tom lane


</pre></blockquote> I think, that if possible, any such adjustment should be noted along with the original time, so
that:<br/><ol><li>the timing issue can be remedied<li>it is possible to link the output to any messages in the machines
logetc.<br /></ol><br /> Cheers,<br /> Gavin<br /> 

Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/20/2014 03:59 PM, Gavin Flower wrote:
> On 21/05/14 01:42, Tom Lane wrote:
>> Andrew Dunstan<andrew@dunslane.net>  writes:
>>> On 05/20/2014 07:09 AM, Tom Lane wrote:
>>>> Robert's got a point here.  In my usage, the annoying thing is not animals
>>>> that take a long time to report in; it's the ones that lie about the
>>>> snapshot time (which is how you get "512abc4 in the middle of a bunch of
>>>> ef9ab5f's").  That is an issue of incorrect system clock, not of how long
>>>> it takes to do the run.  I wonder if the buildfarm script could be taught
>>>> to get the timestamp from an NTP server somewhere?  Or at least
>>>> sanity-check the system clock reading by comparing it to the newest commit
>>>> timestamp in the git repo.
>>> Regarding clock skew, I think we can do better then what you suggest.
>>> The web transaction code in the client adds its own timestamp just
>>> before running the web transaction. It would be quite reasonable to
>>> reject reports from machines with skewed clocks based on this value. I'm
>>> not sure what a reasonable skew might be. Somewhere in the range of 5 to
>>> 15 minutes seems reasonable.
>> Rather than reject, why not take the result and adjust the claimed start
>> timestamp by the difference between the web transaction timestamp and the
>> buildfarm server's time?
>>
>>             
>>
> I think, that if possible, any such adjustment should be noted along 
> with the original time, so that:
>
>  1. the timing issue can be remedied
>  2. it is possible to link the output to any messages in the machines
>     log etc.
>


I don't see how that's going to help anyone. Major clock skew is a sign 
of client misconfiguration. And where would we note this adjustment, and 
who would do anything about it?

We seem to be engaging in a sort of PoohBearism* here. More information 
is not always better.

cheers

andrew



*    Rabbit: Would you like *condensed milk*, or *honey* on your bread?    Winnie the Pooh: Both. But never mind the
bread,please.
 




Re: buildfarm animals and 'snapshot too old'

От
Andrew Dunstan
Дата:
On 05/20/2014 09:42 AM, Tom Lane wrote:

>> Regarding clock skew, I think we can do better then what you suggest.
>> The web transaction code in the client adds its own timestamp just
>> before running the web transaction. It would be quite reasonable to
>> reject reports from machines with skewed clocks based on this value. I'm
>> not sure what a reasonable skew might be. Somewhere in the range of 5 to
>> 15 minutes seems reasonable.
> Rather than reject, why not take the result and adjust the claimed start
> timestamp by the difference between the web transaction timestamp and the
> buildfarm server's time?

Right now I have added some logging of clock skew to see what we're 
actually getting.

It won't be visible in the Web UI, but when I have some data I will post 
more info.

cheers

andrew