Обсуждение: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

Поиск
Список
Период
Сортировка

increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tomas Vondra
Дата:
Hi,

I have two buildfarm animals running tests with -DCLOBBER_CACHE_ALWAYS,
and both of them started having issues with timeouts

  Last file mtime in snapshot: Sat Nov 29 21:41:49 2025 GMT
  ===================================================
  timed out after 14400 secs

The animals are avocet and trilobite, and it's actually a single VM,
with one animal for gcc or clang.

It's fine that these tests take quite a bit of time, but I'm puzzled why
it started to fail. And it seems the stats_ext regression test got much
slower at the end of September.

The last couple runs from avocet have these durations:

11:44:42 ago      timedout [713d9a8] (01:54:23)     Details
7 days 23:33:36 ago      OK [bc32a12] (34:19:40)     Config
16 days 12:19:09 ago      OK [* 23792] (34:32:51)     Config
63 days 03:55:34 ago      OK [91df046] (31:14:46)     Config
71 days 10:57:06 ago      OK [9fc7f6a] (31:51:02)     Config
79 days 05:46:56 ago      OK [454c046] (19:35:25)     Config
87 days 00:36:55 ago      OK [06473f5] (19:31:16)     Config
94 days 19:50:32 ago      OK [f727b63] (19:05:05)     Config
102 days 14:43:13 ago      OK [ef03ea0] (19:09:05)     Config
...

So it was ~19h for a while (started at ~14h about 4y back). And then
between September 14 and 22 it jumped to ~32h. Which seems like a lot.

And that seems to be due to stats_ext changing from

  ok 157       + stats_ext                              855865 ms

to

  ok 157       + stats_ext                            15158426 ms

That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The
machine is running the test with two locales, so there are multiple runs
of this regression test.

Any ideas what might be causing this? I see exactly the same issue on
the other machine (trilobite). There were no upgrades or other changes
around that time, and no other regression test has this issue.


regards

-- 
Tomas Vondra




Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tomas Vondra
Дата:
On 12/2/25 17:05, Tomas Vondra wrote:
> ...
> 
> So it was ~19h for a while (started at ~14h about 4y back). And then
> between September 14 and 22 it jumped to ~32h. Which seems like a lot.
> 
> And that seems to be due to stats_ext changing from
> 
>   ok 157       + stats_ext                              855865 ms
> 
> to
> 
>   ok 157       + stats_ext                            15158426 ms
> 
> That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The
> machine is running the test with two locales, so there are multiple runs
> of this regression test.
> 

After bisecting this, it seems to have changed in this commit:

commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Sep 20 12:44:52 2025 -0400

    Re-allow using statistics for bool-valued functions in WHERE.

    Commit a391ff3c3, which added the ability for a function's support
    function to provide a custom selectivity estimate for "WHERE ...

And it seems to be "stuck" on the last few queries doing this:

  SELECT * FROM check_estimated_rows('SELECT * FROM sb_2 WHERE
  extstat_small(y)');

Attached is a perf-report for this, but the interesting part is:

  --99.51%--plpgsql_exec_function
        exec_toplevel_block.constprop.0
        exec_stmt_block
        exec_stmts
        exec_stmt_return (inlined)
        exec_eval_expr
        exec_eval_simple_expr (inlined)
        |
         --99.37%--SPI_plan_get_cached_plan
               GetCachedPlan
               |
               |--96.58%--RevalidateCachedQuery
               |      |
               |      |--96.00%--pg_analyze_and_rewrite_withcb
               |      |      parse_analyze_withcb
               |      |      |
               |      |       --96.00%--transformTopLevelStmt (inlined)

I'm not sure if this a problem in the code, or whether it's just that
the test case happens to be this expensive with cache clobbering. Maybe
I simply need to increase the timeout for this machine, so that it does
not fail after 4h? It'd be unfortunate, as the runs are already long,
and this pushes them over 24h ... :-(

regards

-- 
Tomas Vondra

Вложения

Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tom Lane
Дата:
Tomas Vondra <tomas@vondra.me> writes:
> On 12/2/25 17:05, Tomas Vondra wrote:
>> So it was ~19h for a while (started at ~14h about 4y back). And then
>> between September 14 and 22 it jumped to ~32h. Which seems like a lot.

> After bisecting this, it seems to have changed in this commit:

> commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Sat Sep 20 12:44:52 2025 -0400
>     Re-allow using statistics for bool-valued functions in WHERE.

Ugh.

> Attached is a perf-report for this, but the interesting part is:
>   --99.51%--plpgsql_exec_function

Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad.  Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.

            regards, tom lane



Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tom Lane
Дата:
I wrote:
> Yeah, I can imagine that constantly flushing the cached plan for
> that plpgsql function would be bad.  Let me see if I can reformulate
> that test without using a plpgsql function --- right offhand, it's
> not obvious why a built-in function wouldn't serve the purpose
> just as well.

I pushed a change for this.  On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.

BTW, I noticed that neither avocet nor trilobite seem to have
'use_installcheck_parallel' enabled in their BF config files.
That results in the installcheck steps taking longer than the
check step, when they should be the same time or shorter.
You could shave several hours off the animals' runtime by
enabling that.

(I'm not really sure why that's not on by default ... if the
check steps are automatically parallelized, why not installcheck?)

            regards, tom lane



Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tomas Vondra
Дата:

On 12/3/25 19:33, Tom Lane wrote:
> I wrote:
>> Yeah, I can imagine that constantly flushing the cached plan for
>> that plpgsql function would be bad.  Let me see if I can reformulate
>> that test without using a plpgsql function --- right offhand, it's
>> not obvious why a built-in function wouldn't serve the purpose
>> just as well.
> 
> I pushed a change for this.  On my Mac laptop, it brings the time
> for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
> I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
> 

Thanks!

> BTW, I noticed that neither avocet nor trilobite seem to have
> 'use_installcheck_parallel' enabled in their BF config files.
> That results in the installcheck steps taking longer than the
> check step, when they should be the same time or shorter.
> You could shave several hours off the animals' runtime by
> enabling that.
> 

Good idea, I enabled that for both animals.


-- 
Tomas Vondra




Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Alexander Lakhin
Дата:
Hello Tomas,

03.12.2025 21:23, Tomas Vondra wrote:
On 12/3/25 19:33, Tom Lane wrote:
I wrote:
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad.  Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.
I pushed a change for this.  On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.

Thanks!

That change decreased the test duration on master:
[1] ok 159       + stats_ext                              964154 ms
vs
[2] ok 157       + stats_ext                            16557572 ms

but the test run still failed (and also fails on other branches, where the
duration is moderate), probably because of the overall timeout changed
somehow.

The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
the failed runs timed out in 4 hours (14400 seconds):
timedout [7e54eac] (03:59:27)
timedout [7792bdc] (03:59:53)
The difference between the last good run and the first bad one I see is:
'script_version' => 'REL_19_1',
vs
'script_version' => 'REL_20',
though I can't see timeout-related changes in [4], probably something was
changed in the environment during the upgrade...

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-12-05%2013%3A21%3A32
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-11-24%2016%3A21%3A44
[3[ https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=avocet&br=REL_15_STABLE
[4] https://github.com/PGBuildFarm/client-code/releases/tag/REL_20

Best regards,
Alexander

Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tomas Vondra
Дата:

On 12/6/25 10:00, Alexander Lakhin wrote:
> Hello Tomas,
> 
> 03.12.2025 21:23, Tomas Vondra wrote:
>> On 12/3/25 19:33, Tom Lane wrote:
>>> I wrote:
>>>> Yeah, I can imagine that constantly flushing the cached plan for
>>>> that plpgsql function would be bad.  Let me see if I can reformulate
>>>> that test without using a plpgsql function --- right offhand, it's
>>>> not obvious why a built-in function wouldn't serve the purpose
>>>> just as well.
>>> I pushed a change for this.  On my Mac laptop, it brings the time
>>> for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
>>> I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
>>>
>> Thanks!
> 
> That change decreased the test duration on master:
> [1] ok 159       + stats_ext                              964154 ms
> vs
> [2] ok 157       + stats_ext                            16557572 ms
> 
> but the test run still failed (and also fails on other branches, where the
> duration is moderate), probably because of the overall timeout changed
> somehow.
> 
> The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
> the failed runs timed out in 4 hours (14400 seconds):
> timedout [7e54eac] (03:59:27)
> timedout [7792bdc] (03:59:53)
> The difference between the last good run and the first bad one I see is:
> 'script_version' => 'REL_19_1',
> vs
> 'script_version' => 'REL_20',
> though I can't see timeout-related changes in [4], probably something was
> changed in the environment during the upgrade...
> 

Yeah, I noticed that too. But I have no idea what could have changed or
why - the only thing I updated is the buildfarm client :-(

I initially thought it might be due to setting

   use_installcheck_parallel => 1

but it still fails after reverting that change. I still have the
build-farm client v19.1 around, so I can try switching back to that.


regards

-- 
Tomas Vondra




Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tom Lane
Дата:
Tomas Vondra <tomas@vondra.me> writes:
> On 12/6/25 10:00, Alexander Lakhin wrote:
>> though I can't see timeout-related changes in [4], probably something was
>> changed in the environment during the upgrade...

> Yeah, I noticed that too. But I have no idea what could have changed or
> why - the only thing I updated is the buildfarm client :-(

Comparing the configuration dumps from avocet's last successful run
and its latest, I see that last successful shows

                  'wait_timeout' => 0
 
but the failing run shows

                  'wait_timeout' => undef

Per the comments in the sample config file:

    # max time in seconds allowed for a single branch run
    # undef means default, which is 4 hours. For unlimited specify 0
    wait_timeout => undef,

I'd actually recommend a setting of a couple of days for these
animals, rather than "0".  We've been known to get BF animals
into infinite loops, and if you use "0" then you will need to
clean up such mistakes manually.

            regards, tom lane



Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Andrew Dunstan
Дата:
On 2025-12-06 Sa 7:04 AM, Tomas Vondra wrote:
>
> On 12/6/25 10:00, Alexander Lakhin wrote:
>> Hello Tomas,
>>
>> 03.12.2025 21:23, Tomas Vondra wrote:
>>> On 12/3/25 19:33, Tom Lane wrote:
>>>> I wrote:
>>>>> Yeah, I can imagine that constantly flushing the cached plan for
>>>>> that plpgsql function would be bad.  Let me see if I can reformulate
>>>>> that test without using a plpgsql function --- right offhand, it's
>>>>> not obvious why a built-in function wouldn't serve the purpose
>>>>> just as well.
>>>> I pushed a change for this.  On my Mac laptop, it brings the time
>>>> for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
>>>> I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
>>>>
>>> Thanks!
>> That change decreased the test duration on master:
>> [1] ok 159       + stats_ext                              964154 ms
>> vs
>> [2] ok 157       + stats_ext                            16557572 ms
>>
>> but the test run still failed (and also fails on other branches, where the
>> duration is moderate), probably because of the overall timeout changed
>> somehow.
>>
>> The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
>> the failed runs timed out in 4 hours (14400 seconds):
>> timedout [7e54eac] (03:59:27)
>> timedout [7792bdc] (03:59:53)
>> The difference between the last good run and the first bad one I see is:
>> 'script_version' => 'REL_19_1',
>> vs
>> 'script_version' => 'REL_20',
>> though I can't see timeout-related changes in [4], probably something was
>> changed in the environment during the upgrade...
>>
> Yeah, I noticed that too. But I have no idea what could have changed or
> why - the only thing I updated is the buildfarm client :-(
>
> I initially thought it might be due to setting
>
>     use_installcheck_parallel => 1
>
> but it still fails after reverting that change. I still have the
> build-farm client v19.1 around, so I can try switching back to that.
>
>

Nothing of significance to what is run changed between 19.1 and 20.

See https://github.com/PGBuildFarm/client-code/compare/REL_19_1...REL_20


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS

От
Tomas Vondra
Дата:
On 12/6/25 17:08, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> On 12/6/25 10:00, Alexander Lakhin wrote:
>>> though I can't see timeout-related changes in [4], probably something was
>>> changed in the environment during the upgrade...
> 
>> Yeah, I noticed that too. But I have no idea what could have changed or
>> why - the only thing I updated is the buildfarm client :-(
> 
> Comparing the configuration dumps from avocet's last successful run
> and its latest, I see that last successful shows
> 
>                   'wait_timeout' => 0
>  
> but the failing run shows
> 
>                   'wait_timeout' => undef
> 
> Per the comments in the sample config file:
> 
>     # max time in seconds allowed for a single branch run
>     # undef means default, which is 4 hours. For unlimited specify 0
>     wait_timeout => undef,
> 
> I'd actually recommend a setting of a couple of days for these
> animals, rather than "0".  We've been known to get BF animals
> into infinite loops, and if you use "0" then you will need to
> clean up such mistakes manually.
> 

I'll try, but that's just the default in v20 of the buildfarm client (I
always get the new .sample and update it with the necessary changes,
keeping all the defaults).

regards

-- 
Tomas Vondra