Обсуждение: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS
Hi, I have two buildfarm animals running tests with -DCLOBBER_CACHE_ALWAYS, and both of them started having issues with timeouts Last file mtime in snapshot: Sat Nov 29 21:41:49 2025 GMT =================================================== timed out after 14400 secs The animals are avocet and trilobite, and it's actually a single VM, with one animal for gcc or clang. It's fine that these tests take quite a bit of time, but I'm puzzled why it started to fail. And it seems the stats_ext regression test got much slower at the end of September. The last couple runs from avocet have these durations: 11:44:42 ago timedout [713d9a8] (01:54:23) Details 7 days 23:33:36 ago OK [bc32a12] (34:19:40) Config 16 days 12:19:09 ago OK [* 23792] (34:32:51) Config 63 days 03:55:34 ago OK [91df046] (31:14:46) Config 71 days 10:57:06 ago OK [9fc7f6a] (31:51:02) Config 79 days 05:46:56 ago OK [454c046] (19:35:25) Config 87 days 00:36:55 ago OK [06473f5] (19:31:16) Config 94 days 19:50:32 ago OK [f727b63] (19:05:05) Config 102 days 14:43:13 ago OK [ef03ea0] (19:09:05) Config ... So it was ~19h for a while (started at ~14h about 4y back). And then between September 14 and 22 it jumped to ~32h. Which seems like a lot. And that seems to be due to stats_ext changing from ok 157 + stats_ext 855865 ms to ok 157 + stats_ext 15158426 ms That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The machine is running the test with two locales, so there are multiple runs of this regression test. Any ideas what might be causing this? I see exactly the same issue on the other machine (trilobite). There were no upgrades or other changes around that time, and no other regression test has this issue. regards -- Tomas Vondra
On 12/2/25 17:05, Tomas Vondra wrote:
> ...
>
> So it was ~19h for a while (started at ~14h about 4y back). And then
> between September 14 and 22 it jumped to ~32h. Which seems like a lot.
>
> And that seems to be due to stats_ext changing from
>
> ok 157 + stats_ext 855865 ms
>
> to
>
> ok 157 + stats_ext 15158426 ms
>
> That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The
> machine is running the test with two locales, so there are multiple runs
> of this regression test.
>
After bisecting this, it seems to have changed in this commit:
commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Sep 20 12:44:52 2025 -0400
Re-allow using statistics for bool-valued functions in WHERE.
Commit a391ff3c3, which added the ability for a function's support
function to provide a custom selectivity estimate for "WHERE ...
And it seems to be "stuck" on the last few queries doing this:
SELECT * FROM check_estimated_rows('SELECT * FROM sb_2 WHERE
extstat_small(y)');
Attached is a perf-report for this, but the interesting part is:
--99.51%--plpgsql_exec_function
exec_toplevel_block.constprop.0
exec_stmt_block
exec_stmts
exec_stmt_return (inlined)
exec_eval_expr
exec_eval_simple_expr (inlined)
|
--99.37%--SPI_plan_get_cached_plan
GetCachedPlan
|
|--96.58%--RevalidateCachedQuery
| |
| |--96.00%--pg_analyze_and_rewrite_withcb
| | parse_analyze_withcb
| | |
| | --96.00%--transformTopLevelStmt (inlined)
I'm not sure if this a problem in the code, or whether it's just that
the test case happens to be this expensive with cache clobbering. Maybe
I simply need to increase the timeout for this machine, so that it does
not fail after 4h? It'd be unfortunate, as the runs are already long,
and this pushes them over 24h ... :-(
regards
--
Tomas Vondra
Вложения
Tomas Vondra <tomas@vondra.me> writes:
> On 12/2/25 17:05, Tomas Vondra wrote:
>> So it was ~19h for a while (started at ~14h about 4y back). And then
>> between September 14 and 22 it jumped to ~32h. Which seems like a lot.
> After bisecting this, it seems to have changed in this commit:
> commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date: Sat Sep 20 12:44:52 2025 -0400
> Re-allow using statistics for bool-valued functions in WHERE.
Ugh.
> Attached is a perf-report for this, but the interesting part is:
> --99.51%--plpgsql_exec_function
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.
regards, tom lane
I wrote:
> Yeah, I can imagine that constantly flushing the cached plan for
> that plpgsql function would be bad. Let me see if I can reformulate
> that test without using a plpgsql function --- right offhand, it's
> not obvious why a built-in function wouldn't serve the purpose
> just as well.
I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
BTW, I noticed that neither avocet nor trilobite seem to have
'use_installcheck_parallel' enabled in their BF config files.
That results in the installcheck steps taking longer than the
check step, when they should be the same time or shorter.
You could shave several hours off the animals' runtime by
enabling that.
(I'm not really sure why that's not on by default ... if the
check steps are automatically parallelized, why not installcheck?)
regards, tom lane
On 12/3/25 19:33, Tom Lane wrote: > I wrote: >> Yeah, I can imagine that constantly flushing the cached plan for >> that plpgsql function would be bad. Let me see if I can reformulate >> that test without using a plpgsql function --- right offhand, it's >> not obvious why a built-in function wouldn't serve the purpose >> just as well. > > I pushed a change for this. On my Mac laptop, it brings the time > for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from > I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours. > Thanks! > BTW, I noticed that neither avocet nor trilobite seem to have > 'use_installcheck_parallel' enabled in their BF config files. > That results in the installcheck steps taking longer than the > check step, when they should be the same time or shorter. > You could shave several hours off the animals' runtime by > enabling that. > Good idea, I enabled that for both animals. -- Tomas Vondra
Hello Tomas,
03.12.2025 21:23, Tomas Vondra wrote:
03.12.2025 21:23, Tomas Vondra wrote:
On 12/3/25 19:33, Tom Lane wrote:I wrote:Yeah, I can imagine that constantly flushing the cached plan for that plpgsql function would be bad. Let me see if I can reformulate that test without using a plpgsql function --- right offhand, it's not obvious why a built-in function wouldn't serve the purpose just as well.I pushed a change for this. On my Mac laptop, it brings the time for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.Thanks!
That change decreased the test duration on master:
[1] ok 159 + stats_ext 964154 ms
vs
[2] ok 157 + stats_ext 16557572 ms
but the test run still failed (and also fails on other branches, where the
duration is moderate), probably because of the overall timeout changed
somehow.
The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all
the failed runs timed out in 4 hours (14400 seconds):
timedout [7e54eac] (03:59:27)
timedout [7792bdc] (03:59:53)
The difference between the last good run and the first bad one I see is:
'script_version' => 'REL_19_1',
vs
'script_version' => 'REL_20',
though I can't see timeout-related changes in [4], probably something was
changed in the environment during the upgrade...
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-12-05%2013%3A21%3A32
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2025-11-24%2016%3A21%3A44
[3[ https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=avocet&br=REL_15_STABLE
[4] https://github.com/PGBuildFarm/client-code/releases/tag/REL_20
Best regards,
Alexander
On 12/6/25 10:00, Alexander Lakhin wrote: > Hello Tomas, > > 03.12.2025 21:23, Tomas Vondra wrote: >> On 12/3/25 19:33, Tom Lane wrote: >>> I wrote: >>>> Yeah, I can imagine that constantly flushing the cached plan for >>>> that plpgsql function would be bad. Let me see if I can reformulate >>>> that test without using a plpgsql function --- right offhand, it's >>>> not obvious why a built-in function wouldn't serve the purpose >>>> just as well. >>> I pushed a change for this. On my Mac laptop, it brings the time >>> for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from >>> I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours. >>> >> Thanks! > > That change decreased the test duration on master: > [1] ok 159 + stats_ext 964154 ms > vs > [2] ok 157 + stats_ext 16557572 ms > > but the test run still failed (and also fails on other branches, where the > duration is moderate), probably because of the overall timeout changed > somehow. > > The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all > the failed runs timed out in 4 hours (14400 seconds): > timedout [7e54eac] (03:59:27) > timedout [7792bdc] (03:59:53) > The difference between the last good run and the first bad one I see is: > 'script_version' => 'REL_19_1', > vs > 'script_version' => 'REL_20', > though I can't see timeout-related changes in [4], probably something was > changed in the environment during the upgrade... > Yeah, I noticed that too. But I have no idea what could have changed or why - the only thing I updated is the buildfarm client :-( I initially thought it might be due to setting use_installcheck_parallel => 1 but it still fails after reverting that change. I still have the build-farm client v19.1 around, so I can try switching back to that. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
> On 12/6/25 10:00, Alexander Lakhin wrote:
>> though I can't see timeout-related changes in [4], probably something was
>> changed in the environment during the upgrade...
> Yeah, I noticed that too. But I have no idea what could have changed or
> why - the only thing I updated is the buildfarm client :-(
Comparing the configuration dumps from avocet's last successful run
and its latest, I see that last successful shows
'wait_timeout' => 0
but the failing run shows
'wait_timeout' => undef
Per the comments in the sample config file:
# max time in seconds allowed for a single branch run
# undef means default, which is 4 hours. For unlimited specify 0
wait_timeout => undef,
I'd actually recommend a setting of a couple of days for these
animals, rather than "0". We've been known to get BF animals
into infinite loops, and if you use "0" then you will need to
clean up such mistakes manually.
regards, tom lane
On 2025-12-06 Sa 7:04 AM, Tomas Vondra wrote: > > On 12/6/25 10:00, Alexander Lakhin wrote: >> Hello Tomas, >> >> 03.12.2025 21:23, Tomas Vondra wrote: >>> On 12/3/25 19:33, Tom Lane wrote: >>>> I wrote: >>>>> Yeah, I can imagine that constantly flushing the cached plan for >>>>> that plpgsql function would be bad. Let me see if I can reformulate >>>>> that test without using a plpgsql function --- right offhand, it's >>>>> not obvious why a built-in function wouldn't serve the purpose >>>>> just as well. >>>> I pushed a change for this. On my Mac laptop, it brings the time >>>> for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from >>>> I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours. >>>> >>> Thanks! >> That change decreased the test duration on master: >> [1] ok 159 + stats_ext 964154 ms >> vs >> [2] ok 157 + stats_ext 16557572 ms >> >> but the test run still failed (and also fails on other branches, where the >> duration is moderate), probably because of the overall timeout changed >> somehow. >> >> The last good run on avocet, REL_15_STABLE [3] took 12:49:15, however all >> the failed runs timed out in 4 hours (14400 seconds): >> timedout [7e54eac] (03:59:27) >> timedout [7792bdc] (03:59:53) >> The difference between the last good run and the first bad one I see is: >> 'script_version' => 'REL_19_1', >> vs >> 'script_version' => 'REL_20', >> though I can't see timeout-related changes in [4], probably something was >> changed in the environment during the upgrade... >> > Yeah, I noticed that too. But I have no idea what could have changed or > why - the only thing I updated is the buildfarm client :-( > > I initially thought it might be due to setting > > use_installcheck_parallel => 1 > > but it still fails after reverting that change. I still have the > build-farm client v19.1 around, so I can try switching back to that. > > Nothing of significance to what is run changed between 19.1 and 20. See https://github.com/PGBuildFarm/client-code/compare/REL_19_1...REL_20 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/6/25 17:08, Tom Lane wrote: > Tomas Vondra <tomas@vondra.me> writes: >> On 12/6/25 10:00, Alexander Lakhin wrote: >>> though I can't see timeout-related changes in [4], probably something was >>> changed in the environment during the upgrade... > >> Yeah, I noticed that too. But I have no idea what could have changed or >> why - the only thing I updated is the buildfarm client :-( > > Comparing the configuration dumps from avocet's last successful run > and its latest, I see that last successful shows > > 'wait_timeout' => 0 > > but the failing run shows > > 'wait_timeout' => undef > > Per the comments in the sample config file: > > # max time in seconds allowed for a single branch run > # undef means default, which is 4 hours. For unlimited specify 0 > wait_timeout => undef, > > I'd actually recommend a setting of a couple of days for these > animals, rather than "0". We've been known to get BF animals > into infinite loops, and if you use "0" then you will need to > clean up such mistakes manually. > I'll try, but that's just the default in v20 of the buildfarm client (I always get the new .sample and update it with the necessary changes, keeping all the defaults). regards -- Tomas Vondra