Re: Recent 027_streaming_regress.pl hangs

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Recent 027_streaming_regress.pl hangs
Дата
Msg-id 20240326035607.grqoyrxjvpyhnkrf@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: Recent 027_streaming_regress.pl hangs  (Andres Freund <andres@anarazel.de>)
Ответы Re: Recent 027_streaming_regress.pl hangs  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> > Also, this is probably not
> > helping anything:
> >
> >                    'extra_config' => {
> >                                                       ...
> >                                                       'fsync = on'
>
> At some point we had practically no test coverage of fsync, so I made my
> animals use fsync. I think we still have little coverage.  I probably could
> reduce the number of animals using it though.

I think there must be some actual regression involved. The frequency of
failures on HEAD vs failures on 16 - both of which run the tests concurrently
via meson - is just vastly different.  I'd expect the absolute number of
failures in 027_stream_regress.pl to differ between branches due to fewer runs
on 16, but there's no explanation for the difference in percentage of
failures. My menagerie had only a single recoveryCheck failure on !HEAD in the
last 30 days, but in the vicinity of 100 on HEAD
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit


If anything the load when testing back branch changes is higher, because
commonly back-branch builds are happening on all branches, so I don't think
that can be the explanation either.

From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a
rash of recoveryCheck failures in the days before that too, but not
027_stream_regress.pl in that way.


It certainly seems suspicious that one commit before the first observed failure
is
2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead().

Of course the failure rate is low enough that it could have been a day or two
before that, too.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Teach predtest about IS [NOT] proofs
Следующее
От: shveta malik
Дата:
Сообщение: Re: Introduce XID age and inactive timeout based replication slot invalidation