Обсуждение: Recent 027_streaming_regress.pl hangs

Поиск
Список
Период
Сортировка

Recent 027_streaming_regress.pl hangs

От
Thomas Munro
Дата:
Hi,

Several animals are timing out while waiting for catchup,
sporadically.  I don't know why.  The oldest example I have found so
far by clicking around is:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-23%2015%3A44%3A35

So perhaps something was committed ~3 weeks ago triggered this.

There are many examples since, showing as recoveryCheck failures.
Apparently they are all on animals wrangled by Andres.  Hmm.  I think
some/all share a physical host, they seem to have quite high run time
variance, and they're using meson.



Re: Recent 027_streaming_regress.pl hangs

От
Thomas Munro
Дата:
On Wed, Mar 13, 2024 at 10:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-23%2015%3A44%3A35

Assuming it is due to a commit in master, and given the failure
frequency, I think it is very likely to be a change from this 3 day
window of commits, and more likely in the top half dozen or so:

d360e3cc60e Fix compiler warning on typedef redeclaration
8af25652489 Introduce a new smgr bulk loading facility.
e612384fc78 Fix mistake in SQL features list
d13ff82319c Fix BF failure in commit 93db6cbda0.
efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on
unexpected input
93db6cbda03 Add a new slot sync worker to synchronize logical slots.
3d47b75546d pgindent fix
b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list.
fbc93b8b5f5 Remove custom Constraint node read/write implementations
801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and
7a424ece48.
011d60c4352 Speed up uuid_out() by not relying on a StringInfo
943f7ae1c86 Add lookup table for replication slot conflict reasons
28f3915b73f Remove superfluous 'pgprocno' field from PGPROC
4989ce72644 MERGE ... DO NOTHING: require SELECT privileges
ed345c2728b Fix typo
690805ca754 doc: Fix link to pg_ident_file_mappings view
ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init()
75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal()
fcd210d496d Doc: improve explanation of type interval, especially extract().
489072ab7a9 Replace relids in lateral subquery parse tree during SJE
74563f6b902 Revert "Improve compression and storage support with inheritance"
d2ca9a50b5b Minor corrections for partition pruning
818fefd8fd4 Fix race leading to incorrect conflict cause in
InvalidatePossiblyObsoleteSlot()
01ec4d89b91 doc: Use system-username instead of system-user

I just haven't got a specific theory yet, as the logs are empty.  I
wonder if some kind of failures could start firing signals around to
get us a stack.



Re: Recent 027_streaming_regress.pl hangs

От
Michael Paquier
Дата:
On Thu, Mar 14, 2024 at 03:00:28PM +1300, Thomas Munro wrote:
> Assuming it is due to a commit in master, and given the failure
> frequency, I think it is very likely to be a change from this 3 day
> window of commits, and more likely in the top half dozen or so:
>
> d360e3cc60e Fix compiler warning on typedef redeclaration
> 8af25652489 Introduce a new smgr bulk loading facility.
> e612384fc78 Fix mistake in SQL features list
> d13ff82319c Fix BF failure in commit 93db6cbda0.
> efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on
> unexpected input
> 93db6cbda03 Add a new slot sync worker to synchronize logical slots.
> 3d47b75546d pgindent fix
> b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list.
> fbc93b8b5f5 Remove custom Constraint node read/write implementations
> 801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and
> 7a424ece48.
> 011d60c4352 Speed up uuid_out() by not relying on a StringInfo
> 943f7ae1c86 Add lookup table for replication slot conflict reasons
> 28f3915b73f Remove superfluous 'pgprocno' field from PGPROC
> 4989ce72644 MERGE ... DO NOTHING: require SELECT privileges
> ed345c2728b Fix typo
> 690805ca754 doc: Fix link to pg_ident_file_mappings view
> ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init()
> 75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal()
> fcd210d496d Doc: improve explanation of type interval, especially extract().
> 489072ab7a9 Replace relids in lateral subquery parse tree during SJE
> 74563f6b902 Revert "Improve compression and storage support with inheritance"
> d2ca9a50b5b Minor corrections for partition pruning
> 818fefd8fd4 Fix race leading to incorrect conflict cause in
> InvalidatePossiblyObsoleteSlot()
> 01ec4d89b91 doc: Use system-username instead of system-user
>
> I just haven't got a specific theory yet, as the logs are empty.  I
> wonder if some kind of failures could start firing signals around to
> get us a stack.

Thanks for providing this list and an analysis.

Hmm.  Perhaps 8af25652489?  That looks like the closest thing in the
list that could have played with the way WAL is generated, hence
potentially impacting the records that are replayed.

93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch
unrelated territory.
--
Michael

Вложения

Re: Recent 027_streaming_regress.pl hangs

От
Thomas Munro
Дата:
On Thu, Mar 14, 2024 at 3:27 PM Michael Paquier <michael@paquier.xyz> wrote:
> Hmm.  Perhaps 8af25652489?  That looks like the closest thing in the
> list that could have played with the way WAL is generated, hence
> potentially impacting the records that are replayed.

Yeah, I was wondering if its checkpoint delaying logic might have
got the checkpointer jammed or something like that, but I don't
currently see how.  Yeah, the replay of bulk newpages could be
relevant, but it's not exactly new technology.  One thing I wondered
about is whether the Perl "wait for catchup" thing, which generates
large volumes of useless log, could be somehow changed to actually
show the progress after some time.  Something like "I'm still waiting
for this replica to reach LSN X, but it has so far only reported LSN
Y, and here's a dump of the WAL around there"?

> 93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch
> unrelated territory.

Hmm.



Re: Recent 027_streaming_regress.pl hangs

От
Alexander Lakhin
Дата:
Hello Thomas and Michael,

14.03.2024 06:16, Thomas Munro wrote:
>
> Yeah, I was wondering if its checkpoint delaying logic might have
> got the checkpointer jammed or something like that, but I don't
> currently see how.  Yeah, the replay of bulk newpages could be
> relevant, but it's not exactly new technology.  One thing I wondered
> about is whether the Perl "wait for catchup" thing, which generates
> large volumes of useless log, could be somehow changed to actually
> show the progress after some time.  Something like "I'm still waiting
> for this replica to reach LSN X, but it has so far only reported LSN
> Y, and here's a dump of the WAL around there"?

I have perhaps reproduced the issue here (at least I'm seeing something
similar), and going to investigate the issue in the coming days, but what
I'm confused with now is the duration of poll_query_until:
For the failure you referenced:
[15:55:54.740](418.725s) # poll_query_until timed out executing this query:

And a couple of others:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-08%2000%3A34%3A06
[00:45:57.747](376.159s) # poll_query_until timed out executing this query:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-04%2016%3A32%3A17
[16:45:24.870](407.970s) # poll_query_until timed out executing this query:

Could it be that the timeout (360 sec?) is just not enough for the test
under the current (changed due to switch to meson) conditions?

Best regards,
Alexander



Re: Recent 027_streaming_regress.pl hangs

От
Thomas Munro
Дата:
On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Could it be that the timeout (360 sec?) is just not enough for the test
> under the current (changed due to switch to meson) conditions?

Hmm, well it looks like he switched over to meson around 42 days ago
2024-02-01, looking at "calliphoridae" (skink has the extra
complication of valgrind, let's look at a more 'normal' animal
instead).  The first failure that looks like that on calliphoridae is
19 days ago 2024-02-23, and after that it's happening every 3 days,
sometimes in clusters.

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=calliphoridae&br=HEAD

But you're right that under meson the test takes a lot longer, I guess
due to increased concurrency:

287/287 postgresql:recovery / recovery/027_stream_regress
           OK              684.50s   6 subtests passed

With make we don't have an individual time per script, but for for all
of the recovery tests we had for example:

t/027_stream_regress.pl ............... ok
All tests successful.
Files=39, Tests=542, 65 wallclock secs ( 0.26 usr  0.06 sys + 20.16
cusr 31.65 csys = 52.13 CPU)



Re: Recent 027_streaming_regress.pl hangs

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> Could it be that the timeout (360 sec?) is just not enough for the test
>> under the current (changed due to switch to meson) conditions?

> But you're right that under meson the test takes a lot longer, I guess
> due to increased concurrency:

What it seems to be is highly variable.  Looking at calliphoridae's
last half dozen successful runs, I see reported times for
027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
else is happening on that machine.  Also, this is probably not
helping anything:

                   'extra_config' => {
                                                      ...
                                                      'fsync = on'

I would suggest turning that off and raising wait_timeout a good
deal, and then we'll see if calliphoridae gets any more stable.

            regards, tom lane



Re: Recent 027_streaming_regress.pl hangs

От
Alexander Lakhin
Дата:
14.03.2024 23:56, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>>> Could it be that the timeout (360 sec?) is just not enough for the test
>>> under the current (changed due to switch to meson) conditions?
>> But you're right that under meson the test takes a lot longer, I guess
>> due to increased concurrency:
> What it seems to be is highly variable.  Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
> else is happening on that machine.  Also, this is probably not
> helping anything:
>
>                     'extra_config' => {
>                                                        ...
>                                                        'fsync = on'
>
> I would suggest turning that off and raising wait_timeout a good
> deal, and then we'll see if calliphoridae gets any more stable.

I could reproduce similar failures with
PG_TEST_EXTRA=wal_consistency_checking
only, running 5 tests in parallel on a slowed-down VM, so that test
duration increased to ~1900 seconds, but perhaps that buildfarm machine
has a different bottleneck (I/O?) or it's concurrent workload is not
uniform as in my experiments.

Meanwhile, I've analyzed failed test logs from buildfarm and calculated
the percentage of WAL replayed before timeout.
For instance, one of the failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2022%3A36%3A40
standby_1.log:
2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG:  started streaming WAL from primary at 0/3000000 on
timeline1
 
...
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart point at 0/E00E030
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18 
22:41:26.647756+00.
2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL:  could not receive data from WAL stream: server closed
the
 
connection unexpectedly

primary.log:
2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: GRANT ALL ON SCHEMA public TO public;
# ^^^ One of the first records produced by `make check`
...
2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG:  statement: ALTER VIEW my_property_secure SET 
(security_barrier=false);
# ^^^ A record near the last completed transaction on standby
...
2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG:  statement: DROP TABLESPACE
regress_tblspace_renamed;
# ^^^ One of the last records produced by `make check`

\set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756'
select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from (:'t1'::time - :'t0'::time));
~52%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2018%3A58%3A58
~48%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2016%3A41%3A13
~43%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2015%3A47%3A09
~36%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-15%2011%3A24%3A38
~87%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-17%2021%3A55%3A41
~36%

So it still looks like a performance-related issue to me. And yes,
fsync = off -> on greatly increases (~3x) the overall test duration in
that my environment.

Best regards,
Alexander



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >> Could it be that the timeout (360 sec?) is just not enough for the test
> >> under the current (changed due to switch to meson) conditions?
> 
> > But you're right that under meson the test takes a lot longer, I guess
> > due to increased concurrency:
> 
> What it seems to be is highly variable.  Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
> else is happening on that machine.

There's a lot of other animals on the same machine, however it's rarely fuly
loaded (with either CPU or IO).

I don't think the test just being slow is the issue here, e.g. in the last
failing iteration
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15

the tests completed

2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG:  statement: DROP ROLE regress_tablespace_user2;
2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG:  disconnection: session time: 0:00:12.431 user=bf
database=regressionhost=[local]
 

and we waited to replicate for quite a while:

2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG:  connection authorized: user=bf database=postgres
application_name=027_stream_regress.pl
2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG:  statement: SELECT '0/15BA21B0' <= replay_lsn AND
state= 'streaming'
 
             FROM pg_catalog.pg_stat_replication
             WHERE application_name IN ('standby_1', 'walreceiver')
2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG:  disconnection: session time: 0:00:00.063 user=bf
database=postgreshost=[local]
 
2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG:  received immediate shutdown request
2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG:  database system is shut down

There was no activity for 7 minutes.

System statistics show relatively low load CPU and IO load for the period from
22:00 - 22:10.


I suspect we have some more fundamental instability at our hands, there have
been failures like this going back a while, and on various machines.



I think at the very least we should make Cluster.pm's wait_for_catchup() print
some information when it times out - right now it's neigh on undebuggable,
because we don't even log what we were waiting for and what the actual
replication position was.



> Also, this is probably not
> helping anything:
> 
>                    'extra_config' => {
>                                                       ...
>                                                       'fsync = on'

At some point we had practically no test coverage of fsync, so I made my
animals use fsync. I think we still have little coverage.  I probably could
reduce the number of animals using it though.

Greetings,

Andres Freund



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-03-20 17:41:47 -0700, Andres Freund wrote:
> There's a lot of other animals on the same machine, however it's rarely fuly
> loaded (with either CPU or IO).
>
> I don't think the test just being slow is the issue here, e.g. in the last
> failing iteration
>
> [...]
>
> I suspect we have some more fundamental instability at our hands, there have
> been failures like this going back a while, and on various machines.

I'm somewhat confused by the timestamps in the log:

[22:07:50.263](223.929s) ok 2 - regression tests pass
...
[22:14:02.051](371.788s) # poll_query_until timed out executing this query:

I read this as 371.788s having passed between the messages. Which of course is
much higher than PostgreSQL::Test::Utils::timeout_default=180

Ah.

The way that poll_query_until() implements timeouts seems decidedly
suboptimal. If a psql invocation, including query processing, takes any
appreciateble amount of time, poll_query_until() waits much longer than it
shoulds, because it very naively determines a number of waits ahead of time:

    my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
    my $attempts = 0;

    while ($attempts < $max_attempts)
    {
...

        # Wait 0.1 second before retrying.
        usleep(100_000);

        $attempts++;
    }

Ick.

What's worse is that if the query takes too long, the timeout afaict never
takes effect.

Greetings,

Andres Freund



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG:  connection authorized: user=bf database=postgres
application_name=027_stream_regress.pl
> 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG:  statement: SELECT '0/15BA21B0' <= replay_lsn AND
state= 'streaming'
 
>              FROM pg_catalog.pg_stat_replication
>              WHERE application_name IN ('standby_1', 'walreceiver')
> 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG:  disconnection: session time: 0:00:00.063 user=bf
database=postgreshost=[local]
 
> 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG:  received immediate shutdown request
> 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG:  database system is shut down
>
> There was no activity for 7 minutes.
>
> System statistics show relatively low load CPU and IO load for the period from
> 22:00 - 22:10.
>
>
> I suspect we have some more fundamental instability at our hands, there have
> been failures like this going back a while, and on various machines.

I've reproduced something like this scenario locally, although I am not sure
it is precisely what is happening on the buildfarm.  At least here it looks
like the problem is that apply is lagging substantially:

2024-03-20 22:43:11.024 PDT [1023505][walreceiver][:0][] DEBUG:  sendtime 2024-03-20 22:43:11.024348-07 receipttime
2024-03-2022:43:11.02437-07 replication apply delay 285322 ms transfer latency 1 ms
 

Which then means that we'll wait for a long time for apply to finish:

Waiting for replication conn standby_1's replay_lsn to pass 0/14385E20 on primary
[22:41:34.521](0.221s) # state before polling:
# pid                | 1023508
# application_name   | standby_1
# sent_lsn           | 0/14385E20
# primary_wal_lsn    | 0/14385E20
# standby_write_lsn  | 0/14385E20
# primary_flush_lsn  | 0/14385E20
# standby_flush_lsn  | 0/14385E20
# standby_replay_lsn | 0/126D5C58
...
[22:43:16.376](0.161s) # running query, attempt 679/1800
[22:43:16.627](0.251s) # state now:
# pid                | 1023508
# application_name   | standby_1
# sent_lsn           | 0/14778468
# primary_wal_lsn    | 0/14778468
# standby_write_lsn  | 0/14778468
# primary_flush_lsn  | 0/14778468
# standby_flush_lsn  | 0/14778468
# standby_replay_lsn | 0/14778468



I am not sure I have debugged why exactly, but it sure looks like one part is
the startup process being busy unlinking files synchronously. This appears to
be exacerbated by mdunlinkfork() first truncating and then separately
unlinking the file - that looks to trigger a lot of filesystem journal
flushes (on xfs).

We also spend a fair bit of time updating the control file, because we flush
the WAL when replaying a transaction commit with a relation unlink. That also
badly interacts with doing metadata operations...

Thirdly, we flush received WAL extremely granularly at times, which requires
another fsync:
2024-03-20 23:30:21.469 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB0000
2024-03-20 23:30:21.473 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB0170
2024-03-20 23:30:21.479 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB2528
2024-03-20 23:30:21.480 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB58C8
2024-03-20 23:30:21.487 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB7DA0
2024-03-20 23:30:21.490 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB92B0
2024-03-20 23:30:21.494 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BBBAC0
2024-03-20 23:30:21.496 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BBCCC0
2024-03-20 23:30:21.499 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BBCE18

This all when we're quite far behind with apply...

Greetings,

Andres Freund



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> > Also, this is probably not
> > helping anything:
> >
> >                    'extra_config' => {
> >                                                       ...
> >                                                       'fsync = on'
>
> At some point we had practically no test coverage of fsync, so I made my
> animals use fsync. I think we still have little coverage.  I probably could
> reduce the number of animals using it though.

I think there must be some actual regression involved. The frequency of
failures on HEAD vs failures on 16 - both of which run the tests concurrently
via meson - is just vastly different.  I'd expect the absolute number of
failures in 027_stream_regress.pl to differ between branches due to fewer runs
on 16, but there's no explanation for the difference in percentage of
failures. My menagerie had only a single recoveryCheck failure on !HEAD in the
last 30 days, but in the vicinity of 100 on HEAD
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit


If anything the load when testing back branch changes is higher, because
commonly back-branch builds are happening on all branches, so I don't think
that can be the explanation either.

From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a
rash of recoveryCheck failures in the days before that too, but not
027_stream_regress.pl in that way.


It certainly seems suspicious that one commit before the first observed failure
is
2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead().

Of course the failure rate is low enough that it could have been a day or two
before that, too.

Greetings,

Andres Freund



Re: Recent 027_streaming_regress.pl hangs

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I think there must be some actual regression involved. The frequency of
> failures on HEAD vs failures on 16 - both of which run the tests concurrently
> via meson - is just vastly different.

Are you sure it's not just that the total time to run the core
regression tests has grown to a bit more than what the test timeout
allows for?

            regards, tom lane



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-03-26 00:00:38 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think there must be some actual regression involved. The frequency of
> > failures on HEAD vs failures on 16 - both of which run the tests concurrently
> > via meson - is just vastly different.
>
> Are you sure it's not just that the total time to run the core
> regression tests has grown to a bit more than what the test timeout
> allows for?

You're right, that could be it - in a way at least, the issue is replay not
catching up within 180s, so it'd have to be the data volume growing, I think.

But it doesn't look like the regression volume meaningfully grew around that
time?

I guess I'll try to write a buildfarm database query to extract how long that
phase of the test took from all runs on my menagerie, not just the failing
one, and see if there's a visible trend.

Greetings,

Andres Freund



Re: Recent 027_streaming_regress.pl hangs

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2024-03-26 00:00:38 -0400, Tom Lane wrote:
>> Are you sure it's not just that the total time to run the core
>> regression tests has grown to a bit more than what the test timeout
>> allows for?

> You're right, that could be it - in a way at least, the issue is replay not
> catching up within 180s, so it'd have to be the data volume growing, I think.
> But it doesn't look like the regression volume meaningfully grew around that
> time?

No, but my impression is that the failure rate has been getting slowly
worse for awhile now.

> I guess I'll try to write a buildfarm database query to extract how long that
> phase of the test took from all runs on my menagerie, not just the failing
> one, and see if there's a visible trend.

+1

            regards, tom lane



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-03-26 00:54:54 -0400, Tom Lane wrote:
> > I guess I'll try to write a buildfarm database query to extract how long that
> > phase of the test took from all runs on my menagerie, not just the failing
> > one, and see if there's a visible trend.
>
> +1

Only the query for successful runs has finished, and it looks like the error
cse is still going to take a while longer, so here are excerpts from a query
showing how long 027_stream_regress.pl took to succeed:

    sysname    |    date    | count | avg_duration
---------------+------------+-------+--------------

 calliphoridae | 2024-01-29 |    10 |          435
 calliphoridae | 2024-02-05 |    25 |          496
 calliphoridae | 2024-02-12 |    36 |          522
 calliphoridae | 2024-02-19 |    25 |          445
 calliphoridae | 2024-02-26 |    35 |          516
 calliphoridae | 2024-03-04 |    53 |          507
 calliphoridae | 2024-03-11 |    51 |          532
 calliphoridae | 2024-03-18 |    53 |          548
 calliphoridae | 2024-03-25 |    13 |          518

 culicidae     | 2024-01-29 |    11 |          420
 culicidae     | 2024-02-05 |    31 |          485
 culicidae     | 2024-02-12 |    35 |          513
 culicidae     | 2024-02-19 |    29 |          489
 culicidae     | 2024-02-26 |    36 |          512
 culicidae     | 2024-03-04 |    63 |          541
 culicidae     | 2024-03-11 |    62 |          597
 culicidae     | 2024-03-18 |    56 |          603
 culicidae     | 2024-03-25 |    16 |          550

 tamandua      | 2024-01-29 |    13 |          420
 tamandua      | 2024-02-05 |    29 |          433
 tamandua      | 2024-02-12 |    34 |          431
 tamandua      | 2024-02-19 |    27 |          382
 tamandua      | 2024-02-26 |    36 |          492
 tamandua      | 2024-03-04 |    60 |          475
 tamandua      | 2024-03-11 |    56 |          533
 tamandua      | 2024-03-18 |    54 |          527
 tamandua      | 2024-03-25 |    21 |          507

Particularly on tamandua it does look like there has been an upwards trend.

Late, will try to look more in the next few days.

Greetings,

Andres Freund



Re: Recent 027_streaming_regress.pl hangs

От
Alexander Lakhin
Дата:
Hello Andres,

26.03.2024 10:59, Andres Freund wrote:
> Late, will try to look more in the next few days.
>

AFAICS, last 027_streaming_regress.pl failures on calliphoridae,
culicidae, tamandua occurred before 2024-03-27:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-26%2004%3A07%3A30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-22%2013%3A26%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-24%2007%3A44%3A27

So it looks like the issue resolved, but there is another apparently
performance-related issue: deadlock-parallel test failures.

A recent one:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-04-02%2022%3A20%3A22
test deadlock-parallel            ... FAILED   345099 ms

+isolationtester: canceling step d2a1 after 300 seconds
  step d2a1: <... completed>
-  sum
------
-10000
-(1 row)
-
...

The server log shows:
2024-04-02 23:56:45.353 UTC [3583878][client backend][5/530:0] LOG: statement: SET force_parallel_mode = on;
...
                   SELECT lock_share(3,x) FROM bigt LIMIT 1;
2024-04-02 23:56:45.364 UTC [3583876][client backend][3/2732:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-02 23:56:45.364 UTC [3583876][client backend][3/2732:0] DETAIL:  parameters: $1 = '3583878'
...
2024-04-02 23:57:28.967 UTC [3583876][client backend][3/5097:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-02 23:57:28.967 UTC [3583876][client backend][3/5097:0] DETAIL:  parameters: $1 = '3583877'
2024-04-02 23:57:29.016 UTC [3583877][client backend][4/530:0] LOG: statement: COMMIT;
2024-04-02 23:57:29.039 UTC [3583876][client backend][3/5098:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-02 23:57:29.039 UTC [3583876][client backend][3/5098:0] DETAIL:  parameters: $1 = '3583879'
...
2024-04-03 00:02:29.096 UTC [3583876][client backend][3/9472:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-03 00:02:29.096 UTC [3583876][client backend][3/9472:0] DETAIL:  parameters: $1 = '3583878'
2024-04-03 00:02:29.172 UTC [3905345][not initialized][:0] LOG: connection received: host=[local]
2024-04-03 00:02:29.240 UTC [3583878][client backend][5/530:0] ERROR:  canceling statement due to user request

The last step duration is 00:02:29.096 - 23:57:29.039 ~ 300 seconds
(default max_step_wait for REL_15_STABLE- (for REL_16_STABLE+ the default
value was increased to 360 by c99c67fc4)).

The average deadlock-parallel duration for REL_15_STABLE on canebrake is
around 128 seconds (for 140 runs I analyzed), but we can find also:
test deadlock-parallel            ... ok       377895 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2024-03-27%2001%3A06%3A24&stg=isolation-check
test deadlock-parallel            ... ok       302549 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2023-11-06%2012%3A47%3A01&stg=isolation-check
test deadlock-parallel            ... ok       255045 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2023-11-09%2010%3A02%3A59&stg=isolation-check

The similar situation on phycodurus:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2024-02-11%2021:05:41
test deadlock-parallel            ... FAILED   389381 ms

The average deadlock-parallel duration for REL_13_STABLE on phycodurus is
around 78 seconds (for 138 recent runs), but there were also:
test deadlock-parallel            ... ok       441736 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=phycodurus&dt=2024-03-04%2015%3A19%3A04&stg=isolation-check
test deadlock-parallel            ... ok       187844 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=phycodurus&dt=2023-11-03%2016%3A13%3A46&stg=isolation-check

And also pogona, REL_14_STABLE:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2024-02-20%2003%3A50%3A49
test deadlock-parallel            ... FAILED   425482 ms

(I could reach similar duration on a slowed-down VM, with JIT enabled as
on these animals.)

So, maybe these machines require larger PGISOLATIONTIMEOUT or there is
still some OS/environment issue there?

Best regards,
Alexander



Re: Recent 027_streaming_regress.pl hangs

От
Andres Freund
Дата:
Hi,

On 2024-04-04 19:00:00 +0300, Alexander Lakhin wrote:
> 26.03.2024 10:59, Andres Freund wrote:
> > Late, will try to look more in the next few days.
> > 
> 
> AFAICS, last 027_streaming_regress.pl failures on calliphoridae,
> culicidae, tamandua occurred before 2024-03-27:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-26%2004%3A07%3A30
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-22%2013%3A26%3A21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-24%2007%3A44%3A27
> 
> So it looks like the issue resolved, but there is another apparently
> performance-related issue: deadlock-parallel test failures.

I reduced test concurrency a bit. I hadn't quite realized how the buildfarm
config and meson test concurrency interact.  But there's still something off
with the frequency of fsyncs during replay, but perhaps that doesn't qualify
as a bug.


> A recent one:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-04-02%2022%3A20%3A22
> test deadlock-parallel            ... FAILED   345099 ms

> (I could reach similar duration on a slowed-down VM, with JIT enabled as
> on these animals.)
> 
> So, maybe these machines require larger PGISOLATIONTIMEOUT or there is
> still some OS/environment issue there?

Hm, possible. Forcing every query to be JITed, in a debug build of LLVM is
absurdly expensive.

Greetings,

Andres Freund