Re: Recent 027_streaming_regress.pl hangs

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: Recent 027_streaming_regress.pl hangs
Дата
Msg-id ed767fdd-bc7a-2a0b-d94a-b4a724af50bf@gmail.com
обсуждение исходный текст
Ответ на Re: Recent 027_streaming_regress.pl hangs  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
14.03.2024 23:56, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>>> Could it be that the timeout (360 sec?) is just not enough for the test
>>> under the current (changed due to switch to meson) conditions?
>> But you're right that under meson the test takes a lot longer, I guess
>> due to increased concurrency:
> What it seems to be is highly variable.  Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
> else is happening on that machine.  Also, this is probably not
> helping anything:
>
>                     'extra_config' => {
>                                                        ...
>                                                        'fsync = on'
>
> I would suggest turning that off and raising wait_timeout a good
> deal, and then we'll see if calliphoridae gets any more stable.

I could reproduce similar failures with
PG_TEST_EXTRA=wal_consistency_checking
only, running 5 tests in parallel on a slowed-down VM, so that test
duration increased to ~1900 seconds, but perhaps that buildfarm machine
has a different bottleneck (I/O?) or it's concurrent workload is not
uniform as in my experiments.

Meanwhile, I've analyzed failed test logs from buildfarm and calculated
the percentage of WAL replayed before timeout.
For instance, one of the failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2022%3A36%3A40
standby_1.log:
2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG:  started streaming WAL from primary at 0/3000000 on
timeline1
 
...
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart point at 0/E00E030
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18 
22:41:26.647756+00.
2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL:  could not receive data from WAL stream: server closed
the
 
connection unexpectedly

primary.log:
2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: GRANT ALL ON SCHEMA public TO public;
# ^^^ One of the first records produced by `make check`
...
2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG:  statement: ALTER VIEW my_property_secure SET 
(security_barrier=false);
# ^^^ A record near the last completed transaction on standby
...
2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG:  statement: DROP TABLESPACE
regress_tblspace_renamed;
# ^^^ One of the last records produced by `make check`

\set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756'
select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from (:'t1'::time - :'t0'::time));
~52%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2018%3A58%3A58
~48%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2016%3A41%3A13
~43%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2015%3A47%3A09
~36%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-15%2011%3A24%3A38
~87%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-17%2021%3A55%3A41
~36%

So it still looks like a performance-related issue to me. And yes,
fsync = off -> on greatly increases (~3x) the overall test duration in
that my environment.

Best regards,
Alexander



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Introduce XID age and inactive timeout based replication slot invalidation
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: SQL:2011 application time