Обсуждение: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

Поиск
Список
Период
Сортировка

Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

От
Alexander Lakhin
Дата:
Hello hackers,

While investigating the recent skink failure [1], I've reproduced this
failure under Valgrind on a slow machine and found that this happens due to
the last checkpoint recorded in the segment 2, that is removed in the test:
The failure log contains:
2023-10-10 19:10:08.212 UTC [2144251][startup][:0] LOG:  invalid checkpoint record
2023-10-10 19:10:08.214 UTC [2144251][startup][:0] PANIC:  could not locate a valid checkpoint record

The line above:
[19:10:02.701](318.076s) ok 1 - 000000010000000000000001 differs from 000000010000000000000002
tells us about the duration of previous operations (> 5 mins).

src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log:
2023-10-10 19:04:50.149 UTC [1845798][postmaster][:0] LOG:  database system is ready to accept connections
...
2023-10-10 19:09:49.131 UTC [1847585][checkpointer][:0] LOG:  checkpoint starting: time
...
2023-10-10 19:10:02.058 UTC [1847585][checkpointer][:0] LOG:  checkpoint complete: ... lsn=0/2093980, redo lsn=0/1F62760

And here is one more instance of this failure [2]:
2022-11-08 02:35:25.826 UTC [1614205][][:0] PANIC:  could not locate a valid checkpoint record
2022-11-08 02:35:26.164 UTC [1612967][][:0] LOG:  startup process (PID 1614205) was terminated by signal 6: Aborted

src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log:
2022-11-08 02:29:57.961 UTC [1546469][][:0] LOG:  database system is ready to accept connections
...
2022-11-08 02:35:10.764 UTC [1611737][][2/10:0] LOG:  statement: SELECT pg_walfile_name(pg_current_wal_insert_lsn())
2022-11-08 02:35:11.598 UTC [1546469][][:0] LOG:  received immediate shutdown request

The next successful run after the failure [1] shows the following duration:
[21:34:48.556](180.150s) ok 1 - 000000010000000000000001 differs from 000000010000000000000002
And the last successful run:
[03:03:53.892](126.206s) ok 1 - 000000010000000000000001 differs from 000000010000000000000002

So to fail on the test, skink should perform at least twice slower than
usual, and may be it's an extraordinary condition indeed, but on the other
hand, may be increase checkpoint_timeout as already done in several tests
(015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...).

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11

Best regards,
Alexander

Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

От
Michael Paquier
Дата:
On Thu, Oct 12, 2023 at 02:00:00PM +0300, Alexander Lakhin wrote:
> So to fail on the test, skink should perform at least twice slower than
> usual, and may be it's an extraordinary condition indeed, but on the other
> hand, may be increase checkpoint_timeout as already done in several tests
> (015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...).
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11

Thanks for the investigation.  Increasing the checkpoint timeout is
not a perfect science but at least it would work until a machine is
able to be slower than the current limit reached, so I would be OK
with your suggestion and raise the bar a bit more to prevent the race
created by these extra checkpoints triggered because of the time.
--
Michael

Вложения

Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

От
Andres Freund
Дата:
Hi,

On 2023-10-12 14:00:00 +0300, Alexander Lakhin wrote:
> So to fail on the test, skink should perform at least twice slower than
> usual

The machine skink is hosted on runs numerous buildfarm animals (24 I think
right now, about to be 28). While it has plenty resources (16 cores/32
threads, 128GB RAM), test runtime is still pretty variable depending on what
other tests are running at the same time...

Greetings,

Andres Freund



Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

От
Michael Paquier
Дата:
On Thu, Oct 12, 2023 at 04:46:02PM -0700, Andres Freund wrote:
> The machine skink is hosted on runs numerous buildfarm animals (24 I think
> right now, about to be 28). While it has plenty resources (16 cores/32
> threads, 128GB RAM), test runtime is still pretty variable depending on what
> other tests are running at the same time...

Okay.  It seems to me that just setting checkpoint_timeout to 1h
should leave plenty of room to make sure that the test is not unstable
compared to the default of 5 mins.  So, why not just do that and see
what happens for a few days?
--
Michael

Вложения