Recovery test failure for recovery_min_apply_delay on hamster

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Recovery test failure for recovery_min_apply_delay on hamster
Дата
Msg-id CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com
обсуждение исходный текст
Ответы Re: Recovery test failure for recovery_min_apply_delay on hamster  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers
Hi all,

I have enabled yesterday the recovery test suite on hamster, and we
did not have to wait long before seeing the first failure on it, the
machine being slow as hell so it is quite good at catching race
conditions:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-03-01%2016%3A00%3A06
Honestly, I did runs on this machine of the test suite, but I didn't
see it, so that's quite sporadic. Yesterday's run worked fine for
example.

In more details, the following problem showed up:
### Running SQL command on node "standby": SELECT count(*) FROM tab_int
not ok 1 - check content with delay of 1s

#   Failed test 'check content with delay of 1s'
#   at t/005_replay_delay.pl line 39.
#          got: '20'
#     expected: '10'
### Running SQL command on node "master": SELECT pg_current_xlog_location();
### Running SQL command on node "standby": SELECT count(*) FROM tab_int
ok 2 - check content with delay of 2s

This is a timing issue, caused by the use of recovery_min_apply_delay,
the test doing the following:
1) Set up recovery_min_apply_delay to 2 seconds
2) Start the standby
3) Apply an INSERT on master, save pg_current_xlog_location from master
4) sleep 1s
5) query standby, and wait that WAL has not been applied yet.
6) Wait that required LSN from master has been applied
7) query again standby, and see that WAL has been applied.

The problem is that visibly hamster is so slow that more than 2s have
been spent between phases 3 and 5, meaning that the delay has already
been reached, and WAL was applied.

Here are a couple of ways to address this problem:
1) Remove the check before applying the delay
2) Increase recovery_min_apply_delay to a time that will allow even
slow machines to see a difference. By experience with the other tests
30s would be enough. The sleep time needs to be increased as well,
making the time taken for the test to run longer
3) Remove all together 005, because doing either 1) or 2) reduces the
value of the test.
I'd like 1) personally, I still see value in this test.

Thoughts?
-- 
Michael



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: TAP / recovery-test fs-level backups, psql enhancements etc
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Move PinBuffer and UnpinBuffer to atomics