Re: [BUG] non archived WAL removed during production crash recovery
От | Michael Paquier |
---|---|
Тема | Re: [BUG] non archived WAL removed during production crash recovery |
Дата | |
Msg-id | 20200427074945.GG11369@paquier.xyz обсуждение исходный текст |
Ответ на | Re: [BUG] non archived WAL removed during production crash recovery (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Ответы |
Re: [BUG] non archived WAL removed during production crash recovery
|
Список | pgsql-bugs |
On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote: > I agree the three tests could be removed as they were not covering the bug we > were chasing. However, they might still be useful to detect futur non expected > behavior changes. If you agree with this, please, find in attachment a patch > proposal against HEAD that recreate these three tests **after** a waiting loop > on both standby1 and standby2. This waiting loop is inspired from the tests in > 9.5 -> 10. FWIW, I would prefer keeping all three tests as well. So.. I have spent more time on this problem and mereswin here is a very good sample because it failed all three tests: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2020-04-24%2006%3A03%3A53 For standby2, we get this failure: ok 11 - .ready file for WAL segment 000000010000000000000001 existing in backup is kept with archive_mode=always on standby not ok 12 - .ready file for WAL segment 000000010000000000000002 created with archive_mode=always on standby Then, looking at 020_archive_status_standby2.log, we have the following logs: 2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG: statement: CHECKPOINT [...] 2020-04-24 02:08:32.303 PDT [9821:7] LOG: restored log file "000000010000000000000002" from archive In this case, the test forced a checkpoint to test the segment recycling *before* the extra restored segment we'd like to work on was actually restored. So it looks like my initial feeling about the timing issue was right, and I am also able to reproduce the original set of failures by adding a manual sleep to delay restores of segments, like that for example: --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname, if (recoveryRestoreCommand == NULL || strcmp(recoveryRestoreCommand, "") == 0) goto not_available; + pg_usleep(10 * 1000000); /* 10s */ + /* With your patch the problem does not show up anymore even with the delay added, so I would like to apply what you have sent and add back those tests. For now, I would just patch HEAD though as that's not worth the risk of destabilizing stable branches in the buildfarm. > $primary->poll_query_until('postgres', > q{SELECT archived_count FROM pg_stat_archiver}, '1') > - or die "Timed out while waiting for archiving to finish"; > + or die "Timed out while waiting for archiving to finish"; Some noise in the patch. This may have come from some unfinished business with pgindent. -- Michael
Вложения
В списке pgsql-bugs по дате отправления: