Re: [BUG] non archived WAL removed during production crash recovery
От | Michael Paquier |
---|---|
Тема | Re: [BUG] non archived WAL removed during production crash recovery |
Дата | |
Msg-id | 20200417065043.GC350229@paquier.xyz обсуждение исходный текст |
Ответ на | Re: [BUG] non archived WAL removed during production crash recovery (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Ответы |
Re: [BUG] non archived WAL removed during production crash recovery
|
Список | pgsql-bugs |
On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote: > On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz> wrote: >> Removing .done files does not matter as the segments related to them are >> already gone. > > Not necessarily. There's a time windows between the moment the archiver set > the .done file and when the checkpointer removes the associated WAL file. > So, after a PANIC because of lack of space, WAL associated with .done files but > not removed yet will be removed during the crash recovery. Not sure that it is something that matters for this thread though, so if necessary I think that it could be be discussed separately. >> Even with that, why should we need to make the backend smarter about >> the removal of .ready files during crash recovery. It seems to me >> that we should keep them, and an operator could always come by himself >> and do some manual cleanup to free some space in the pg_wal partition. > > We are agree on this. Okay. > Unless I'm wrong, the empty string does not raise an error in pg_stat_archiver, > and I wanted to add a test on this as well. Exactly, it won't raise an error. Instead I switched to use a poll query with pg_stat_file() and .ready files, but this has proved to delay the test considerably if we did not create more segments. And your approach has the merit to be more simple with only two segments manipulated for the whole test. So I have tried first my idea, noticed the mess it introduced, and just kept your approach. > Thanks for your review! Let me know if you want me to add/change/fix some tests. Thanks, I have worked more on the test, refactoring pieces related to the segment names, adjusting some comments and fixing some of the logic. Note that you introduced something incorrect at the creation of $standby2 as you have been updating postgresql.conf.auto for $standby1. I have noticed an extra issue while looking at the backend pieces today: at the beginning of the REDO loop we forgot one place where SharedRecoveryState *has* to be updated to a correct state (around the comment "Update pg_control to show that we are..." in xlog.c) as the startup process may decide to switch the control file state to DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to update the new shared flag at this early stage. It did not matter before because SharedRecoveryInProgress would be only "true" for both, but that's not the case anymore as we need to make the difference between crash recovery and archive recovery in the new flag. There is no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH as the initial shared memory state is RECOVERY_STATE_CRASH, but updating the flag makes the code more consistent IMO so I updated it anyway in the attached. I have the feeling that I need to work a bit more on this patch, but my impression is that we are getting to something committable here. Thoughts? -- Michael
Вложения
В списке pgsql-bugs по дате отправления: