Re: BUG #17744: Fail Assert while recoverying from pg_basebackup
От | Thomas Munro |
---|---|
Тема | Re: BUG #17744: Fail Assert while recoverying from pg_basebackup |
Дата | |
Msg-id | CA+hUKGJWOsxT-kkMmLwvFn-VVmZ3vnYBeFw7=_Aq5txdFdGUEg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17744: Fail Assert while recoverying from pg_basebackup (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: BUG #17744: Fail Assert while recoverying from pg_basebackup
|
Список | pgsql-bugs |
On Sat, Feb 25, 2023 at 4:39 PM Michael Paquier <michael@paquier.xyz> wrote: > By the way, I think that I have put my finger on something.. The > startup process happily sets InArchiveRecovery to true after > read_backup_label() if finding one, with a comment telling that > archive recovery is requested, still ArchiveRecoveryRequested is never > set to true unless there is a recovery.signal or a standby.signal. Yeah, it's all a bit confusing. That particular fact is the reason we can't just rely on RECOVERY_STATE_CRASH to suppress restart points in RecoveryRestartPoint(), which was my first naive idea before I started working through scenarios... Another thing that is a little confusing is that commit 7ff23c6 didn't change the behaviour of hot standbys in this respect: they already started the checkpointer at the start of recovery, not at the start of hot standby. Which leads to Bowen's suggestion to make recovery-from-backup work the same. As for what we're actually going to do about this, and how to test that it's working the way we want, here's my current idea, which I'm planning to work up experimental patches for on Monday: (1) Since it's quite hard to hit the assertion failure without generating a megaton of data, or hacking checkpoint_timeout to allow 1 second etc, I think we should test the startup process's behaviour instead (ie rather than the checkpointer's). We could have RecoveryRestartPoint() log when it's recording a restart point in shared memory (= a potential restart point, whether or not the checkpoint chooses to act on it), and also when it's choosing to skip. At DEBUG2 level. Then we could then have a test 035_restart_points.pl for all the various situations (actual crash, backup, backup + archives, pitr, with standby enabled, etc etc), and make assertions about which messages we log. That gives us observability and nails down the differences between these modes. (2) Then I think we should simply prevent restart points that 14 wouldn't have allowed, just a couple of back-patchable lines, maybe (yet another) new variable. Then for master I think we should (in later patches, for 16 or maybe 17) consider relaxing that. There doesn't seem to be a technical reason why a very long running PITR shouldn't benefit from restart points, and that would be in line with the direction of making recovery modes more similar, but given we/I missed this subtlety in the past, the more conservative choice seems more appropriate for 15.
В списке pgsql-bugs по дате отправления: