Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Дата
Msg-id CA+hUKG+fYGFzSSuGMXZscuD1gp+6rMbs_XpFxD-YGGBJjP1fOg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"  (Stephen Frost <sfrost@snowman.net>)
Ответы Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Tue, Jul 25, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
> * Thomas Munro (thomas.munro@gmail.com) wrote:
> > Here's a new minimal patch that solves only the bugs in basebackup +
> > the simple SQL-facing functions that read the control file, by simply
> > acquiring ControlFileLock in the obvious places.  This should be
> > simple enough for back-patching?
>
> I don't particularly like fixing this in a way that only works for
> pg_basebackup and means that the users of other backup tools don't have
> a solution to this problem.  What are we supposed to tell users of
> pgBackRest when they see this fix for pg_basebackup in the next set of
> minor releases and they ask us if we've addressed this risk?
>
> We might be able to accept the 're-read on CRC failure' approach, if it
> were being used for pg_controldata and we documented that external
> tools and especially backup tools using the low-level API are required
> to check the CRC and to re-read on failure if accessing a running
> system.

Hi Stephen, David, and thanks for looking.  Alright, let's try that idea out.

0001 + 0002.  Acquire the lock in the obvious places in the backend,
to completely exclude the chance of anything going wrong for the easy
cases, including pg_basebackup.  (This is the v4 from yesterday).
And...

0003.  Document this problem and how to detect it, in the
low-level-backup section.  Better words welcome.  And...

0004.  Retries for front-end programs, using the idea suggested by Tom
(to wit: if it fails, retry until it fails with the same CRC value
twice).  It's theoretically imperfect, but it's highly unlikely to
fail in practice and the best we can do without file locks or a
connection to the server AFAICT.  (This is the first patch I posted,
adjusted to give up after 10 (?) loops, and not to bother at all in
backend code since that takes ControlFileLock with the 0001 patch).

> While it's not ideal, maybe we could get away with changing the contents
> of the backup_label as part of a back-patch?  The documentation, at
> least on a quick look, says to copy the second field from pg_backup_stop
> into a backup_label file but doesn't say anything about what those
> contents are or if they can change.  That would at least address the
> concern of backups ending up with a corrupt pg_control file and not
> being able to be restored, even if tools aren't updated to verify the
> CRC or similar.  Of course, that's a fair bit of code churn for a
> backpatch, which I certainly understand as being risky.  If we can't
> back-patch that, it might still be the way to go moving forward, while
> also telling tools to check the CRC.  (I'm not going to try to figure
> out some back-patchable pretend solution for this for shell scripts that
> pretend to be able to backup running PG databases; this issue is
> probably the least of their issues anyway...)

I think you're probably right that anyone following those instructions
would be OK if we just back-patched such a thing, but it all seems a
little too much to me.  +1 to looking into that for v17 (and I guess
maybe someone could eventually argue for back-patching much later with
experience).  I'm sure other solutions are possible too... other
places to put a safe atomic copy of the control file could include: in
the WAL, or in extra files (pg_control.XXX) in the data directory +
some infallible way for recovery to choose which one to start up from.
Or something.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Removing the fixed-size buffer restriction in hba.c
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Removing the fixed-size buffer restriction in hba.c