Обсуждение: Use of backup_label not noted in log

Поиск
Список
Период
Сортировка

Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

I've often had to analyze what caused corruption in PG instances, where the
symptoms match not having had backup_label in place when bringing on the
node. However that's surprisingly hard - the only log messages that indicate
use of backup_label are at DEBUG1.

Given how crucial use of backup_label is and how frequently people do get it
wrong, I think we should add a LOG message - it's not like use of backup_label
is a frequent thing in the life of a postgres instance and is going to swamp
the log.  And I think we should backpatch that addition.

Medium term I think we should go further, and leave evidence in pg_control
about the last use of ControlFile->backupStartPoint, instead of resetting it.

I realize that there's a discussion about removing backup_label - but I think
that's fairly orthogonal. Should we go with the pg_control approach, we should
still emit a useful message when starting in a state that's "equivalent" to
having used the backup_label.

Thoughts?

Greetings,

Andres Freund



Re: Use of backup_label not noted in log

От
Laurenz Albe
Дата:
On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:
> I've often had to analyze what caused corruption in PG instances, where the
> symptoms match not having had backup_label in place when bringing on the
> node. However that's surprisingly hard - the only log messages that indicate
> use of backup_label are at DEBUG1.
>
> Given how crucial use of backup_label is and how frequently people do get it
> wrong, I think we should add a LOG message - it's not like use of backup_label
> is a frequent thing in the life of a postgres instance and is going to swamp
> the log.  And I think we should backpatch that addition.

+1

I am not sure about the backpatch: it is not a bug, and we should not wantonly
introduce new log messages in a minor release.  Some monitoring system may
get confused.

What about adding it to the "redo starts at" message, something like

  redo starts at 12/12345678, taken from control file

or

  redo starts at 12/12345678, taken from backup label

Yours,
Laurenz Albe



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/17/23 00:18, Andres Freund wrote:
> 
> I've often had to analyze what caused corruption in PG instances, where the
> symptoms match not having had backup_label in place when bringing on the
> node. However that's surprisingly hard - the only log messages that indicate
> use of backup_label are at DEBUG1.
> 
> Given how crucial use of backup_label is and how frequently people do get it
> wrong, I think we should add a LOG message - it's not like use of backup_label
> is a frequent thing in the life of a postgres instance and is going to swamp
> the log.  And I think we should backpatch that addition.

+1 for the message and I think a backpatch is fine as long as it is a 
new message. If monitoring systems can't handle an unrecognized message 
then that feels like a problem on their part.

> Medium term I think we should go further, and leave evidence in pg_control
> about the last use of ControlFile->backupStartPoint, instead of resetting it.

Michael also thinks this is a good idea.

> I realize that there's a discussion about removing backup_label - but I think
> that's fairly orthogonal. Should we go with the pg_control approach, we should
> still emit a useful message when starting in a state that's "equivalent" to
> having used the backup_label.

Agreed, this new message could easily be adapted to the recovery in 
pg_control patch.

Regards,
-David



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/17/23 01:41, Laurenz Albe wrote:
> On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:
>> I've often had to analyze what caused corruption in PG instances, where the
>> symptoms match not having had backup_label in place when bringing on the
>> node. However that's surprisingly hard - the only log messages that indicate
>> use of backup_label are at DEBUG1.
>>
>> Given how crucial use of backup_label is and how frequently people do get it
>> wrong, I think we should add a LOG message - it's not like use of backup_label
>> is a frequent thing in the life of a postgres instance and is going to swamp
>> the log.  And I think we should backpatch that addition.
> 
> +1
> 
> I am not sure about the backpatch: it is not a bug, and we should not wantonly
> introduce new log messages in a minor release.  Some monitoring system may
> get confused.
> 
> What about adding it to the "redo starts at" message, something like
> 
>    redo starts at 12/12345678, taken from control file
> 
> or
> 
>    redo starts at 12/12345678, taken from backup label

I think a backpatch is OK as long as it is a separate message, but I 
like your idea of adding to the "redo starts" message going forward.

I know this isn't really a bug, but not being able to tell where 
recovery information came from seems like a major omission in the logging.

Regards,
-David



Re: Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

On 2023-11-17 06:41:46 +0100, Laurenz Albe wrote:
> On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:
> > I've often had to analyze what caused corruption in PG instances, where the
> > symptoms match not having had backup_label in place when bringing on the
> > node. However that's surprisingly hard - the only log messages that indicate
> > use of backup_label are at DEBUG1.
> >
> > Given how crucial use of backup_label is and how frequently people do get it
> > wrong, I think we should add a LOG message - it's not like use of backup_label
> > is a frequent thing in the life of a postgres instance and is going to swamp
> > the log.  And I think we should backpatch that addition.
>
> +1
>
> I am not sure about the backpatch: it is not a bug, and we should not wantonly
> introduce new log messages in a minor release.  Some monitoring system may
> get confused.

I think log monitoring need (and do) handle unknown log messages
gracefully. You're constantly encountering them.  If were to change an
existing log message in the back branches it'd be a different story.

The reason for backpatching is that this is by far the most common reason for
corrupted systems in the wild that I have seen. And there's no way to
determine from the logs whether something has gone right or wrong - not really
a bug, but a pretty substantial weakness. And we're going to have to deal with
< 17 releases for 5 years, so making this at least somewhat diagnosable seems
like a good idea.


> What about adding it to the "redo starts at" message, something like
>
>   redo starts at 12/12345678, taken from control file
>
> or
>
>   redo starts at 12/12345678, taken from backup label

I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().

I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.

Greetings,

Andres Freund



Re: Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

On 2023-11-18 09:30:01 -0400, David Steele wrote:
> I know this isn't really a bug, but not being able to tell where recovery
> information came from seems like a major omission in the logging.

Yea. I was preparing to forecefully suggest that some monitoring tooling
should verify that new standbys and PITRs needs to check that backup_label was
actually used, just to remember that there's nothing they could realistically
use (using DEBUG1 in production imo isn't realistic).

Greetings,

Andres Freund



Re: Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
> > What about adding it to the "redo starts at" message, something like
> >
> >   redo starts at 12/12345678, taken from control file
> >
> > or
> >
> >   redo starts at 12/12345678, taken from backup label
> 
> I think it'd make sense to log use of backup_label earlier than that - the
> locations from backup_label might end up not being available in the archive,
> the primary or locally, and we'll error out with "could not locate a valid
> checkpoint record".
> 
> I'd probably just do it within the if (read_backup_label()) block in
> InitWalRecovery(), *before* the ReadCheckpointRecord().

Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1

When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958

Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.


I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.


> I do like the idea of expanding the "redo starts at" message
> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
> ControlFile->backupEndPoint would provide information about when the node
> might become consistent.

Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.


There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.

And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

  LOG:  consistent recovery state reached at ...
before
  LOG:  redo starts at ...


But that's all clearly just material for HEAD.

Greetings,

Andres Freund

Вложения

Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> Note that the LSN in the "continuing" case is the one the backup started at,
> not where recovery will start.
>
> I've wondered whether it's worth also adding an explicit message just after
> ReachedEndOfBackup(), but it seems far less urgent due to the existing
> "consistent recovery state reached at %X/%X" message.

Upgrading the surrounding DEBUG1 to a LOG is another option, but I
agree that I've seen less that as being an actual problem in the field
compared to the famous I-removed-a-backup-label-and-I-m-still-up,
until this user sees signs of corruption after recovery was finished,
sometimes days after putting back an instance online.

> Playing around with this a bit, I'm wondering if we instead should remove that
> message, and emit something more informative earlier on. If there's a problem,
> you kinda want the information before we finally get to the loop in
> PerformWalLRecovery(). If e.g. there's no WAL you'll only get
> LOG:  invalid checkpoint record
> PANIC:  could not locate a valid checkpoint record

I was looking at this code a few weeks ago and have on my stack of
list to do an item about sending a patch to make this exact message
PANIC more talkative as there are a lot of instances with
log_min_messages > log.

> which is delightfully lacking in details.

With a user panicking as much as the server itself, that's even more
tasty.

+       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+           ereport(LOG,
+                   (errmsg("continuing to start from base backup with redo LSN %X/%X",
+                           LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));

"Continuing to start" sounds a bit weird to me, though, considering
that there are a few LOGs that say "starting" when there is a signal
file, but I don't have a better idea on top of my mind.  So that
sounds OK here.
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
Laurenz Albe
Дата:
I can accept that adding log messages to back branches is ok.
Perhaps I am too nervous about things like that, because as an extension
developer I have been bitten too often by ABI breaks in minor releases
in the past.

On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote:
> +       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> +           ereport(LOG,
> +                   (errmsg("continuing to start from base backup with redo LSN %X/%X",
> +                           LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
>
> "Continuing to start" sounds a bit weird to me, though, considering
> that there are a few LOGs that say "starting" when there is a signal
> file, but I don't have a better idea on top of my mind.  So that
> sounds OK here.

We can only reach that message in recovery or standby mode, right?
So why not write "continuing to recover from base backup"?


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.

Yours,
Laurenz Albe



Re: Use of backup_label not noted in log

От
David Steele
Дата:
[Resending since I accidentally replied off-list]

On 11/18/23 17:49, Andres Freund wrote:
> On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
>>> What about adding it to the "redo starts at" message, something like
>>>
>>>    redo starts at 12/12345678, taken from control file
>>>
>>> or
>>>
>>>    redo starts at 12/12345678, taken from backup label
>>
>> I think it'd make sense to log use of backup_label earlier than that - the
>> locations from backup_label might end up not being available in the archive,
>> the primary or locally, and we'll error out with "could not locate a valid
>> checkpoint record".
>>
>> I'd probably just do it within the if (read_backup_label()) block in
>> InitWalRecovery(), *before* the ReadCheckpointRecord().
> 
> Not enamored with the phrasing of the log messages, but here's a prototype:
> 
> When starting up with backup_label present:
> LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1

I'd prefer something like:

LOG:  starting backup recovery with redo...

> When restarting before reaching the end of the backup, but after backup_label
> has been removed:
> LOG:  continuing to start from base backup with redo LSN A/34100028
> LOG:  entering standby mode
> LOG:  redo starts at A/3954B958

And here:

LOG:  restarting backup recovery with redo...

> Note that the LSN in the "continuing" case is the one the backup started at,
> not where recovery will start.
> 
> I've wondered whether it's worth also adding an explicit message just after
> ReachedEndOfBackup(), but it seems far less urgent due to the existing
> "consistent recovery state reached at %X/%X" message.

I think the current message is sufficient, but what do you have in mind?

> We are quite inconsistent about how we spell LSNs. Sometimes with LSN
> preceding, sometimes not. Sometimes with (LSN). Etc.

Well, this could be improved in HEAD for sure.

>> I do like the idea of expanding the "redo starts at" message
>> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
>> ControlFile->backupEndPoint would provide information about when the node
>> might become consistent.

+1

> Playing around with this a bit, I'm wondering if we instead should remove that
> message, and emit something more informative earlier on. If there's a problem,
> you kinda want the information before we finally get to the loop in
> PerformWalLRecovery(). If e.g. there's no WAL you'll only get
> LOG:  invalid checkpoint record
> PANIC:  could not locate a valid checkpoint record
> 
> which is delightfully lacking in details.

I've been thinking about improving this myself. It would probably also 
help a lot to hint that restore_command may be missing or not returning 
results (but also not erroring). But there are a bunch of ways to get to 
this message so we'd need to be careful.

> There also are some other oddities:
> 
> If the primary is down when starting up, and we'd need WAL from the primary
> for the first record, the "redo start at" message is delayed until that
> happens, because we emit the message not before we read the first record, but
> after. That's just plain odd.

Agreed. Moving it up would be better.

> And sometimes we'll start referencing the LSN at which we are starting
> recovery before the "redo starts at" message. If e.g. we shut down
> at a restart point, we'll emit
> 
>    LOG:  consistent recovery state reached at ...
> before
>    LOG:  redo starts at ...

Huh, I haven't seen that one. Definitely confusing.

> But that's all clearly just material for HEAD.

Absolutely. I've been thinking about some of this as well, but want to 
see if we can remove the backup label first so we don't have to rework a 
bunch of stuff.

Of course, that shouldn't stop you from proceeding. I'm sure anything 
you are thinking of here could be adapted.

Regards,
-David



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/20/23 06:35, Laurenz Albe wrote:
> 
> If we add a message for starting with "backup_label", shouldn't
> we also add a corresponding message for starting from a checkpoint
> found in the control file?  If you see that in a problem report,
> you immediately know what is going on.

+1. It is easier to detect the presence of a message than the absence of 
one.

Regards,
-David



Re: Use of backup_label not noted in log

От
Robert Haas
Дата:
On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> I can accept that adding log messages to back branches is ok.
> Perhaps I am too nervous about things like that, because as an extension
> developer I have been bitten too often by ABI breaks in minor releases
> in the past.

I think that adding a log message to the back branches would probably
make my life better not worse, because when people do strange things
and then send me the log messages to figure out what the heck
happened, it would be there, and I'd have a clue. However, the world
doesn't revolve around me. I can imagine users getting spooked if a
new message that they've never seen before, and I think that risk
should be considered. There are good reasons for keeping the
back-branches stable, and as you said before, this isn't a bug fix.

I do also think it is worth considering how this proposal interacts
with the proposal to remove backup_label. If that proposal goes
through, then this proposal is obsolete, I believe. But if this is a
good idea, does that mean that's not a good idea? Or would we try to
make the pg_control which that patch would drop in place have some
internal difference which we could use to drive a similar log message?
Maybe we should, because knowing whether or not the user followed the
backup procedure correctly would indeed be a big help and it would be
regrettable to gain that capability only to lose it again...

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/20/23 12:24, Robert Haas wrote:
> On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>> I can accept that adding log messages to back branches is ok.
>> Perhaps I am too nervous about things like that, because as an extension
>> developer I have been bitten too often by ABI breaks in minor releases
>> in the past.
> 
> I think that adding a log message to the back branches would probably
> make my life better not worse, because when people do strange things
> and then send me the log messages to figure out what the heck
> happened, it would be there, and I'd have a clue. However, the world
> doesn't revolve around me. I can imagine users getting spooked if a
> new message that they've never seen before, and I think that risk
> should be considered. There are good reasons for keeping the
> back-branches stable, and as you said before, this isn't a bug fix.

Personally I think that the value of the information outweighs the 
weirdness of a new message appearing.

> I do also think it is worth considering how this proposal interacts
> with the proposal to remove backup_label. If that proposal goes
> through, then this proposal is obsolete, I believe. 

Not at all. I don't even think the messages will need to be reworded, or 
not much since they don't mention backup_label.

> But if this is a
> good idea, does that mean that's not a good idea? Or would we try to
> make the pg_control which that patch would drop in place have some
> internal difference which we could use to drive a similar log message?

The recovery in pg_control patch has all the same recovery info stored, 
so similar (or the same) log message would still be appropriate.

> Maybe we should, because knowing whether or not the user followed the
> backup procedure correctly would indeed be a big help and it would be
> regrettable to gain that capability only to lose it again...

The info is certainly valuable and we wouldn't lose it, unless there is 
something I'm not getting.

Regards,
-David



Re: Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

On 2023-11-20 11:24:25 -0500, Robert Haas wrote:
> I do also think it is worth considering how this proposal interacts
> with the proposal to remove backup_label. If that proposal goes
> through, then this proposal is obsolete, I believe.

I think it's the opposite, if anything. Today you can at least tell there was
use of a backup_label by looking for backup_label.old and you can verify
fairly easily in a restore script that backup_label is present. If we "just"
use pg_control, neither of those is as easy. I.e. log messages would be more
important, not less.  Depending on how things work out, we might need to
reformulate and/or move them a bit, but that doesn't seem like a big deal.


> But if this is a good idea, does that mean that's not a good idea? Or would
> we try to make the pg_control which that patch would drop in place have some
> internal difference which we could use to drive a similar log message?

I think we absolutely have to. If there's no way to tell whether an "external"
pg_backup_start/stop() procedure actually used the proper pg_control, it'd
make the situation substantially worse compared to today's, already bad,
situation.

Greetings,

Andres Freund



Re: Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

On 2023-11-20 17:30:31 +0900, Michael Paquier wrote:
> On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> > Note that the LSN in the "continuing" case is the one the backup started at,
> > not where recovery will start.
> > 
> > I've wondered whether it's worth also adding an explicit message just after
> > ReachedEndOfBackup(), but it seems far less urgent due to the existing
> > "consistent recovery state reached at %X/%X" message.
> 
> Upgrading the surrounding DEBUG1 to a LOG is another option, but I
> agree that I've seen less that as being an actual problem in the field
> compared to the famous I-removed-a-backup-label-and-I-m-still-up,
> until this user sees signs of corruption after recovery was finished,
> sometimes days after putting back an instance online.

"end of backup reached" could scare users, it doesn't obviously indicate
something "good". "completed backup recovery, started at %X/%X" or such would
be better imo.


> +       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> +           ereport(LOG,
> +                   (errmsg("continuing to start from base backup with redo LSN %X/%X",
> +                           LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
> 
> "Continuing to start" sounds a bit weird to me, though, considering
> that there are a few LOGs that say "starting" when there is a signal
> file, but I don't have a better idea on top of my mind.  So that
> sounds OK here.

I didn't like it much either - but I like David's proposal in his sibling
reply:

LOG:  starting backup recovery with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1
LOG:  restarting backup recovery with redo LSN A/34100028
and adding the message from above:
LOG:  completing backup recovery with redo LSN A/34100028

Greetings,

Andres Freund



Re: Use of backup_label not noted in log

От
Andres Freund
Дата:
Hi,

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:
> On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote:
> > +       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> > +           ereport(LOG,
> > +                   (errmsg("continuing to start from base backup with redo LSN %X/%X",
> > +                           LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
> >
> > "Continuing to start" sounds a bit weird to me, though, considering
> > that there are a few LOGs that say "starting" when there is a signal
> > file, but I don't have a better idea on top of my mind.  So that
> > sounds OK here.
>
> We can only reach that message in recovery or standby mode, right?
> So why not write "continuing to recover from base backup"?

It can be reached without either too, albeit much less commonly.


> If we add a message for starting with "backup_label", shouldn't
> we also add a corresponding message for starting from a checkpoint
> found in the control file?  If you see that in a problem report,
> you immediately know what is going on.

Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.

You also can extrapolate from the messages absence that we started up without
backup_label, it's not like there would be a lot of messages inbetween
  "database system was interrupted; last ..."
and
  "starting backup recovery ..."
(commonly there would be no messages)

We can do more on HEAD of course, but we should be wary of just spamming the
log unnecessarily as well.


I guess we could add this message at the same time, including in the back
branches. Initially I thought that might be unwise, because replacing
        elog(DEBUG1, "end of backup reached");
with a different message could theoretically cause issues, even if unlikely,
given that it's a DEBUG1 message.

But I think we actually want to emit the message a bit later, just *after* we
updated the control file, as that's the actually relevant piece after which we
won't go back to the "backup recovery" state.  I am somewhat agnostic about
whether we should add that in the back branches or not.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80


Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".

Greetings,

Andres Freund

Вложения

Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/20/23 14:27, Andres Freund wrote:
> Hi,
> 
> On 2023-11-19 14:28:12 -0400, David Steele wrote:
>> On 11/18/23 17:49, Andres Freund wrote:
>>> On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
>>> Not enamored with the phrasing of the log messages, but here's a prototype:
>>>
>>> When starting up with backup_label present:
>>> LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1
>>
>> I'd prefer something like:
>>
>> LOG:  starting backup recovery with redo...
> 
>>> When restarting before reaching the end of the backup, but after backup_label
>>> has been removed:
>>> LOG:  continuing to start from base backup with redo LSN A/34100028
>>> LOG:  entering standby mode
>>> LOG:  redo starts at A/3954B958
>>
>> And here:
>>
>> LOG:  restarting backup recovery with redo...
> 
> I like it.

Cool.

>>> I've wondered whether it's worth also adding an explicit message just after
>>> ReachedEndOfBackup(), but it seems far less urgent due to the existing
>>> "consistent recovery state reached at %X/%X" message.
>>
>> I think the current message is sufficient, but what do you have in mind?
> 
> Well, the consistency message is emitted after every restart. Whereas a single
> instance only should go through backup recovery once. So it seems worthwhile
> to differentiate the two in log messages.

Ah, right. That works for me, then.

Regards,
-David



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/20/23 15:03, Andres Freund wrote:
> On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:
> 
>> If we add a message for starting with "backup_label", shouldn't
>> we also add a corresponding message for starting from a checkpoint
>> found in the control file?  If you see that in a problem report,
>> you immediately know what is going on.
> 
> Maybe - the reason I hesitate on that is that emitting an additional log
> message when starting from a base backup just adds something "once once the
> lifetime of a node". Whereas emitting something every start obviously doesn't
> impose any limit.

Hmm, yeah, that would be a bit much.

> Here's the state with my updated patch, when starting up from a base backup:
> 
> LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 64-bit
> LOG:  listening on IPv6 address "::1", port 5441
> LOG:  listening on IPv4 address "127.0.0.1", port 5441
> LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
> LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
> LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint LSN E/B01B17F0, on timeline ID 1
> LOG:  entering standby mode
> LOG:  redo starts at E/AFF07F20
> LOG:  completed recovery from base backup with redo LSN E/AFF07F20
> LOG:  consistent recovery state reached at E/B420FC80
> 
> Besides the phrasing and the additional log message (I have no opinion about
> whether it should be backpatched or not), I used %u for TimelineID as
> appropriate, and added a comma before "on timeline".

I still wonder if we need "base backup" in the messages? That sort of 
implies (at least to me) you used pg_basebackup but that may not be the 
case.

FWIW, I also prefer "backup recovery" over "recovery from backup". 
"recovery from backup" reads fine here, but if gets more awkward when 
you want to say something like "recovery from backup settings". In that 
case, I think "backup recovery settings" reads better. Not important for 
this patch, maybe, but the recovery in pg_control patch went the other 
way and I definitely think it makes sense to keep them consistent, 
whichever way we go.

Other than that, looks good for HEAD. Whether we back patch or not is 
another question, of course.

Regards,
-David




Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote:
> On 11/20/23 15:03, Andres Freund wrote:
>> Besides the phrasing and the additional log message (I have no opinion about
>> whether it should be backpatched or not), I used %u for TimelineID as
>> appropriate, and added a comma before "on timeline".

The "starting/restarting/completed recovery" line sounds better here,
so I'm OK with your suggestions.

> I still wonder if we need "base backup" in the messages? That sort of
> implies (at least to me) you used pg_basebackup but that may not be the
> case.

Or just s/base backup/backup/?

> Other than that, looks good for HEAD. Whether we back patch or not is
> another question, of course.

I'd rather see more information in the back-branches more quickly, so
count me in the bucket of folks in favor of a backpatch.
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
Laurenz Albe
Дата:
On Mon, 2023-11-20 at 11:03 -0800, Andres Freund wrote:
> > If we add a message for starting with "backup_label", shouldn't
> > we also add a corresponding message for starting from a checkpoint
> > found in the control file?  If you see that in a problem report,
> > you immediately know what is going on.
>
> Maybe - the reason I hesitate on that is that emitting an additional log
> message when starting from a base backup just adds something "once once the
> lifetime of a node". Whereas emitting something every start obviously doesn't
> impose any limit.

The message should only be shown if PostgreSQL replays WAL, that is,
after a crash.  That would (hopefully) make it a rare message too.

Yours,
Laurenz Albe



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/20/23 23:54, Michael Paquier wrote:
> On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote:
> 
>> I still wonder if we need "base backup" in the messages? That sort of
>> implies (at least to me) you used pg_basebackup but that may not be the
>> case.
> 
> Or just s/base backup/backup/?

That's what I meant but did not explain very well.

Regards,
-David



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 11/20/23 15:08, David Steele wrote:
> On 11/20/23 14:27, Andres Freund wrote:
> 
>>>> I've wondered whether it's worth also adding an explicit message 
>>>> just after
>>>> ReachedEndOfBackup(), but it seems far less urgent due to the existing
>>>> "consistent recovery state reached at %X/%X" message.
>>>
>>> I think the current message is sufficient, but what do you have in mind?
>>
>> Well, the consistency message is emitted after every restart. Whereas 
>> a single
>> instance only should go through backup recovery once. So it seems 
>> worthwhile
>> to differentiate the two in log messages.
> 
> Ah, right. That works for me, then.

Any status on this patch? If we do back patch it would be nice to see 
this in the upcoming minor releases. I'm in favor of a back patch, as I 
think this is minimally invasive and would be very useful for debugging 
recovery issues.

I like the phrasing you demonstrated in [1] but doesn't seem like 
there's a new patch for that, so I have attached one.

Happy to do whatever else I can to get this across the line.

Regards,
-David

---

[1] 
https://www.postgresql.org/message-id/20231120183633.c4lhoq4hld4u56dd%40awork3.anarazel.de
Вложения

Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Fri, Jan 19, 2024 at 09:32:26AM -0400, David Steele wrote:
> Any status on this patch? If we do back patch it would be nice to see this
> in the upcoming minor releases. I'm in favor of a back patch, as I think
> this is minimally invasive and would be very useful for debugging recovery
> issues.

I am not sure about the backpatch part, but on a second look I'm OK
with applying it on HEAD for now with the LOG added for the startup of
recovery when the backup_label file is read, for the recovery
completed from a backup, and for the restart from a backup.

> I like the phrasing you demonstrated in [1] but doesn't seem like there's a
> new patch for that, so I have attached one.

+       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)

Nit 1: I would use XLogRecPtrIsInvalid here.

+       ereport(LOG,
+               (errmsg("completed backup recovery with redo LSN %X/%X",
+                       LSN_FORMAT_ARGS(oldBackupStartPoint))));

Nit 2: How about adding backupEndPoint in this LOG?  That would give:
"completed backup recovery with redo LSN %X/%X and end LSN %X/%X".
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Mon, Jan 22, 2024 at 04:36:27PM +0900, Michael Paquier wrote:
> +       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
>
> Nit 1: I would use XLogRecPtrIsInvalid here.
>
> +       ereport(LOG,
> +               (errmsg("completed backup recovery with redo LSN %X/%X",
> +                       LSN_FORMAT_ARGS(oldBackupStartPoint))));
>
> Nit 2: How about adding backupEndPoint in this LOG?  That would give:
> "completed backup recovery with redo LSN %X/%X and end LSN %X/%X".

Hearing nothing, I've just applied a version of the patch with these
two modifications on HEAD.  If this needs tweaks, just let me know.
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 1/25/24 04:12, Michael Paquier wrote:
> On Mon, Jan 22, 2024 at 04:36:27PM +0900, Michael Paquier wrote:
>> +       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
>>
>> Nit 1: I would use XLogRecPtrIsInvalid here.
>>
>> +       ereport(LOG,
>> +               (errmsg("completed backup recovery with redo LSN %X/%X",
>> +                       LSN_FORMAT_ARGS(oldBackupStartPoint))));
>>
>> Nit 2: How about adding backupEndPoint in this LOG?  That would give:
>> "completed backup recovery with redo LSN %X/%X and end LSN %X/%X".
> 
> Hearing nothing, I've just applied a version of the patch with these
> two modifications on HEAD.  If this needs tweaks, just let me know.

I had planned to update the patch this morning -- so thanks for doing 
that. I think having the end point in the message makes perfect sense.

I would still advocate for a back patch here. It is frustrating to get 
logs from users that just say:

LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

It would be very helpful to know what the checkpoint record LSN was in 
this case.

Regards,
-David



Re: Use of backup_label not noted in log

От
Michael Banck
Дата:
Hi,

On Thu, Jan 25, 2024 at 08:56:52AM -0400, David Steele wrote:
> I would still advocate for a back patch here. It is frustrating to get logs
> from users that just say:
> 
> LOG:  invalid checkpoint record
> PANIC:  could not locate a valid checkpoint record
> 
> It would be very helpful to know what the checkpoint record LSN was in this
> case.

I agree.


Michael



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 1/25/24 09:29, Michael Banck wrote:
> Hi,
> 
> On Thu, Jan 25, 2024 at 08:56:52AM -0400, David Steele wrote:
>> I would still advocate for a back patch here. It is frustrating to get logs
>> from users that just say:
>>
>> LOG:  invalid checkpoint record
>> PANIC:  could not locate a valid checkpoint record
>>
>> It would be very helpful to know what the checkpoint record LSN was in this
>> case.
> 
> I agree.

Another thing to note here -- knowing the LSN is important but also 
knowing that backup recovery was attempted (i.e. backup_label exists) is 
really crucial. Knowing both just saves so much time in back and forth 
debugging.

It appears the tally for back patching is:

For: Andres, David, Michael B
Not Sure: Robert, Laurenz, Michael P

It seems at least nobody is dead set against it.

Regards,
-David



Re: Use of backup_label not noted in log

От
Tom Lane
Дата:
David Steele <david@pgmasters.net> writes:
> Another thing to note here -- knowing the LSN is important but also 
> knowing that backup recovery was attempted (i.e. backup_label exists) is 
> really crucial. Knowing both just saves so much time in back and forth 
> debugging.

> It appears the tally for back patching is:

> For: Andres, David, Michael B
> Not Sure: Robert, Laurenz, Michael P

> It seems at least nobody is dead set against it.

We're talking about 1d35f705e, right?  That certainly looks harmless
and potentially useful.  I'm +1 for back-patching.

            regards, tom lane



Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 1/25/24 17:42, Tom Lane wrote:
> David Steele <david@pgmasters.net> writes:
>> Another thing to note here -- knowing the LSN is important but also
>> knowing that backup recovery was attempted (i.e. backup_label exists) is
>> really crucial. Knowing both just saves so much time in back and forth
>> debugging.
> 
>> It appears the tally for back patching is:
> 
>> For: Andres, David, Michael B
>> Not Sure: Robert, Laurenz, Michael P
> 
>> It seems at least nobody is dead set against it.
> 
> We're talking about 1d35f705e, right?  That certainly looks harmless
> and potentially useful.  I'm +1 for back-patching.

That's the one. If we were modifying existing messages I would be 
against it, but new, infrequent (but oh so helpful) messages seem fine.

Regards,
-David



Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Thu, Jan 25, 2024 at 08:56:52AM -0400, David Steele wrote:
> I would still advocate for a back patch here. It is frustrating to get logs
> from users that just say:
>
> LOG:  invalid checkpoint record
> PANIC:  could not locate a valid checkpoint record
>
> It would be very helpful to know what the checkpoint record LSN was in this
> case.

Yes, I've pested over this one in the past when debugging corruption
issues.  To me, this would just mean to appens to the PANIC an "at
%X/%X", but perhaps you have more in mind for these code paths?
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Thu, Jan 25, 2024 at 05:44:52PM -0400, David Steele wrote:
> On 1/25/24 17:42, Tom Lane wrote:
>> We're talking about 1d35f705e, right?  That certainly looks harmless
>> and potentially useful.  I'm +1 for back-patching.
>
> That's the one. If we were modifying existing messages I would be against
> it, but new, infrequent (but oh so helpful) messages seem fine.

Well, I'm OK with this consensus on 1d35f705e if folks think this is
useful enough for all the stable branches.
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
David Steele
Дата:

On 1/25/24 20:52, Michael Paquier wrote:
> On Thu, Jan 25, 2024 at 08:56:52AM -0400, David Steele wrote:
>> I would still advocate for a back patch here. It is frustrating to get logs
>> from users that just say:
>>
>> LOG:  invalid checkpoint record
>> PANIC:  could not locate a valid checkpoint record
>>
>> It would be very helpful to know what the checkpoint record LSN was in this
>> case.
> 
> Yes, I've pested over this one in the past when debugging corruption
> issues.  To me, this would just mean to appens to the PANIC an "at
> %X/%X", but perhaps you have more in mind for these code paths?

I think adding the LSN to the panic message would be a good change for HEAD.

However, that still would not take the place of the additional messages 
in 1d35f705e showing that the LSN came from a backup_label.

Regards,
-David



Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Fri, Jan 26, 2024 at 12:08:46PM +0900, Michael Paquier wrote:
> Well, I'm OK with this consensus on 1d35f705e if folks think this is
> useful enough for all the stable branches.

I have done that down to REL_15_STABLE for now as this is able to
apply cleanly there.  Older branches have a lack of information here,
actually, because read_backup_label() does not return the TLI
retrieved from the start WAL segment, so we don't have the whole
package of information.
--
Michael

Вложения

Re: Use of backup_label not noted in log

От
David Steele
Дата:
On 1/28/24 20:09, Michael Paquier wrote:
> On Fri, Jan 26, 2024 at 12:08:46PM +0900, Michael Paquier wrote:
>> Well, I'm OK with this consensus on 1d35f705e if folks think this is
>> useful enough for all the stable branches.
> 
> I have done that down to REL_15_STABLE for now as this is able to
> apply cleanly there.  Older branches have a lack of information here,
> actually, because read_backup_label() does not return the TLI
> retrieved from the start WAL segment, so we don't have the whole
> package of information.

I took a pass at this on PG14 and things definitely look a lot different 
back there. Not only is the timeline missing, but there are two sections 
of code for ending a backup, one for standby backup and one for primary.

I'm satisfied with the back patches as they stand, unless anyone else 
wants to have a look.

Regards,
-David



Re: Use of backup_label not noted in log

От
Michael Paquier
Дата:
On Mon, Jan 29, 2024 at 10:03:19AM -0400, David Steele wrote:
> I took a pass at this on PG14 and things definitely look a lot different
> back there. Not only is the timeline missing, but there are two sections of
> code for ending a backup, one for standby backup and one for primary.

Unfortunately.  The TLI from the start WAL segment lacking from these
APIs is really annoying especially if the backup_label is gone for
some reason..

> I'm satisfied with the back patches as they stand, unless anyone else wants
> to have a look.

Okay, thanks for double-checking!
--
Michael

Вложения