Обсуждение: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

Поиск
Список
Период
Сортировка

Add missing error codes to PANIC/FATAL error reports in xlogrecovery

От
Krishnakumar R
Дата:
Hi,

Please find a patch attached which adds missing sql error code in
error reports which are FATAL or PANIC, in xlogrecovery.
This will help with deducing patterns when looking at error reports
from multiple postgres instances.

--
Thanks and Regards,
Krishnakumar (KK).
[Microsoft]

Вложения

Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

От
Andres Freund
Дата:
Hi,

On 2023-11-30 10:54:12 -0800, Krishnakumar R wrote:
> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index c61566666a..2f50928e7e 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -630,7 +630,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
>                  if (!ReadRecord(xlogprefetcher, LOG, false,
>                                  checkPoint.ThisTimeLineID))
>                      ereport(FATAL,
> -                            (errmsg("could not find redo location referenced by checkpoint record"),
> +                            (errcode(ERRCODE_DATA_CORRUPTED),
> +                             errmsg("could not find redo location referenced by checkpoint record"),
>                               errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or
\"%s/standby.signal\"and add required recovery options.\n"
 
>                                       "If you are not restoring from a backup, try removing the file
\"%s/backup_label\".\n"
>                                       "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if
restoringfrom a backup.",
 

Wondering if we should add a ERRCODE_CLUSTER_CORRUPTED for cases like this. We
have ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED, which make
ERRCODE_DATA_CORRUPTED feel a bit too specific in this kind of situation?

OTOH, just having anything other than ERRCODE_INTERNAL_ERROR is better.


> @@ -640,7 +641,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
>          else
>          {
>              ereport(FATAL,
> -                    (errmsg("could not locate required checkpoint record"),
> +                    (errcode(ERRCODE_DATA_CORRUPTED),
> +                     errmsg("could not locate required checkpoint record"),
>                       errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or
\"%s/standby.signal\"and add required recovery options.\n"
 
>                               "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
>                               "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring
froma backup.",
 

Another aside: Isn't the hint here obsolete since we've removed exclusive
backups? I can't think of any scenario now where removing backup_label would
be correct in a non-exclusive backup.


> @@ -817,7 +820,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
>           */
>          switchpoint = tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, NULL);
>          ereport(FATAL,
> -                (errmsg("requested timeline %u is not a child of this server's history",
> +                (errcode(ERRCODE_DATA_CORRUPTED),
> +                 errmsg("requested timeline %u is not a child of this server's history",
>                          recoveryTargetTLI),
>                   errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested
timeline,the server forked off from that timeline at %X/%X.",
 
>                             LSN_FORMAT_ARGS(ControlFile->checkPoint),

Hm, this one arguably is not corruption, but we still cannot
continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error code?

Greetings,

Andres Freund



Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

От
Robert Haas
Дата:
On Thu, Nov 30, 2023 at 2:47 PM Andres Freund <andres@anarazel.de> wrote:
> Another aside: Isn't the hint here obsolete since we've removed exclusive
> backups? I can't think of any scenario now where removing backup_label would
> be correct in a non-exclusive backup.

That's an extremely good point.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Wondering if we should add a ERRCODE_CLUSTER_CORRUPTED for cases like this. We
> have ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED, which make
> ERRCODE_DATA_CORRUPTED feel a bit too specific in this kind of situation?

Maybe.  We didn't officially define DATA_CORRUPTED as referring to
table data, but given the existence of INDEX_CORRUPTED maybe we
should treat it as that.  In any case ...

> Hm, this one arguably is not corruption, but we still cannot
> continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error code?

... I don't really like turning a whole bunch of error cases into
the same error code without some closer analysis.  I think you
are right that these need a bit more case-by-case thought.

            regards, tom lane



Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

От
Andres Freund
Дата:
Hi,

On 2023-11-30 16:02:55 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Wondering if we should add a ERRCODE_CLUSTER_CORRUPTED for cases like this. We
> > have ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED, which make
> > ERRCODE_DATA_CORRUPTED feel a bit too specific in this kind of situation?
> 
> Maybe.  We didn't officially define DATA_CORRUPTED as referring to
> table data, but given the existence of INDEX_CORRUPTED maybe we
> should treat it as that.

I'm on the fence about it. Certainly DATA_CORRUPTED would be more appropriate
than INTERNAL_ERROR.


> > Hm, this one arguably is not corruption, but we still cannot
> > continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error code?
> 
> ... I don't really like turning a whole bunch of error cases into
> the same error code without some closer analysis.

Other than this instance, they all indicate that the cluster is toast in some
way or another. So *_CORRUPTED seems appropriate. And even this instance would
be better off as _CORRUPTED than as INTERNAL_ERROR. There's so many of the
latter that you can't realistically alert on them occurring.

I don't like my idea of ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE much, that's
not something you realistically can alert on, and this error certainly is an
instance of "you're screwed until you manually intervene".

Greetings,

Andres Freund



Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

От
Krishnakumar R
Дата:
Hi,

Updated the patch with ERRCODE_CLUSTER_CORRUPTED & kept
ERRCODE_DATA_CORRUPTED when recovery is not consistent.

> > > Hm, this one arguably is not corruption, but we still cannot
> > > continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error code?

Added a ERRCODE_TIMELINE_INCONSISTENT to be specific about the
scenarios with timeline mismatches. Thoughts ?

>> Another aside: Isn't the hint here obsolete since we've removed exclusive
backups? I can't think of any scenario now where removing backup_label would
be correct in a non-exclusive backup.

Attached another patch which applies on top of the first patch to
remove the obsolete hint.

- KK

Вложения