Обсуждение: Propagate XLogFindNextRecord error to callers

Поиск
Список
Период
Сортировка

Propagate XLogFindNextRecord error to callers

От
Anthonin Bonnefoy
Дата:
Hi,

Currently, XLogFindNextRecord errormsg is ignored and callers will only output a generic 'could not find a valid record' message without details.
Additionally, invalid page header won't go through XLogReadRecord, leaving the error in state->errormsg_buf.

This patch propagates XLogFindNextRecord's error message to the callers and displays it. In case of an invalid page header, the errormsg is filled with errormsg_buf content.

With this patch, pg_waldump will now have the following output when reading a file with an invalid header:
pg_waldump: error: could not find a valid record after D80/5C000000: invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN D80/5C000000, offset 0 

Regards,
Anthonin Bonnefoy
Вложения

Re: Propagate XLogFindNextRecord error to callers

От
Mircea Cadariu
Дата:
Hi Anthonin,

On 12/12/2025 10:39, Anthonin Bonnefoy wrote:
>
> With this patch, pg_waldump will now have the following output when 
> reading a file with an invalid header:
> pg_waldump: error: could not find a valid record after D80/5C000000: 
> invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN 
> D80/5C000000, offset 0

I've picked up the review for your patch.

Attached is a failing test that reproduces the issue. Have I got it 
right? We can consider using it to validate your patch then.

-- 
Thanks,
Mircea Cadariu

Вложения

Re: Propagate XLogFindNextRecord error to callers

От
Anthonin Bonnefoy
Дата:
On Tue, Feb 3, 2026 at 5:05 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> I've picked up the review for your patch.

Thanks for picking it!

> Attached is a failing test that reproduces the issue. Have I got it
> right? We can consider using it to validate your patch then.

Yeah, that's the gist of it. However, the test you've written will
only work on little endian architectures. Also, I think the xlog page
header size won't have the 4 bytes padding on 32 bits systems.

I've added a similar test in 001_basic.pl, but it relies on copying an
existing WAL file and setting the WAL magic to 0000. This way, the
result will be the same independent of the endianness and memory
padding.

Regards,
Anthonin Bonnefoy

Вложения

Re: Propagate XLogFindNextRecord error to callers

От
Japin Li
Дата:
On Mon, 09 Feb 2026 at 10:21, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
> On Tue, Feb 3, 2026 at 5:05 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
>> I've picked up the review for your patch.
>
> Thanks for picking it!
>
>> Attached is a failing test that reproduces the issue. Have I got it
>> right? We can consider using it to validate your patch then.
>
> Yeah, that's the gist of it. However, the test you've written will
> only work on little endian architectures. Also, I think the xlog page
> header size won't have the 4 bytes padding on 32 bits systems.
>
> I've added a similar test in 001_basic.pl, but it relies on copying an
> existing WAL file and setting the WAL magic to 0000. This way, the
> result will be the same independent of the endianness and memory
> padding.
>

+    {
+        if (errormsg)
+            pg_fatal("could not find a valid record after %X/%X: %s",
+                     LSN_FORMAT_ARGS(private.startptr), errormsg);
+        else
+            pg_fatal("could not find a valid record after %X/%X",
+                     LSN_FORMAT_ARGS(private.startptr));

For consistency, this should use the %X/%08X format as elsewhere.

> Regards,
> Anthonin Bonnefoy
>
> [2. text/x-diff; v2-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch]...

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.



Re: Propagate XLogFindNextRecord error to callers

От
Anthonin Bonnefoy
Дата:
On Mon, Feb 9, 2026 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
> For consistency, this should use the %X/%08X format as elsewhere.

Good catch, I've updated the patch.

Вложения

Re: Propagate XLogFindNextRecord error to callers

От
Mircea Cadariu
Дата:

Hi Anthonin, 

Thanks for the updated patch. 

I have noticed this code added in XLogFindNextRecord in the patch, appears also in XLogNextRecord (line 334).

+	if (state->errormsg_deferred)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		state->errormsg_deferred = false;
+	}
+

In XLogNextRecord, right before the above code, we do *errormsg = NULL. Should this be done also in XLogFindNextRecord in the patch? 

If so, what about even extracting a helper method which will be called from both places? 

A nit for the commit message: Propage -> Propagate

-- 
Thanks,
Mircea Cadariu

Re: Propagate XLogFindNextRecord error to callers

От
Anthonin Bonnefoy
Дата:
On Mon, Feb 9, 2026 at 1:15 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> In XLogNextRecord, right before the above code, we do *errormsg = NULL. Should this be done also in
XLogFindNextRecordin the patch? 
>
> If so, what about even extracting a helper method which will be called from both places?

XLogReadRecord may already have consumed errormsg_deferred and set
errormsg. We can't set it to NULL, or that would erase a valid error
message.

A simplification would have been to remove processing the deferred
error processing from XLogNextRecord and leave it to
XLogFindNextRecord, but there are some calls to XLogNextRecord (like
in xlogprefetcher.c), so being able to get the error message when
calling XLogNextRecord is necessary.

> A nit for the commit message: Propage -> Propagate

Fixed

Вложения

Re: Propagate XLogFindNextRecord error to callers

От
Mircea Cadariu
Дата:
On 11/02/2026 10:43, Anthonin Bonnefoy wrote:

> XLogReadRecord may already have consumed errormsg_deferred and set
> errormsg. We can't set it to NULL, or that would erase a valid error
Indeed. Should we place this initialisation at the top of 
XLogFindNextRecord, before any processing? At that point, there's 
nothing to erase.

> +    /*
> +     * we may have reported errors due to invalid WAL header, propagate the
> +     * error message to the caller.
> +     */

You can consider capitalising.

-- 
Thanks,
Mircea Cadariu




Re: Propagate XLogFindNextRecord error to callers

От
Anthonin Bonnefoy
Дата:
On Wed, Feb 11, 2026 at 1:16 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> Indeed. Should we place this initialisation at the top of
> XLogFindNextRecord, before any processing? At that point, there's
> nothing to erase.

That makes sense, I've added the '*errormsg = NULL' at the top of the function.

> You can consider capitalising.

Done

Вложения

Re: Propagate XLogFindNextRecord error to callers

От
Mircea Cadariu
Дата:

Hi,

On 12/02/2026 09:43, Anthonin Bonnefoy wrote:
Done

Thanks for the updated patch. 

It applies cleanly on HEAD and the tests pass locally, as well as in CI. 

I've set the patch entry to 'Ready for Committer'. 

-- 
Thanks,
Mircea Cadariu

Re: Propagate XLogFindNextRecord error to callers

От
Chao Li
Дата:

> On Feb 12, 2026, at 15:43, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
>
> On Wed, Feb 11, 2026 at 1:16 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
>> Indeed. Should we place this initialisation at the top of
>> XLogFindNextRecord, before any processing? At that point, there's
>> nothing to erase.
>
> That makes sense, I've added the '*errormsg = NULL' at the top of the function.
>
>> You can consider capitalising.
>
> Done
> <v5-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch>

Hi Anthonin,

Thanks for the patch. I agree it’s useful to print a more detailed error message instead of the generic one.

From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new
parameterultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of
state,and we also need extra handling for cases like errormsg == NULL. 

Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally
pstrdup()sstate->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string
explicitly,and there’s no hidden dependency on the reader state’s lifetime. 

This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

What do you think?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Propagate XLogFindNextRecord error to callers

От
Anthonin Bonnefoy
Дата:
Thanks for the comments!

On Tue, Feb 24, 2026 at 5:00 AM Chao Li <li.evan.chao@gmail.com> wrote:
> From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new
parameterultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of
state,and we also need extra handling for cases like errormsg == NULL. 
>
> Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally
pstrdup()sstate->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string
explicitly,and there’s no hidden dependency on the reader state’s lifetime. 
>
> This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

One issue I see is that it introduces another way to get the error,
with XLogReadRecord and XLogNextRecord using an errormsg parameter,
and XLogFindNextRecord using the helper function. Maybe the solution
would be to change both XLogReadRecord and XLogNextRecord to use this
new function to stay consistent, but that means changing their
signatures.

Also, I see the errormsg parameter as a way to signal the caller that
"this function can fail, the detailed error will be available here".
With the XLogReaderGetLastError, it becomes the caller's
responsibility to know which function may fill the error message and
check it accordingly.

The error message is likely printed shortly after the function's call,
so I suspect the risk of using the errormsg after its intended
lifetime is low.

You bring up a good point about the errormsg's lifetime, which is
definitely something to mention in the function's comments. I've
updated the patch with the additional comments.

Regards,
Anthonin Bonnefoy

Вложения