Обсуждение: Propagate XLogFindNextRecord error to callers

Поиск

Список

Период

Сортировка

Propagate XLogFindNextRecord error to callers

От

Anthonin Bonnefoy

Дата:

12 декабря 2025 г., 13:39:50

Hi,

Currently, XLogFindNextRecord errormsg is ignored and callers will only output a generic 'could not find a valid record' message without details.

Additionally, invalid page header won't go through XLogReadRecord, leaving the error in state->errormsg_buf.

This patch propagates XLogFindNextRecord's error message to the callers and displays it. In case of an invalid page header, the errormsg is filled with errormsg_buf content.

With this patch, pg_waldump will now have the following output when reading a file with an invalid header:

pg_waldump: error: could not find a valid record after D80/5C000000: invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN D80/5C000000, offset 0

Regards,

Anthonin Bonnefoy

Вложения

v1-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch

Re: Propagate XLogFindNextRecord error to callers

От

Mircea Cadariu

Дата:

03 февраля, 19:05:40

Hi Anthonin,

On 12/12/2025 10:39, Anthonin Bonnefoy wrote:
>
> With this patch, pg_waldump will now have the following output when 
> reading a file with an invalid header:
> pg_waldump: error: could not find a valid record after D80/5C000000: 
> invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN 
> D80/5C000000, offset 0

I've picked up the review for your patch.

Attached is a failing test that reproduces the issue. Have I got it 
right? We can consider using it to validate your patch then.

-- 
Thanks,
Mircea Cadariu

Вложения

v1-0001-Add-pg_waldump-test-for-invalid-WAL-page-magic-er.patch

Re: Propagate XLogFindNextRecord error to callers

От

Anthonin Bonnefoy

Дата:

09 февраля, 12:21:59

On Tue, Feb 3, 2026 at 5:05 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> I've picked up the review for your patch.

Thanks for picking it!

> Attached is a failing test that reproduces the issue. Have I got it
> right? We can consider using it to validate your patch then.

Yeah, that's the gist of it. However, the test you've written will
only work on little endian architectures. Also, I think the xlog page
header size won't have the 4 bytes padding on 32 bits systems.

I've added a similar test in 001_basic.pl, but it relies on copying an
existing WAL file and setting the WAL magic to 0000. This way, the
result will be the same independent of the endianness and memory
padding.

Regards,
Anthonin Bonnefoy

Вложения

v2-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch

Re: Propagate XLogFindNextRecord error to callers

От

Japin Li

Дата:

09 февраля, 12:36:19

On Mon, 09 Feb 2026 at 10:21, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
> On Tue, Feb 3, 2026 at 5:05 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
>> I've picked up the review for your patch.
>
> Thanks for picking it!
>
>> Attached is a failing test that reproduces the issue. Have I got it
>> right? We can consider using it to validate your patch then.
>
> Yeah, that's the gist of it. However, the test you've written will
> only work on little endian architectures. Also, I think the xlog page
> header size won't have the 4 bytes padding on 32 bits systems.
>
> I've added a similar test in 001_basic.pl, but it relies on copying an
> existing WAL file and setting the WAL magic to 0000. This way, the
> result will be the same independent of the endianness and memory
> padding.
>

+    {
+        if (errormsg)
+            pg_fatal("could not find a valid record after %X/%X: %s",
+                     LSN_FORMAT_ARGS(private.startptr), errormsg);
+        else
+            pg_fatal("could not find a valid record after %X/%X",
+                     LSN_FORMAT_ARGS(private.startptr));

For consistency, this should use the %X/%08X format as elsewhere.

> Regards,
> Anthonin Bonnefoy
>
> [2. text/x-diff; v2-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch]...

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

Re: Propagate XLogFindNextRecord error to callers

От

Anthonin Bonnefoy

Дата:

09 февраля, 13:06:30

On Mon, Feb 9, 2026 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
> For consistency, this should use the %X/%08X format as elsewhere.

Good catch, I've updated the patch.

Вложения

v3-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch

Re: Propagate XLogFindNextRecord error to callers

От

Mircea Cadariu

Дата:

09 февраля, 15:15:34

Hi Anthonin,

Thanks for the updated patch.

I have noticed this code added in XLogFindNextRecord in the patch, appears also in XLogNextRecord (line 334).

+	if (state->errormsg_deferred)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		state->errormsg_deferred = false;
+	}
+

In XLogNextRecord, right before the above code, we do *errormsg = NULL. Should this be done also in XLogFindNextRecord in the patch?

If so, what about even extracting a helper method which will be called from both places?

A nit for the commit message: Propage -> Propagate

-- 
Thanks,
Mircea Cadariu

Re: Propagate XLogFindNextRecord error to callers

От

Anthonin Bonnefoy

Дата:

11 февраля, 11:43:51

On Mon, Feb 9, 2026 at 1:15 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> In XLogNextRecord, right before the above code, we do *errormsg = NULL. Should this be done also in
XLogFindNextRecordin the patch? 
>
> If so, what about even extracting a helper method which will be called from both places?

XLogReadRecord may already have consumed errormsg_deferred and set
errormsg. We can't set it to NULL, or that would erase a valid error
message.

A simplification would have been to remove processing the deferred
error processing from XLogNextRecord and leave it to
XLogFindNextRecord, but there are some calls to XLogNextRecord (like
in xlogprefetcher.c), so being able to get the error message when
calling XLogNextRecord is necessary.

> A nit for the commit message: Propage -> Propagate

Fixed

Вложения

v4-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch

Re: Propagate XLogFindNextRecord error to callers

От

Mircea Cadariu

Дата:

11 февраля, 15:15:56

On 11/02/2026 10:43, Anthonin Bonnefoy wrote:

> XLogReadRecord may already have consumed errormsg_deferred and set
> errormsg. We can't set it to NULL, or that would erase a valid error
Indeed. Should we place this initialisation at the top of 
XLogFindNextRecord, before any processing? At that point, there's 
nothing to erase.

> +    /*
> +     * we may have reported errors due to invalid WAL header, propagate the
> +     * error message to the caller.
> +     */

You can consider capitalising.

-- 
Thanks,
Mircea Cadariu

Re: Propagate XLogFindNextRecord error to callers

От

Anthonin Bonnefoy

Дата:

12 февраля, 10:43:03

On Wed, Feb 11, 2026 at 1:16 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> Indeed. Should we place this initialisation at the top of
> XLogFindNextRecord, before any processing? At that point, there's
> nothing to erase.

That makes sense, I've added the '*errormsg = NULL' at the top of the function.

> You can consider capitalising.

Done

Вложения

v5-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch

Re: Propagate XLogFindNextRecord error to callers

От

Mircea Cadariu

Дата:

17 февраля, 22:48:18

Hi,

On 12/02/2026 09:43, Anthonin Bonnefoy wrote:

Done

Thanks for the updated patch.

It applies cleanly on HEAD and the tests pass locally, as well as in CI.

I've set the patch entry to 'Ready for Committer'.

-- 
Thanks,
Mircea Cadariu

Re: Propagate XLogFindNextRecord error to callers

От

Chao Li

Дата:

24 февраля, 07:00:05

> On Feb 12, 2026, at 15:43, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
>
> On Wed, Feb 11, 2026 at 1:16 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
>> Indeed. Should we place this initialisation at the top of
>> XLogFindNextRecord, before any processing? At that point, there's
>> nothing to erase.
>
> That makes sense, I've added the '*errormsg = NULL' at the top of the function.
>
>> You can consider capitalising.
>
> Done
> <v5-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch>

Hi Anthonin,

Thanks for the patch. I agree it’s useful to print a more detailed error message instead of the generic one.

From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new
parameterultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of
state,and we also need extra handling for cases like errormsg == NULL. 

Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally
pstrdup()sstate->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string
explicitly,and there’s no hidden dependency on the reader state’s lifetime. 

This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

What do you think?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Propagate XLogFindNextRecord error to callers

От

Anthonin Bonnefoy

Дата:

26 февраля, 11:19:09

Thanks for the comments!

On Tue, Feb 24, 2026 at 5:00 AM Chao Li <li.evan.chao@gmail.com> wrote:
> From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new
parameterultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of
state,and we also need extra handling for cases like errormsg == NULL. 
>
> Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally
pstrdup()sstate->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string
explicitly,and there’s no hidden dependency on the reader state’s lifetime. 
>
> This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

One issue I see is that it introduces another way to get the error,
with XLogReadRecord and XLogNextRecord using an errormsg parameter,
and XLogFindNextRecord using the helper function. Maybe the solution
would be to change both XLogReadRecord and XLogNextRecord to use this
new function to stay consistent, but that means changing their
signatures.

Also, I see the errormsg parameter as a way to signal the caller that
"this function can fail, the detailed error will be available here".
With the XLogReaderGetLastError, it becomes the caller's
responsibility to know which function may fill the error message and
check it accordingly.

The error message is likely printed shortly after the function's call,
so I suspect the risk of using the errormsg after its intended
lifetime is low.

You bring up a good point about the errormsg's lifetime, which is
definitely something to mention in the function's comments. I've
updated the patch with the additional comments.

Regards,
Anthonin Bonnefoy

Вложения

v6-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch

Re: Propagate XLogFindNextRecord error to callers

От

Fujii Masao

Дата:

17 марта, 12:48:46

On Thu, Feb 26, 2026 at 5:19 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Thanks for the comments!
>
> On Tue, Feb 24, 2026 at 5:00 AM Chao Li <li.evan.chao@gmail.com> wrote:
> > From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new
parameterultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of
state,and we also need extra handling for cases like errormsg == NULL. 
> >
> > Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which
internallypstrdup()s state->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the
returnedstring explicitly, and there’s no hidden dependency on the reader state’s lifetime. 
> >
> > This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.
>
> One issue I see is that it introduces another way to get the error,
> with XLogReadRecord and XLogNextRecord using an errormsg parameter,
> and XLogFindNextRecord using the helper function. Maybe the solution
> would be to change both XLogReadRecord and XLogNextRecord to use this
> new function to stay consistent, but that means changing their
> signatures.
>
> Also, I see the errormsg parameter as a way to signal the caller that
> "this function can fail, the detailed error will be available here".
> With the XLogReaderGetLastError, it becomes the caller's
> responsibility to know which function may fill the error message and
> check it accordingly.
>
> The error message is likely printed shortly after the function's call,
> so I suspect the risk of using the errormsg after its intended
> lifetime is low.
>
> You bring up a good point about the errormsg's lifetime, which is
> definitely something to mention in the function's comments. I've
> updated the patch with the additional comments.

Since this patch is marked Ready for Committer, I've started reviewing it.

+ * When set, *errormsg points to an internal buffer that's valid until the next
+ * call to XLogReadRecord.

Could that buffer also be invalidated by other functions that modify
"XLogReaderState *state", such as XLogBeginRead()?


+# Wrong WAL version. We copy an existing wal file and set the
+# page's magic value to 0000.

Would it be better to describe the purpose of this test at the top?
For example:

    # Test that pg_waldump reports a detailed error message when dumping
    # a WAL file with an invalid magic number (0000).
    {
            # The broken WAL file is created by copying a valid WAL file and
            # overwriting its magic number with 0000.
            my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();


+ open($fh, '+<', $broken_wal);
+ close($fh);

Should we add error handling like:

    open(my $fh, '+<', $broken_wal)
        or BAIL_OUT("open($broken_wal) failed: $!");
    close($fh)
        or BAIL_OUT("close failed: $!");


Also, other similar tests seem to call binmode. Is it really unnecessary
in this case?

Regards,

--
Fujii Masao

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Propagate XLogFindNextRecord error to callers

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения