Обсуждение: Fix possible 'unexpected data beyond EOF' on replica restart

Поиск

Список

Период

Сортировка

Fix possible 'unexpected data beyond EOF' on replica restart

От

Anthonin Bonnefoy

Дата:

16 декабря 2025 г., 16:07:58

Hi,

On restart, a replica can fail with an 'unexpected data beyond EOF in block x of relation T/D/R' error. This happened on a PG17.7 and I've been able to reproduce it on PG 18. This can happen under the following circumstances:

- A relation has a size of 400 blocks.
- Blocks 201 to 400 are empty.
- Block 200 has two rows.
- Blocks 101 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied the truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails in mdtruncate since 'nblocks > curnblk', but the caller isn't aware of that and will still update the cached size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely on the cached size and incorrectly assume the page exists in file, and thus won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size is 100 blocks, an extend is attempted. However, the block 200 is already present in the buffer table due to the first FPW. This triggers the 'unexpected data beyond EOF' since the page isn't new.

The issue can be reproduced with the following script:

"""

pgbench -i

# Prepare the relation

psql -c "DELETE FROM pgbench_accounts WHERE aid > 80000 AND aid != ALL('{90000, 90001}');"
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE OFF) pgbench_accounts;"

# Restartpoint here
psql -c "CHECKPOINT;"
psql -p 5433 -c "CHECKPOINT;"

# First truncate
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE ON) pgbench_accounts;"

# First FPW deletion
psql -c "DELETE FROM pgbench_accounts WHERE aid = 90001;"

# Second FPW deletion

psql -c "CHECKPOINT;"
psql -c "DELETE FROM pgbench_accounts WHERE aid = 90000;"

# Second truncate
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE ON) pgbench_accounts;"

# Let some time for replica to replay the truncate
psql -c "SELECT pg_sleep(1);"

# Stop without advancing the restartpoint
kill -9 $(pgrep -f "pg_data_replica")

# Restart should fail with the EOF error
pg_ctl -D pg_data_replica restart

"""

This assumes the replica is running on port 5433 and no hot_standby_feedback (otherwise, tuples will be seen as 'not yet removable'). I've used kill -9 to avoid advancing the restart point, but I've seen the issue happening with a clean shutdown.

The patch fixes the issue by moving smgr_cached_nblocks updates in mdtruncate and only updating the cached value if truncate was successful.

Regards,

Anthonin Bonnefoy

Вложения

v1-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patch

Re: Fix possible 'unexpected data beyond EOF' on replica restart

От

Amul Sul

Дата:

17 декабря 2025 г., 10:26:15

On Tue, Dec 16, 2025 at 6:38 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
>  [...]
> The patch fixes the issue by moving smgr_cached_nblocks updates in mdtruncate and only updating the cached value if
truncatewas successful. 
>

Thanks for detailed reproducible steps, I can see the reported issue
and proposed patch fixes the same. Patch looks good to me except
following changes in smgrtruncate():

-       /* Make the cached size is invalid if we encounter an error. */
-       reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
        smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i],
                                               old_nblocks[i], nblocks[i]);

The deleted code you moved to mdtruncate() should be kept where it
was. We cannot ensure that every extension using this interface will
adhere to that requirement what the comment describes. Furthermore,
an extension's routine might miss updating smgr_cached_nblocks.
To ensure that updates smgr_cached_nblocks properly, we should also
add an assertion after the call, like this:

        /*
         * Ensure that the local smgr_cached_nblocks value is updated.
         */
        Assert(reln->smgr_cached_nblocks[forknum[i]] != InvalidBlockNumber);

Regards,
Amul

Re: Fix possible 'unexpected data beyond EOF' on replica restart

От

Anthonin Bonnefoy

Дата:

17 декабря 2025 г., 11:40:28

On Wed, Dec 17, 2025 at 8:26 AM Amul Sul <sulamul@gmail.com> wrote:

Thanks for detailed reproducible steps, I can see the reported issue
and proposed patch fixes the same.

Thanks for the review!

The deleted code you moved to mdtruncate() should be kept where it
was. We cannot ensure that every extension using this interface will
adhere to that requirement what the comment describes.

Yeah, I've overlooked the case of extensions. I've moved back the InvalidBlockNumber assignment.

Furthermore, an extension's routine might miss updating

smgr_cached_nblocks. To ensure that updates smgr_cached_nblocks

properly, we should also add an assertion after the call, like this:

/*
* Ensure that the local smgr_cached_nblocks value is updated.
*/
Assert(reln->smgr_cached_nblocks[forknum[i]] != InvalidBlockNumber);

Good point. I've added the assertion.

I wonder how critical it is to have an up to date value of smgr_cached_nblocks after smgr_truncate. Leaving InvalidBlockNumber was also an option as the next user will ask the kernel for the real size. There are some functions like DropRelationBuffers which rely only on the cached value, so it's probably safer to keep the same behaviour.

Regards,

Anthonin Bonnefoy

Вложения

v2-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patch

Re: Fix possible 'unexpected data beyond EOF' on replica restart

От

Heikki Linnakangas

Дата:

18 декабря 2025 г., 17:18:46

On 17/12/2025 10:40, Anthonin Bonnefoy wrote:
> On Wed, Dec 17, 2025 at 8:26 AM Amul Sul <sulamul@gmail.com 
> <mailto:sulamul@gmail.com>> wrote:
> 
>> The deleted code you moved to mdtruncate() should be kept where it
>> was. We cannot ensure that every extension using this interface will
>> adhere to that requirement what the comment describes. 
> 
> Yeah, I've overlooked the case of extensions. I've moved back the 
> InvalidBlockNumber assignment.

The smgr interface isn't really an extension point, so we don't need to 
worry about that. (I wish it was, but that's a different story [1])

I don't think mdtruncate() should modify smgr_cached_nblocks, we should 
keep that in smgrtruncate(). smgr.c is responsible for all other updates 
of smgr_cached_nblocks, so doing it in mdtruncate() would be a layering 
violation.

I'm thinking that we should do the attached. Untested, and we should 
also add a comment to smgrtruncate() and mdtruncate() to explain how 
they behave if nblocks > curnblk.

I wonder if we should move the whole "if (nblocks > curnblk)" check and 
ereport() from mdtruncate() to smgrtruncate(). That logic doesn't really 
depend on anything specific to md.c. If you'd imagine a different smgr 
implementation, it'd need to just copy-paste that check. It's the 
caller's mistake if it passes nblocks > curnblk, when not in recovery. 
Then again, we do have other places in md.c too that behave differently 
when InRecovery.

What do you think?

> I wonder how critical it is to have an up to date value of 
> smgr_cached_nblocks after smgr_truncate. Leaving InvalidBlockNumber was 
> also an option as the next user will ask the kernel for the real size. 
> There are some functions like DropRelationBuffers which rely only on the 
> cached value, so it's probably safer to keep the same behaviour.

Leaving it invalid should work. But as the comment says, we might as 
well update the cached value since we have the value at hand. It's just 
that we were doing it wrong.

[1] 
https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com

- Heikki

Вложения

v3-Fix-unexpected-data-beyond-EOF-on-replica-restart.patch

Re: Fix possible 'unexpected data beyond EOF' on replica restart

От

Anthonin Bonnefoy

Дата:

18 декабря 2025 г., 19:12:24

On Thu, Dec 18, 2025 at 3:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> The smgr interface isn't really an extension point, so we don't need to
> worry about that. (I wish it was, but that's a different story [1])

Thanks for the additional context.

> I don't think mdtruncate() should modify smgr_cached_nblocks, we should
> keep that in smgrtruncate(). smgr.c is responsible for all other updates
> of smgr_cached_nblocks, so doing it in mdtruncate() would be a layering
> violation.
>
> I'm thinking that we should do the attached. Untested, and we should
> also add a comment to smgrtruncate() and mdtruncate() to explain how
> they behave if nblocks > curnblk.

This appears to work. The cached size is correctly updated and my
script doesn't trigger the bug anymore. I've updated the patch with
this approach.

> I wonder if we should move the whole "if (nblocks > curnblk)" check and
> ereport() from mdtruncate() to smgrtruncate(). That logic doesn't really
> depend on anything specific to md.c. If you'd imagine a different smgr
> implementation, it'd need to just copy-paste that check. It's the
> caller's mistake if it passes nblocks > curnblk, when not in recovery.
> Then again, we do have other places in md.c too that behave differently
> when InRecovery.
>
> What do you think?

I imagine we will still want some sanity checks in mdtruncate(), or at
least an Assert to make sure the provided block values are correct.
This also assumes that we will only have one caller to mdtruncate(),
another caller will have to duplicate the check.

And in the context of a backpatch, the current approach has the
benefit of minimising the amount of change, so I am slightly partial
to keeping the check in mdtruncate().

Regards,
Anthonin Bonnefoy

Вложения

v4-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patch

Re: Fix possible 'unexpected data beyond EOF' on replica restart

От

Anthonin Bonnefoy

Дата:

12 января, 11:34:21

Here's a new version with a bit of rewording.

Вложения

v5-0001-Fix-unexpected-data-beyond-EOF-on-replica-restart.patch

Re: Fix possible 'unexpected data beyond EOF' on replica restart

От

Heikki Linnakangas

Дата:

15 января, 22:11:54

On 12/01/2026 10:34, Anthonin Bonnefoy wrote:
> Here's a new version with a bit of rewording.

Applied to all branches. Thanks!

I left out this comment change on smgrtruncate():

> @@ -870,6 +870,9 @@ smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
>   * be called in a critical section, but the current size must be checked
>   * outside the critical section, and no interrupts or smgr functions relating
>   * to this relation should be called in between.
> + *
> + * If the specified number of blocks is higher than the current size, the
> + * request is ignored when we are InRecovery, otherwise, an error is raised.
>   */
>  void
>  smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,

I would love to have a comment along those lines, but I think it's 
misleading here, because smgrtruncate() actually does a few things even 
when the requested size is greater than the current size: it 
nevertheless calls DropRelationBuffers(), CacheInvalidateSmgr(), and 
updates the cached sizes in the SmgrRelation. So it seems misleading to 
say the request is ignored.

I wasn't sure how to phrase that, so I just left out that comment 
change. I'm open to proposals on a good wording though.

(The similar comment in mdnblocks() is accurate, mdnblocks() bails out 
very early, so I kept that)

- Heikki

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Fix possible 'unexpected data beyond EOF' on replica restart

Fix possible 'unexpected data beyond EOF' on replica restart

Вложения

Re: Fix possible 'unexpected data beyond EOF' on replica restart

Re: Fix possible 'unexpected data beyond EOF' on replica restart

Вложения

Re: Fix possible 'unexpected data beyond EOF' on replica restart

Вложения

Re: Fix possible 'unexpected data beyond EOF' on replica restart

Вложения

Re: Fix possible 'unexpected data beyond EOF' on replica restart

Вложения

Re: Fix possible 'unexpected data beyond EOF' on replica restart