Обсуждение: pgstattuple "unexpected zero page" for gist and hash indexes

Поиск
Список
Период
Сортировка

pgstattuple "unexpected zero page" for gist and hash indexes

От
Nitin Motiani
Дата:
Hi Hackers,

$SUBJECT happens if we crash just after extending the index. Noah
Misch looked into it and came up with the steps to simulate this by
patching pgstattuple to extend each rel it visits:

====
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -254,6 +254,8 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
                                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                                 errmsg("cannot access temporary
tables of other sessions")));

+       ReleaseBuffer(ExtendBufferedRel(BMR_REL(rel), MAIN_FORKNUM,
NULL, EB_CLEAR_SIZE_CACHE));
+
        if (RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind) ||
====

Then, in the regression database, this reaches the error for each hash index and
each gist index:

====
CREATE EXTENSION pgstattuple;
CREATE FUNCTION pgstattuple_safe(IN reloid regclass) RETURNS text
LANGUAGE plpgsql AS $$
BEGIN
        RETURN pgstattuple($1);
EXCEPTION WHEN OTHERS THEN
        RETURN sqlerrm;
END
$$;
SELECT * from (
 SELECT pgstattuple_safe(oid), relkind, oid::regclass
 FROM pg_class
 WHERE pg_relation_filenode(oid) <> 0
)
WHERE pgstattuple_safe NOT LIKE '%supported%'
AND pgstattuple_safe NOT LIKE '(%'
ORDER BY 1;
====

There have been prior reports:
2016-09 https://www.postgresql.org/message-id/CAA4eK1%2BzT16bxVpF3J9UsiZxExCqFa6QxzUfazmJthW6dO78ww%40mail.gmail.com
2016-09 https://www.postgresql.org/message-id/CAE9k0PnCaBkMgsDGuuPnPPTrQUc%3Dy9NiQvvsFFQkDNGcjYSajg%40mail.gmail.com
2021-02 docs https://www.postgresql.org/message-id/161280049644.664.3414087059452030397@wrigleys.postgresql.org


For hash indexes, this error comes from pgstat_hash_page() ->
_hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page.
Similarly gistcheckpage() is the cause for gist indexes.

Since after a crash recovery a zero page is normal (as long as not
referenced from the rest of the index), emitting
ERRCODE_INDEX_CORRUPTED is a false alarm.

To fix this, we propose that we remove the checks from
pgstat_hash_page() and pgstat_gist_page(). This is something which
pgstat_btree_page() already does. It uses a lower-level function
ReadBufferExtended() and avoids using _bt_getbuf() which would check
for the "unexpected zero page".

I'm attaching a patch for the above which does the following :

1. Replaces _hash_getbuf_with_strategy() with ReadBufferExtended() in
pgstat_hash_page(). Then for non-PageIsNew() pages, we explicitly
check the PageGetSpecialSize().

2. Similarly gistcheckpage() is removed in pgstat_gist_page(). And for
non-PageIsNew(), we explicitly check the PageGetSpecialSize() before
checking if it's a leaf page.

I confirmed that this was working by running the above mentioned
simulation steps along with the patch.

Note that _hash_getbuf_with_strategy() also checks if blkno is P_NEW.
But that check seems unnecessary here as P_NEW is the same as
InvalidBlockNo and pgstat_hash_page() is called by pgstat_index() for
the block nos starting from HASH_METAPAGE + 1 and remains <
RelationGetNumberOfBlocks().

An alternative might be that we still call the check functions for the
non-PageIsNew() cases. Following were Noah Misch's comments on these
options :

I mildly lean toward not calling these *check*page functions, for two
reasons.  First, pgstattuple() is not a general integrity checker.  It
should try its best to count things, and it
shouldn't go out of its way to report corruption.  Second, pgstattuple doesn't
depend on access-method-specific index integrity.  pgstat_gist_page() is a thin
wrapper around the generic pgstat_index_page().  pgstat_hash_page() does
slightly more, but a page failing the _hash_checkpage() checks still wouldn't
elicit intolerable behavior from pgstat_hash_page().

Let me know what you folks think of this.

Thanks & Regards,
Nitin Motiani
Google

Вложения

Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Dilip Kumar
Дата:
On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> For hash indexes, this error comes from pgstat_hash_page() ->
> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page.
> Similarly gistcheckpage() is the cause for gist indexes.
>
> Since after a crash recovery a zero page is normal (as long as not
> referenced from the rest of the index), emitting
> ERRCODE_INDEX_CORRUPTED is a false alarm.

I agree this seems like a false alarm, and at least its inconsistent
with the btree.

> To fix this, we propose that we remove the checks from
> pgstat_hash_page() and pgstat_gist_page(). This is something which
> pgstat_btree_page() already does. It uses a lower-level function
> ReadBufferExtended() and avoids using _bt_getbuf() which would check
> for the "unexpected zero page".
>
> I'm attaching a patch for the above which does the following :
>
> 1. Replaces _hash_getbuf_with_strategy() with ReadBufferExtended() in
> pgstat_hash_page(). Then for non-PageIsNew() pages, we explicitly
> check the PageGetSpecialSize().
>
> 2. Similarly gistcheckpage() is removed in pgstat_gist_page(). And for
> non-PageIsNew(), we explicitly check the PageGetSpecialSize() before
> checking if it's a leaf page.
>
> I confirmed that this was working by running the above mentioned
> simulation steps along with the patch.

Yeah that seems fine to me.

> Note that _hash_getbuf_with_strategy() also checks if blkno is P_NEW.
> But that check seems unnecessary here as P_NEW is the same as
> InvalidBlockNo and pgstat_hash_page() is called by pgstat_index() for
> the block nos starting from HASH_METAPAGE + 1 and remains <
> RelationGetNumberOfBlocks().

Right, we don't need that check here.

I observed that pgstat_btree_page() incorporates the count of new
pages into its free space calculation[1]. Does it make sense to do the
same for hash and gist as well as we are leaning towards making these
consistent.

[1]
if (PageIsNew(page))
{
/* fully empty page */
stat->free_space += BLCKSZ;
}

--
Regards,
Dilip Kumar
Google



Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Nitin Motiani
Дата:
On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I observed that pgstat_btree_page() incorporates the count of new
> pages into its free space calculation[1]. Does it make sense to do the
> same for hash and gist as well as we are leaning towards making these
> consistent.
>
> [1]
> if (PageIsNew(page))
> {
> /* fully empty page */
> stat->free_space += BLCKSZ;
> }
>

Thanks for the feedback. Yes, it makes sense to do the same free space
calculation for the other index types unless there is some special
handling in btree implementation of the free space. From the
pgstattuple code, I don't see any special handling so I have made the
change suggested by you. I'm attaching patch v2 with that change.

Thanks

Вложения

Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Michael Paquier
Дата:
On Mon, Sep 29, 2025 at 09:03:13AM +0530, Dilip Kumar wrote:
> On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>> For hash indexes, this error comes from pgstat_hash_page() ->
>> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page.
>> Similarly gistcheckpage() is the cause for gist indexes.
>>
>> Since after a crash recovery a zero page is normal (as long as not
>> referenced from the rest of the index), emitting
>> ERRCODE_INDEX_CORRUPTED is a false alarm.
>
> I agree this seems like a false alarm, and at least its inconsistent
> with the btree.

It also means that if one is doing a join of pg_class with pgstattuple
to gather some diagnostics on a given database, then one is required
to wrap pgstattuple into a custom plpgsql function to catch errors,
like you did, or avoid these indexes in all cases.  This is not
user-friendly in both cases.

On Tue, Sep 30, 2025 at 10:46:11AM +0530, Nitin Motiani wrote:
> On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I observed that pgstat_btree_page() incorporates the count of new
>> pages into its free space calculation[1]. Does it make sense to do the
>> same for hash and gist as well as we are leaning towards making these
>> consistent.
>>
>> [1]
>> if (PageIsNew(page))
>> {
>> /* fully empty page */
>> stat->free_space += BLCKSZ;
>> }
>
> Thanks for the feedback. Yes, it makes sense to do the same free space
> calculation for the other index types unless there is some special
> handling in btree implementation of the free space. From the
> pgstattuple code, I don't see any special handling so I have made the
> change suggested by you. I'm attaching patch v2 with that change.

Including the space for the new page for gist in the size makes sense
here.  gist has protections for empty pages in VACUUM, and can grab
empty pages through the FSM.

For hash, we don't have much of that AFAIK, but your argument with the
timely crash validates the fact that we'd better include this number
in the free size as it is part of the relation anyway.

In short, fine by me for both.  If others have an opinion, feel free.
--
Michael

Вложения

Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Dilip Kumar
Дата:
On Tue, Sep 30, 2025 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Sep 29, 2025 at 09:03:13AM +0530, Dilip Kumar wrote:
> > On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote:
> >> For hash indexes, this error comes from pgstat_hash_page() ->
> >> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page.
> >> Similarly gistcheckpage() is the cause for gist indexes.
> >>
> >> Since after a crash recovery a zero page is normal (as long as not
> >> referenced from the rest of the index), emitting
> >> ERRCODE_INDEX_CORRUPTED is a false alarm.
> >
> > I agree this seems like a false alarm, and at least its inconsistent
> > with the btree.
>
> It also means that if one is doing a join of pg_class with pgstattuple
> to gather some diagnostics on a given database, then one is required
> to wrap pgstattuple into a custom plpgsql function to catch errors,
> like you did, or avoid these indexes in all cases.  This is not
> user-friendly in both cases.

Right, I agree with that.

> On Tue, Sep 30, 2025 at 10:46:11AM +0530, Nitin Motiani wrote:
> > On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> I observed that pgstat_btree_page() incorporates the count of new
> >> pages into its free space calculation[1]. Does it make sense to do the
> >> same for hash and gist as well as we are leaning towards making these
> >> consistent.
> >>
> >> [1]
> >> if (PageIsNew(page))
> >> {
> >> /* fully empty page */
> >> stat->free_space += BLCKSZ;
> >> }
> >
> > Thanks for the feedback. Yes, it makes sense to do the same free space
> > calculation for the other index types unless there is some special
> > handling in btree implementation of the free space. From the
> > pgstattuple code, I don't see any special handling so I have made the
> > change suggested by you. I'm attaching patch v2 with that change.
>
> Including the space for the new page for gist in the size makes sense
> here.  gist has protections for empty pages in VACUUM, and can grab
> empty pages through the FSM.
>
> For hash, we don't have much of that AFAIK, but your argument with the
> timely crash validates the fact that we'd better include this number
> in the free size as it is part of the relation anyway.

Right

> In short, fine by me for both.  If others have an opinion, feel free.

Thanks for your feedback Michael.

--
Regards,
Dilip Kumar
Google



Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Dilip Kumar
Дата:
On Tue, Sep 30, 2025 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Sep 30, 2025 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Sep 29, 2025 at 09:03:13AM +0530, Dilip Kumar wrote:
> > > On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote:
> > >> For hash indexes, this error comes from pgstat_hash_page() ->
> > >> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page.
> > >> Similarly gistcheckpage() is the cause for gist indexes.
> > >>
> > >> Since after a crash recovery a zero page is normal (as long as not
> > >> referenced from the rest of the index), emitting
> > >> ERRCODE_INDEX_CORRUPTED is a false alarm.
> > >
> > > I agree this seems like a false alarm, and at least its inconsistent
> > > with the btree.
> >
> > It also means that if one is doing a join of pg_class with pgstattuple
> > to gather some diagnostics on a given database, then one is required
> > to wrap pgstattuple into a custom plpgsql function to catch errors,
> > like you did, or avoid these indexes in all cases.  This is not
> > user-friendly in both cases.
>
> Right, I agree with that.
>
> > On Tue, Sep 30, 2025 at 10:46:11AM +0530, Nitin Motiani wrote:
> > > On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >> I observed that pgstat_btree_page() incorporates the count of new
> > >> pages into its free space calculation[1]. Does it make sense to do the
> > >> same for hash and gist as well as we are leaning towards making these
> > >> consistent.
> > >>
> > >> [1]
> > >> if (PageIsNew(page))
> > >> {
> > >> /* fully empty page */
> > >> stat->free_space += BLCKSZ;
> > >> }

Looking into the V2 below check [1]

My primary concern is the handling of non-new pages where the page's
special size does not match GISTPageOpaqueData (same for
pgstat_hash_page()). I mean isn't that corruption and this should be
reported.  OTOH, pgstat_btree_page() is also not reporting it, in fact
it is assuming that if the page is not new then it can safely read the
page opaque data without validating the size.  My additional goal with
this patch is to standardize the validation logic across all three
functions to prevent these inconsistencies in the future.

[1]
if (PageIsNew(page))
{
/* fully empty page */
stat->free_space += BLCKSZ;
}
else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) &&
GistPageIsLeaf(page))
{
pgstat_index_page(stat, page, FirstOffsetNumber,
PageGetMaxOffsetNumber(page));
}
else
{
/* root or node or corrupted */
}

--
Regards,
Dilip Kumar
Google



Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Michael Paquier
Дата:
On Tue, Sep 30, 2025 at 05:31:46PM +0530, Dilip Kumar wrote:
> My primary concern is the handling of non-new pages where the page's
> special size does not match GISTPageOpaqueData (same for
> pgstat_hash_page()). I mean isn't that corruption and this should be
> reported.  OTOH, pgstat_btree_page() is also not reporting it, in fact
> it is assuming that if the page is not new then it can safely read the
> page opaque data without validating the size.  My additional goal with
> this patch is to standardize the validation logic across all three
> functions to prevent these inconsistencies in the future.

pgstattuple's job is to report tuple-level statistics.  I don't really
think that we need to make it more complicated or smarter to detect
corruption cases, because we have other tools that are designed for
that, like amcheck.  True that amcheck does not have support for gist
and hash as checks for index AMs, but it's something that could be
sorted out on its own as a new feature.

From this perspective, v2 seems kind of fine based on its simplicity
and because it makes pgstattuple do a better job in terms of
statistics numbers reported.  I think that I would just remove the "or
corrupted" in the extra comment, though.
--
Michael

Вложения

Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Nitin Motiani
Дата:
Apologies, I accidentally sent my previous reply only to Michael
instead of hitting 'reply all'. Adding the contents of those messages
in the quoted text.

On Wed, Oct 1, 2025 at 4:45 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Oct 01, 2025 at 04:17:49PM +0530, Nitin Motiani wrote:
> > Thanks Michael. We can keep the simple change we have in v2 without
> > reporting any corruption. But perhaps we should check for the opaque
> > size mismatch for btree (as it's already done for gist and hash) to
> > keep the code consistent for all three. We can avoid any reporting or
> > further analysis but we can skip the other operations in the case of
> > size mismatch. What are your thoughts on that?
>
> You mean an check on BTPageOpaqueData with a new else branch in
> pgstat_btree_page()?  Yep, let's do that as well.
> --
> Michael

Thanks Michael. I'm attaching v3 with this change.

Вложения

Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Dilip Kumar
Дата:
On Wed, Oct 1, 2025 at 8:32 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> Apologies, I accidentally sent my previous reply only to Michael
> instead of hitting 'reply all'. Adding the contents of those messages
> in the quoted text.
>
> On Wed, Oct 1, 2025 at 4:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Oct 01, 2025 at 04:17:49PM +0530, Nitin Motiani wrote:
> > > Thanks Michael. We can keep the simple change we have in v2 without
> > > reporting any corruption. But perhaps we should check for the opaque
> > > size mismatch for btree (as it's already done for gist and hash) to
> > > keep the code consistent for all three. We can avoid any reporting or
> > > further analysis but we can skip the other operations in the case of
> > > size mismatch. What are your thoughts on that?
> >
> > You mean an check on BTPageOpaqueData with a new else branch in
> > pgstat_btree_page()?  Yep, let's do that as well.
> > --
>
> Thanks Michael. I'm attaching v3 with this change.

I think this looks good to me now, lets see what Michael has to say on
this, after that we can mark ready for committer.

--
Regards,
Dilip Kumar
Google



Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Michael Paquier
Дата:
On Wed, Oct 01, 2025 at 08:53:01PM +0530, Dilip Kumar wrote:
> I think this looks good to me now, lets see what Michael has to say on
> this, after that we can mark ready for committer.

Thanks for the updated patch.  I have tweaked a bit more the code
blocks to be more consistent with each other, and applied the change
down to v13, as there have been numerous complaints on the matter.
--
Michael

Вложения

Re: pgstattuple "unexpected zero page" for gist and hash indexes

От
Nitin Motiani
Дата:
Thank you, Michael!