Обсуждение: pgstattuple "unexpected zero page" for gist and hash indexes
Hi Hackers, $SUBJECT happens if we crash just after extending the index. Noah Misch looked into it and came up with the steps to simulate this by patching pgstattuple to extend each rel it visits: ==== --- a/contrib/pgstattuple/pgstattuple.c +++ b/contrib/pgstattuple/pgstattuple.c @@ -254,6 +254,8 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo) (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot access temporary tables of other sessions"))); + ReleaseBuffer(ExtendBufferedRel(BMR_REL(rel), MAIN_FORKNUM, NULL, EB_CLEAR_SIZE_CACHE)); + if (RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind) || ==== Then, in the regression database, this reaches the error for each hash index and each gist index: ==== CREATE EXTENSION pgstattuple; CREATE FUNCTION pgstattuple_safe(IN reloid regclass) RETURNS text LANGUAGE plpgsql AS $$ BEGIN RETURN pgstattuple($1); EXCEPTION WHEN OTHERS THEN RETURN sqlerrm; END $$; SELECT * from ( SELECT pgstattuple_safe(oid), relkind, oid::regclass FROM pg_class WHERE pg_relation_filenode(oid) <> 0 ) WHERE pgstattuple_safe NOT LIKE '%supported%' AND pgstattuple_safe NOT LIKE '(%' ORDER BY 1; ==== There have been prior reports: 2016-09 https://www.postgresql.org/message-id/CAA4eK1%2BzT16bxVpF3J9UsiZxExCqFa6QxzUfazmJthW6dO78ww%40mail.gmail.com 2016-09 https://www.postgresql.org/message-id/CAE9k0PnCaBkMgsDGuuPnPPTrQUc%3Dy9NiQvvsFFQkDNGcjYSajg%40mail.gmail.com 2021-02 docs https://www.postgresql.org/message-id/161280049644.664.3414087059452030397@wrigleys.postgresql.org For hash indexes, this error comes from pgstat_hash_page() -> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page. Similarly gistcheckpage() is the cause for gist indexes. Since after a crash recovery a zero page is normal (as long as not referenced from the rest of the index), emitting ERRCODE_INDEX_CORRUPTED is a false alarm. To fix this, we propose that we remove the checks from pgstat_hash_page() and pgstat_gist_page(). This is something which pgstat_btree_page() already does. It uses a lower-level function ReadBufferExtended() and avoids using _bt_getbuf() which would check for the "unexpected zero page". I'm attaching a patch for the above which does the following : 1. Replaces _hash_getbuf_with_strategy() with ReadBufferExtended() in pgstat_hash_page(). Then for non-PageIsNew() pages, we explicitly check the PageGetSpecialSize(). 2. Similarly gistcheckpage() is removed in pgstat_gist_page(). And for non-PageIsNew(), we explicitly check the PageGetSpecialSize() before checking if it's a leaf page. I confirmed that this was working by running the above mentioned simulation steps along with the patch. Note that _hash_getbuf_with_strategy() also checks if blkno is P_NEW. But that check seems unnecessary here as P_NEW is the same as InvalidBlockNo and pgstat_hash_page() is called by pgstat_index() for the block nos starting from HASH_METAPAGE + 1 and remains < RelationGetNumberOfBlocks(). An alternative might be that we still call the check functions for the non-PageIsNew() cases. Following were Noah Misch's comments on these options : I mildly lean toward not calling these *check*page functions, for two reasons. First, pgstattuple() is not a general integrity checker. It should try its best to count things, and it shouldn't go out of its way to report corruption. Second, pgstattuple doesn't depend on access-method-specific index integrity. pgstat_gist_page() is a thin wrapper around the generic pgstat_index_page(). pgstat_hash_page() does slightly more, but a page failing the _hash_checkpage() checks still wouldn't elicit intolerable behavior from pgstat_hash_page(). Let me know what you folks think of this. Thanks & Regards, Nitin Motiani Google
Вложения
On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > For hash indexes, this error comes from pgstat_hash_page() -> > _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page. > Similarly gistcheckpage() is the cause for gist indexes. > > Since after a crash recovery a zero page is normal (as long as not > referenced from the rest of the index), emitting > ERRCODE_INDEX_CORRUPTED is a false alarm. I agree this seems like a false alarm, and at least its inconsistent with the btree. > To fix this, we propose that we remove the checks from > pgstat_hash_page() and pgstat_gist_page(). This is something which > pgstat_btree_page() already does. It uses a lower-level function > ReadBufferExtended() and avoids using _bt_getbuf() which would check > for the "unexpected zero page". > > I'm attaching a patch for the above which does the following : > > 1. Replaces _hash_getbuf_with_strategy() with ReadBufferExtended() in > pgstat_hash_page(). Then for non-PageIsNew() pages, we explicitly > check the PageGetSpecialSize(). > > 2. Similarly gistcheckpage() is removed in pgstat_gist_page(). And for > non-PageIsNew(), we explicitly check the PageGetSpecialSize() before > checking if it's a leaf page. > > I confirmed that this was working by running the above mentioned > simulation steps along with the patch. Yeah that seems fine to me. > Note that _hash_getbuf_with_strategy() also checks if blkno is P_NEW. > But that check seems unnecessary here as P_NEW is the same as > InvalidBlockNo and pgstat_hash_page() is called by pgstat_index() for > the block nos starting from HASH_METAPAGE + 1 and remains < > RelationGetNumberOfBlocks(). Right, we don't need that check here. I observed that pgstat_btree_page() incorporates the count of new pages into its free space calculation[1]. Does it make sense to do the same for hash and gist as well as we are leaning towards making these consistent. [1] if (PageIsNew(page)) { /* fully empty page */ stat->free_space += BLCKSZ; } -- Regards, Dilip Kumar Google
On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I observed that pgstat_btree_page() incorporates the count of new > pages into its free space calculation[1]. Does it make sense to do the > same for hash and gist as well as we are leaning towards making these > consistent. > > [1] > if (PageIsNew(page)) > { > /* fully empty page */ > stat->free_space += BLCKSZ; > } > Thanks for the feedback. Yes, it makes sense to do the same free space calculation for the other index types unless there is some special handling in btree implementation of the free space. From the pgstattuple code, I don't see any special handling so I have made the change suggested by you. I'm attaching patch v2 with that change. Thanks
Вложения
On Mon, Sep 29, 2025 at 09:03:13AM +0530, Dilip Kumar wrote: > On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote: >> For hash indexes, this error comes from pgstat_hash_page() -> >> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page. >> Similarly gistcheckpage() is the cause for gist indexes. >> >> Since after a crash recovery a zero page is normal (as long as not >> referenced from the rest of the index), emitting >> ERRCODE_INDEX_CORRUPTED is a false alarm. > > I agree this seems like a false alarm, and at least its inconsistent > with the btree. It also means that if one is doing a join of pg_class with pgstattuple to gather some diagnostics on a given database, then one is required to wrap pgstattuple into a custom plpgsql function to catch errors, like you did, or avoid these indexes in all cases. This is not user-friendly in both cases. On Tue, Sep 30, 2025 at 10:46:11AM +0530, Nitin Motiani wrote: > On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> I observed that pgstat_btree_page() incorporates the count of new >> pages into its free space calculation[1]. Does it make sense to do the >> same for hash and gist as well as we are leaning towards making these >> consistent. >> >> [1] >> if (PageIsNew(page)) >> { >> /* fully empty page */ >> stat->free_space += BLCKSZ; >> } > > Thanks for the feedback. Yes, it makes sense to do the same free space > calculation for the other index types unless there is some special > handling in btree implementation of the free space. From the > pgstattuple code, I don't see any special handling so I have made the > change suggested by you. I'm attaching patch v2 with that change. Including the space for the new page for gist in the size makes sense here. gist has protections for empty pages in VACUUM, and can grab empty pages through the FSM. For hash, we don't have much of that AFAIK, but your argument with the timely crash validates the fact that we'd better include this number in the free size as it is part of the relation anyway. In short, fine by me for both. If others have an opinion, feel free. -- Michael
Вложения
On Tue, Sep 30, 2025 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Sep 29, 2025 at 09:03:13AM +0530, Dilip Kumar wrote: > > On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote: > >> For hash indexes, this error comes from pgstat_hash_page() -> > >> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page. > >> Similarly gistcheckpage() is the cause for gist indexes. > >> > >> Since after a crash recovery a zero page is normal (as long as not > >> referenced from the rest of the index), emitting > >> ERRCODE_INDEX_CORRUPTED is a false alarm. > > > > I agree this seems like a false alarm, and at least its inconsistent > > with the btree. > > It also means that if one is doing a join of pg_class with pgstattuple > to gather some diagnostics on a given database, then one is required > to wrap pgstattuple into a custom plpgsql function to catch errors, > like you did, or avoid these indexes in all cases. This is not > user-friendly in both cases. Right, I agree with that. > On Tue, Sep 30, 2025 at 10:46:11AM +0530, Nitin Motiani wrote: > > On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> I observed that pgstat_btree_page() incorporates the count of new > >> pages into its free space calculation[1]. Does it make sense to do the > >> same for hash and gist as well as we are leaning towards making these > >> consistent. > >> > >> [1] > >> if (PageIsNew(page)) > >> { > >> /* fully empty page */ > >> stat->free_space += BLCKSZ; > >> } > > > > Thanks for the feedback. Yes, it makes sense to do the same free space > > calculation for the other index types unless there is some special > > handling in btree implementation of the free space. From the > > pgstattuple code, I don't see any special handling so I have made the > > change suggested by you. I'm attaching patch v2 with that change. > > Including the space for the new page for gist in the size makes sense > here. gist has protections for empty pages in VACUUM, and can grab > empty pages through the FSM. > > For hash, we don't have much of that AFAIK, but your argument with the > timely crash validates the fact that we'd better include this number > in the free size as it is part of the relation anyway. Right > In short, fine by me for both. If others have an opinion, feel free. Thanks for your feedback Michael. -- Regards, Dilip Kumar Google
On Tue, Sep 30, 2025 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Sep 30, 2025 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Sep 29, 2025 at 09:03:13AM +0530, Dilip Kumar wrote: > > > On Tue, Sep 16, 2025 at 1:06 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > >> For hash indexes, this error comes from pgstat_hash_page() -> > > >> _hash_getbuf_with_strategy() -> _hash_checkpage() for a zero page. > > >> Similarly gistcheckpage() is the cause for gist indexes. > > >> > > >> Since after a crash recovery a zero page is normal (as long as not > > >> referenced from the rest of the index), emitting > > >> ERRCODE_INDEX_CORRUPTED is a false alarm. > > > > > > I agree this seems like a false alarm, and at least its inconsistent > > > with the btree. > > > > It also means that if one is doing a join of pg_class with pgstattuple > > to gather some diagnostics on a given database, then one is required > > to wrap pgstattuple into a custom plpgsql function to catch errors, > > like you did, or avoid these indexes in all cases. This is not > > user-friendly in both cases. > > Right, I agree with that. > > > On Tue, Sep 30, 2025 at 10:46:11AM +0530, Nitin Motiani wrote: > > > On Mon, Sep 29, 2025 at 9:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > >> I observed that pgstat_btree_page() incorporates the count of new > > >> pages into its free space calculation[1]. Does it make sense to do the > > >> same for hash and gist as well as we are leaning towards making these > > >> consistent. > > >> > > >> [1] > > >> if (PageIsNew(page)) > > >> { > > >> /* fully empty page */ > > >> stat->free_space += BLCKSZ; > > >> } Looking into the V2 below check [1] My primary concern is the handling of non-new pages where the page's special size does not match GISTPageOpaqueData (same for pgstat_hash_page()). I mean isn't that corruption and this should be reported. OTOH, pgstat_btree_page() is also not reporting it, in fact it is assuming that if the page is not new then it can safely read the page opaque data without validating the size. My additional goal with this patch is to standardize the validation logic across all three functions to prevent these inconsistencies in the future. [1] if (PageIsNew(page)) { /* fully empty page */ stat->free_space += BLCKSZ; } else if (PageGetSpecialSize(page) == MAXALIGN(sizeof(GISTPageOpaqueData)) && GistPageIsLeaf(page)) { pgstat_index_page(stat, page, FirstOffsetNumber, PageGetMaxOffsetNumber(page)); } else { /* root or node or corrupted */ } -- Regards, Dilip Kumar Google
On Tue, Sep 30, 2025 at 05:31:46PM +0530, Dilip Kumar wrote: > My primary concern is the handling of non-new pages where the page's > special size does not match GISTPageOpaqueData (same for > pgstat_hash_page()). I mean isn't that corruption and this should be > reported. OTOH, pgstat_btree_page() is also not reporting it, in fact > it is assuming that if the page is not new then it can safely read the > page opaque data without validating the size. My additional goal with > this patch is to standardize the validation logic across all three > functions to prevent these inconsistencies in the future. pgstattuple's job is to report tuple-level statistics. I don't really think that we need to make it more complicated or smarter to detect corruption cases, because we have other tools that are designed for that, like amcheck. True that amcheck does not have support for gist and hash as checks for index AMs, but it's something that could be sorted out on its own as a new feature. From this perspective, v2 seems kind of fine based on its simplicity and because it makes pgstattuple do a better job in terms of statistics numbers reported. I think that I would just remove the "or corrupted" in the extra comment, though. -- Michael
Вложения
Apologies, I accidentally sent my previous reply only to Michael instead of hitting 'reply all'. Adding the contents of those messages in the quoted text. On Wed, Oct 1, 2025 at 4:45 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Oct 01, 2025 at 04:17:49PM +0530, Nitin Motiani wrote: > > Thanks Michael. We can keep the simple change we have in v2 without > > reporting any corruption. But perhaps we should check for the opaque > > size mismatch for btree (as it's already done for gist and hash) to > > keep the code consistent for all three. We can avoid any reporting or > > further analysis but we can skip the other operations in the case of > > size mismatch. What are your thoughts on that? > > You mean an check on BTPageOpaqueData with a new else branch in > pgstat_btree_page()? Yep, let's do that as well. > -- > Michael Thanks Michael. I'm attaching v3 with this change.
Вложения
On Wed, Oct 1, 2025 at 8:32 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > Apologies, I accidentally sent my previous reply only to Michael > instead of hitting 'reply all'. Adding the contents of those messages > in the quoted text. > > On Wed, Oct 1, 2025 at 4:45 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Wed, Oct 01, 2025 at 04:17:49PM +0530, Nitin Motiani wrote: > > > Thanks Michael. We can keep the simple change we have in v2 without > > > reporting any corruption. But perhaps we should check for the opaque > > > size mismatch for btree (as it's already done for gist and hash) to > > > keep the code consistent for all three. We can avoid any reporting or > > > further analysis but we can skip the other operations in the case of > > > size mismatch. What are your thoughts on that? > > > > You mean an check on BTPageOpaqueData with a new else branch in > > pgstat_btree_page()? Yep, let's do that as well. > > -- > > Thanks Michael. I'm attaching v3 with this change. I think this looks good to me now, lets see what Michael has to say on this, after that we can mark ready for committer. -- Regards, Dilip Kumar Google
On Wed, Oct 01, 2025 at 08:53:01PM +0530, Dilip Kumar wrote: > I think this looks good to me now, lets see what Michael has to say on > this, after that we can mark ready for committer. Thanks for the updated patch. I have tweaked a bit more the code blocks to be more consistent with each other, and applied the change down to v13, as there have been numerous complaints on the matter. -- Michael
Вложения
Thank you, Michael!