Обсуждение: Orphan page in _bt_split

Поиск
Список
Период
Сортировка

Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:
Hi hacker,

The function _bt_split creates new (right) page to split into.
The comment says:

     /*
      * Acquire a new right page to split into, now that left page has a new
      * high key.  From here on, it's not okay to throw an error without
      * zeroing rightpage first.  This coding rule ensures that we won't
      * confuse future VACUUM operations, which might otherwise try to 
re-find
      * a downlink to a leftover junk page as the page undergoes deletion.
      *
      * It would be reasonable to start the critical section just after 
the new
      * rightpage buffer is acquired instead; that would allow us to avoid
      * leftover junk pages without bothering to zero rightpage. We do 
it this
      * way because it avoids an unnecessary PANIC when either origpage 
or its
      * existing sibling page are corrupt.
      */

So we should zero this page in case of reporting error to "not confuse 
vacuum.
It is done for all places where this function reports error before 
entering critical section and wal-logging this page.

But before it there is call to _bt_getbuf:

     /*
      * We have to grab the original right sibling (if any) and update 
its prev
      * link.  We are guaranteed that this is deadlock-free, since we couple
      * the locks in the standard order: left to right.
      */
     if (!isrightmost)
     {
         sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);

_bp_getbuf just reads page in a shared buffer. But ReadBuffer actually 
can invoke half of Postgres code (locating buffer, evicting victim, 
writing to SMGR,..., So there are a lot of places where error can be 
thrown (including even such error as statement timeout).
And in this case current transaction will be aborted without zeroing the 
right page.
So vacuum will be confused.

It can be quite easily demonstrated by inserting error here:

     /*
      * We have to grab the original right sibling (if any) and update 
its prev
      * link.  We are guaranteed that this is deadlock-free, since we couple
      * the locks in the standard order: left to right.
      */
     if (!isrightmost)
     {
         sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
         elog(ERROR, "@@@ Terminated");


And doing something like this:

postgres=# create table t3(pk integer primary key, sk integer, payload 
integer);
CREATE TABLE
postgres=# create index on t3(sk);
CREATE INDEX
postgres=# insert into t3 values 
(generate_series(1,1000000),random()*1000000,0);
ERROR:  @@@ Terminated
postgres=# vacuum t3;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
server closed the connection unexpectedly
     This probably means the server terminated abnormally
     before or while processing the request.


log:

2025-09-01 07:46:37.459 EEST [89290] ERROR:  @@@ Terminated
2025-09-01 07:46:37.459 EEST [89290] STATEMENT:  insert into t3 values 
(generate_series(1,1000000),random()*1000000,0);
2025-09-01 07:46:42.391 EEST [89406] LOG:  failed to re-find parent key 
in index "t3_sk_idx" for deletion target page 4
2025-09-01 07:46:42.391 EEST [89406] CONTEXT:  while vacuuming index 
"t3_sk_idx" of relation "public.t3"
TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406
0   postgres                            0x0000000104bf2f64 
ExceptionalCondition + 216
1   postgres                            0x00000001043ace94 
_bt_lock_subtree_parent + 248
2   postgres                            0x00000001043aaf98 
_bt_mark_page_halfdead + 508
3   postgres                            0x00000001043aaaec _bt_pagedel + 
1068
4   postgres                            0x00000001043b56cc btvacuumpage 
+ 2116
5   postgres                            0x00000001043b4dd4 btvacuumscan 
+ 564


This code is not changed for quite long time so I wonder why nobody 
noticed this error before?

Proposed patch is attached.
It creates copy of right page in the same way as for left (original) page.
And updates the page only in critical section.



Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Mon, Sep 01, 2025 at 08:03:18AM +0300, Konstantin Knizhnik wrote:
> _bp_getbuf just reads page in a shared buffer. But ReadBuffer actually can
> invoke half of Postgres code (locating buffer, evicting victim, writing to
> SMGR,..., So there are a lot of places where error can be thrown (including
> even such error as statement timeout).
> And in this case current transaction will be aborted without zeroing the
> right page.
> So vacuum will be confused.
>
> It can be quite easily demonstrated by inserting error here:

Nice investigation and report, that I assume you have just guessed
from a read of the code and that there could be plenty of errors that
could happen in this code path.  It indeed looks like some weak
coding assumption introduced in this code path by 9b42e713761a from
2019, going down to v11.

We could have a SQL regression test for this case, just put a
INJECTION_POINT(), then force an ERROR callback to force an incorrect
state.  The test can be made cheap enough.

> This code is not changed for quite long time so I wonder why nobody noticed
> this error before?

I am ready to believe that errors are just hard to reach in this path.

> It creates copy of right page in the same way as for left (original) page.
> And updates the page only in critical section.

Paying the cost of an extra allocation for a temporary page to bypass
the problem entirely surely makes the code safer in the long run, so
as any changes made in this area would never trigger the same problem
ever again.  We do that already for GIN with critical sections, as one
example.

Peter G., as the committer of 9b42e713761a, an opinion?
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Peter Geoghegan
Дата:
On Mon, Sep 1, 2025 at 1:03 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> So we should zero this page in case of reporting error to "not confuse
> vacuum.
> It is done for all places where this function reports error before
> entering critical section and wal-logging this page.

Right.

> _bp_getbuf just reads page in a shared buffer. But ReadBuffer actually
> can invoke half of Postgres code (locating buffer, evicting victim,
> writing to SMGR,..., So there are a lot of places where error can be
> thrown (including even such error as statement timeout).
> And in this case current transaction will be aborted without zeroing the
> right page.
> So vacuum will be confused.

...

> 2025-09-01 07:46:37.459 EEST [89290] ERROR:  @@@ Terminated
> 2025-09-01 07:46:37.459 EEST [89290] STATEMENT:  insert into t3 values
> (generate_series(1,1000000),random()*1000000,0);
> 2025-09-01 07:46:42.391 EEST [89406] LOG:  failed to re-find parent key
> in index "t3_sk_idx" for deletion target page 4
> 2025-09-01 07:46:42.391 EEST [89406] CONTEXT:  while vacuuming index
> "t3_sk_idx" of relation "public.t3"
> TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406

> This code is not changed for quite long time so I wonder why nobody
> noticed this error before?

The assertion failure + log output that you've shown is from
_bt_lock_subtree_parent. I'm aware that that error appears from time
to time in the field. When we hit this log message in a non-assert
build, VACUUM should still be able to finish successfully.

My assumption up until now has been that the corruption underlying
these "failed to re-find parent" LOG messages is generally due to
breaking changes to collations (page deletion uses an insertion scan
key to "re-find" a downlink to the page undergoing deletion), and
generic corruption. But it now seems possible that some of them were
due to the problem that you've highlighted.

The problem that you've highlighted happens late within _bt_split,
when the contents of the new right sibling page (everything except its
LSN) has already been written to the page. So I'm fairly confident
that any real world problems would show themselves as "failed to
re-find parent" LOG message from VACUUM (there are no downlinks or
sibling pointers that point to the new right page, so only VACUUM will
ever be able to read the page).

> Proposed patch is attached.
> It creates copy of right page in the same way as for left (original) page.
> And updates the page only in critical section.

I remember that when I worked on what became commit 9b42e71376 back in
2019 (which fixed a similar problem caused by the INCLUDE index
patch), Tom suggested that we do things this way defensively (without
being specifically aware of the _bt_getbuf hazard). That does seem
like the best approach.

I'm a little bit worried about the added overhead of allocating an
extra BLCKSZ buffer for this. I wonder if it would make sense to use
BLCKSZ arrays of char for both of the temp pages.

--
Peter Geoghegan



Re: Orphan page in _bt_split

От
Peter Geoghegan
Дата:
On Mon, Sep 1, 2025 at 1:35 AM Michael Paquier <michael@paquier.xyz> wrote:
> Nice investigation and report, that I assume you have just guessed
> from a read of the code and that there could be plenty of errors that
> could happen in this code path.  It indeed looks like some weak
> coding assumption introduced in this code path by 9b42e713761a from
> 2019, going down to v11.

Commit 9b42e713761a really has nothing to do with this. It fixed a
similar issue that slipped in to Postgres 11. At worst, commit
9b42e713761a neglected to fix this other problem in passing.

This hazard has existing since commit 8fa30f906b, from 2010. That's
the commit that introduced the general idea of making _bt_split zero
its rightpage in order to make it safe to throw an ERROR instead of just
PANICing.

> We could have a SQL regression test for this case, just put a
> INJECTION_POINT(), then force an ERROR callback to force an incorrect
> state.  The test can be made cheap enough.
>
> > This code is not changed for quite long time so I wonder why nobody noticed
> > this error before?
>
> I am ready to believe that errors are just hard to reach in this path.

Why?

There's just no reason to think that we'd ever be able to tie back one
of these LOG messages from VACUUM to the problem within _bt_split.
There's too many other forms of corruption that might result in VACUUM
logging this same error (e.g., breaking changes to a glibc collation).

An important case where this weakness will make life worse for users
is a checksum failure against the existing right sibling page -- since
those are not once off, transient errors (unlike, say, OOMs). Once you
have an index page with a bad checksum, there's a decent chance that
the application will attempt to insert onto the page to the immediate
left of that bad page. That'll trigger a split, sooner or later. Which
in turn triggers the problem that Konstantin reported. It's not going
to make the corruption problem markedly worse, but it's still not
great: there's no telling how many times successive inserters will try
and inevitably fail to split the same page, creating a new junk page
each time.

--
Peter Geoghegan



Re: Orphan page in _bt_split

От
Peter Geoghegan
Дата:
On Mon, Sep 1, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There's just no reason to think that we'd ever be able to tie back one
> of these LOG messages from VACUUM to the problem within _bt_split.
> There's too many other forms of corruption that might result in VACUUM
> logging this same error (e.g., breaking changes to a glibc collation).

Thinking about this some more, I guess it's generally fairly unlikely
that VACUUM would actually even attempt to delete such a page. The
only reason why it happens with Konstantin's test case is because the
whole inserting transaction aborts, leaving behind many garbage tuples
that VACUUM will remove, leading to an empty page. Without that,
VACUUM won't think to even try to delete a left over junk right
sibling page.

> An important case where this weakness will make life worse for users
> is a checksum failure against the existing right sibling page -- since
> those are not once off, transient errors (unlike, say, OOMs). Once you
> have an index page with a bad checksum, there's a decent chance that
> the application will attempt to insert onto the page to the immediate
> left of that bad page. That'll trigger a split, sooner or later.

Also rethinking this aspect: a checksum failure probably *isn't* going
to make much difference. Since that'll also cause bigger problems for
VACUUM than logging one of these "failed to re-find parent key"
messages.

I still think that we should do something about this. Not sure how I
feel about backpatching just yet.

--
Peter Geoghegan



Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Mon, Sep 01, 2025 at 03:04:58PM -0400, Peter Geoghegan wrote:
> This hazard has existing since commit 8fa30f906b, from 2010. That's
> the commit that introduced the general idea of making _bt_split zero
> its rightpage in order to make it safe to throw an ERROR instead of just
> PANICing.

Thanks, you're right, I have fat-fingered this one.  It's much older
than 2019.
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Mon, Sep 01, 2025 at 02:35:56PM -0400, Peter Geoghegan wrote:
> I remember that when I worked on what became commit 9b42e71376 back in
> 2019 (which fixed a similar problem caused by the INCLUDE index
> patch), Tom suggested that we do things this way defensively (without
> being specifically aware of the _bt_getbuf hazard). That does seem
> like the best approach.
>
> I'm a little bit worried about the added overhead of allocating an
> extra BLCKSZ buffer for this. I wonder if it would make sense to use
> BLCKSZ arrays of char for both of the temp pages.

Hmm.  Yes, the allocation makes me a bit uneasy.  Or it would be
possible to just use a PGAlignedBlock here?  That's already used in
some code paths for page manipulations, forcing alignment.
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:
On 02/09/2025 7:42 AM, Michael Paquier wrote:
> On Mon, Sep 01, 2025 at 02:35:56PM -0400, Peter Geoghegan wrote:
>> I remember that when I worked on what became commit 9b42e71376 back in
>> 2019 (which fixed a similar problem caused by the INCLUDE index
>> patch), Tom suggested that we do things this way defensively (without
>> being specifically aware of the _bt_getbuf hazard). That does seem
>> like the best approach.
>>
>> I'm a little bit worried about the added overhead of allocating an
>> extra BLCKSZ buffer for this. I wonder if it would make sense to use
>> BLCKSZ arrays of char for both of the temp pages.
> Hmm.  Yes, the allocation makes me a bit uneasy.  Or it would be
> possible to just use a PGAlignedBlock here?  That's already used in
> some code paths for page manipulations, forcing alignment.


Do you suggest to replace `PageGetTempPage` with using local buffers?
Or may be change `PageGetTempPage*` to accept parameter with provided 
local buffer?
But it is used in ~20 places. Change them all?
It seems to be too invasive change for such small fix, but I can do it.

Certainly copying data can be done in more efficient way, but I do not 
thing that palloc here can have any noticeable impact on performance.
Please let me know if I should prepare new page avoiding memory 
allocation or you prefer to do it yourself.





Re: Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:
On 01/09/2025 10:18 PM, Peter Geoghegan wrote:
> On Mon, Sep 1, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> There's just no reason to think that we'd ever be able to tie back one
>> of these LOG messages from VACUUM to the problem within _bt_split.
>> There's too many other forms of corruption that might result in VACUUM
>> logging this same error (e.g., breaking changes to a glibc collation).
> Thinking about this some more, I guess it's generally fairly unlikely
> that VACUUM would actually even attempt to delete such a page. The
> only reason why it happens with Konstantin's test case is because the
> whole inserting transaction aborts, leaving behind many garbage tuples
> that VACUUM will remove, leading to an empty page. Without that,
> VACUUM won't think to even try to delete a left over junk right
> sibling page.


But sooner or later vacuum will be called for this index and will 
traverse this page, will not it?
There is not other way to reuse this page without deleting it or I am 
missing something?


>
>> An important case where this weakness will make life worse for users
>> is a checksum failure against the existing right sibling page -- since
>> those are not once off, transient errors (unlike, say, OOMs). Once you
>> have an index page with a bad checksum, there's a decent chance that
>> the application will attempt to insert onto the page to the immediate
>> left of that bad page. That'll trigger a split, sooner or later.
> Also rethinking this aspect: a checksum failure probably *isn't* going
> to make much difference. Since that'll also cause bigger problems for
> VACUUM than logging one of these "failed to re-find parent key"
> messages.


But vacuum is not just logging this message. It throws error which means 
that vacuum for this relation will be performed any more.





Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Wed, Sep 03, 2025 at 09:32:41AM +0300, Konstantin Knizhnik wrote:
> On 01/09/2025 10:18 PM, Peter Geoghegan wrote:
>> Also rethinking this aspect: a checksum failure probably *isn't* going
>> to make much difference. Since that'll also cause bigger problems for
>> VACUUM than logging one of these "failed to re-find parent key"
>> messages.
>
> But vacuum is not just logging this message. It throws error which means
> that vacuum for this relation will be performed any more.

Yeah.  For a long-running autovacuum job, getting potentially random
in-flight failures is always annoying, more if these jobs deal with
wraparound.
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Peter Geoghegan
Дата:
On Wed, Sep 3, 2025 at 2:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> But sooner or later vacuum will be called for this index and will
> traverse this page, will not it?
> There is not other way to reuse this page without deleting it or I am
> missing something?

That's true. But VACUUM won't even attempt to delete it unless it can
also remove all of the index tuples. Which, in general, probably won't
happen (it happened with your test case, but that's probably not
typical).

> But vacuum is not just logging this message. It throws error which means
> that vacuum for this relation will be performed any more.

What error? You showed an assertion failure, but that won't be hit in
release builds.

--
Peter Geoghegan



Re: Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:
On 04/09/2025 3:55 AM, Peter Geoghegan wrote:
> On Wed, Sep 3, 2025 at 2:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>> But sooner or later vacuum will be called for this index and will
>> traverse this page, will not it?
>> There is not other way to reuse this page without deleting it or I am
>> missing something?
> That's true. But VACUUM won't even attempt to delete it unless it can
> also remove all of the index tuples. Which, in general, probably won't
> happen (it happened with your test case, but that's probably not
> typical).
>
>> But vacuum is not just logging this message. It throws error which means
>> that vacuum for this relation will be performed any more.
> What error? You showed an assertion failure, but that won't be hit in
> release builds.
>
Sorry, I missed it with another "failed to re-find" error ("... for split").
Yes, this error can happen only for Postgres build with enabled casserts.






Re: Orphan page in _bt_split

От
Peter Geoghegan
Дата:
On Wed, Sep 3, 2025 at 2:25 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> Do you suggest to replace `PageGetTempPage` with using local buffers?
> Or may be change `PageGetTempPage*` to accept parameter with provided
> local buffer?

I think that _bt_split could easily be switched over to using 2 local
PGAlignedBlock variables, removing its use of PageGetTempPage. I don't
think that it is necessary to consider other PageGetTempPage callers.

--
Peter Geoghegan



Re: Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:


On 13/09/2025 10:10 PM, Peter Geoghegan wrote:
On Wed, Sep 3, 2025 at 2:25 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
Do you suggest to replace `PageGetTempPage` with using local buffers?
Or may be change `PageGetTempPage*` to accept parameter with provided
local buffer?
I think that _bt_split could easily be switched over to using 2 local
PGAlignedBlock variables, removing its use of PageGetTempPage. I don't
think that it is necessary to consider other PageGetTempPage callers.


Attached please find updated version of the patch which uses PGAlignedBlock instead of PageGetTempPage.  

Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Sun, Sep 14, 2025 at 04:40:54PM +0300, Konstantin Knizhnik wrote:
> On 13/09/2025 10:10 PM, Peter Geoghegan wrote:
>> I think that _bt_split could easily be switched over to using 2 local
>> PGAlignedBlock variables, removing its use of PageGetTempPage. I don't
>> think that it is necessary to consider other PageGetTempPage callers.

Hmm.  I didn't really agree with this last statement, especially if
these code paths don't need to manipulate Page pointers across the
stack.  So I have taken this opportunity to double-check all the
existing calls of PageGetTempPage() which is an API old enough to vote
(105409746499?), while 44cac9346479 has introduced PGAlignedBlock much
later, seeing if there are some opportunities here:
- dataSplitPageInternal() and entrySplitPage() expect the contents of
the pages to be returned to the caller.  We could use PGAlignedBlock
with pre-prepared pages that don't get allocated, but this requires
the callers of the internal routines to take responsibility for that,
like dataBeginPlaceToPage().  The complexity is not appealing in these
case.
- The same argument applies to the call in ginPlaceToPage(), passing
down the page to fillRoot().
- gistplacetopage(), with a copy of a page filled its special area,
manipulated across other calls.
- An optimization exists for btree_xlog_split() where a temporary page
is not manipulated, but the fact that we are in redo does not really
matter much for the extra allocation, I guess.

At the end, I don't feel excited about switching these cases, where
the gains are not obvious.

> Attached please find updated version of the patch which uses PGAlignedBlock
> instead of PageGetTempPage.

+     * high key. To prevent confision of VACUUM with orphan page, we also
+     * first fill in-mempory copy oif this page.

Three typos in two lines here (not sure about the best wording that
fits with VACUUM, but I like the suggested simplification):
s/in-mempory/in-memory/
s/confision/confusion/
s/oif/if/

Saying that, this patch sounds like a good idea, making these code
paths a bit more reliable for VACUUM, without paying the cost of an
allocation.  At least for HEAD, that's nice to have.  Peter, what do
you think?
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Tue, Sep 16, 2025 at 10:26:12AM +0900, Michael Paquier wrote:
> Saying that, this patch sounds like a good idea, making these code
> paths a bit more reliable for VACUUM, without paying the cost of an
> allocation.  At least for HEAD, that's nice to have.  Peter, what do
> you think?

Hearing nothing, I'm planning to have a second look at it and do
something on HEAD.  Feel free if you have any comments.
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Константин Книжник
Дата:

> On 18 Sep 2025, at 7:28 AM, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Sep 16, 2025 at 10:26:12AM +0900, Michael Paquier wrote:
>> Saying that, this patch sounds like a good idea, making these code
>> paths a bit more reliable for VACUUM, without paying the cost of an
>> allocation.  At least for HEAD, that's nice to have.  Peter, what do
>> you think?
>
> Hearing nothing, I'm planning to have a second look at it and do
> something on HEAD.  Feel free if you have any comments.
> —
> Michael



Attached please find rebased version of the patch with fixed mistypings.



Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Mon, Sep 22, 2025 at 07:44:18PM +0300, Константин Книжник wrote:
> Attached please find rebased version of the patch with fixed mistypings.

I have looked at v3.

-   leftpage = PageGetTempPage(origpage);
+   leftpage = leftpage_buf.data;
+   memcpy(leftpage, origpage, BLCKSZ);
    _bt_pageinit(leftpage, BufferGetPageSize(buf));

What's the point of copying the contents of origpage to leftpage?
_bt_pageinit() is going to initialize leftpage (plus a few more things
set like the  page LSN, etc.), so the memcpy should not be necessary,
no?

+   rightpage = BufferGetPage(rbuf);
+   memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+   ropaque = BTPageGetOpaque(rightpage);

When we reach this state of the logic, aka at the beginning of the
critical section, the right and left pages are in a correct state,
and your code is OK because we copy the contents of the right page
back into its legitimate place.  What is not OK to me is that the
copy of the "temporary" right page back to "rbuf" is not explained.  I
think that this deserves a comment, especially the part about ropaque
which is set once by this proposal, then manipulated while in the
critical section.
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:
On 25/09/2025 7:35 AM, Michael Paquier wrote:
> On Mon, Sep 22, 2025 at 07:44:18PM +0300, Константин Книжник wrote:
>> Attached please find rebased version of the patch with fixed mistypings.
> I have looked at v3.
>
> -   leftpage = PageGetTempPage(origpage);
> +   leftpage = leftpage_buf.data;
> +   memcpy(leftpage, origpage, BLCKSZ);
>      _bt_pageinit(leftpage, BufferGetPageSize(buf));
>
> What's the point of copying the contents of origpage to leftpage?
> _bt_pageinit() is going to initialize leftpage (plus a few more things
> set like the  page LSN, etc.), so the memcpy should not be necessary,
> no?

Sorry, memcpy is really not needed here.

>
> +   rightpage = BufferGetPage(rbuf);
> +   memcpy(rightpage, rightpage_buf.data, BLCKSZ);
> +   ropaque = BTPageGetOpaque(rightpage);
>
> When we reach this state of the logic, aka at the beginning of the
> critical section, the right and left pages are in a correct state,
> and your code is OK because we copy the contents of the right page
> back into its legitimate place.  What is not OK to me is that the
> copy of the "temporary" right page back to "rbuf" is not explained.  I
> think that this deserves a comment, especially the part about ropaque
> which is set once by this proposal, then manipulated while in the
> critical section.


I added the comment, please check that it is clear and complete.
Updated version of the patch is attached.


Вложения

Re: Orphan page in _bt_split

От
Konstantin Knizhnik
Дата:
On 25/09/2025 7:35 AM, Michael Paquier wrote:
> On Mon, Sep 22, 2025 at 07:44:18PM +0300, Константин Книжник wrote:
>> Attached please find rebased version of the patch with fixed mistypings.
> I have looked at v3.
>
> -   leftpage = PageGetTempPage(origpage);
> +   leftpage = leftpage_buf.data;
> +   memcpy(leftpage, origpage, BLCKSZ);
>      _bt_pageinit(leftpage, BufferGetPageSize(buf));
>
> What's the point of copying the contents of origpage to leftpage?
> _bt_pageinit() is going to initialize leftpage (plus a few more things
> set like the  page LSN, etc.), so the memcpy should not be necessary,
> no?
>
> +   rightpage = BufferGetPage(rbuf);
> +   memcpy(rightpage, rightpage_buf.data, BLCKSZ);
> +   ropaque = BTPageGetOpaque(rightpage);
>
> When we reach this state of the logic, aka at the beginning of the
> critical section, the right and left pages are in a correct state,
> and your code is OK because we copy the contents of the right page
> back into its legitimate place.  What is not OK to me is that the
> copy of the "temporary" right page back to "rbuf" is not explained.  I
> think that this deserves a comment, especially the part about ropaque
> which is set once by this proposal, then manipulated while in the
> critical section.
> --
> Michael


Sorry, I have attached wrong patch.



Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Thu, Sep 25, 2025 at 09:41:00AM +0300, Konstantin Knizhnik wrote:
> Sorry, I have attached wrong patch.

Thanks, I was confused for a couple of minutes.

+    /*
+     * Now we are in critical section and it is not needed to maintain temporary
+     * copy of right page in the local memory. We can copy it to the destination buffer.
+     * Unlike leftpage, rightpage and ropaque are still needed to complete update
+     * of the page, so we need to correctly reinitialize them.
+     */
+    rightpage = BufferGetPage(rbuf);
+    memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+    ropaque = BTPageGetOpaque(rightpage);

Hmm.  This looks kind of explicit enough to document the purpose.
The wording could be simplified a bit more.  I'll take it from there.
--
Michael

Вложения

Re: Orphan page in _bt_split

От
Michael Paquier
Дата:
On Thu, Sep 25, 2025 at 03:45:21PM +0900, Michael Paquier wrote:
> Hmm.  This looks kind of explicit enough to document the purpose.
> The wording could be simplified a bit more.  I'll take it from there.

Reworded a bit more the comments, and applied on HEAD.  There could be
an argument for back-patching, I guess, but the lack of complaints is
also an indicator to not do so, and VACUUM is able to handle the
post-failure cleanup should an ERROR happen in flight when splitting
a page.
--
Michael

Вложения