Обсуждение: ReadRecentBuffer() doesn't scale well

Поиск
Список
Период
Сортировка

ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

As mentioned nearby [1], Thomas brought up [2] the idea of using
ReadRecentBuffer() _bt_getroot().  I couldn't resist and prototyped it.

Unfortunately it scaled way worse at first. This is not an inherent issue, but
due to an implementation choice in ReadRecentBuffer().  Whereas the normal
BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
LockBufHdr(), checks if the buffer ID is the same and then uses
PinBuffer_Locked().

The problem with that is that PinBuffer() takes care to not hold the buffer
header spinlock, it uses compare_exchange to atomically acquire the pin, while
guaranteing nobody holds the lock.  When holding the buffer header spinlock,
there obviously is the risk of being scheduled out (or even just not have
exclusive access to the cacheline).

ReadRecentBuffer() scales worse even if LockBufHdr() is immediately followed
by PinBuffer_Locked(), so it's really just holding the lock that is the issue.


The fairly obvious solution to this is to just use PinBuffer() and just unpin
the buffer if its identity was changed concurrently. There could be an
unlocked pre-check as well.  However, there's the following comment in
ReadRecentBuffer():
             * It's now safe to pin the buffer.  We can't pin first and ask
             * questions later, because it might confuse code paths like
             * InvalidateBuffer() if we pinned a random non-matching buffer.
             */

But I'm not sure I buy that - there's plenty other things that can briefly
acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
contents, etc).



Another difference between using PinBuffer() and PinBuffer_locked() is that
the latter does not adjust a buffer's usagecount.

Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
increasing usagecount anymore?


FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
the root page's buffer id in RelationData, seems a noticeable win. About 7% in
a concurrent, read-only pgbench that utilizes batches of 10. And it should be
easy to get much bigger wins, e.g. with a index nested loop with a relatively
small index on the inner side.


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20230627013458.axge7iylw7llyvww%40awork3.anarazel.de
[2] https://twitter.com/MengTangmu/status/1673439083518115840



Re: ReadRecentBuffer() doesn't scale well

От
Thomas Munro
Дата:
On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:
> As mentioned nearby [1], Thomas brought up [2] the idea of using
> ReadRecentBuffer() _bt_getroot().  I couldn't resist and prototyped it.

Thanks!

> Unfortunately it scaled way worse at first. This is not an inherent issue, but
> due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> LockBufHdr(), checks if the buffer ID is the same and then uses
> PinBuffer_Locked().
>
> The problem with that is that PinBuffer() takes care to not hold the buffer
> header spinlock, it uses compare_exchange to atomically acquire the pin, while
> guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> there obviously is the risk of being scheduled out (or even just not have
> exclusive access to the cacheline).

Yeah.  Aside from inherent nastiness of user-space spinlocks, this new
use case is also enormously more likely to contend and then get into
trouble by being preempted due to btree root pages being about the
hottest pages in the universe than the use case I was focusing on at
the time.

> The fairly obvious solution to this is to just use PinBuffer() and just unpin
> the buffer if its identity was changed concurrently. There could be an
> unlocked pre-check as well.  However, there's the following comment in
> ReadRecentBuffer():
>                          * It's now safe to pin the buffer.  We can't pin first and ask
>                          * questions later, because it might confuse code paths like
>                          * InvalidateBuffer() if we pinned a random non-matching buffer.
>                          */
>
> But I'm not sure I buy that - there's plenty other things that can briefly
> acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> contents, etc).

I may well have been too cautious with that.  The worst thing I can
think of right now is that InvalidateBuffer() would busy loop (as it
already does in other rare cases) when it sees a pin.

> Another difference between using PinBuffer() and PinBuffer_locked() is that
> the latter does not adjust a buffer's usagecount.
>
> Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> increasing usagecount anymore?

Yeah, that is not great.  The simplification you suggest would fix
that too, though I guess it would also bump the usage count of buffers
that don't have the tag we expected; that's obviously rare and erring
on a better side though.

> FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> easy to get much bigger wins, e.g. with a index nested loop with a relatively
> small index on the inner side.

Wooo, that's better than I was hoping.  Thanks for trying it out!  I
think, for the complexity involved (ie very little), it's a nice
result, and worth considering even though it's also a solid clue that
we could do much better than this with a (yet to be designed)
longer-lived pin scheme.  smgr_targblock could be another
easy-to-cache candidate, ie a place where there is a single
interesting hot page that we're already keeping track of with no
requirement for new backend-local mapping machinery.



Re: ReadRecentBuffer() doesn't scale well

От
Peter Geoghegan
Дата:
On Mon, Jun 26, 2023 at 8:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Yeah.  Aside from inherent nastiness of user-space spinlocks, this new
> use case is also enormously more likely to contend and then get into
> trouble by being preempted due to btree root pages being about the
> hottest pages in the universe than the use case I was focusing on at
> the time.

They're not just the hottest. They're also among the least likely to
change from one moment to the next. (If that ever failed to hold then
it wouldn't take long for the index to become grotesquely tall.)

--
Peter Geoghegan



Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:
> > Unfortunately it scaled way worse at first. This is not an inherent issue, but
> > due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > LockBufHdr(), checks if the buffer ID is the same and then uses
> > PinBuffer_Locked().
> >
> > The problem with that is that PinBuffer() takes care to not hold the buffer
> > header spinlock, it uses compare_exchange to atomically acquire the pin, while
> > guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> > there obviously is the risk of being scheduled out (or even just not have
> > exclusive access to the cacheline).
> 
> Yeah.  Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.


> > The fairly obvious solution to this is to just use PinBuffer() and just unpin
> > the buffer if its identity was changed concurrently. There could be an
> > unlocked pre-check as well.  However, there's the following comment in
> > ReadRecentBuffer():
> >                          * It's now safe to pin the buffer.  We can't pin first and ask
> >                          * questions later, because it might confuse code paths like
> >                          * InvalidateBuffer() if we pinned a random non-matching buffer.
> >                          */
> >
> > But I'm not sure I buy that - there's plenty other things that can briefly
> > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> > contents, etc).
> 
> I may well have been too cautious with that.  The worst thing I can
> think of right now is that InvalidateBuffer() would busy loop (as it
> already does in other rare cases) when it sees a pin.

Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.


> > Another difference between using PinBuffer() and PinBuffer_locked() is that
> > the latter does not adjust a buffer's usagecount.
> >
> > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> > increasing usagecount anymore?
> 
> Yeah, that is not great.  The simplification you suggest would fix
> that too, though I guess it would also bump the usage count of buffers
> that don't have the tag we expected; that's obviously rare and erring
> on a better side though.

Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.


> > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> > the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> > a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> > easy to get much bigger wins, e.g. with a index nested loop with a relatively
> > small index on the inner side.
> 
> Wooo, that's better than I was hoping.  Thanks for trying it out!  I
> think, for the complexity involved (ie very little)

I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.


> smgr_targblock could be another easy-to-cache candidate, ie a place where
> there is a single interesting hot page that we're already keeping track of
> with no requirement for new backend-local mapping machinery.

I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...


> it's a nice result, and worth considering even though it's also a solid clue
> that we could do much better than this with a (yet to be designed)
> longer-lived pin scheme.

Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.

Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.


Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Peter Geoghegan
Дата:
On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> I think we should be able to have a post-check that can figure out
> if the copied root page is out of date after searching it, without needing the
> content lock.

I'm guessing that you're thinking of doing something with the page LSN?

--
Peter Geoghegan



Re: ReadRecentBuffer() doesn't scale well

От
Thomas Munro
Дата:
On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we should be able to have a post-check that can figure out
> > if the copied root page is out of date after searching it, without needing the
> > content lock.
>
> I'm guessing that you're thinking of doing something with the page LSN?

If the goal is to get rid of both pins and content locks, LSN isn't
enough.  A page might be evicted and replaced by another page that has
the same LSN because they were modified by the same record.  Maybe
that's vanishingly rare, but the correct thing would be counter that
goes up on modification AND eviction.  (FWIW I toyed with variants of
this concept in the context of SLRU -> buffer pool migration, where I
was trying to do zero-lock CLOG lookups; in that case I didn't need
the copy of the page being discussed here due to the data being
atomically readable, but I had the same requirement for a
"did-it-change-under-my-feet?" check).



Re: ReadRecentBuffer() doesn't scale well

От
Peter Geoghegan
Дата:
On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> If the goal is to get rid of both pins and content locks, LSN isn't
> enough.  A page might be evicted and replaced by another page that has
> the same LSN because they were modified by the same record.  Maybe
> that's vanishingly rare, but the correct thing would be counter that
> goes up on modification AND eviction.

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

More concretely: A root page can be concurrently split when there is
an in-flight index scan that is about to land on it (which becomes the
left half of the split). It doesn't matter if it's a searcher that is
"between" the meta page and the root page. It doesn't matter if a
level was added. This is true even though nothing that you'd usually
think of as an interlock is held "between levels". The root page isn't
really special, except in the obvious way. We can even have two roots
at the same time (the true root, and the fast root).

--
Peter Geoghegan



Re: ReadRecentBuffer() doesn't scale well

От
Thomas Munro
Дата:
On Tue, Jun 27, 2023 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > If the goal is to get rid of both pins and content locks, LSN isn't
> > enough.  A page might be evicted and replaced by another page that has
> > the same LSN because they were modified by the same record.  Maybe
> > that's vanishingly rare, but the correct thing would be counter that
> > goes up on modification AND eviction.
>
> It should be safe to allow searchers to see a version of the root page
> that is out of date. The Lehman & Yao design is very permissive about
> these things. There aren't any special cases where the general rules
> are weakened in some way that might complicate this approach.
> Searchers need to check the high key to determine if they need to move
> right -- same as always.

OK.  I guess I'm talking about a slightly more general version of the
problem inspired by the stuff I mentioned in parentheses, which would
simply get the wrong answer if the mapping changed, whereas here you'd
use the cached copy in a race case which should still work for
searches.

So I guess the question for this thread is: do we want to work on
ReadRecentBuffer(), or just take this experiment as evidence of even
more speed-up available and aim for that directly?



Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-27 16:40:08 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> > > I think we should be able to have a post-check that can figure out
> > > if the copied root page is out of date after searching it, without needing the
> > > content lock.
> >
> > I'm guessing that you're thinking of doing something with the page LSN?

Yes, that seems to be the most obvious.


> If the goal is to get rid of both pins and content locks, LSN isn't
> enough.

I was imaginging you'd have a long-lived pin. I don't think trying to make it
work without that is particularly promising in this context, where it seems
quite feasible to keep pins around for a while.


> A page might be evicted and replaced by another page that has the same LSN
> because they were modified by the same record.  Maybe that's vanishingly
> rare, but the correct thing would be counter that goes up on modification
> AND eviction.

I don't think it would need to be a single counter. If we wanted to do
something like this, I think you'd have to have a counter in the buffer desc
that's incremented whenever the page is replaced. Plus the LSN for the page
content change "counter".

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:
> It should be safe to allow searchers to see a version of the root page
> that is out of date. The Lehman & Yao design is very permissive about
> these things. There aren't any special cases where the general rules
> are weakened in some way that might complicate this approach.
> Searchers need to check the high key to determine if they need to move
> right -- same as always.

Wouldn't we at least need a pin on the root page, or hold a snapshot, to
defend against page deletions?

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Peter Geoghegan
Дата:
On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:
> > It should be safe to allow searchers to see a version of the root page
> > that is out of date. The Lehman & Yao design is very permissive about
> > these things. There aren't any special cases where the general rules
> > are weakened in some way that might complicate this approach.
> > Searchers need to check the high key to determine if they need to move
> > right -- same as always.
>
> Wouldn't we at least need a pin on the root page, or hold a snapshot, to
> defend against page deletions?

You need to hold a snapshot to prevent concurrent page recycling --
though not page deletion itself (I did say "anything that you'd
usually think of as an interlock"). I'm pretty sure that a concurrent
page deletion is possible, even when you hold a pin on the page.
(Perhaps not, but if not then it's just an accident -- a side-effect
of the interlock that protects against concurrent heap TID recycling.)

You can't delete a rightmost page (on any level). Every root page is a
rightmost page. So the root would have to be split, and then once
again emptied before it could be deleted -- only then would there be a
danger of some backend with a locally cached root page having an
irredeemably bad picture of what's going on with the index. That's
another angle that you could approach the problem from, I suppose.

--
Peter Geoghegan



Re: ReadRecentBuffer() doesn't scale well

От
Ants Aasma
Дата:
On Tue, 27 Jun 2023 at 07:09, Andres Freund <andres@anarazel.de> wrote:
> On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> > On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:
> > > Unfortunately it scaled way worse at first. This is not an inherent issue, but
> > > due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> > > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > > LockBufHdr(), checks if the buffer ID is the same and then uses
> > > PinBuffer_Locked().
> > >
> > > The problem with that is that PinBuffer() takes care to not hold the buffer
> > > header spinlock, it uses compare_exchange to atomically acquire the pin, while
> > > guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> > > there obviously is the risk of being scheduled out (or even just not have
> > > exclusive access to the cacheline).
> >
> > Yeah.  Aside from inherent nastiness of user-space spinlocks
>
> I've been wondering about making our backoff path use futexes, after some
> adaptive spinning.

If you want to experiment, here is a rebased version of something I
hacked up a couple of years back on the way to Fosdem Pgday. I didn't
pursue it further because I didn't have a use case where it showed a
significant difference.

--
Ants

Вложения

Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:
> If you want to experiment, here is a rebased version of something I
> hacked up a couple of years back on the way to Fosdem Pgday. I didn't
> pursue it further because I didn't have a use case where it showed a
> significant difference.

Thanks for posting!

Based on past experiments, anything that requires an atomic op during spinlock
release on x86 will be painful :/. I'm not sure there's a realistic way to
avoid that with futexes though :(.

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Ants Aasma
Дата:
On Tue, 27 Jun 2023 at 18:40, Andres Freund <andres@anarazel.de> wrote:
> On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:
> > If you want to experiment, here is a rebased version of something I
> > hacked up a couple of years back on the way to Fosdem Pgday. I didn't
> > pursue it further because I didn't have a use case where it showed a
> > significant difference.
>
> Thanks for posting!
>
> Based on past experiments, anything that requires an atomic op during spinlock
> release on x86 will be painful :/. I'm not sure there's a realistic way to
> avoid that with futexes though :(.

Do you happen to know if a plain xchg instruction counts as an atomic
for this? I haven't done atomics stuff in a while, so I might be
missing something, but at first glance I think using a plain xchg
would be enough for the releasing side.

-- 
Ants



Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-27 19:04:31 +0300, Ants Aasma wrote:
> On Tue, 27 Jun 2023 at 18:40, Andres Freund <andres@anarazel.de> wrote:
> > On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:
> > > If you want to experiment, here is a rebased version of something I
> > > hacked up a couple of years back on the way to Fosdem Pgday. I didn't
> > > pursue it further because I didn't have a use case where it showed a
> > > significant difference.
> >
> > Thanks for posting!
> >
> > Based on past experiments, anything that requires an atomic op during spinlock
> > release on x86 will be painful :/. I'm not sure there's a realistic way to
> > avoid that with futexes though :(.
> 
> Do you happen to know if a plain xchg instruction counts as an atomic
> for this? I haven't done atomics stuff in a while, so I might be
> missing something, but at first glance I think using a plain xchg
> would be enough for the releasing side.

It is automatically an atomic op when referencing memory:

Intel SDM 9.1.2.1 Automatic Locking:
"The operations on which the processor automatically follows the LOCK semantics are as follows:
• When executing an XCHG instruction that references memory.
...
"

Theoretically cmpxchg can be used in a non-atomic fashion. I'm not sure that
can be done correctly though, if you want to also store a separate value for
the futex. This stuff is hard to think though :)

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-27 01:10:25 -0700, Peter Geoghegan wrote:
> On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:
> > > It should be safe to allow searchers to see a version of the root page
> > > that is out of date. The Lehman & Yao design is very permissive about
> > > these things. There aren't any special cases where the general rules
> > > are weakened in some way that might complicate this approach.
> > > Searchers need to check the high key to determine if they need to move
> > > right -- same as always.
> >
> > Wouldn't we at least need a pin on the root page, or hold a snapshot, to
> > defend against page deletions?
>
> You need to hold a snapshot to prevent concurrent page recycling --
> though not page deletion itself (I did say "anything that you'd
> usually think of as an interlock").

I don't think we'd want to have a snapshot for this, that make it much less
beneficial.


> I'm pretty sure that a concurrent page deletion is possible, even when you
> hold a pin on the page.  (Perhaps not, but if not then it's just an accident
> -- a side-effect of the interlock that protects against concurrent heap TID
> recycling.)

Looks like the pin should prevent the danger, but wouldn't be very attractive,
due to blocking vacuum...


I've wondered before about a type of pin that just prevents buffer
replacement, but not cleaning up page contents. I think that'd be beneficial
in quite a few places.


> You can't delete a rightmost page (on any level). Every root page is a
> rightmost page. So the root would have to be split, and then once
> again emptied before it could be deleted -- only then would there be a
> danger of some backend with a locally cached root page having an
> irredeemably bad picture of what's going on with the index. That's
> another angle that you could approach the problem from, I suppose.

If we had a way of just preventing the page from being replaced, or reliably
detecting that happening, without blocking btree vacuum, the easiest path
seems to be to use the cached version of the root page, and re-do the work
whenever a) the LSN of the page has changed or b) the buffer has been
replaced. To me that seems like it'd likely be simpler and more general than
relying on being able to step right from any outdated, but not deleted,
version of the page (due to the page deletion issues).

Obviously that'd lead to retries more often - but realistically it's still
going to be vanishingly rare, root pages don't get modified that much once the
index is beyond toy size.


I think a replacement counter in the buffer desc is the easiest way to achieve
that?  We'd have to store the buffer ID, buffer replacement counter and page
LSN in RelationData.  I think the protocol would have to be something like

1) do search on the copy of the root page

2) get page LSN from the relevant buffer contents - this could be from a
   different relation / block or even an empty page, but will never be an
   invalid memory access, as we don't free shared buffers before shutdown.  If
   the LSN changed since taking the copy, take a new copy of the root page and
   start at 1)

3) check if buffer replacement counter is the same as at the time of the copy,
   if not take a new copy of the root page and start at 1)

4) happiness


For optimization reasons it might make sense to store the buffer replacement
counter on a separate cacheline from BufferDesc.{state,content_lock}, so
reading the buffer replacement counter doesn't cause cacheline contention with
backends working with state/lock. But that's an implementation detail, and it
might not matter much, because the pressure on state,content_lock would be
reduced drastically.

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Thomas Munro
Дата:
I (re)discovered why I used the lock-then-pin approach.  In the
comments I mentioned InvalidBuffer(), but the main problem is in its
caller GetVictimBuffer() which has various sanity checks about
reference counts that can occasionally fail if you have code randomly
pinning any old buffer.

New idea: use the standard PinBuffer() function, but add a mode that
doesn't pin invalid buffers (with caveat that you can perhaps get a
false negative due to unlocked read, but never a false positive; see
commit message).  Otherwise we'd have to duplicate all the same logic
to use cmpxchg for ReadRecentBuffer(), or rethink the assumptions in
that other code.

As for the lack of usage bump in the back-branches, I think the
options are: teach PinBuffer_Locked() to increment it optionally, or
back-patch whatever we come up with for this.

For the root buffer optimisation, the obvious place for storage seems
to be under rd_amcache.  It was originally invented for the cached
metapage (commit d2896a9ed14) but could accommodate a new struct
holding whatever we want.  Here is a patch to try that out.
BTAMCacheData would also be a natural place to put future things
including a copy of the root page itself, in later work on lock-free
tricks.

Experimental patches attached.

Вложения

Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-29 19:35:30 +1200, Thomas Munro wrote:
> I (re)discovered why I used the lock-then-pin approach.  In the
> comments I mentioned InvalidBuffer(), but the main problem is in its
> caller GetVictimBuffer() which has various sanity checks about
> reference counts that can occasionally fail if you have code randomly
> pinning any old buffer.

You're right. Specifically non-valid buffers are the issue.


> New idea: use the standard PinBuffer() function, but add a mode that
> doesn't pin invalid buffers (with caveat that you can perhaps get a
> false negative due to unlocked read, but never a false positive; see
> commit message).  Otherwise we'd have to duplicate all the same logic
> to use cmpxchg for ReadRecentBuffer(), or rethink the assumptions in
> that other code.

It might be worth using lock free code in more places before long, but I agree
with the solution here.


> As for the lack of usage bump in the back-branches, I think the
> options are: teach PinBuffer_Locked() to increment it optionally, or
> back-patch whatever we come up with for this.

Hm, or just leave it as is.


> For the root buffer optimisation, the obvious place for storage seems
> to be under rd_amcache.  It was originally invented for the cached
> metapage (commit d2896a9ed14) but could accommodate a new struct
> holding whatever we want.  Here is a patch to try that out.
> BTAMCacheData would also be a natural place to put future things
> including a copy of the root page itself, in later work on lock-free
> tricks.

I am wondering if we don't want something more generic than stashing this in
rd_amcache. Don't want to end up duplicating relevant code across the uses of
rd_amcache in every AM.


> @@ -663,38 +663,17 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
>      else
>      {
>          bufHdr = GetBufferDescriptor(recent_buffer - 1);
> -        have_private_ref = GetPrivateRefCount(recent_buffer) > 0;
>  
> -        /*
> -         * Do we already have this buffer pinned with a private reference?  If
> -         * so, it must be valid and it is safe to check the tag without
> -         * locking.  If not, we have to lock the header first and then check.
> -         */
> -        if (have_private_ref)
> -            buf_state = pg_atomic_read_u32(&bufHdr->state);
> -        else
> -            buf_state = LockBufHdr(bufHdr);
> -
> -        if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
> +        /* Is it still valid and holding the right tag? */
> +        if (PinBuffer(bufHdr, NULL, true))

I do wonder if we should have an unlocked pre-check for a) the buffer being
valid and b) BufferTagsEqual() matching.  With such a pre-check the race for
increasing the usage count of the wrong buffer is quite small, without the
pre-check it doesn't seem that small anymore.

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
Thomas Munro
Дата:
On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:
> I am wondering if we don't want something more generic than stashing this in
> rd_amcache. Don't want to end up duplicating relevant code across the uses of
> rd_amcache in every AM.

I suppose we could try to track hot pages automatically.  Ants Aasma
mentioned that he was working on a tiny SIMD-based LRU that could be
useful for something like that, without being too slow.  Just for
fun/experimentation, here's a simple attempt to use a very stupid
stand-in LRU to cache the most recent 8 lookups for each fork of each
relation.  Obviously that will miss every time for many workloads so
it'd have to be pretty close to free and this code probably isn't good
enough, but perhaps Ants knows how to sprinkle the right magic fairy
dust on it.  It should automatically discover things like root pages,
the heap target block during repeat inserts etc, and visibility map
pages would stick around, and perhaps a few more pages of btrees that
have a very hot key range (but not pgbench).

> I do wonder if we should have an unlocked pre-check for a) the buffer being
> valid and b) BufferTagsEqual() matching.  With such a pre-check the race for
> increasing the usage count of the wrong buffer is quite small, without the
> pre-check it doesn't seem that small anymore.

Yeah, that makes sense.  Done in this version.

Вложения

Re: ReadRecentBuffer() doesn't scale well

От
Andres Freund
Дата:
Hi,

On 2023-06-30 14:13:11 +1200, Thomas Munro wrote:
> On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:
> > I am wondering if we don't want something more generic than stashing this in
> > rd_amcache. Don't want to end up duplicating relevant code across the uses of
> > rd_amcache in every AM.
> 
> I suppose we could try to track hot pages automatically.

I think that could be useful - but as a separate facility. The benefit of
stashing the root page buffer in the relcache is that it's practically free of
overhead and doesn't have complications from how many other intervening
accesses there are etc.

I was more thinking of just making the relevant fields part of RelationData
and delegating the precise use to the individual AMs.


> > I do wonder if we should have an unlocked pre-check for a) the buffer being
> > valid and b) BufferTagsEqual() matching.  With such a pre-check the race for
> > increasing the usage count of the wrong buffer is quite small, without the
> > pre-check it doesn't seem that small anymore.
> 
> Yeah, that makes sense.  Done in this version.

Cool.

Greetings,

Andres Freund



Re: ReadRecentBuffer() doesn't scale well

От
vignesh C
Дата:
On Fri, 30 Jun 2023 at 07:43, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:
> > I am wondering if we don't want something more generic than stashing this in
> > rd_amcache. Don't want to end up duplicating relevant code across the uses of
> > rd_amcache in every AM.
>
> I suppose we could try to track hot pages automatically.  Ants Aasma
> mentioned that he was working on a tiny SIMD-based LRU that could be
> useful for something like that, without being too slow.  Just for
> fun/experimentation, here's a simple attempt to use a very stupid
> stand-in LRU to cache the most recent 8 lookups for each fork of each
> relation.  Obviously that will miss every time for many workloads so
> it'd have to be pretty close to free and this code probably isn't good
> enough, but perhaps Ants knows how to sprinkle the right magic fairy
> dust on it.  It should automatically discover things like root pages,
> the heap target block during repeat inserts etc, and visibility map
> pages would stick around, and perhaps a few more pages of btrees that
> have a very hot key range (but not pgbench).
>
> > I do wonder if we should have an unlocked pre-check for a) the buffer being
> > valid and b) BufferTagsEqual() matching.  With such a pre-check the race for
> > increasing the usage count of the wrong buffer is quite small, without the
> > pre-check it doesn't seem that small anymore.
>
> Yeah, that makes sense.  Done in this version.

I have changed the status of commitfest entry to "Waiting on Author"
as Andres's comments were not discussed further. Feel free to handle
the comments and change the status accordingly for the commitfest
entry.

Regards,
Vignesh