Обсуждение: ReadRecentBuffer() doesn't scale well
Hi, As mentioned nearby [1], Thomas brought up [2] the idea of using ReadRecentBuffer() _bt_getroot(). I couldn't resist and prototyped it. Unfortunately it scaled way worse at first. This is not an inherent issue, but due to an implementation choice in ReadRecentBuffer(). Whereas the normal BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does LockBufHdr(), checks if the buffer ID is the same and then uses PinBuffer_Locked(). The problem with that is that PinBuffer() takes care to not hold the buffer header spinlock, it uses compare_exchange to atomically acquire the pin, while guaranteing nobody holds the lock. When holding the buffer header spinlock, there obviously is the risk of being scheduled out (or even just not have exclusive access to the cacheline). ReadRecentBuffer() scales worse even if LockBufHdr() is immediately followed by PinBuffer_Locked(), so it's really just holding the lock that is the issue. The fairly obvious solution to this is to just use PinBuffer() and just unpin the buffer if its identity was changed concurrently. There could be an unlocked pre-check as well. However, there's the following comment in ReadRecentBuffer(): * It's now safe to pin the buffer. We can't pin first and ask * questions later, because it might confuse code paths like * InvalidateBuffer() if we pinned a random non-matching buffer. */ But I'm not sure I buy that - there's plenty other things that can briefly acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other contents, etc). Another difference between using PinBuffer() and PinBuffer_locked() is that the latter does not adjust a buffer's usagecount. Leaving the scalability issue aside, isn't it somewhat odd that optimizing a codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not increasing usagecount anymore? FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching the root page's buffer id in RelationData, seems a noticeable win. About 7% in a concurrent, read-only pgbench that utilizes batches of 10. And it should be easy to get much bigger wins, e.g. with a index nested loop with a relatively small index on the inner side. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20230627013458.axge7iylw7llyvww%40awork3.anarazel.de [2] https://twitter.com/MengTangmu/status/1673439083518115840
On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote: > As mentioned nearby [1], Thomas brought up [2] the idea of using > ReadRecentBuffer() _bt_getroot(). I couldn't resist and prototyped it. Thanks! > Unfortunately it scaled way worse at first. This is not an inherent issue, but > due to an implementation choice in ReadRecentBuffer(). Whereas the normal > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does > LockBufHdr(), checks if the buffer ID is the same and then uses > PinBuffer_Locked(). > > The problem with that is that PinBuffer() takes care to not hold the buffer > header spinlock, it uses compare_exchange to atomically acquire the pin, while > guaranteing nobody holds the lock. When holding the buffer header spinlock, > there obviously is the risk of being scheduled out (or even just not have > exclusive access to the cacheline). Yeah. Aside from inherent nastiness of user-space spinlocks, this new use case is also enormously more likely to contend and then get into trouble by being preempted due to btree root pages being about the hottest pages in the universe than the use case I was focusing on at the time. > The fairly obvious solution to this is to just use PinBuffer() and just unpin > the buffer if its identity was changed concurrently. There could be an > unlocked pre-check as well. However, there's the following comment in > ReadRecentBuffer(): > * It's now safe to pin the buffer. We can't pin first and ask > * questions later, because it might confuse code paths like > * InvalidateBuffer() if we pinned a random non-matching buffer. > */ > > But I'm not sure I buy that - there's plenty other things that can briefly > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other > contents, etc). I may well have been too cautious with that. The worst thing I can think of right now is that InvalidateBuffer() would busy loop (as it already does in other rare cases) when it sees a pin. > Another difference between using PinBuffer() and PinBuffer_locked() is that > the latter does not adjust a buffer's usagecount. > > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not > increasing usagecount anymore? Yeah, that is not great. The simplification you suggest would fix that too, though I guess it would also bump the usage count of buffers that don't have the tag we expected; that's obviously rare and erring on a better side though. > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching > the root page's buffer id in RelationData, seems a noticeable win. About 7% in > a concurrent, read-only pgbench that utilizes batches of 10. And it should be > easy to get much bigger wins, e.g. with a index nested loop with a relatively > small index on the inner side. Wooo, that's better than I was hoping. Thanks for trying it out! I think, for the complexity involved (ie very little), it's a nice result, and worth considering even though it's also a solid clue that we could do much better than this with a (yet to be designed) longer-lived pin scheme. smgr_targblock could be another easy-to-cache candidate, ie a place where there is a single interesting hot page that we're already keeping track of with no requirement for new backend-local mapping machinery.
On Mon, Jun 26, 2023 at 8:34 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Yeah. Aside from inherent nastiness of user-space spinlocks, this new > use case is also enormously more likely to contend and then get into > trouble by being preempted due to btree root pages being about the > hottest pages in the universe than the use case I was focusing on at > the time. They're not just the hottest. They're also among the least likely to change from one moment to the next. (If that ever failed to hold then it wouldn't take long for the index to become grotesquely tall.) -- Peter Geoghegan
Hi, On 2023-06-27 15:33:57 +1200, Thomas Munro wrote: > On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote: > > Unfortunately it scaled way worse at first. This is not an inherent issue, but > > due to an implementation choice in ReadRecentBuffer(). Whereas the normal > > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does > > LockBufHdr(), checks if the buffer ID is the same and then uses > > PinBuffer_Locked(). > > > > The problem with that is that PinBuffer() takes care to not hold the buffer > > header spinlock, it uses compare_exchange to atomically acquire the pin, while > > guaranteing nobody holds the lock. When holding the buffer header spinlock, > > there obviously is the risk of being scheduled out (or even just not have > > exclusive access to the cacheline). > > Yeah. Aside from inherent nastiness of user-space spinlocks I've been wondering about making our backoff path use futexes, after some adaptive spinning. > > The fairly obvious solution to this is to just use PinBuffer() and just unpin > > the buffer if its identity was changed concurrently. There could be an > > unlocked pre-check as well. However, there's the following comment in > > ReadRecentBuffer(): > > * It's now safe to pin the buffer. We can't pin first and ask > > * questions later, because it might confuse code paths like > > * InvalidateBuffer() if we pinned a random non-matching buffer. > > */ > > > > But I'm not sure I buy that - there's plenty other things that can briefly > > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other > > contents, etc). > > I may well have been too cautious with that. The worst thing I can > think of right now is that InvalidateBuffer() would busy loop (as it > already does in other rare cases) when it sees a pin. Right. Particularly if we were to add a pre-check for the tag to match, that should be extremely rare. > > Another difference between using PinBuffer() and PinBuffer_locked() is that > > the latter does not adjust a buffer's usagecount. > > > > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a > > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not > > increasing usagecount anymore? > > Yeah, that is not great. The simplification you suggest would fix > that too, though I guess it would also bump the usage count of buffers > that don't have the tag we expected; that's obviously rare and erring > on a better side though. Yea, I'm not worried about that. If somebody is, we could just add code to decrement the usagecount again. > > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching > > the root page's buffer id in RelationData, seems a noticeable win. About 7% in > > a concurrent, read-only pgbench that utilizes batches of 10. And it should be > > easy to get much bigger wins, e.g. with a index nested loop with a relatively > > small index on the inner side. > > Wooo, that's better than I was hoping. Thanks for trying it out! I > think, for the complexity involved (ie very little) I don't really have a concrete thought for where to store the id of the recent buffer. I just added a new field into some padding in RelationData, but we might go for something fancier. > smgr_targblock could be another easy-to-cache candidate, ie a place where > there is a single interesting hot page that we're already keeping track of > with no requirement for new backend-local mapping machinery. I wonder if we should simple add a generic field for such a Buffer to RelationData, that the AM can use as it desires. For btree that would be the root page, for heap the target block ... > it's a nice result, and worth considering even though it's also a solid clue > that we could do much better than this with a (yet to be designed) > longer-lived pin scheme. Indeed. PinBuffer() is pretty hot after the change. As is the buffer content lock. Particularly for the root page, it'd be really interesting to come up with a scheme that keeps an offline copy of the root page while also pinning the real root page. I think we should be able to have a post-check that can figure out if the copied root page is out of date after searching it, without needing the content lock. Greetings, Andres Freund
On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote: > I think we should be able to have a post-check that can figure out > if the copied root page is out of date after searching it, without needing the > content lock. I'm guessing that you're thinking of doing something with the page LSN? -- Peter Geoghegan
On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote: > > I think we should be able to have a post-check that can figure out > > if the copied root page is out of date after searching it, without needing the > > content lock. > > I'm guessing that you're thinking of doing something with the page LSN? If the goal is to get rid of both pins and content locks, LSN isn't enough. A page might be evicted and replaced by another page that has the same LSN because they were modified by the same record. Maybe that's vanishingly rare, but the correct thing would be counter that goes up on modification AND eviction. (FWIW I toyed with variants of this concept in the context of SLRU -> buffer pool migration, where I was trying to do zero-lock CLOG lookups; in that case I didn't need the copy of the page being discussed here due to the data being atomically readable, but I had the same requirement for a "did-it-change-under-my-feet?" check).
On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote: > If the goal is to get rid of both pins and content locks, LSN isn't > enough. A page might be evicted and replaced by another page that has > the same LSN because they were modified by the same record. Maybe > that's vanishingly rare, but the correct thing would be counter that > goes up on modification AND eviction. It should be safe to allow searchers to see a version of the root page that is out of date. The Lehman & Yao design is very permissive about these things. There aren't any special cases where the general rules are weakened in some way that might complicate this approach. Searchers need to check the high key to determine if they need to move right -- same as always. More concretely: A root page can be concurrently split when there is an in-flight index scan that is about to land on it (which becomes the left half of the split). It doesn't matter if it's a searcher that is "between" the meta page and the root page. It doesn't matter if a level was added. This is true even though nothing that you'd usually think of as an interlock is held "between levels". The root page isn't really special, except in the obvious way. We can even have two roots at the same time (the true root, and the fast root). -- Peter Geoghegan
On Tue, Jun 27, 2023 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > If the goal is to get rid of both pins and content locks, LSN isn't > > enough. A page might be evicted and replaced by another page that has > > the same LSN because they were modified by the same record. Maybe > > that's vanishingly rare, but the correct thing would be counter that > > goes up on modification AND eviction. > > It should be safe to allow searchers to see a version of the root page > that is out of date. The Lehman & Yao design is very permissive about > these things. There aren't any special cases where the general rules > are weakened in some way that might complicate this approach. > Searchers need to check the high key to determine if they need to move > right -- same as always. OK. I guess I'm talking about a slightly more general version of the problem inspired by the stuff I mentioned in parentheses, which would simply get the wrong answer if the mapping changed, whereas here you'd use the cached copy in a race case which should still work for searches. So I guess the question for this thread is: do we want to work on ReadRecentBuffer(), or just take this experiment as evidence of even more speed-up available and aim for that directly?
Hi, On 2023-06-27 16:40:08 +1200, Thomas Munro wrote: > On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote: > > > I think we should be able to have a post-check that can figure out > > > if the copied root page is out of date after searching it, without needing the > > > content lock. > > > > I'm guessing that you're thinking of doing something with the page LSN? Yes, that seems to be the most obvious. > If the goal is to get rid of both pins and content locks, LSN isn't > enough. I was imaginging you'd have a long-lived pin. I don't think trying to make it work without that is particularly promising in this context, where it seems quite feasible to keep pins around for a while. > A page might be evicted and replaced by another page that has the same LSN > because they were modified by the same record. Maybe that's vanishingly > rare, but the correct thing would be counter that goes up on modification > AND eviction. I don't think it would need to be a single counter. If we wanted to do something like this, I think you'd have to have a counter in the buffer desc that's incremented whenever the page is replaced. Plus the LSN for the page content change "counter". Greetings, Andres Freund
Hi, On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote: > It should be safe to allow searchers to see a version of the root page > that is out of date. The Lehman & Yao design is very permissive about > these things. There aren't any special cases where the general rules > are weakened in some way that might complicate this approach. > Searchers need to check the high key to determine if they need to move > right -- same as always. Wouldn't we at least need a pin on the root page, or hold a snapshot, to defend against page deletions? Greetings, Andres Freund
On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote: > > It should be safe to allow searchers to see a version of the root page > > that is out of date. The Lehman & Yao design is very permissive about > > these things. There aren't any special cases where the general rules > > are weakened in some way that might complicate this approach. > > Searchers need to check the high key to determine if they need to move > > right -- same as always. > > Wouldn't we at least need a pin on the root page, or hold a snapshot, to > defend against page deletions? You need to hold a snapshot to prevent concurrent page recycling -- though not page deletion itself (I did say "anything that you'd usually think of as an interlock"). I'm pretty sure that a concurrent page deletion is possible, even when you hold a pin on the page. (Perhaps not, but if not then it's just an accident -- a side-effect of the interlock that protects against concurrent heap TID recycling.) You can't delete a rightmost page (on any level). Every root page is a rightmost page. So the root would have to be split, and then once again emptied before it could be deleted -- only then would there be a danger of some backend with a locally cached root page having an irredeemably bad picture of what's going on with the index. That's another angle that you could approach the problem from, I suppose. -- Peter Geoghegan
On Tue, 27 Jun 2023 at 07:09, Andres Freund <andres@anarazel.de> wrote: > On 2023-06-27 15:33:57 +1200, Thomas Munro wrote: > > On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote: > > > Unfortunately it scaled way worse at first. This is not an inherent issue, but > > > due to an implementation choice in ReadRecentBuffer(). Whereas the normal > > > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does > > > LockBufHdr(), checks if the buffer ID is the same and then uses > > > PinBuffer_Locked(). > > > > > > The problem with that is that PinBuffer() takes care to not hold the buffer > > > header spinlock, it uses compare_exchange to atomically acquire the pin, while > > > guaranteing nobody holds the lock. When holding the buffer header spinlock, > > > there obviously is the risk of being scheduled out (or even just not have > > > exclusive access to the cacheline). > > > > Yeah. Aside from inherent nastiness of user-space spinlocks > > I've been wondering about making our backoff path use futexes, after some > adaptive spinning. If you want to experiment, here is a rebased version of something I hacked up a couple of years back on the way to Fosdem Pgday. I didn't pursue it further because I didn't have a use case where it showed a significant difference. -- Ants
Вложения
Hi, On 2023-06-27 14:49:48 +0300, Ants Aasma wrote: > If you want to experiment, here is a rebased version of something I > hacked up a couple of years back on the way to Fosdem Pgday. I didn't > pursue it further because I didn't have a use case where it showed a > significant difference. Thanks for posting! Based on past experiments, anything that requires an atomic op during spinlock release on x86 will be painful :/. I'm not sure there's a realistic way to avoid that with futexes though :(. Greetings, Andres Freund
On Tue, 27 Jun 2023 at 18:40, Andres Freund <andres@anarazel.de> wrote: > On 2023-06-27 14:49:48 +0300, Ants Aasma wrote: > > If you want to experiment, here is a rebased version of something I > > hacked up a couple of years back on the way to Fosdem Pgday. I didn't > > pursue it further because I didn't have a use case where it showed a > > significant difference. > > Thanks for posting! > > Based on past experiments, anything that requires an atomic op during spinlock > release on x86 will be painful :/. I'm not sure there's a realistic way to > avoid that with futexes though :(. Do you happen to know if a plain xchg instruction counts as an atomic for this? I haven't done atomics stuff in a while, so I might be missing something, but at first glance I think using a plain xchg would be enough for the releasing side. -- Ants
Hi, On 2023-06-27 19:04:31 +0300, Ants Aasma wrote: > On Tue, 27 Jun 2023 at 18:40, Andres Freund <andres@anarazel.de> wrote: > > On 2023-06-27 14:49:48 +0300, Ants Aasma wrote: > > > If you want to experiment, here is a rebased version of something I > > > hacked up a couple of years back on the way to Fosdem Pgday. I didn't > > > pursue it further because I didn't have a use case where it showed a > > > significant difference. > > > > Thanks for posting! > > > > Based on past experiments, anything that requires an atomic op during spinlock > > release on x86 will be painful :/. I'm not sure there's a realistic way to > > avoid that with futexes though :(. > > Do you happen to know if a plain xchg instruction counts as an atomic > for this? I haven't done atomics stuff in a while, so I might be > missing something, but at first glance I think using a plain xchg > would be enough for the releasing side. It is automatically an atomic op when referencing memory: Intel SDM 9.1.2.1 Automatic Locking: "The operations on which the processor automatically follows the LOCK semantics are as follows: • When executing an XCHG instruction that references memory. ... " Theoretically cmpxchg can be used in a non-atomic fashion. I'm not sure that can be done correctly though, if you want to also store a separate value for the futex. This stuff is hard to think though :) Greetings, Andres Freund
Hi, On 2023-06-27 01:10:25 -0700, Peter Geoghegan wrote: > On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote: > > > It should be safe to allow searchers to see a version of the root page > > > that is out of date. The Lehman & Yao design is very permissive about > > > these things. There aren't any special cases where the general rules > > > are weakened in some way that might complicate this approach. > > > Searchers need to check the high key to determine if they need to move > > > right -- same as always. > > > > Wouldn't we at least need a pin on the root page, or hold a snapshot, to > > defend against page deletions? > > You need to hold a snapshot to prevent concurrent page recycling -- > though not page deletion itself (I did say "anything that you'd > usually think of as an interlock"). I don't think we'd want to have a snapshot for this, that make it much less beneficial. > I'm pretty sure that a concurrent page deletion is possible, even when you > hold a pin on the page. (Perhaps not, but if not then it's just an accident > -- a side-effect of the interlock that protects against concurrent heap TID > recycling.) Looks like the pin should prevent the danger, but wouldn't be very attractive, due to blocking vacuum... I've wondered before about a type of pin that just prevents buffer replacement, but not cleaning up page contents. I think that'd be beneficial in quite a few places. > You can't delete a rightmost page (on any level). Every root page is a > rightmost page. So the root would have to be split, and then once > again emptied before it could be deleted -- only then would there be a > danger of some backend with a locally cached root page having an > irredeemably bad picture of what's going on with the index. That's > another angle that you could approach the problem from, I suppose. If we had a way of just preventing the page from being replaced, or reliably detecting that happening, without blocking btree vacuum, the easiest path seems to be to use the cached version of the root page, and re-do the work whenever a) the LSN of the page has changed or b) the buffer has been replaced. To me that seems like it'd likely be simpler and more general than relying on being able to step right from any outdated, but not deleted, version of the page (due to the page deletion issues). Obviously that'd lead to retries more often - but realistically it's still going to be vanishingly rare, root pages don't get modified that much once the index is beyond toy size. I think a replacement counter in the buffer desc is the easiest way to achieve that? We'd have to store the buffer ID, buffer replacement counter and page LSN in RelationData. I think the protocol would have to be something like 1) do search on the copy of the root page 2) get page LSN from the relevant buffer contents - this could be from a different relation / block or even an empty page, but will never be an invalid memory access, as we don't free shared buffers before shutdown. If the LSN changed since taking the copy, take a new copy of the root page and start at 1) 3) check if buffer replacement counter is the same as at the time of the copy, if not take a new copy of the root page and start at 1) 4) happiness For optimization reasons it might make sense to store the buffer replacement counter on a separate cacheline from BufferDesc.{state,content_lock}, so reading the buffer replacement counter doesn't cause cacheline contention with backends working with state/lock. But that's an implementation detail, and it might not matter much, because the pressure on state,content_lock would be reduced drastically. Greetings, Andres Freund
I (re)discovered why I used the lock-then-pin approach. In the comments I mentioned InvalidBuffer(), but the main problem is in its caller GetVictimBuffer() which has various sanity checks about reference counts that can occasionally fail if you have code randomly pinning any old buffer. New idea: use the standard PinBuffer() function, but add a mode that doesn't pin invalid buffers (with caveat that you can perhaps get a false negative due to unlocked read, but never a false positive; see commit message). Otherwise we'd have to duplicate all the same logic to use cmpxchg for ReadRecentBuffer(), or rethink the assumptions in that other code. As for the lack of usage bump in the back-branches, I think the options are: teach PinBuffer_Locked() to increment it optionally, or back-patch whatever we come up with for this. For the root buffer optimisation, the obvious place for storage seems to be under rd_amcache. It was originally invented for the cached metapage (commit d2896a9ed14) but could accommodate a new struct holding whatever we want. Here is a patch to try that out. BTAMCacheData would also be a natural place to put future things including a copy of the root page itself, in later work on lock-free tricks. Experimental patches attached.
Вложения
Hi, On 2023-06-29 19:35:30 +1200, Thomas Munro wrote: > I (re)discovered why I used the lock-then-pin approach. In the > comments I mentioned InvalidBuffer(), but the main problem is in its > caller GetVictimBuffer() which has various sanity checks about > reference counts that can occasionally fail if you have code randomly > pinning any old buffer. You're right. Specifically non-valid buffers are the issue. > New idea: use the standard PinBuffer() function, but add a mode that > doesn't pin invalid buffers (with caveat that you can perhaps get a > false negative due to unlocked read, but never a false positive; see > commit message). Otherwise we'd have to duplicate all the same logic > to use cmpxchg for ReadRecentBuffer(), or rethink the assumptions in > that other code. It might be worth using lock free code in more places before long, but I agree with the solution here. > As for the lack of usage bump in the back-branches, I think the > options are: teach PinBuffer_Locked() to increment it optionally, or > back-patch whatever we come up with for this. Hm, or just leave it as is. > For the root buffer optimisation, the obvious place for storage seems > to be under rd_amcache. It was originally invented for the cached > metapage (commit d2896a9ed14) but could accommodate a new struct > holding whatever we want. Here is a patch to try that out. > BTAMCacheData would also be a natural place to put future things > including a copy of the root page itself, in later work on lock-free > tricks. I am wondering if we don't want something more generic than stashing this in rd_amcache. Don't want to end up duplicating relevant code across the uses of rd_amcache in every AM. > @@ -663,38 +663,17 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN > else > { > bufHdr = GetBufferDescriptor(recent_buffer - 1); > - have_private_ref = GetPrivateRefCount(recent_buffer) > 0; > > - /* > - * Do we already have this buffer pinned with a private reference? If > - * so, it must be valid and it is safe to check the tag without > - * locking. If not, we have to lock the header first and then check. > - */ > - if (have_private_ref) > - buf_state = pg_atomic_read_u32(&bufHdr->state); > - else > - buf_state = LockBufHdr(bufHdr); > - > - if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag)) > + /* Is it still valid and holding the right tag? */ > + if (PinBuffer(bufHdr, NULL, true)) I do wonder if we should have an unlocked pre-check for a) the buffer being valid and b) BufferTagsEqual() matching. With such a pre-check the race for increasing the usage count of the wrong buffer is quite small, without the pre-check it doesn't seem that small anymore. Greetings, Andres Freund
On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote: > I am wondering if we don't want something more generic than stashing this in > rd_amcache. Don't want to end up duplicating relevant code across the uses of > rd_amcache in every AM. I suppose we could try to track hot pages automatically. Ants Aasma mentioned that he was working on a tiny SIMD-based LRU that could be useful for something like that, without being too slow. Just for fun/experimentation, here's a simple attempt to use a very stupid stand-in LRU to cache the most recent 8 lookups for each fork of each relation. Obviously that will miss every time for many workloads so it'd have to be pretty close to free and this code probably isn't good enough, but perhaps Ants knows how to sprinkle the right magic fairy dust on it. It should automatically discover things like root pages, the heap target block during repeat inserts etc, and visibility map pages would stick around, and perhaps a few more pages of btrees that have a very hot key range (but not pgbench). > I do wonder if we should have an unlocked pre-check for a) the buffer being > valid and b) BufferTagsEqual() matching. With such a pre-check the race for > increasing the usage count of the wrong buffer is quite small, without the > pre-check it doesn't seem that small anymore. Yeah, that makes sense. Done in this version.
Вложения
Hi, On 2023-06-30 14:13:11 +1200, Thomas Munro wrote: > On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote: > > I am wondering if we don't want something more generic than stashing this in > > rd_amcache. Don't want to end up duplicating relevant code across the uses of > > rd_amcache in every AM. > > I suppose we could try to track hot pages automatically. I think that could be useful - but as a separate facility. The benefit of stashing the root page buffer in the relcache is that it's practically free of overhead and doesn't have complications from how many other intervening accesses there are etc. I was more thinking of just making the relevant fields part of RelationData and delegating the precise use to the individual AMs. > > I do wonder if we should have an unlocked pre-check for a) the buffer being > > valid and b) BufferTagsEqual() matching. With such a pre-check the race for > > increasing the usage count of the wrong buffer is quite small, without the > > pre-check it doesn't seem that small anymore. > > Yeah, that makes sense. Done in this version. Cool. Greetings, Andres Freund
On Fri, 30 Jun 2023 at 07:43, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote: > > I am wondering if we don't want something more generic than stashing this in > > rd_amcache. Don't want to end up duplicating relevant code across the uses of > > rd_amcache in every AM. > > I suppose we could try to track hot pages automatically. Ants Aasma > mentioned that he was working on a tiny SIMD-based LRU that could be > useful for something like that, without being too slow. Just for > fun/experimentation, here's a simple attempt to use a very stupid > stand-in LRU to cache the most recent 8 lookups for each fork of each > relation. Obviously that will miss every time for many workloads so > it'd have to be pretty close to free and this code probably isn't good > enough, but perhaps Ants knows how to sprinkle the right magic fairy > dust on it. It should automatically discover things like root pages, > the heap target block during repeat inserts etc, and visibility map > pages would stick around, and perhaps a few more pages of btrees that > have a very hot key range (but not pgbench). > > > I do wonder if we should have an unlocked pre-check for a) the buffer being > > valid and b) BufferTagsEqual() matching. With such a pre-check the race for > > increasing the usage count of the wrong buffer is quite small, without the > > pre-check it doesn't seem that small anymore. > > Yeah, that makes sense. Done in this version. I have changed the status of commitfest entry to "Waiting on Author" as Andres's comments were not discussed further. Feel free to handle the comments and change the status accordingly for the commitfest entry. Regards, Vignesh