Обсуждение: Buffer locking is special (hints, checksums, AIO writes)
Hi, I'm working on making bufmgr.c ready for AIO writes. That requires some infrastructure changes that I wanted to discuss... In this email I'm trying to outline the problems and what I currently think we should do. == Problem 1 - hint bits == Currently pages that are being written out are copied if checksums are enabled. The copy is needed because hint bits can be set with just a share lock. Writing out a buffer only requires a share lock. If we didn't copy the buffer, the computed checksum can be falsified by the checksum. Even without checksums hint bits being set while IO is ongoing is an issue, e.g. btrfs assumes that pages do not change while being written out with direct-io, and corrupts itself if they do [1]. The copy we make to avoid checksum failure are not at all cheap [2], but are particularly problematic for AIO, where a lot of buffers can undergo AIO at the same time. For AIO the issue is that the longer-lived bounce buffers that I had initially prototyped actually end up using a lot of memory, making it hard to figure out what defaults to set. In [2] I had worked on an approach to avoid copying pages during writes. It avoided the need to copy buffers by not allowing hint bits to be set while IO is ongoing. It did so by skipping hint bit writes while IO is ongoing (plus a fair bit of infrastructure to make that cheap enough). In that thread Heikki was wondering whether we should instead not go for a more fundamental solution to the problem, like introducing a separate lock level that prevents concurrent modifications but allows share lockers. At the time I was against that, because it seemed like a large project. However, since then I realized there's a architecturally related second issue: == Problem 2 - AIO writes vs exclusive locks == Separate from the hint bit issue, there is a second issue that I didn't have a good answer for: Making acquiring an exclusive lock concurrency safe in the presence of asynchronous writes: The problem is that while a buffer is being written out, it obviously has to be share locked. That's true even with AIO. With AIO the share lock is held once the IO is completed. The problem is that if a backend wants to exclusively lock a buffer undergoing AIO, it can't just wait for the content lock as today, it might have to actually reap the IO completion from the operating system. If one just were to wait for the content lock, there's no forward progress guarantee. The buffer's state "knows" that it's undergoing write IO (BM_VALID and BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The problem is that it's surprisingly hard to do so race free: If a backend A were to just check if a buffer is undergoing IO before locking it, a backend B could start IO on the buffer between A checking for BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just hold the buffer header spinlock across a blocking lwlock acquisition. There potentially are ways to synchronize the buffer state and the content lock, but it requires deep integration between bufmgr.c and lwlock.c. == Problem 3 - Cacheline contention == This is unrelated to AIO, but might influence the architecture for potential solutions. It might make sense to skip over this section on a quicker read-through. The problem is that in some workloads the cacheline containing the BufferDesc becomes very hot. The most common case are workloads with lots of index nested loops, the root page and some of the other inner pages in the index can become very contended. I've seen cases where running the same workload in four separate copies in the same postgres instance yields ~8x the throughput - on a machine with forty cores, there's machines with many more these days [4]/ There are three main issues: a) We manipulate the same cache line multiple times. E.g. btree always first pins the page and then locks the page, which performs two atomic operations on the same cacheline. I've experimented with putting the content lock on a separate cacheline, but that causes regressions in other cases. Leaving nontrivial implementation issues aside, we could release the lock and pin at the same time. With a bit increased difficulty we could do the same for pin and lock acquisition. nbtree almost always does the two together, so it'd not be hard to make it benefit from such an optimization. b) Many operations, like unpinning a buffer, need to use CAS, instead of an atomic subtraction. atomic add/sub scales a *lot* better than compare and swap, as there is no need to retry. With increased contention the number of retries increases further and further. The reason we need to use CAS in so many places is the following: Historically postgres' spinlocks don't use an atomic operation to release the spinlock on x86. When making buffer header locks their own thing, I carried that forward, to avoid performance regressions. Because of that BufferDesc.state may not be modified while the buffer header spinlock is held - which is incompatible with using atomic-sub. However, since then we converted most of the "hot" uses of buffer header spinlocks into CAS loops (see e.g. PinBuffer()). That makes it feasible to use an atomic operation for the buffer header lock release, which in turn allows e.g. unpinning with an atomic sub. c) Read accesses to the BufferDesc cause contention Some code, like nbtree, relies on functions like BufferGetBlockNumber(). Unfortunately that contends with concurrent modifications of the buffer descriptor (like pinning). Potential solutions are to rely less on functions like BufferGetBlockNumber() or to split out the memory for that into a separate (denser?) array. d) Even after addressing all of the above, there's still a lot of contention I think the solution here would be something roughly to fastpath locks. If a buffer is very contended, we can mark it as super-pinned & share locked, avoiding any atomic operation on the buffer descriptor itself. Instead the current lock and pincount would be stored in each backends PGPROC. Obviously evicting or exclusively-locking such a buffer would be a lot more expensive. I've prototyped it and it helps a *lot*. The reason I mention this here is that this seems impossible to do while using the generic lwlocks for the content lock. == Solution ? == My conclusion from the above is that we ought to: A) Make Buffer Locks something separate from lwlocks As part of that introduce a new lock level in-between share and exclusive locks (provisionally called share-exclusive, but I hate it). The new lock level allows other share lockers, but can only be held by one backend. This allows to change the rules so that: 1) Share lockers are not allowed to modify buffers anymore 2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits()) 3) Write IO needs to the new lock level This addresses 1) from above B) Merge BufferDesc.state and the content lock This allows to address 2) from above, as we now atomically can check if IO was concurrently initiated. Obviously BufferDesc.state is not currently wide enough, therefore the buffer state has to be updated to a 64bit variable. C) Allow some modifications of BufferDesc.state while holding spinlock Just naively doing the above two things reduces scalability, as the likelihood of CAS failures increases, due to increased number of modifications of the same atomic variable. However, by allowing unpinning while the buffer header spinlock is held, scalability considerably improves in my tests. Doing so requires changing all uses of LockBufHdr(), but by introducing a static inline helper the complexity can largely be encapsulated. I've prototyped the above. The current state is pretty rough, but before I spend the non-trivial time to make it into an understandable sequence of changes, I wanted to get feedback. Does this plan sound reasonable? The hardest part about this change is that everything kind of depends on each other. The changes are large enough that they clearly can't just be committed at once, but doing them over time risks [temporary] performance regressions. The order of changes I think makes the most sense is the following: 1) Allow some modifications while holding the buffer header spinlock 2) Reduce buffer pin with just an atomic-sub This needs to happen first, otherwise there are performance regressions during the later steps. 3) Widen BufferDesc.state to 64 bits 4) Implement buffer locking inside BufferDesc.state 5) Do IO while holding share-exclusive lock and require all buffer modifications to at least hold share exclusive lock 6) Wait for AIO when acquiring an exclusive content lock (some of these will likely have parts of their own, but that's details) Sane? DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com [2] https://www.postgresql.org/message-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m [3] In some cases it causes slowdowns for checkpointer close to 50%! [4] https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19
Hello, Andres! Andres Freund <andres@anarazel.de>: > > As part of that introduce a new lock level in-between share and exclusive > locks (provisionally called share-exclusive, but I hate it). The new lock > level allows other share lockers, but can only be held by one backend. > > This allows to change the rules so that: > > 1) Share lockers are not allowed to modify buffers anymore > 2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits()) > 3) Write IO needs to the new lock level IIUC, it may be mapped to existing locking system: BUFFER_LOCK_SHARE ----> AccessShareLock new lock mode ----> ExclusiveLock BUFFER_LOCK_EXCLUSIVE ----> AccessExclusiveLock So, it all may be named: BUFFER_LOCK_ACCESS_SHARE BUFFER_LOCK_EXCLUSIVE BUFFER_LOCK_ACCESS_EXCLUSIVE being more consistent with table-level locking system. Greetings, Mikhail.
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > My conclusion from the above is that we ought to: > > A) Make Buffer Locks something separate from lwlocks > B) Merge BufferDesc.state and the content lock > C) Allow some modifications of BufferDesc.state while holding spinlock +1 to (A) and (B). No particular opinion on (C) but if it works well, great. > The order of changes I think makes the most sense is the following: > > 1) Allow some modifications while holding the buffer header spinlock > 2) Reduce buffer pin with just an atomic-sub > 3) Widen BufferDesc.state to 64 bits > 4) Implement buffer locking inside BufferDesc.state > 5) Do IO while holding share-exclusive lock and require all buffer > modifications to at least hold share exclusive lock > 6) Wait for AIO when acquiring an exclusive content lock No strong objections. I certainly like getting to (5) and (6) and I think those are in the right order. I'm not sure about the rest. I thought (1) and (2) were the same change after reading your email; and it surprises me a little bit that (2) is separate from (4). But I'm sure you have a much better sense of this than I do. > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? AFAIK "share exclusive" or "SX" is standard terminology. While I'm not wholly hostile to the idea of coming up with something else, I don't think our tendency to invent our own way to do everything is one of our better tendencies as a project. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > My conclusion from the above is that we ought to: > > > > A) Make Buffer Locks something separate from lwlocks > > B) Merge BufferDesc.state and the content lock > > C) Allow some modifications of BufferDesc.state while holding spinlock > > +1 to (A) and (B). No particular opinion on (C) but if it works well, great. Without it I see performance regressions due to the increased rate of CAS failures due to having more changes to one atomic variable :/ > > The order of changes I think makes the most sense is the following: > > > > 1) Allow some modifications while holding the buffer header spinlock > > 2) Reduce buffer pin with just an atomic-sub > > 3) Widen BufferDesc.state to 64 bits > > 4) Implement buffer locking inside BufferDesc.state > > 5) Do IO while holding share-exclusive lock and require all buffer > > modifications to at least hold share exclusive lock > > 6) Wait for AIO when acquiring an exclusive content lock > > No strong objections. I certainly like getting to (5) and (6) and I > think those are in the right order. I'm not sure about the rest. > I thought (1) and (2) were the same change after reading your email They are certainly related. I thought it'd make sense to split them as outlined above, as (1) is relatively verbose on its own, but far more mechanical. > and it surprises me a little bit that (2) is separate from (4). Without doing 2) first, I see performance/scalability regressions doing (4). Doing (3) without (2) also hurts... > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? > > AFAIK "share exclusive" or "SX" is standard terminology. While I'm not > wholly hostile to the idea of coming up with something else, I don't > think our tendency to invent our own way to do everything is one of > our better tendencies as a project. I guess it bothers me that we'd use share-exclusive to mean the buffer can't be modified, but for real (vs share, which does allow some modifications). But it's very well plausible that there's no meaningfully better name, in which case we certainly shouldn't differ from what's somewhat commonly used. Greetings, Andres Freund
On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote: > I'm working on making bufmgr.c ready for AIO writes. Nice! > == Problem 2 - AIO writes vs exclusive locks == > > Separate from the hint bit issue, there is a second issue that I didn't have a > good answer for: Making acquiring an exclusive lock concurrency safe in the > presence of asynchronous writes: > > The problem is that while a buffer is being written out, it obviously has to > be share locked. That's true even with AIO. With AIO the share lock is held > once the IO is completed. The problem is that if a backend wants to > exclusively lock a buffer undergoing AIO, it can't just wait for the content > lock as today, it might have to actually reap the IO completion from the > operating system. If one just were to wait for the content lock, there's no > forward progress guarantee. > > The buffer's state "knows" that it's undergoing write IO (BM_VALID and > BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive > locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The > problem is that it's surprisingly hard to do so race free: > > If a backend A were to just check if a buffer is undergoing IO before locking > it, a backend B could start IO on the buffer between A checking for > BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just > hold the buffer header spinlock across a blocking lwlock acquisition. > > There potentially are ways to synchronize the buffer state and the content > lock, but it requires deep integration between bufmgr.c and lwlock.c. You may have considered and rejected simpler alternatives for (2) before picking the approach you go on to outline. Anything interesting? For example, I imagine these might work with varying degrees of inefficiency: - Use LWLockConditionalAcquire() with some nonstandard waiting protocol when there's a non-I/O lock conflict. - Take BM_IO_IN_PROGRESS before exclusive-locking, then release it. > == Problem 3 - Cacheline contention == > c) Read accesses to the BufferDesc cause contention > > Some code, like nbtree, relies on functions like > BufferGetBlockNumber(). Unfortunately that contends with concurrent > modifications of the buffer descriptor (like pinning). Potential solutions > are to rely less on functions like BufferGetBlockNumber() or to split out > the memory for that into a separate (denser?) array. Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data structure, since the buffer's mapping can't change until we unpin. > d) Even after addressing all of the above, there's still a lot of contention > > I think the solution here would be something roughly to fastpath locks. If > a buffer is very contended, we can mark it as super-pinned & share locked, > avoiding any atomic operation on the buffer descriptor itself. Instead the > current lock and pincount would be stored in each backends PGPROC. > Obviously evicting or exclusively-locking such a buffer would be a lot more > expensive. > > I've prototyped it and it helps a *lot*. The reason I mention this here is > that this seems impossible to do while using the generic lwlocks for the > content lock. Nice. On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote: > On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > > The order of changes I think makes the most sense is the following: No concerns so far. I won't claim I can picture all the implications and be sure this is the right thing, but it sounds promising. I like your principle of ordering changes to avoid performance regressions. > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? I would consider {AccessShare, Exclusive, AccessExclusive}. What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive. That has the same conflict matrix as the corresponding heavyweight locks, which seems good. I don't love our mode names, particularly ShareRowExclusive being unsharable. However, learning one special taxonomy is better than learning two. > > AFAIK "share exclusive" or "SX" is standard terminology. Can you say more about that? I looked around at https://google.com/search?q=share+exclusive+%22sx%22+lock but didn't find anything well-aligned with the proposal: https://dev.mysql.com/doc/dev/mysql-server/latest//PAGE_LOCK_ORDER.html looked most relevant, but it doesn't give the big idea. https://mysqlonarm.github.io/Understanding-InnoDB-rwlock-stats/ is less authoritative but does articulate the big idea, as "Shared-Exclusive (SX): offer write access to the resource with inconsistent read. (relaxed exclusive)." That differs from $SUBJECT semantics, in which SHARE-EXCLUSIVE can't see inconsistent reads. https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_LOCK.html has term SX = "sub exclusive". I gather an SX lock on a table lets one do SELECT FOR UPDATE on that table (each row is the "sub"component being locked). https://man.freebsd.org/cgi/man.cgi?query=sx_slock&sektion=9&format=html uses the term "SX", but it's more like our lwlocks. One acquires S or X, not blends of them.
On Tue, Aug 26, 2025 at 8:14 PM Noah Misch <noah@leadboat.com> wrote: > > > AFAIK "share exclusive" or "SX" is standard terminology. > > Can you say more about that? Looks like I was misremembering. I was thinking of Gray & Reuter, Transaction Processing: Concepts and Techniques, 1993. However, opening it up, I find that his vocabulary is slightly different. He offers the following six lock modes: IS, IX, S, SIX, Update, X. "I" means "intent" and acts as a modifier to the letter that follows. Hence, SIX means "a course-granularity shared lock with intent to set finer-granularity exclusive locks" (p. 408). His lock manager is hierarchical, so taking a SIX lock on a table means that you are allowed to read all the rows in the table and you are allowed to exclusive-lock individual rows as desired and nobody is allowed to exclusive-lock any rows in the table. It is compatible only with IS; that is, it does not preclude other people from share-locking individual rows (which might delay your exclusive locks on those rows). Since we don't have intent-locking in PostgreSQL, I think my brain mentally flattened this hierarchy down to S, X, SX, but that's not what he actually wrote. His "Update" locks are also somewhat interesting: an update lock is exactly like an exclusive lock except that it permits PAST share-locks. You take an update lock when you currently need a share-lock but anticipate the possibility of needing an exclusive-lock. This is a deadlock avoidance strategy: updaters will take turns, and some of them will ultimately want exclusive locks and others won't, but they can't deadlock against each other as long as they all take "Update" locks initially and don't try to upgrade to that level later. An updater's attempt to upgrade to an exclusive lock can still be delayed by, or deadlock against, share lockers, but those typically won't try to higher lock levels later. If we were to use the existing PostgreSQL naming convention, I think I'd probably argue that the nearest parallel to this level is ShareUpdateExclusive: a self-exclusive lock level that permits ordinary table access to continue while blocking exclusive locks, used for an in-flight maintenance operation. But that's arguable, of course. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-08-26 17:14:49 -0700, Noah Misch wrote: > On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote: > > == Problem 2 - AIO writes vs exclusive locks == > > > > Separate from the hint bit issue, there is a second issue that I didn't have a > > good answer for: Making acquiring an exclusive lock concurrency safe in the > > presence of asynchronous writes: > > > > The problem is that while a buffer is being written out, it obviously has to > > be share locked. That's true even with AIO. With AIO the share lock is held > > once the IO is completed. The problem is that if a backend wants to > > exclusively lock a buffer undergoing AIO, it can't just wait for the content > > lock as today, it might have to actually reap the IO completion from the > > operating system. If one just were to wait for the content lock, there's no > > forward progress guarantee. > > > > The buffer's state "knows" that it's undergoing write IO (BM_VALID and > > BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive > > locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The > > problem is that it's surprisingly hard to do so race free: > > > > If a backend A were to just check if a buffer is undergoing IO before locking > > it, a backend B could start IO on the buffer between A checking for > > BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just > > hold the buffer header spinlock across a blocking lwlock acquisition. > > > > There potentially are ways to synchronize the buffer state and the content > > lock, but it requires deep integration between bufmgr.c and lwlock.c. > > You may have considered and rejected simpler alternatives for (2) before > picking the approach you go on to outline. I tried a few things... > Anything interesting? Not really. The first one you propose is what I looked at for a while: > For example, I imagine these might work with varying degrees of > inefficiency: > > - Use LWLockConditionalAcquire() with some nonstandard waiting protocol when > there's a non-I/O lock conflict. It's nontrivial to make this race free - the problem is the case where we *do* have to wait for an exclusive content lock. It's possible for the lwlock to be released by the owning backend and for IO to be started, after checking whether IO is in progress (after LWLockConditionalAcquire() had failed). I came up with a complicated scheme, where setting IO in progress would afterwards wake up all lwlock waiters and all exclusive content lock waits were done with LWLockAcquireOrWait(). I think that was working - but it's also a slower and seems really fragile and ugly. > - Take BM_IO_IN_PROGRESS before exclusive-locking, then release it. That just seems expensive. We could make it cheaper by doing it only if a LWLockConditionalAcquire() doesn't succeed. But it still seems not great. And it doesn't really help with addressing the 'setting hint bits while IO is in progress" part... > > == Problem 3 - Cacheline contention == > > > c) Read accesses to the BufferDesc cause contention > > > > Some code, like nbtree, relies on functions like > > BufferGetBlockNumber(). Unfortunately that contends with concurrent > > modifications of the buffer descriptor (like pinning). Potential solutions > > are to rely less on functions like BufferGetBlockNumber() or to split out > > the memory for that into a separate (denser?) array. > > Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data > structure, since the buffer's mapping can't change until we unpin. Hm. I didn't think about a backend local datastructure for that, perhaps because it seems not cheap to maintain (both from a runtime and a space perspective). If we store the read-only data for buffers separately from the read-write data, we could access that from backends without a lock, since it can't change with the buffer pinned. One way to do that would be to maintain a back-pointer from the BufferDesc to the BufferLookupEnt, since the latter *already* contains the BufferTag. We probably don't want to add another indirection to the buffer mapping hash table, otherwise we could deduplicate the other way round and just put padding between the modified and read-only part of a buffer desc. > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote: > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > > > The order of changes I think makes the most sense is the following: > > No concerns so far. I won't claim I can picture all the implications and be > sure this is the right thing, but it sounds promising. I like your principle > of ordering changes to avoid performance regressions. I suspect we'll have to merge this incrementally to stay sane, I don't want to end up with a period of worse performance due to this, that could make it harder to evaluate other work. > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? > I would consider {AccessShare, Exclusive, AccessExclusive}. One thing I forgot to mention is that with the proposed re-architecture in place, we could subsequently go further and make pinning just be a very lightweight lock level, instead of that being a separate dedicated infrstructure. One nice outgrowth of that would be that that acquiring a cleanup lock would just be a real lock acquisition, instead of the dedicated limited machinery we have right now. Which would leave us with: - reference (pins today) - share - share-exclusive - exclusive - cleanup This doesn't quite seem to map onto the heavyweight lock levels in a sensible way... > What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive. There are a few hundred references to the lock levels though, seems painful to rename them :( > That has the same conflict matrix as the corresponding heavyweight locks, > which seems good. > I don't love our mode names, particularly ShareRowExclusive being > unsharable. I hate them with a passion :). Except for the most basic ones they just don't stay in my head for more than a few hours. > However, learning one special taxonomy is better than learning two. But that's fair. Greetings, Andres Freund
On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote: > On 2025-08-26 17:14:49 -0700, Noah Misch wrote: > > On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote: > > > == Problem 3 - Cacheline contention == > > > > > c) Read accesses to the BufferDesc cause contention > > > > > > Some code, like nbtree, relies on functions like > > > BufferGetBlockNumber(). Unfortunately that contends with concurrent > > > modifications of the buffer descriptor (like pinning). Potential solutions > > > are to rely less on functions like BufferGetBlockNumber() or to split out > > > the memory for that into a separate (denser?) array. > > > > Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data > > structure, since the buffer's mapping can't change until we unpin. > > Hm. I didn't think about a backend local datastructure for that, perhaps > because it seems not cheap to maintain (both from a runtime and a space > perspective). Yes, paying off the cost of maintaining it could be tricky. It could be the kind of thing where the overhead loses at 10 cores and wins at 40 cores. It could also depend heavily on the workload's concurrent pins per backend. > If we store the read-only data for buffers separately from the read-write > data, we could access that from backends without a lock, since it can't change > with the buffer pinned. Good point. That alone may be enough of a win. > One way to do that would be to maintain a back-pointer from the BufferDesc to > the BufferLookupEnt, since the latter *already* contains the BufferTag. We > probably don't want to add another indirection to the buffer mapping hash > table, otherwise we could deduplicate the other way round and just put padding > between the modified and read-only part of a buffer desc. I think you're saying clients would save the back-pointer once and dereference it many times, with each dereference of a saved back-pointer avoiding a shmem read of BufferDesc.tag. Is that right? > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote: > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? > > > I would consider {AccessShare, Exclusive, AccessExclusive}. > > One thing I forgot to mention is that with the proposed re-architecture in > place, we could subsequently go further and make pinning just be a very > lightweight lock level, instead of that being a separate dedicated > infrstructure. One nice outgrowth of that would be that that acquiring a > cleanup lock would just be a real lock acquisition, instead of the dedicated > limited machinery we have right now. > > Which would leave us with: > - reference (pins today) > - share > - share-exclusive > - exclusive > - cleanup > > This doesn't quite seem to map onto the heavyweight lock levels in a sensible > way... Could map it like this: AccessShare - pins today RowShare - check tuple visibility (BUFFER_LOCK_SHARE today) Share - set hint bits ShareUpdateExclusive - clean/write out (borrowing Robert's idea) Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today) AccessExclusive - cleanup lock or evict the buffer That has a separate level for hint bits vs. I/O, so multiple backends could set hint bits. I don't know whether the benchmarks would favor maintaining that distinction. > > What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive. > > There are a few hundred references to the lock levels though, seems painful to > rename them :( Yes, especially in comments and extensions. Likely more important than that for the long-term, your latest proposal has the advantage of keeping short names for the most-commonly-referenced lock types. (We could keep BUFFER_LOCK_SHARE with the lower layers translating that into RowShare, but that weakens or eliminates the benefit of reducing what readers need to learn.) For what it's worth, 6 PGXN modules reference BUFFER_LOCK_SHARE and/or BUFFER_LOCK_EXCLUSIVE. Compared to share-exclusive, I think I'd prefer a name that describes the use cases, "set-hints-or-write" (or separate "write" and "set-hints" levels). What do you think of that? I don't know whether that should win vs. names like ShareUpdateExclusive, though.
On Wed, Aug 27, 2025 at 12:18 PM Andres Freund <andres@anarazel.de> wrote: > Which would leave us with: > - reference (pins today) > - share > - share-exclusive > - exclusive > - cleanup > > This doesn't quite seem to map onto the heavyweight lock levels in a sensible > way... Could do: ACCESS SHARE, SHARE, SHARE UPDATE EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE. I've always thought that a pin was a lot like an access share lock and a cleanup lock was a lot like an access exclusive lock. But then again, using the same terminology for two different things might be confusing. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-08-27 12:14:41 -0700, Noah Misch wrote: > On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote: > > One way to do that would be to maintain a back-pointer from the BufferDesc to > > the BufferLookupEnt, since the latter *already* contains the BufferTag. We > > probably don't want to add another indirection to the buffer mapping hash > > table, otherwise we could deduplicate the other way round and just put padding > > between the modified and read-only part of a buffer desc. > > I think you're saying clients would save the back-pointer once and dereference > it many times, with each dereference of a saved back-pointer avoiding a shmem > read of BufferDesc.tag. Is that right? I was thinking that we'd not have BufferDesc.tag, instead just storing it solely in BufferLookupEnt. To get the tag of a BufferDesc, you'd every time have to follow the back-reference. But that's actually why it doesn't work - reading the back-reference pointer would have the same issue as just reading BufferDesc.tag... > > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote: > > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? > > > > > I would consider {AccessShare, Exclusive, AccessExclusive}. > > > > One thing I forgot to mention is that with the proposed re-architecture in > > place, we could subsequently go further and make pinning just be a very > > lightweight lock level, instead of that being a separate dedicated > > infrstructure. One nice outgrowth of that would be that that acquiring a > > cleanup lock would just be a real lock acquisition, instead of the dedicated > > limited machinery we have right now. > > > > Which would leave us with: > > - reference (pins today) > > - share > > - share-exclusive > > - exclusive > > - cleanup > > > > This doesn't quite seem to map onto the heavyweight lock levels in a sensible > > way... > > Could map it like this: > > AccessShare - pins today > RowShare - check tuple visibility (BUFFER_LOCK_SHARE today) > Share - set hint bits > ShareUpdateExclusive - clean/write out (borrowing Robert's idea) > Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today) > AccessExclusive - cleanup lock or evict the buffer I tend think having things like RowShare for buffer locking is confusing enough to actually make the similarity to the heavyweight locks to not be a win... > That has a separate level for hint bits vs. I/O, so multiple backends could > set hint bits. I don't know whether the benchmarks would favor maintaining > that distinction. I don't think it would - I actually found multiple backends setting the same hint bits to *hurt* performance a bit. But what's more important, we don't have the space for it, I think. Every lock that can be acquired multiple times needs a lock count of 18 bits. And we need to store the buffer state flags (10 bits). There's just not enough space in 64bit to have three 18bit counters as well as flag bits etc. > Compared to share-exclusive, I think I'd prefer a name that describes the use > cases, "set-hints-or-write" (or separate "write" and "set-hints" levels). I would too, I just couldn't come up with something that conveys the meanings in a sufficiently concise way :) > What do you think of that? I don't know whether that should win vs. names > like ShareUpdateExclusive, though. I think it'd be a win compared to the heavyweight lock names... Greetings, Andres Freund
On Wed, Aug 27, 2025 at 03:29:02PM -0400, Andres Freund wrote: > On 2025-08-27 12:14:41 -0700, Noah Misch wrote: > > On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote: > > > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote: > > > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote: > > > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? > > > Which would leave us with: > > > - reference (pins today) > > > - share > > > - share-exclusive > > > - exclusive > > > - cleanup > > Compared to share-exclusive, I think I'd prefer a name that describes the use > > cases, "set-hints-or-write" (or separate "write" and "set-hints" levels). Another name idea is "self-exclusive", to contrast with "exclusive" excluding all of (exclusive, self-exclusive, share). Fortunately, not much code will acquire this lock type. Hence, there's relatively little damage if the name is less obvious than older lock types or if the name changes later.
On Wed, Aug 27, 2025 at 10:03:08AM -0400, Robert Haas wrote: > If we were to use the existing PostgreSQL naming convention, I think > I'd probably argue that the nearest parallel to this level is > ShareUpdateExclusive: a self-exclusive lock level that permits > ordinary table access to continue while blocking exclusive locks, used > for an in-flight maintenance operation. But that's arguable, of > course. ShareUpdateExclusive is a term that's been used for some time now and relates to knowledge that's quite spread in the tree, so it feels like a natural fit for the use-case described on this thread as we'd want a self-conflicting lock. share-exclusive did not sound that bad to me, TBH, quite the contrary, when applied to buffer locking for aio. "intent" is also a word I've bumped quite a lot into while looking at some naming convention, but this is more related to the fact that a lock is going to be taken, which we don't really have. So that feels off. -- Michael
Вложения
Hi, On 2025-08-22 15:44:48 -0400, Andres Freund wrote: > The hardest part about this change is that everything kind of depends on each > other. The changes are large enough that they clearly can't just be committed > at once, but doing them over time risks [temporary] performance regressions. > > > > > The order of changes I think makes the most sense is the following: > > 1) Allow some modifications while holding the buffer header spinlock > > 2) Reduce buffer pin with just an atomic-sub > > This needs to happen first, otherwise there are performance regressions > during the later steps. Here are the first few cleaned up patches implementing the above steps, as well as some cleanups. I included a commit from another thread, as it conflicts with these changes, and we really should apply it - and it's arguably required to make the changes viable, as it removes one more use of PinBuffer_Locked(). Another change included is to not return the buffer with the spinlock held from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason for that is to reduce the most common PinBuffer_locked() call. By definition PinBuffer_locked() will become a bit slower due to 0003. But even without 0003 it 0002 is faster than master. And the previous approach also just seems pretty unclean. I don't love that it requires the new TrackNewBufferPin(), but I don't really have a better idea. I invite particular attention to the commit message for 0003 as well as the comment changes in buf_internals.h within. I'm not sure I like the TODOs added in 0003 and removed in 0004, but squashing the changes doesn't really seem better either. Greetings, Andres Freund
Вложения
- v3-0001-Improve-ReadRecentBuffer-scalability.patch
- v3-0002-bufmgr-Don-t-lock-buffer-header-in-StrategyGetBuf.patch
- v3-0003-bufmgr-Allow-some-buffer-state-modifications-whil.patch
- v3-0004-bufmgr-Use-atomic-sub-or-for-unpinning-and-markin.patch
- v3-0005-bufmgr-fewer-calls-to-BufferDescriptorGetContentL.patch
- v3-0006-bufmgr-Introduce-FlushUnlockedBuffer.patch
Hi, On 2025-09-15 19:05:37 -0400, Andres Freund wrote: > On 2025-08-22 15:44:48 -0400, Andres Freund wrote: > > The hardest part about this change is that everything kind of depends on each > > other. The changes are large enough that they clearly can't just be committed > > at once, but doing them over time risks [temporary] performance regressions. > > > > > > > > > > The order of changes I think makes the most sense is the following: > > > > 1) Allow some modifications while holding the buffer header spinlock > > > > 2) Reduce buffer pin with just an atomic-sub > > > > This needs to happen first, otherwise there are performance regressions > > during the later steps. > > Here are the first few cleaned up patches implementing the above steps, as > well as some cleanups. I included a commit from another thread, as it > conflicts with these changes, and we really should apply it - and it's > arguably required to make the changes viable, as it removes one more use of > PinBuffer_Locked(). > > Another change included is to not return the buffer with the spinlock held > from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason > for that is to reduce the most common PinBuffer_locked() call. By definition > PinBuffer_locked() will become a bit slower due to 0003. But even without 0003 > it 0002 is faster than master. And the previous approach also just seems > pretty unclean. I don't love that it requires the new TrackNewBufferPin(), > but I don't really have a better idea. > > I invite particular attention to the commit message for 0003 as well as the > comment changes in buf_internals.h within. Robert looked at the patches while we were chatting, and I addressed his feedback in this new version. Changes: - Updated patch description for 0002, giving a lot more background - Improved BufferDesc comments a fair bit more in 0003 - Reduced the size of 0003 a bit, by using UnlockBufHdrExt() instead of a CAS loop in buffer_stage_common() and reordering some things in TerminateBufferIO() - Made 0004 only use the non-looping atomic op in UnpinBufferNoOwner(), not MarkBufferDirty(). I realized the latter would take additional complexity to make safe (a CAS loop in TerminateBufferIO()). I am somewhat doubtful that there are workloads where it matters... Greetings, Andres Freund
Вложения
- v4-0001-Improve-ReadRecentBuffer-scalability.patch
- v4-0002-bufmgr-Don-t-lock-buffer-header-in-StrategyGetBuf.patch
- v4-0003-bufmgr-Allow-some-buffer-state-modifications-whil.patch
- v4-0004-bufmgr-Use-atomic-sub-for-unpinning-buffers.patch
- v4-0005-bufmgr-fewer-calls-to-BufferDescriptorGetContentL.patch
- v4-0006-bufmgr-Introduce-FlushUnlockedBuffer.patch