Обсуждение: RFC: bufmgr locking changes
I've attached a (gzip'ed) patch that makes the following changes to
the buffer manager:
(1) Overhaul locking; whenever the code needs to modify the state
of an individual buffer, do synchronization via a per-buffer
"meta data lock" rather than the global BufMgrLock. For more
info on the motivation for this, see the several recent
-hackers threads on the subject.
(2) Reduce superfluous dirtying of pages: previously,
LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the
buffer. This behavior has been removed, to reduce the number
of pages that are dirtied. For more info on the motivation for
this, see the previous -hackers thread on the topic.
(3) Simplify the code in a bunch of places
This basically involved rewriting bufmgr.c (which makes the patch a
little illegible; it might be easier to just apply it and read through
the new code). That also means there are still a fair number of bugs
and unresolved issues.
The new locking works as follows:
- modifying the shared buffer table (buf_table.c) or making calls
into the freelist code (freelist.c) still requires holding the
BufMgrLock. There was some talk of trying to add more granular
locking to freelist.c itself: if there is still significant
contention for the BufMgrLock after these changes have been
applied, I'll take a look at doing that
- to modify a BufferDesc's meta data (shared refcount, flags, tag,
etc.), one must hold the buffer's "meta data lock". Also, I
removed the existing "io_in_progress" lock; instead, we now hold
the buffer's meta data lock while doing buffer I/O.
- if a backend holds the BufMgrLock, it will never try to
LWLockAcquire() a per-buffer meta data lock, due to the risk of
deadlock (and the loss in concurrency if we got blocked waiting
while still holding the BufMgrLock). Instead it does a
LWLockConditionalAcquire() and handles an acquisition failure
sanely
- if a backend holds the meta data lock for a buffer, it CAN
attempt to LWLockAcquire() the BufMgrLock. This is safe from
deadlocks, due to the previous paragraph.
The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t
1000` bails out pretty quickly, for example), but any comments on
whether this scheme is in-theory correct would be very welcome.
For #2, my understanding of the existing XLOG code is incomplete, so
perhaps someone can tell me if I'm on the right track. I've modified a
few of the XLogInsert() call sites so that they now:
- acquire an exclusive buffer cntx_lock
- modify the content of the page/buffer
- WriteNoReleaseBuffer() to mark the buffer as dirty; since we
still hold the buffer's cntx_lock, a checkpoint process can't
write this page to disk yet. This replaces the implicit "mark
this page as dirty" behavior of LockBuffer()
- do the XLogInsert()
- release the cntx_lock
- ReleaseBuffer() to decrement the buffer's pincount
For example, sequence.c now works like this (I've probably missed a
few places) -- is this correct, and/or is there a better way to do it?
Notes, and caveats:
- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
completely equivalent to WriteNoReleaseBuffer(), so I just removed
the former and replaced all the calls to it with calls to the later.
- Moved the code for flushing a dirty buffer to disk to buf_io.c
- Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c
- Removed the following now-unused buffer flag bits: BM_VALID,
BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
'flags' field of the BufferDesc down to 8 bits (from 16)
- Removed the cntxDirty field from the BufferDesc: now that we don't
need to acquire the BufMgrLock to modify the buffer's flags, there's
no reason to keep this around
- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
saves 2 bits * shared_buffers per backend on 32-bit machines and 6
bits * shared_buffers per backend on some 64-bit machines. It means
a given backend can only pin a single buffer 65,636 times, but that
should be more than enough. Similarly, made LocalRefCount an array
of uint16s.
I was thinking of adding overflow checking to the refcount increment
code to make sure we fail safely if a backend *does* try to exceed
this number of pins, but I can't imagine a scenario when it would
actually occur, so I haven't bothered.
- Remove the BufferLocks array. AFAICS this wasn't actually necessary.
- A few loose ends in the code still need to be wrapped up (for
example, I need to take another glance at the pin-waiter stuff, and
the error handling still needs some more work), but I think most of
the functionality is there. Areas of concern are denoted by 'XXX'
comments.
Any comments?
-Neil
Вложения
On Wed, Jan 07, 2004 at 05:37:09PM -0500, Neil Conway wrote: > > - if a backend holds the BufMgrLock, it will never try to > LWLockAcquire() a per-buffer meta data lock, due to the risk of > deadlock (and the loss in concurrency if we got blocked waiting > while still holding the BufMgrLock). Instead it does a > LWLockConditionalAcquire() and handles an acquisition failure > sanely I can only see this work when you always check that you actually got the right buffer after you locked the meta data. There is nothing preventing an other backend from using it to for something else in it between the time you release the BufMgrLock and you lock the buffer's meta data. Note that your PinBuffer() is now always called when you already have the lock meta lock, which is probably a good idea or you're going to make that function more complex that it should be. Just say that it should hold the meta lock instead of the BufMgrLock when it's called. Kurt
Neil Conway <neilc@samurai.com> writes:
> - to modify a BufferDesc's meta data (shared refcount, flags, tag,
> etc.), one must hold the buffer's "meta data lock". Also, I
> removed the existing "io_in_progress" lock; instead, we now hold
> the buffer's meta data lock while doing buffer I/O.
The latter is a really bad idea IMHO. The io_in_progress lock can be
held for eons (in CPU terms) and should not be blocking people who
simply want to bump their refcount up and down. In particular, there is
no reason for an in-process write to block people who want read-only
access to the buffer.
It's possible that you could combine the io_in_progress lock with the
cntx_lock, but I doubt locking out access to the metadata during i/o
is a good idea.
> - if a backend holds the BufMgrLock, it will never try to
> LWLockAcquire() a per-buffer meta data lock, due to the risk of
> deadlock (and the loss in concurrency if we got blocked waiting
> while still holding the BufMgrLock). Instead it does a
> LWLockConditionalAcquire() and handles an acquisition failure
> sanely
Hmm, what's "sanely" exactly? It seems to me that figuring out how to
not need to lock in this direction is a critical part of the redesign,
so you can't just handwave here and expect people to understand.
> - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
> completely equivalent to WriteNoReleaseBuffer(), so I just removed
> the former and replaced all the calls to it with calls to the later.
The reason I've kept the separation was as a form of documentation as to
the reason for each write. Although they currently do the same thing,
that might not always be true. I'd prefer not to eliminate the
distinction from the source code --- though I'd not object if you want
to make SetBufferCommitInfoNeedsSave a macro that invokes the other
routine.
> - Removed the following now-unused buffer flag bits: BM_VALID,
> BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
> 'flags' field of the BufferDesc down to 8 bits (from 16)
I cannot believe that it's workable to remove BM_JUST_DIRTIED. How are
you handling the race conditions that this was for? I'm also
unconvinced that removing BM_IO_IN_PROGRESS is a good idea.
> - Make 'PrivateRefCount' an array of uint16s, rather than longs. This
> saves 2 bits * shared_buffers per backend on 32-bit machines and 6
> bits * shared_buffers per backend on some 64-bit machines. It means
> a given backend can only pin a single buffer 65,636 times, but that
> should be more than enough. Similarly, made LocalRefCount an array
> of uint16s.
I think both of these are ill-considered micro-optimization. How do you
know that the pin count can't exceed 64K? Consider recursive plpgsql
functions for a likely counterexample.
> I was thinking of adding overflow checking to the refcount increment
> code to make sure we fail safely if a backend *does* try to exceed
> this number of pins, but I can't imagine a scenario when it would
> actually occur, so I haven't bothered.
Leaving the arrays as longs is a much saner approach IMHO.
> - Remove the BufferLocks array. AFAICS this wasn't actually necessary.
Please put that back. It is there to avoid unnecessary acquisitions of
buffer locks during UnlockBuffers (which is executed during any
transaction abort). Without it, you will be needing to lock every
single buffer during an abort in order to check its flags.
regards, tom lane
(Sorry Tom, I was meaning to reply to you once I'd had a chance to revise the bufmgr patch; since that seems a fair waysoff, I figured it would be better to respond now.) Tom Lane <tgl@sss.pgh.pa.us> writes: > Neil Conway <neilc@samurai.com> writes: >> we now hold the buffer's meta data lock while doing buffer I/O. > > The latter is a really bad idea IMHO. The io_in_progress lock can be > held for eons (in CPU terms) and should not be blocking people who > simply want to bump their refcount up and down. My reasoning was that the contention for the per-buffer meta data lock should be pretty low. The io_in_progress lock is held when we're either faulting a page in or flushing a page out. In the first case, we can't actually use the buffer no matter how we do the locking (its content is incomplete), so there's no effective loss in concurrency. In the second case, what kinds of concurrent activity can we allow on the buffer? (We can allow reads, of course, but I don't believe we can allow writes.) However, I'll think some more on this, you (and Jan, who raised this point a while ago via IRC) are probably correct. > It's possible that you could combine the io_in_progress lock with the > cntx_lock Yeah, that's a possibility. >> - Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now >> completely equivalent to WriteNoReleaseBuffer(), so I just removed >> the former and replaced all the calls to it with calls to the later. > > The reason I've kept the separation was as a form of documentation as to > the reason for each write. Although they currently do the same thing, > that might not always be true. I'd prefer not to eliminate the > distinction from the source code --- though I'd not object if you want > to make SetBufferCommitInfoNeedsSave a macro that invokes the other > routine. Ok, fair enough -- I've changed SetBufferCommitInfoNeedsSave() to be a macro for WriteNoReleaseBuffer(). >> - Make 'PrivateRefCount' an array of uint16s, rather than longs. This >> saves 2 bits * shared_buffers per backend on 32-bit machines and 6 >> bits * shared_buffers per backend on some 64-bit machines. It means >> a given backend can only pin a single buffer 65,636 times, but that >> should be more than enough. Similarly, made LocalRefCount an array >> of uint16s. > > I think both of these are ill-considered micro-optimization. How do you > know that the pin count can't exceed 64K? Consider recursive plpgsql > functions for a likely counterexample. Fair enough -- I couldn't conceive of an actual scenario in which a single backend would acquire > 64K pins on a single buffer, but I'll take your word that it's not so far fetched. However, there is still room for improvement, IMHO: on a machine with 64-bit longs, we're plainly allocating 4 bytes more than we need do. Any objection if I change these to arrays of int32? > Please put that back. It is there to avoid unnecessary acquisitions > of buffer locks during UnlockBuffers (which is executed during any > transaction abort). Without it, you will be needing to lock every > single buffer during an abort in order to check its flags. It seems bizarre that we need to iterate through a few thousand array elements just to do some lock cleanup. I'll take a closer look at this... -Neil
Neil Conway <neilc@samurai.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> The latter is a really bad idea IMHO. The io_in_progress lock can be
>> held for eons (in CPU terms) and should not be blocking people who
>> simply want to bump their refcount up and down.
> My reasoning was that the contention for the per-buffer meta data lock
> should be pretty low. The io_in_progress lock is held when we're
> either faulting a page in or flushing a page out. In the first case,
> we can't actually use the buffer no matter how we do the locking (its
> content is incomplete), so there's no effective loss in
> concurrency. In the second case, what kinds of concurrent activity can
> we allow on the buffer? (We can allow reads, of course, but I don't
> believe we can allow writes.)
True, there's no win in the read-busy case, but I think you
underestimate the value of the write-busy case. Multiple concurrent
readers are a very important consideration. In Postgres it is possible
for a reader to cause a write to occur (because it sets commit hint
bits, as per the SetBufferCommitInfoNeedsSave() business), and so you
could have a situation like
Reader pins page Reader examines some tuples Reader sets a commit bit and dirties page ...
Writerstarts write ... Reader examines some more tuples Reader unpins page Writer finishes
write
If the reader can't unpin until the writer is done, then we will have
foreground readers blocked on the background writer process, which is
exactly what we do not want.
>> I think both of these are ill-considered micro-optimization. How do you
>> know that the pin count can't exceed 64K? Consider recursive plpgsql
>> functions for a likely counterexample.
> Fair enough -- I couldn't conceive of an actual scenario in which
> a single backend would acquire > 64K pins on a single buffer, but I'll
> take your word that it's not so far fetched. However, there is still
> room for improvement, IMHO: on a machine with 64-bit longs, we're
> plainly allocating 4 bytes more than we need do. Any objection if I
> change these to arrays of int32?
That seems like a reasonable compromise.
regards, tom lane