Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer
Дата
Msg-id CA+hUKG+gwez63UpPt1-u2rosh3rW3VeKdve23u-=fi9KWAyz5w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer  (Xuneng Zhou <xunengzhou@gmail.com>)
Ответы Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer
Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer
Список pgsql-bugs
On Fri, Aug 8, 2025 at 3:09 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Could you please elaborate on it a little bit more? Thanks.

Sure.  It's a bit complicated and involves many moving parts, so let
me try to summarise the read_stream.c fast path, forwarded buffers and
the motivation and complexities, and then finally the bug.  This email
turned out longer than I feared so sorry for the wall of text, but it
seemed useful to work through it with illustrations to help more
people understand this stuff and contribute improvements, and also
recheck my assumptions at the same time.

New version with #ifdefs.  I believe that qualified for trivial dead
code elimination in optimized builds but yeah it's still better this
way, thanks all.  This is the version I plan to push if there are no
further comments.

Before digging further, a brief recap of read_stream.c and its fast path:

The goal is to be able to perform the equivalent of a stream of
ReadBuffer() while minimising system calls, IOPS and I/O stalls by
looking a variable distance ahead in the block number stream,
combining consecutive blocks into a wider read operation according to
your io_combine_limit setting, and running some number of I/Os
asynchronously up to your {effective,maintenance}_io_concurrency
setting.  It starts off looking just one block into the future, but
adjusts it up and down according to the history of demand and cache
misses, as a prediction of future I/O needs.

If the blocks are all cached, the lookahead distance is minimised and
it becomes very similar to a plain loop of ReadBuffer() calls, since
I/O and thus I/O combining and concurrency aren't predicted to be
useful.  It has a special fast path mode to squeeze a bit more
performance out of the extreme case of all-cached blocks that don't
use the facility for streaming optional data along with each block.
It uses specialised code in read_stream.c and bufmgr.c to minimise
bookkeeping, branching and data cache misses through queue maintenance
and movement.

The bug involves multiple fast path transitions and short reads, so
next let me describe the bufmgr.c's interface for combined and
asynchronous I/O and the need for short reads:

read_stream.c works by issuing overlapping pairs of calls to
StartReadBuffers() and WaitReadBuffers(), with stream->ios holding the
queue of IOs with a pending call to WaitReadBuffers().  The interface
of StartReadBuffers() is more complex than ReadBuffer() because it has
to cope with a lot of edge cases.  It takes a relation and fork, a
block number, an input/output nblocks count, and an input/output array
of buffers.  It returns true if you need to call WaitReadBuffers(),
meaning the operation wasn't completed immediately from cache.  The
reason nblocks is input/output is that the operation must sometimes be
"split", also known as a short read.

It splits IOs partly because we want StartReadBuffers() to start
exactly zero or one IOs, and the caller would lose control of the I/O
depth if StartReadBuffers() were capable of starting multiple IOs.  We
also want to avoid some avoidable waits.  Here are some reasons for
splits:

 1. The leading block is BM_VALID, so we choose to give it to you
immediately and not look further (we could look further and return
more than one consecutive BM_VALID block at a time, but this isn't
implemented)
 2. An internal block is BM_VALID, so any non-BM_VALID blocks either
side of it can't be combined into one operation
 3. An internal block is BM_IO_IN_PROGRESS, meaning that another
backend is reading or writing it
 4. An internal or final block is in a different 1GB segment file

(Reason 3 will usually turn in to reason 2 eventually, or the other
backend's IO attempt might in theory fail and then it'll be this
backend's turn to try, and there are some slightly complicated rules
about whether we have to wait for that or split the IO here, for
deadlock avoidance and progress guarantees.)

Forwarded buffers are an artefact of the short reads.  Let me
summarize what changed in v18:

In v17, we didn't have real AIO yet, just I/O combining and pseudo-AIO
with POSIX_FADV_WILLNEED, so StartReadBuffers() pinned buffers and
possibly shortened the read for reasons #1 and #2.  This meant that it
pinned at most one extra buffer that terminated the operation, and it
used a bit of a kludge do deal with that: for reason #1 you get the
buffer, and for reason #2, it pretended that block was part of the IO,
it just didn't include it in the actual read operation.  This was
simple, but not scalable to later work.  Then WaitReadBuffers() tried
to acquire the BM_IO_IN_PROGRESS flags for all the blocks, and "hid"
all four split reasons by looping: it skipped buffers that had become
BM_VALID in the meantime (someone else read them) and waited for
BM_IO_IN_PROGRESS to become BM_VALID (or not if another backend's read
failed and this backend could now try), and then md.c internally
looped to handle reason #4.  WaitReadBuffers() *had* to acquire
BM_IO_IN_PROGRESS in this second phase when I/O was really synchronous
and StartReadBuffers() was just feeding advice to the kernel: if we
acquired it sooner then other backends might wait for this backend to
reach WaitReadBuffers(), which could deadlock, and we had no
infrastructure to deal with that.  In other words, I/Os were already
split for all the same reasons, but #3 and #4 were handled at the
later WaitReadBuffers() stage and in secret.

In v18, the handling of reasons #3 and #4 moved into
StartReadBuffers().  By definition we have to try to acquire
BM_IO_IN_PROGRESS at that earlier stage now, because the IO really is
in progress, and the deadlock risk is gone, because any backend can
drive an IO started by any other backend to completion using the
associated PgAioHandle.

This increased the pin management problem: StartReadBuffers() can
acquire a lot of pins, up to 128 (io_combine_limit=1MB) and then
discover a reason to split the IO.  So what do do with all those pins?
 A simple answer would be: unpin all the buffers after the split
point.

We didn't want to do that, because read_stream.c is almost certainly
going to call StartReadBuffers() for the rest of the blocks (and
possibly more that it now wants after those ones).  So we wanted a way
to transfer them from one StartReadBuffer() call to the following one,
avoiding an unpin-and-repin dance.  This was quite a difficult API
design question, and this isn't the only design we considered, but we
settled on making the buffer array that you give to StartReadBuffers()
an input/output array for transporting these extra pinned buffers
between calls, while in v17 it was a output-only array.  In v18, you
have to fill it up with InvalidBuffer initially.  If it wants to
report that it has pinned buffers but then excluded them from a read
that is now a short read, it writes them into that array, but it
doesn't count them in the *nblocks value that it gives you.  The term
I dreamed up for that is "forwarded" buffers.  It is the caller's job
to pass them in as input to the next StartReadBuffers() call, or if
for whatever reason it doesn't want to continue, it has to unpin them.

Next, let me describe how read_stream.c manages its circular buffer queue:

|<------------- queue_size ------------>|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | | | | |0|1|2|3|4|5|6|7|8| | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             ^                 ^
             |                 |
    oldest_buffer_index next_buffer_index

oldest_buffer_index is the consumer end of the queue, and any
associated IO has to be finished before that buffer can be return by
read_stream_next_buffer().  next_buffer_index is the producer end of
the queue, where IOs are started.  StartReadBuffers() takes pointer to
an array of buffers of the size of the requested read, so
read_stream.c gives it the address of that part of its queue.

This means that if StartReadBuffer() forwards some buffers because of
a split, we don't have to do anything at all, because we advance
next_buffer_index to the location of the start of the next operation,
and that is where the forwarded buffers landed, ready for the next
call to StartReadBuffers():

|<------------- queue_size ------------>|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | | | | |0|1|2|3|4|5|6|7|8|9|A| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             ^                 ^
             |                 |
    oldest_buffer_index next_buffer_index

Well, not quite nothing, we do count how many there are in
stream->forwarded_buffers.  That's used by the code that limits
lookahead when an unrealistically small buffer pool reports that we
hold too many pins already.  Those are pins that this backend holds,
so they are not counted as extra pins that StartReadBuffer() would
acquire.

That's a nice way for successive StartReadBuffers() calls to
communicate, without having to copy them to and from some other
storage between calls.  In exchange for that convenience, we also need
to fill the entries with InvalidBuffer initially so they don't look
like forwarded buffers.  We don't want to do that up front: the queue
might be up to {effective,maintenance}_io_concurrency *
io_combine_limit in size (highest possible value: 128,000), and if
this is a small or perfectly cached scan then we will never use most
of the queue.  So, we:

 * initialize them on the first go around the queue only, tracked in
stream->initialized_buffers
 * invalidate entries consumed by read_stream_next_buffer() in
preparation for reuse

The fast path repeatedly uses the same queue entry instead of
advancing oldest_buffer_index and next_buffer_index, so it doesn't
need to invalidate queue entries, saving an instruction and reducing
cache line pollution.  Now I am getting pretty close to explaining the
bug, but there is one final complication...

Sometimes a StartReadBuffers() call spans past the physical end of the
queue.  It needs to wrap around.  We could have chosen to give
StartReadBuffers() an iov, iov_cnt style interface like preadv() so
that we could use iov_cnt = 2 and locate some buffers at the physical
end of the queue and the rest at the physical start of the queue, but
StartReadBuffers()'s interface was complicated enough already.  The
solution I went with was to allocate extra array entries after the
queue to allow for the maximum possible overflow if beginning a read
into a buffer located in the final queue entry.  The overflowing
buffers are immediately copied to the physical start of the queue,
ready for consumption in regular circular queue position.  Note the
twin positions for buffer 2 in this example:

|<------------- queue_size ------------>|<--- overflow_size --->|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|2|3|4|5|6|7| | | | | | | | | | | | |0|1|2|3|4|5|6|7| | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             ^                       ^
             |                       |
      next_buffer_index      oldest_buffer_index

(Perhaps it would be clearer if overflow_size were a member variable
in ReadStream instead of being open coded as io_combine_limit - 1 in a
few places.)

The non-fast-path invalidates the queue entry holding the buffer it is
returning to the caller of read_stream_next_buffer(), as required to
mark it as "not a forwarded buffer" when reused next time we go around
the circular queue.  If it's in the low physical entries that also
might have a twin in the overflow zone, it must also be invalidated.
The reason we keep both copies for a bit longer is that we don't know
whether the following StartReadBuffers() will also be split, so we
don't know whether it will also overflow, so we allow for any number
of following StartReadBuffers() calls to overflow too OR move to the
physical start of the queue.

Now I can finally explain the bug :-)

The fast-path doesn't have to invalidate queue entries, because it
spins on the same queue entry repeatedly as an optimisation.
Unfortunately I forgot to consider that there might be a copy of it in
the overflow zone.  If there is a multi-block StartReadBuffers() that
spans into the overflow zone but is split, a continuing
StartReadBuffers(), and then all the requirements for entering fast
path mode are met, the fast path begins to spin on one buffer entry
repeatedly near the physical start of the queue, and it initially
holds a buffer that was copied from the overflow zone.  Nothing goes
wrong yet, but then we leave fast path mode because we need to do I/O,
and begin to cycle around the queue again in the regular non-fast path
code.  Eventually we have another StartReadBuffers() call that spans
into the overflow zone, and bufmgr.c sees a stale buffer number that
we failed to clear out earlier.  It thinks it's a forwarded buffer.
Forwarded buffers must be pinned and map to blocknum + i, and it
asserts that to avoid corrupting the buffer pool by overwriting a
random buffer, which failed due to the bug.

I don't want the fast path to do extra steps.  The solution I found
was to clear the overflow entry *before* entering the fast path.  We
already know which buffer entry the fast path is going to use, and we
also know that there is no I/O split needing to be continued as part
of the conditions of entry to the fast path, so the solution is simply
to clear any corresponding overflow entry, very slightly ahead of the
usual time.

Вложения

В списке pgsql-bugs по дате отправления: