Re: index prefetching
От | Thomas Munro |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | CA+hUKGJOstoEipeKc5naLFjK2TQK5=hKvJ-1pzoQN4U-Mt1ztQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Список | pgsql-hackers |
On Thu, Aug 7, 2025 at 8:41 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Aug 5, 2025 at 7:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > There must be a similar opportunity for parallel index scans. It has > > that "seize the scan" concept where parallel workers do one-at-a-time > > locked linked list leapfrog. > > True. More generally, flexibility to reorder work would be useful there. > > The structure of parallel B-tree scans is one where each worker > performs its own "independent" index scan. The workers each only > return tuples from those leaf pages that they themselves manage to > read. That isn't particularly efficient, since we'll usually have to > merge the "independent" index scan tuples together once again using a > GatherMerge. Yeah. This all sails close to the stuff I wrote about in the post-mortem of my failed attempt to teach parallel bitmap heapscan not to throw I/O combining opportunities out the window for v18, after Melanie streamified BHS. Basically you have competing goals: * preserve natural ranges of blocks up to io_combine_limit * make workers run out of work at the same time * avoiding I/O stalls using lookahead and concurrency You can't have all three right now: I/O streams are elastic so allocation decisions made at the producer end don't control work *finishing* time, so we need something new. I wrote about an idea based on work stealing when data runs out. Read streams would work independently, but cooperate at end of stream, avoiding interlocking almost all the time. That was basically a refinement of an earlier "shared" read stream that seems too locky. (Obviously the seize-the-scan block producer is a total lockfest navitaging a link-at-a-time data structure, but let's call that a use-case specific problem.) Other architectures are surely possible too, including admitting that precise streams are not right for that problem, and using something like the PredictBlock() approach mentioned below for prefetching and then sticking to block-at-a-time work distribution. Or we could go the other way and admit that block-at-a-time is also not ideal -- what if some blocks are 10,000 times more expensive to process than others? -- and do work stealing that degrades ultimately to tuple granularity, a logical extreme position. > > Basically, the stuff that we can't fix with "precise" I/O > > streaming as I like to call it, where it might still be interesting to > > think about opportunities to do fuzzier speculative lookahead. I'll > > start a new thread. > > That sounds interesting. I worry that we won't ever be able to get > away without some fallback that behaves roughly like OS readahead. Yeah. I might write about some of these things properly but here is an unfiltered brain dump of assorted theories of varying crackpottedness and some illustrative patches that I'm *not* proposing: * you can make a dumb speculative sequential readahead stream pretty easily, but it's not entirely satisfying: here's one of the toy patches I mentioned in Athens, that shows btree leaf scans (but not parallel ones) doing that, producing nice I/O combining and concurrency that ramps up in the usual way if it happens to be sequential (well I just rebased this and didn't test it, but it should still work); I will describe some other approaches to try to place this in the space of possibilities I'm aware of... * you could make a stream that pulls leaf pages from higher level internal pages on demand (if you want to avoid the flow control problems that come from trying to choose a batch size up front before you know you'll even need it by using a pull approach), or just notice that it looks sequential and install a block range producer, and if that doesn't match the next page pointers by the time you get there then you destroy it and switch strategies, or something * you could just pretend it's always sequential and reset the stream every time you're wrong or some only slightly smarter scheme than that, but it's still hard to know what's going on in cooperating processes... * you could put sequential extent information in meta blocks or somehow scatter hints around... * you could instead give up on explicit streams for fuzzy problems, and teach the buffer pool to do the same tricks as the kernel, with a scheme that lets go of the pins and reacquires them later (hopefully cheaply with ReadRecentBuffer(), by leaving a trail of breadcrumbs in SMgrRelation or shared memory, similar to what I already proposed for btree root pages and another related patch that speeds up seq scans, which I plan to repost soon): SMgrRelation could hold the state necessary for the buffer pool to notice when you keep calling ReadBuffer() for sequential blocks and begin to prefetch them speculatively with growing distance heuristics so it doesn't overdo it, but somehow not hold pins on your behalf (this was one of the driving concerns that made me originally think that I needed an explicit stream as an explicit opt-in and scoping for extra pins, which an AM might not want at certain times, truncation or cleanup or something, who knows) * you could steal Linux's BM_READAHEAD concept, where speculatively loaded pages carry a special marker so they can be recognized by later ReadBuffer() calls to encourage more magic readahead, because it's measurably fruitful; this will be seen also by other backends, eg parallel workers working on the same problem, though there is a bit of an interleaving edge problem (you probably want to know if adjacent pages have the flag in some window, and I have an idea for that that doesn't involve the buffer mapping table); in other words the state tracked in SMgrRelation is only used to ignite readahead, but shared flags in or parallel with the buffer pool apply fuel * from 30,000 feet, the question is what scope you do the detection at; you can find examples of OSes that only look at one fd for sequential detection and only consider strict-next-block (old Unixen, I suspect maybe Windows but IDK), systems that have a tolerance windows (Linux), systems that search a small table of active streams no matter which fds they're coming through (ZFS does this I think, sort of like our own synchronized_scan detector), and systems that use per-page accounting to measure and amplify success, and we have analogies in our architecture as candidate scopes: explicit stream objects, the per-backend SMgrRelation, the proposed system-wide SharedSMgrRelation, the buffer pool itself, and perhaps a per-buffer hint array with relaxed access (this last is something I've experimented with, both as a way to store relaxed navigation information for sequential scans skipping the buffer mapping table and as a place to accumulate prefetch-driving statistics) * so far that's just talking about sequential heuristics, but we have many clues the kernel doesn't, it's just that they're not always reliable enough for a "precise" read streams and we might not want to use the approach to speculation I mentioned above where you have a read stream but as soon as it gives you something you didn't expect you have to give up completely or reset it and start feeding it again; presumably you could code around that with a fuzzy speculation buffer that tolerates a bit of disorder * you could make a better version of PrefetchBuffer() for guided prefetching, let's call it PredictBuffer(), that is initially lazy but if the predictions turn out to be correct it starts looking further ahead in the stream of predictions you made and eventually becomes quite eager, like PrefetchBuffer(), but just lazy enough to perform I/O combining; note that I'm now talking about "pushing" rather than "pulling" predictions, another central question with implications, and one of the nice things about read streams * for a totally different line of attack that goes back to precise pull-based streams, you could imagine a read stream that lets you 'peek' at data coming down the pipe as soon as it is ready (ie it's already in cache, or it isn't but the IO finishes before the stream consumer gets to it), so you can get a head start on a jump requiring I/O in a self-referential data structure like a linked list (with some obvious limitations); here is a toy patch that allows you to install such a callback, which could feed next-block information to the main block number callback, so now we have three times of interest in the I/O stream: block numbers are pulled into the producer end, valid pages are pushed out to you as soon as possible somewhere in the middle or probably often just a step ahead the producer and can feed block numbers back to it, and pages are eventually pulled out of the consumer end for processing; BUT NOTE: this idea is not entirely compatible with the lazy I/O completion draining of io_method=io_uring (or the posix_aio patch I dumped on the list the other day, and the Windows equivalent could plausibly go either way), and works much better with io_method=worker whose completions are advertised eagerly, so this implementation of the idea is a dead end, if even the goal itself is interesting, not sure * the same effect could be achieved with chained streams where the consumer-facing stream is a simple elastic queue of buffers that is fed by the real I/O stream, with the peeking in between; that would suit those I/O methods much better; it might need a new read_stream_next_buffer_conditional() that calls that WaitReadBuffersWouldStall() function, unless the consumer queue is empty and it has to call read_stream_next_buffer() which might block; the point being to periodically pump the peeking mechanism * the peek concept is pretty weak on its own because it's hard to reach a state where you have enough lookahead window that it can follow a navigational jump in time to save you from a stall but ... maybe there are streams that contain a lot of either sequential or well cached blocks with occasional jumps to random I/O; if you could somehow combine the advanced vapourware of several of these magic bullet points, then perhaps you can avoid some stalls Please take all of that with an absolutely massive grain of salt, it's just very raw ideas...
Вложения
В списке pgsql-hackers по дате отправления: