Re: Streaming I/O, vectored I/O (WIP)
От | Heikki Linnakangas |
---|---|
Тема | Re: Streaming I/O, vectored I/O (WIP) |
Дата | |
Msg-id | 98f01395-fc60-4d50-89c9-fc3011fa2a45@iki.fi обсуждение исходный текст |
Ответ на | Re: Streaming I/O, vectored I/O (WIP) (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: Streaming I/O, vectored I/O (WIP)
(Thomas Munro <thomas.munro@gmail.com>)
|
Список | pgsql-hackers |
> I spent most of the past few days trying to regain some lost > performance. Thanks to Andres for some key observations and help! > That began with reports from Bilal and Melanie (possibly related to > things Tomas had seen too, not sure) of regressions in all-cached > workloads, which I already improved a bit with the ABC algorithm that > minimised pinning for this case. That is, if there's no recent I/O so > we reach what I call behaviour A, it should try to do as little magic > as possible. But it turns out that wasn't enough! It is very hard to > beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over > millions of already-cached blocks, if you have to do exactly the same > work AND extra instructions for management. I got a little nerd-sniped by that, and did some micro-benchmarking of my own. I tested essentially this, with small values of 'nblocks' so that all pages are in cache: for (int i = 0; i < niters; i++) { for (BlockNumber blkno = 0; blkno < nblocks; blkno++) { buf = ReadBuffer(rel, blkno); ReleaseBuffer(buf); } } The results look like this (lower is better, test program and script attached): master (213c959a29): 8.0 s streaming-api v13: 9.5 s This test exercises just the ReadBuffer() codepath, to check if there is a regression there. It does not exercise the new streaming APIs. So looks like the streaming API patches add some overhead to the simple non-streaming ReadBuffer() case. This is a highly narrow micro-benchmark, of course, so even though this is a very performance-sensitive codepath, we could perhaps accept a small regression there. In any real workload, you'd at least need to take the buffer lock and read something from the page. But can we do better? Aside from performance, I was never quite happy with the BMR_REL/BMR_SMGR stuff we introduced in PG v16. I like having one common struct like BufferManagerRelation that is used in all the functions, instead of having separate Relation and SMgrRelation variants of every function. But those macros feel a bit hacky and we are not consistently using them in all the functions. Why is there no ReadBuffer() variant that takes a BufferManagerRelation? The attached patch expands the use of BufferManagerRelations. The principle now is that before calling any bufmgr function, you first initialize a BufferManagerRelation struct, and pass that to the function. The initialization is done by the InitBMRForRel() or InitBMRForSmgr() function, which replace the BMR_REL/BMR_SMGR macros. They are full-blown functions now because they do more setup upfront than BMR_REL/BMR_SMGR. For example, InitBMRForRel() always initializes the 'smgr' field, so that you don't need to repeat this pattern in all the other functions: - /* Make sure our bmr's smgr and persistent are populated. */ - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); - bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } Initializing the BufferManagerRelation is still pretty cheap, so it's feasible to call it separately for every ReadBuffer() call. But you can also reuse it across calls, if you read multiple pages in a loop, for example. That saves a few cycles. The microbenchmark results with these changes: master (213c959a29): 8.0 s streaming-api v13: 9.5 s bmr-refactor 8.4 s bmr-refactor, InitBMR once 7.7 s The difference between the "bmr-refactor" and "initBMR once" is that in the "initBMR once" test, I modified the benchmark to call InitBMRForRel() just once, outside the loop. So that shows the benefit of reusing the BufferManagerRelation. This refactoring seems to make performance regression smaller, even if you don't take advantage of reusing the BufferManagerRelation. This also moves things around a little in ReadBuffer_common() (now called ReadBufferBMR). Instead of calling StartReadBuffer(), it calls PinBufferForBlock() directly. I tried doing that before the other refactorings, but that alone didn't seem to make much difference. Not sure if it's needed, it's perhaps an orthogonal refactoring, but it's included here nevertheless. What do you think? The first three attached patches are your v13 patches unchanged. The fourth is the micro-benchmark I used. The last patch is the interesting one. PS. To be clear, I'm happy with your v13 streaming patch set as it is. I don't think this BufferManagerRelation refactoring is a show-stopper. -- Heikki Linnakangas Neon (https://neon.tech)
Вложения
В списке pgsql-hackers по дате отправления: