Re: Streaming I/O, vectored I/O (WIP)

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: Streaming I/O, vectored I/O (WIP)
Дата	28 марта 2024 г. 20:45:08
Msg-id	98f01395-fc60-4d50-89c9-fc3011fa2a45@iki.fi обсуждение исходный текст
Ответ на	Re: Streaming I/O, vectored I/O (WIP) (Thomas Munro <thomas.munro@gmail.com>)
Ответы	Re: Streaming I/O, vectored I/O (WIP)
Список	pgsql-hackers

Дерево обсуждения

> I spent most of the past few days trying to regain some lost
> performance.  Thanks to Andres for some key observations and help!
> That began with reports from Bilal and Melanie (possibly related to
> things Tomas had seen too, not sure) of regressions in all-cached
> workloads, which I already improved a bit with the ABC algorithm that
> minimised pinning for this case.  That is, if there's no recent I/O so
> we reach what I call behaviour A, it should try to do as little magic
> as possible.  But it turns out that wasn't enough!  It is very hard to
> beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
> millions of already-cached blocks, if you have to do exactly the same
> work AND extra instructions for management.

I got a little nerd-sniped by that, and did some micro-benchmarking of 
my own. I tested essentially this, with small values of 'nblocks' so 
that all pages are in cache:

    for (int i = 0; i < niters; i++)
    {
        for (BlockNumber blkno = 0; blkno < nblocks; blkno++)
        {
            buf = ReadBuffer(rel, blkno);
            ReleaseBuffer(buf);
        }
    }

The results look like this (lower is better, test program and script 
attached):

master (213c959a29):        8.0 s
streaming-api v13:        9.5 s

This test exercises just the ReadBuffer() codepath, to check if there is 
a regression there. It does not exercise the new streaming APIs.

So looks like the streaming API patches add some overhead to the simple 
non-streaming ReadBuffer() case. This is a highly narrow 
micro-benchmark, of course, so even though this is a very 
performance-sensitive codepath, we could perhaps accept a small 
regression there. In any real workload, you'd at least need to take the 
buffer lock and read something from the page.

But can we do better? Aside from performance, I was never quite happy 
with the BMR_REL/BMR_SMGR stuff we introduced in PG v16. I like having 
one common struct like BufferManagerRelation that is used in all the 
functions, instead of having separate Relation and SMgrRelation variants 
of every function. But those macros feel a bit hacky and we are not 
consistently using them in all the functions. Why is there no 
ReadBuffer() variant that takes a BufferManagerRelation?

The attached patch expands the use of BufferManagerRelations. The 
principle now is that before calling any bufmgr function, you first 
initialize a BufferManagerRelation struct, and pass that to the 
function. The initialization is done by the InitBMRForRel() or 
InitBMRForSmgr() function, which replace the BMR_REL/BMR_SMGR macros. 
They are full-blown functions now because they do more setup upfront 
than BMR_REL/BMR_SMGR. For example, InitBMRForRel() always initializes 
the 'smgr' field, so that you don't need to repeat this pattern in all 
the other functions:

-       /* Make sure our bmr's smgr and persistent are populated. */
-       if (bmr.smgr == NULL)
-       {
-               bmr.smgr = RelationGetSmgr(bmr.rel);
-               bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-       }

Initializing the BufferManagerRelation is still pretty cheap, so it's 
feasible to call it separately for every ReadBuffer() call. But you can 
also reuse it across calls, if you read multiple pages in a loop, for 
example. That saves a few cycles.

The microbenchmark results with these changes:

master (213c959a29):        8.0 s
streaming-api v13:        9.5 s
bmr-refactor            8.4 s
bmr-refactor, InitBMR once    7.7 s

The difference between the "bmr-refactor" and "initBMR once" is that in 
the "initBMR once" test, I modified the benchmark to call 
InitBMRForRel() just once, outside the loop. So that shows the benefit 
of reusing the BufferManagerRelation. This refactoring seems to make 
performance regression smaller, even if you don't take advantage of 
reusing the BufferManagerRelation.

This also moves things around a little in ReadBuffer_common() (now 
called ReadBufferBMR). Instead of calling StartReadBuffer(), it calls 
PinBufferForBlock() directly. I tried doing that before the other 
refactorings, but that alone didn't seem to make much difference. Not 
sure if it's needed, it's perhaps an orthogonal refactoring, but it's 
included here nevertheless.

What do you think? The first three attached patches are your v13 patches 
unchanged. The fourth is the micro-benchmark I used. The last patch is 
the interesting one.


PS. To be clear, I'm happy with your v13 streaming patch set as it is. I 
don't think this BufferManagerRelation refactoring is a show-stopper.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Streaming I/O, vectored I/O (WIP)

Вложения