Re: Streaming I/O, vectored I/O (WIP)

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Streaming I/O, vectored I/O (WIP)
Дата
Msg-id 289a1c0e-8444-4009-a8c2-c2d77ced6f07@iki.fi
обсуждение исходный текст
Ответ на Re: Streaming I/O, vectored I/O (WIP)  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Streaming I/O, vectored I/O (WIP)  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On 24/03/2024 15:02, Thomas Munro wrote:
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
>> for a shorter name.
> 
> Hmm.  The idea of 'buffer' appearing in a couple of names is that
> there are conceptually other kinds of I/O that we might want to
> stream, like raw files or buffers other than the buffer pool, maybe
> even sockets, so this would be part of a family of similar interfaces.
> I think it needs to be clear that this variant gives you buffers.  I'm
> OK with removing "get" but I guess it would be better to keep the
> words in the same order across the three functions?  What about these?
> 
> streaming_read_buffer_begin();
> streaming_read_buffer_next();
> streaming_read_buffer_end();
> 
> Tried like that in this version.  Other ideas would be to make
> "stream" the main noun, buffered_read_stream_begin() or something.
> Ideas welcome.

Works for me, although "streaming_read_buffer" is a pretty long prefix. 
The flags like "STREAMING_READ_MAINTENANCE" probably ought to be 
"STREAMING_READ_BUFFER_MAINTENANCE" as well.

Maybe "buffer_stream_next()"?

> Here are some other changes:
> 
> * I'm fairly happy with the ABC adaptive distance algorithm so far, I
> think, but I spent more time tidying up the way it is implemented.  I
> didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
> so I created a new dense array stream->buffers that behaved as a
> second circular queue.
> 
> * The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
> become the GUC that it always wanted to be: buffer_io_size defaulting
> to 128kB.  Seems like a reasonable thing to have?  Could also
> influence things like bulk write?  (The main problem I have with the
> GUC currently is choosing a category, async resources is wrong....)
> 
> * By analogy, it started to look a bit funny that each range had room
> for a ReadBuffersOperation, and we had enough ranges for
> max_pinned_buffers * 1 block range.  So I booted that out to another
> dense array, of size max_ios.
> 
> * At the same time, Bilal and Andres had been complaining privately
> about 'range' management overheads showing up in perf and creating a
> regression against master on fully cached scans that do nothing else
> (eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
> and no CPU work with the page, a somewhat extreme case but a
> reasonable way to isolate the management costs); having made the above
> change, it suddenly seemed obvious that I should make the buffers
> array the 'main' circular queue, pointing off to another place for
> information required for dealing with misses.  In this version, there
> are no more range objects.  This feels better and occupies and touches
> less memory.  See pictures below.

+1 for all that. Much better!

> * Various indexes and sizes that couldn't quite fit in uint8_t but
> couldn't possibly exceed a few thousand because they are bounded by
> numbers deriving from range-limited GUCs are now int16_t (while I was
> looking for low hanging opportunities to reduce memory usage...)

Is int16 enough though? It seems so, because:

     max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and 
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.

If someone changes those constants though, int16 might overflow and fail 
in weird ways. I'd suggest being more careful here and explicitly clamp 
max_pinned_buffers at PG_INT16_MAX or have a static assertion or 
something. (I think it needs to be somewhat less than PG_INT16_MAX, 
because of the extra "overflow buffers" stuff and some other places 
where you do arithmetic.)

>     /*
>      * We gave a contiguous range of buffer space to StartReadBuffers(), but
>      * we want it to wrap around at max_pinned_buffers.  Move values that
>      * overflowed into the extra space.  At the same time, put -1 in the I/O
>      * slots for the rest of the buffers to indicate no I/O.  They are covered
>      * by the head buffer's I/O, if there is one.  We avoid a % operator.
>      */
>     overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
>     if (overflow > 0)
>     {
>         memmove(&stream->buffers[0],
>                 &stream->buffers[stream->max_pinned_buffers],
>                 sizeof(stream->buffers[0]) * overflow);
>         for (int i = 0; i < overflow; ++i)
>             stream->buffer_io_indexes[i] = -1;
>         for (int i = 1; i < nblocks - overflow; ++i)
>             stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
>     }
>     else
>     {
>         for (int i = 1; i < nblocks; ++i)
>             stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
>     }

Instead of clearing buffer_io_indexes here, it might be cheaper/simpler 
to initialize the array to -1 in streaming_read_buffer_begin(), and 
reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(), 
after the WaitReadBuffers() call. In other words, except when an I/O is 
in progress, keep all the elements at -1, even the elements that are not 
currently in use.

Alternatively, you could remember the first buffer that the I/O applies 
to in the 'ios' array. In other words, instead of pointing from buffer 
to the I/O that it depends on, point from the I/O to the buffer that 
depends on it. The last attached patch implements that approach. I'm not 
wedded to it, but it feels a little simpler.


>         if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
>         {
>             /* Distance ramps up fast (behavior C). */
>             ...
>         }
>         else
>         {
>             /* No advice; move towards full I/O size (behavior B). */
>             ...
>         }

The comment on ReadBuffersOperation says "Declared in public header only 
to allow inclusion in other structs, but contents should not be 
accessed", but here you access the 'flags' field.

You also mentioned that the StartReadBuffers() argument list is too 
long. Perhaps the solution is to redefine ReadBuffersOperation so that 
it consists of two parts: 1st part is filled in by the caller, and 
contains the arguments, and 2nd part is private to bufmgr.c. The 
signature for StartReadBuffers() would then be just:

bool StartReadBuffers(ReadBuffersOperation *operation);

That would make it OK to read the 'flags' field. It would also allow 
reusing the same ReadBuffersOperation struct for multiple I/Os for the 
same relation; you only need to change the changing parts of the struct 
on each operation.


In the attached patch set, the first three patches are your v9 with no 
changes. The last patch refactors away 'buffer_io_indexes' like I 
mentioned above. The others are fixes for some other trivial things that 
caught my eye.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bertrand Drouvot
Дата:
Сообщение: Re: Introduce XID age and inactive timeout based replication slot invalidation
Следующее
От: Abhijit Menon-Sen
Дата:
Сообщение: Re: Possibility to disable `ALTER SYSTEM`