Re: Streaming I/O, vectored I/O (WIP)

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Streaming I/O, vectored I/O (WIP)
Дата
Msg-id b336fbe8-1ad0-c289-5ed8-bbb74513e5bb@iki.fi
обсуждение исходный текст
Ответ на Streaming I/O, vectored I/O (WIP)  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Streaming I/O, vectored I/O (WIP)  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On 31/08/2023 07:00, Thomas Munro wrote:
> Currently PostgreSQL reads (and writes) data files 8KB at a time.
> That's because we call ReadBuffer() one block at a time, with no
> opportunity for lower layers to do better than that.  This thread is
> about a model where you say which block you'll want next with a
> callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, as 
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:

  13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by 
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

  4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I 
hoped that this would simplify it. I didn't look closely at that patch, 
so maybe it's simpler even though it's more code.

> There are more kinds of streaming I/O that would be useful, such as
> raw unbuffered files, and of course writes, and I've attached some
> early incomplete demo code for writes (just for fun), but the main
> idea I want to share in this thread is the idea of replacing lots of
> ReadBuffer() calls with the streaming model.

All this makes sense. Some random comments on the patches:

> +    /* Avoid a slightly more expensive kernel call if there is no benefit. */
> +    if (iovcnt == 1)
> +        returnCode = pg_pread(vfdP->fd,
> +                              iov[0].iov_base,
> +                              iov[0].iov_len,
> +                              offset);
> +    else
> +        returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself? 
pg_readv() is currently just a macro if the system provides preadv(), 
but it could be a "static inline" that does the above dance. I think 
that optimization is platform-dependent anyway, pread() might not be any 
faster on some OSs. In particular, if the system doesn't provide 
preadv() and we use the implementation in src/port/preadv.c, it's the 
same kernel call anyway.

> v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

> /*
>  * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
>  * the returned buffer can be used immediately.  Otherwise, a physical read
>  * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
>  * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
>  * caller has the opportunity to coalesce reads of neighboring blocks into one
>  * CompleteReadBuffers() call.
>  *
>  * *found is set to true for a hit, and false for a miss.
>  *
>  * *allocated is set to true for a miss that allocates a buffer for the first
>  * time.  If there are multiple calls to PrepareReadBuffer() for the same
>  * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
>  * read, then only the first such call will receive *allocated == true, which
>  * the caller might use to issue just one prefetch hint.
>  */
> Buffer
> PrepareReadBuffer(BufferManagerRelation bmr,
>                   ForkNumber forkNum,
>                   BlockNumber blockNum,
>                   BufferAccessStrategy strategy,
>                   bool *found,
>                   bool *allocated)
> 

If you decide you don't want to perform the read, after all, is there a 
way to abort it without calling CompleteReadBuffers()? Looking at the 
later patch that introduces the streaming read API, seems that it 
finishes all the reads, so I suppose we don't need an abort function. 
Does it all get cleaned up correctly on error?

> /*
>  * Convert an array of buffer address into an array of iovec objects, and
>  * return the number that were required.  'iov' must have enough space for up
>  * to PG_IOV_MAX elements.
>  */
> static int
> buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
  The comment is a bit inaccurate. There's an assertion that If nblocks 
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for 
up to PG_IOV_MAX elements, that's only because we also assume that 
nblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) to 
prevent them from passing nblocks > PG_IOV_MAX.

in streaming_read.h:

> typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
>         uintptr_t pgsr_private,
>         void *per_io_private,
>         BufferManagerRelation *bmr,
>         ForkNumber *forkNum,
>         BlockNumber *blockNum,
>         ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on 
each read. I see that you used that in the WAL replay prefetching, so I 
guess that makes sense.

> extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
> extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
> extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
> extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the 
WAL replay prefetching patch, and only after calling 
pg_streaming_read_reset(). Could pg_streaming_read_reset() call 
pg_streaming_read_prefetch() directly? Is there any need to "reset" the 
stream, without also starting prefetching?

In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

> @@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
>   * If the WAL record contains a block reference with the given ID, *rlocator,
>   * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
>   * returns true.  Otherwise returns false.
> + *
> + * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
> + * of the pin is transferred to the caller.
>   */
>  bool
>  XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
> @@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
>      if (blknum)
>          *blknum = bkpb->blkno;
>      if (prefetch_buffer)
> +    {
>          *prefetch_buffer = bkpb->prefetch_buffer;
> +
> +        /*
> +         * Clear this flag is so that we can assert that redo records take
> +         * ownership of all buffers pinned by xlogprefetcher.c.
> +         */
> +        bkpb->prefetch_buffer = InvalidBuffer;
> +    }
>      return true;
>  }

Could these changes be committed independently of all the other changes?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Eager page freeze criteria clarification
Следующее
От: Melanie Plageman
Дата:
Сообщение: Re: Eager page freeze criteria clarification