Обсуждение: aio/README.md comments

Поиск
Список
Период
Сортировка

aio/README.md comments

От
Jeff Davis
Дата:
aio/README.md:

* In the section "### IO can be started in critical sections", the
first paragraph seems like it belongs in another section.

* The README generally mixes design goals with implemented
functionality. For instance, we're only using it on the read path
currently, but the README mentions WAL writing several times. We should
probably clarify that a bit.

* "`io_method=sync` does not actually perform AIO but allows to use the
AIO API while performing synchronous IO. This can be useful for
debugging." Sync is still useful for cases where the shared buffers are
a small fraction of system memory, right?

* "Particularly on modern storage..." I assume this is talking about
SSDs, but it could also mean some kind of network block storage. If our
architecture is changing in response to new real-world hardware, we
should briefly try to connect the design choices to assumptions about
hardware, where appropriate.

Regards,
    Jeff Davis




Re: aio/README.md comments

От
Andres Freund
Дата:
Hi,

On 2025-08-29 08:12:36 -0700, Jeff Davis wrote:
> aio/README.md:
>
> * In the section "### IO can be started in critical sections", the
> first paragraph seems like it belongs in another section.

It explains why we want to eventually want do WAL IO using AIO, which in turn
requires AIO to be executable in critical section. So it's intentionally there
and I think has to be there in some form - if you have suggestion for how to
make that clearer...


> * The README generally mixes design goals with implemented
> functionality. For instance, we're only using it on the read path
> currently, but the README mentions WAL writing several times. We should
> probably clarify that a bit.

It mentions WAL writes because they have a huge design impact... What
concretely would you like to clarify?


> * "`io_method=sync` does not actually perform AIO but allows to use the
> AIO API while performing synchronous IO. This can be useful for
> debugging." Sync is still useful for cases where the shared buffers are
> a small fraction of system memory, right?

I don't really see an advantage of sync in those cases either. IO latency can
be really painful if s_b is a small fraction of memory too, and it can be
avoided by doing real readahead.


> * "Particularly on modern storage..." I assume this is talking about
> SSDs, but it could also mean some kind of network block storage.

As the start of the bullet point says, it's about high throughput
storage. Yes, that most crucially is indeed PCIe connected SSDs, but you can
have pretty darn fast networked storage too. Once you do > 1GB/s of IO, the
cycles for actually copying memory become really relevant.


> If our architecture is changing in response to new real-world hardware, we
> should briefly try to connect the design choices to assumptions about
> hardware, where appropriate.

Could you make a bit of a more concrete suggestion of what you would like to
see mentioned? I tried to keep it a bit technology neutral, because mentioning
specific technologies tends to get more out of date than more general things
like storage that has high throughput - I don't think we'll go back to SATA
devices with ~600MB/s of hard bus limited throughput...


Greetings,

Andres Freund



Re: aio/README.md comments

От
Jeff Davis
Дата:
On Fri, 2025-08-29 at 12:32 -0400, Andres Freund wrote:
> I don't really see an advantage of sync in those cases either.

It seems a bit early to say that it's just there for debugging. But
it's just in a README, so I won't argue the point.

I attached some proposed changes based on my understanding.

Regards,
    Jeff Davis


Вложения

Re: aio/README.md comments

От
Xuneng Zhou
Дата:
Hi,

On Sat, Aug 30, 2025 at 6:24 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2025-08-29 at 12:32 -0400, Andres Freund wrote:
> > I don't really see an advantage of sync in those cases either.
>
> It seems a bit early to say that it's just there for debugging. But
> it's just in a README, so I won't argue the point.
>
> I attached some proposed changes based on my understanding.
>
> Regards,
>         Jeff Davis
>

+  These memory copies can become the bottleneck when the
+  underlying storage has high enough throughput, which is common for
+  solid-state drives or fast network block devices.

Would it be helpful to be more specific on the types of solid-state
drives like PCIe/NVMe SSD?
SATA SSDs' ~600 MB/s theoretical ceiling bandwidth might not be high
enough. The rate of out-of-date
could be less concerning since the advancement of hardwares has slowdowned.

Best,
Xuneng



Re: aio/README.md comments

От
Andres Freund
Дата:
Hi,

On 2025-08-29 15:23:48 -0700, Jeff Davis wrote:
> On Fri, 2025-08-29 at 12:32 -0400, Andres Freund wrote:
> > I don't really see an advantage of sync in those cases either.
> 
> It seems a bit early to say that it's just there for debugging. But
> it's just in a README, so I won't argue the point.

There might be some regressions that make io_method=sync beneficial, but short
to medium term, the goal ought to be to make all non-ridiculous configurations
(I don't care about AIO performing well with s_b=16) to not regress
meaningfully and for most things to be the same or better with AIO.

I don't see any reason for io_method=sync to be something we should have for
anything other than debugging medium to long term.

Why do you think different?



> diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
> index 72ae3b3737d..8fa6bd6e9ca 100644
> --- a/src/backend/storage/aio/README.md
> +++ b/src/backend/storage/aio/README.md
> @@ -4,27 +4,38 @@
>  
>  ### Why Asynchronous IO
>  
> -Until the introduction of asynchronous IO postgres relied on the operating
> -system to hide the cost of synchronous IO from postgres. While this worked
> -surprisingly well in a lot of workloads, it does not do as good a job on
> -prefetching and controlled writeback as we would like.
> -
> -There are important expensive operations like `fdatasync()` where the operating
> -system cannot hide the storage latency. This is particularly important for WAL
> -writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
> -writes can yield significantly higher throughput.

I think this second paragraph was important and your rewrite largely removed
it?


> +Postgres depends on IO operations happening asynchronously for reasonable
> +performance: for instance, a sequential scan would be far slower without the
> +benefit of readahead. Historically, Postgres only used synchronous APIs for
> +IO, while assuming that the operating system would use the kernel buffer cache
> +to make those operations asynchronous in most cases (aside from, e.g.,
> +`fdatasync()`).
> +
> +The asynchronous IO APIs described here do not depend on that
> +assumption. Instead, they allow different low-level IO methods, which are
> +given more control and therefore rely less on the kernel's
> +behavior. Currently, only async read operations are supported, but the
> +infrastructure is designed to support async write operations in the future.

The infrastructure supports writes today, it's just md.c and bufmgr.c isn't
aren't ready to use it today.


>  ### Why Direct / unbuffered IO
>  
>  The main reasons to want to use Direct IO are:
>  
> -- Lower CPU usage / higher throughput. Particularly on modern storage buffered
> -  writes are bottlenecked by the operating system having to copy data from the
> -  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
> -  can often move the data directly between the storage devices and postgres'
> -  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
> -  perform other work.
> +- Avoid extra memory copies between the kernel buffer cache and Postgres
> +  shared buffers. These memory copies can become the bottleneck when the
> +  underlying storage has high enough throughput, which is common for
> +  solid-state drives or fast network block devices. Instead, direct IO can
> +  often move the data directly between the Postgres buffer cache and the
> +  device by using DMA, leaving the CPU free to perform other work.
>  - Reduced latency - Direct IO can have substantially lower latency than
>    buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
>    write latency.

I preferred the prior formulation that had the main reasons at the start of
the bullet points.


> @@ -37,11 +48,24 @@ The main reasons *not* to use Direct IO are:
>  
>  - Without AIO, Direct IO is unusably slow for most purposes.
>  - Even with AIO, many parts of postgres need to be modified to perform
> -  explicit prefetching.
> +  explicit prefetching (see read_stream.c).
>  - In situations where shared_buffers cannot be set appropriately large,
>    e.g. because there are many different postgres instances hosted on shared
>    hardware, performance will often be worse than when using buffered IO.

Ok, although perhaps better to refer to the read stream section at the bottom?


> +### Writing WAL
> +
> +Using AIO and Direct IO can reduce the overhead of WAL logging
> +substantially:
> +
> +- AIO allows to start WAL writes eagerly, so they complete before needing to
> +  wait
> +- AIO allows to have multiple WAL flushes in progress at the same time
> +- Direct IO can reduce the number of roundtrips to storage on some OSs
> +  and storage HW (buffered IO and direct IO without O_DSYNC needs to
> +  issue a write and after the write's completion a cache flush,
> +  whereas O\_DIRECT + O\_DSYNC can use a single Force Unit Access
> +  (FUA) write).

>  ## AIO Usage Example
>  
> @@ -196,25 +220,15 @@ processing to the AIO workers).
>  
>  ### IO can be started in critical sections
>  
> -Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
>  
> -- AIO allows to start WAL writes eagerly, so they complete before needing to
> -  wait
> -- AIO allows to have multiple WAL flushes in progress at the same time
> -- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
> -  the number of roundtrips to storage on some OSs and storage HW (buffered IO
> -  and direct IO without O_DSYNC needs to issue a write and after the write's
> -  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
> -  Force Unit Access (FUA) write).

Direct IO alone does not reduce the number of roundtrips, the combination of
DIO and O_DSYNC does. I think that got less clear in the rewrite.

Greetings,

Andres Freund



Re: aio/README.md comments

От
Jeff Davis
Дата:
On Sat, 2025-08-30 at 12:20 -0400, Andres Freund wrote:
> There might be some regressions that make io_method=sync beneficial,
> but short
> to medium term, the goal ought to be to make all non-ridiculous
> configurations
> (I don't care about AIO performing well with s_b=16) to not regress
> meaningfully and for most things to be the same or better with AIO.
>
> I don't see any reason for io_method=sync to be something we should
> have for
> anything other than debugging medium to long term.
>
> Why do you think different?

I don't disagree, but:

(a) It seems inconsistent that the user-facing documentation offers the
"sync" option with no mention that it's a debugging/developer option,
but our internal README says it's only there for debugging.

(b) When AIO gets used for more purposes (e.g. writes), the overall
picture may get more complicated. While I expect the performance to be
much better overall, I wouldn't be surprised if "sync" ends up still
being useful for some purposes.


Regards,
    Jeff Davis