Обсуждение: Should io_method=worker remain the default?

Поиск
Список
Период
Сортировка

Should io_method=worker remain the default?

От
Jeff Davis
Дата:
Has there already been a discussion about leaving the default as
io_method=worker? There was an Open Item for this, which was closed as
"Won't Fix", but the links don't explain why as far as I can see.

I tested a concurrent scan-heavy workload (see below) where the data
fits in memory, and "worker" seems to be 30% slower than "sync" with
default settings.

I'm not suggesting that AIO overall is slow -- on the contrary, I'm
excited about AIO. But if it regresses in some cases, we should make a
conscious choice about the default and what kind of tuning advice needs
to be offered.

I briefly tried tuning to see if a different io_workers value would
solve the problem, but no luck.

The good news is that io_uring seemed to solve the problem.
Unfortunately, that's platform-specific, so it can't be the default. I
didn't dig in very much, but it seemed to be at least as good as "sync"
mode for this workload.

Regards,
    Jeff Davis



Test summary: 32 connections each perform repeated sequential scans.
Each connection scans a different 1GB partition of the same table. I
used partitioning and a predicate to make it easier to script in
pgbench.


Test details:

Machine:
   AMD Ryzen 9 9950X 16-Core Processor
   64GB RAM
   Local storage, NVMe SSD
   Ubuntu 24.04 (Linux 6.11, liburing 2.5)

   Note: the storage didn't matter much, because the data fits in
memory. To get consistent results, when changing between data
directories for the 17 and 18 tests, I had to drop the filesystem cache
first to make room, then run a few scans to warm it with the data from
the right data directory.

For simplicity I disabled parallel query, but that didn't seem to have
a big effect. Everything else was set to the default.

Setup (checksums enabled):

   => create table t(sid int8, c0 int8, c1 int8, c2 int8, c3 int8, c4
int8, c5 int8, c6 int8, c7 int8) partition by range (sid);

   $ (for i in `seq 0 31`; do
         echo "create table t$(printf "%02d" $i) partition of t for
values from ($i) to ($((i+1)));";
      done) | ./bin/psql postgres
   $ (for i in `seq 0 31`; do 
         echo "insert into t$(printf "%02d" $i) select $i, 0, 1, 2, 3,
4, 5, 6, 7 from generate_series(0, 10000000);";
      done) | ./bin/psql postgres

   => vacuum analyze; checkpoint;

Script count.sql:

  SELECT COUNT(*) FROM t WHERE sid=:client_id;

pgbench:

  ./bin/pgbench --dbname=postgres -M prepared -n -c 32 -T 60 \
     -f count.sql

Results:

PG17:
  tps = 36.209048

PG18 (io_method=sync)
  tps = 34.014890

PG18 (io_method=worker io_workers=3)
  tps = 23.938509

PG18 (io_method=worker io_workers=16)
  tps = 16.734360

PG18 (io_method=io_uring)
  tps = 35.546825






Re: Should io_method=worker remain the default?

От
Tomas Vondra
Дата:
On 9/3/25 08:47, Jeff Davis wrote:
> 
> Has there already been a discussion about leaving the default as
> io_method=worker? There was an Open Item for this, which was closed as
> "Won't Fix", but the links don't explain why as far as I can see.
> 

There was a discussion in the first thread mentioned in the commit
message, including some test results:

https://www.postgresql.org/message-id/e6db33f3-50de-43d3-9d9f-747c3b376e80%40vondra.me

There were no proposals to change the default (from worker), so the
conclusion was to stick with the current default. It wasn't quite clear
who's expected to make the proposal / final decision, though.

> I tested a concurrent scan-heavy workload (see below) where the data
> fits in memory, and "worker" seems to be 30% slower than "sync" with
> default settings.
> 

I think this could be due to the "IPC overhead" discussed in the index
prefetch thread recently:

https://www.postgresql.org/message-id/1c9302da-c834-4773-a527-1c1a7029c5a3%40vondra.me

> I'm not suggesting that AIO overall is slow -- on the contrary, I'm
> excited about AIO. But if it regresses in some cases, we should make a
> conscious choice about the default and what kind of tuning advice needs
> to be offered.
> 
> I briefly tried tuning to see if a different io_workers value would
> solve the problem, but no luck.
> 

Right, that's what I saw too. It simply boils down to how many signals
can a single process send/receive. I'd bet if you try the signal-echo
thing, it's be close to the throughput with io_method=worker.

> The good news is that io_uring seemed to solve the problem.
> Unfortunately, that's platform-specific, so it can't be the default. I
> didn't dig in very much, but it seemed to be at least as good as "sync"
> mode for this workload.
> 

AFAICS that matches what we observed in the index prefetch thread.

>
> Test summary: 32 connections each perform repeated sequential scans.
> Each connection scans a different 1GB partition of the same table. I
> used partitioning and a predicate to make it easier to script in
> pgbench.

I'll try to reproduce this, but if it's due to the same IPC overhead,
that would be surprising (for me). In the index case it makes sense,
because the reads are random enough to prevent I/O combining. But for a
sequential workload I'd expect I/O combining to help. Could it be that
it ends up evicting buffers randomly, which (I guess) might interfere
with the combining? What's shared_buffers set to? Have you watched how
large the I/O requests are? iostat, iosnoop or strace would tell you.



regards

-- 
Tomas Vondra




Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Wed, 2025-09-03 at 10:34 +0200, Tomas Vondra wrote:
> I'll try to reproduce this, but if it's due to the same IPC overhead,
> that would be surprising (for me). In the index case it makes sense,
> because the reads are random enough to prevent I/O combining. But for
> a
> sequential workload I'd expect I/O combining to help. Could it be
> that
> it ends up evicting buffers randomly, which (I guess) might interfere
> with the combining? What's shared_buffers set to?

I left it as the default (128MB).

>  Have you watched how
> large the I/O requests are? iostat, iosnoop or strace would tell you.

In my test there's not much device IO, because the data is cached. I
used:

  strace -p $io_worker_pid -e trace=preadv

and looked briefly at the output. iov_len seems to range between 8kB
and about 128kB, but still a lot of 8kB. That's from a very brief look,
I can try to get more precise numbers, but there seem to be enough 8kB
ones to support your theory.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Álvaro Herrera
Дата:
On 2025-Sep-02, Jeff Davis wrote:

> The good news is that io_uring seemed to solve the problem.
> Unfortunately, that's platform-specific, so it can't be the default.

Why not?  We have had wal_sync_method with a platform-dependent default
for a very long time.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Should io_method=worker remain the default?

От
Andres Freund
Дата:
Hi,

On 2025-09-02 23:47:48 -0700, Jeff Davis wrote:
> Has there already been a discussion about leaving the default as
> io_method=worker? There was an Open Item for this, which was closed as
> "Won't Fix", but the links don't explain why as far as I can see.

> I tested a concurrent scan-heavy workload (see below) where the data
> fits in memory, and "worker" seems to be 30% slower than "sync" with
> default settings.
>
> Test summary: 32 connections each perform repeated sequential scans.
> Each connection scans a different 1GB partition of the same table. I
> used partitioning and a predicate to make it easier to script in
> pgbench.

32 parallel seq scans of a large relations, with default shared buffers, fully
cached in the OS page cache, seems like a pretty absurd workload. That's not
to say we shouldn't spend some effort to avoid regressions for it, but it also
doesn't seem to be worth focusing all that much on it.  Or is there a
real-world scenario this actually emulating?


I think the regression is not due to anything inherent to worker, but due to
pressure on AioWorkerSubmissionQueueLock - at least that's what I'm seeing on
a older two socket machine. It's possible the bottleneck is different on a
newer machine (my newer workstation is busy on another benchmark rn).

*If* we actually care about this workload, we can make
pgaio_worker_submit_internal() acquire that lock conditionally, and perform
the IOs synchronously instead. That seems to help here, sufficiently to make
worker the same as sync - although plenty contention remains, from "the worker
side", which can't just acquire the lock conditionally.

But I'm really not sure doing > 30GB/s of repeated reads from the page cache
is a particularly useful thing to optimize. I see a lot of unrelated
contention, e.g. on the BufferMappingLock - unsurprising, it's a really
extreme workload...

If I instead just increase s_b, I get 2x the throughput...

Greetings,

Andres Freund



Re: Should io_method=worker remain the default?

От
Andres Freund
Дата:
Hi,

On 2025-09-03 17:53:55 +0200, Álvaro Herrera wrote:
> On 2025-Sep-02, Jeff Davis wrote:
> 
> > The good news is that io_uring seemed to solve the problem.
> > Unfortunately, that's platform-specific, so it can't be the default.
> 
> Why not?  We have had wal_sync_method with a platform-dependent default
> for a very long time.

At least for now there are other reasons:

1) io_uring needs a higher ulimit -n, because we need to create an io_uring
   instance for each proc entry, in postmaster.  In the future, we might want
   to increase soft ulimit -n, but we aren't doing that yet.

   In older kernels the number of io_uring memory mapping also can pose a
   performance issue at high connection rates.

2) Sometimes worker is faster than io_uring, specifically when checksums are
   enabled. Workers can offload the checksum computations, io_uring can't.

3) We can't, at build time, know whether io_uring is actually available at
   runtime. io_uring can be disabled with a runtime sysctl.

   We could add a dynamic fallback to worker, but we don't have that yet.


Anyway, the issue with worker in this case isn't inherent, it's just a bit of
lock contention in an extreme workload...

Greetings,

Andres Freund



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote:
> 32 parallel seq scans of a large relations, with default shared
> buffers, fully
> cached in the OS page cache, seems like a pretty absurd workload.

It's the default settings, and users often just keep going with the
defaults as long as it works, giving little thought to any kind of
tuning or optimization until they hit a wall. Fully cached data is
common, as are scan-heavy workloads. Calling it "absurd" is an
exaggeration.

> That's not
> to say we shouldn't spend some effort to avoid regressions for it,
> but it also
> doesn't seem to be worth focusing all that much on it.

Fair, but we should acknowledge the places where the new defaults do
better vs worse, and provide some guidance on what to look for and how
to tune it. We should also not be in too much of a rush to get rid of
"sync" mode until we have a better idea about where the tradeoffs are.

>   Or is there a
> real-world scenario this actually emulating?

This test was my first try at reproducing a smaller (but still
noticeable) regression seen on a more realistic benchmark. I'm not 100%
sure whether I reproduced the same effect or a different one, but I
don't think we should dismiss it so quickly.

> *If* we actually care about this workload, we can make
> pgaio_worker_submit_internal() acquire that lock conditionally, and
> perform
> the IOs synchronously instead.

I like the idea of some kind of fallback for multiple reasons. I
noticed that if I set io_workers=1, and then I SIGSTOP that worker,
then sequential scans make no progress at all until I send SIGCONT. A
fallback to synchronous sounds more robust, and more similar to what we
do with walwriter and bgwriter. (That may be 19 material, though.)

> But I'm really not sure doing > 30GB/s of repeated reads from the
> page cache
> is a particularly useful thing to optimize.

A long time ago, the expectation was that Postgres might be running on
a machine along with other software, and perhaps many instances of
Postgres on the same machine. In that case, low shared_buffers compared
with the overall system memory makes sense, which would cause a lot of
back-and-forth into shared buffers. That was also the era of magnetic
disks, where such memory copies seemed almost free by comparison --
perhaps we just don't care about that case any more?


> If I instead just increase s_b, I get 2x the throughput...

Increase to what? I tried a number of settings. Obviously >32GB makes
it a non-issue because everything is cached. Values between 128MB and
32GB didn't seem to help, and were in some cases lower, but I didn't
look into why yet. It might have something to do with crowding out the
page cache.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Andres Freund
Дата:
Hi,

On 2025-09-03 11:50:05 -0700, Jeff Davis wrote:
> On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote:
> > 32 parallel seq scans of a large relations, with default shared
> > buffers, fully
> > cached in the OS page cache, seems like a pretty absurd workload.
> 
> It's the default settings, and users often just keep going with the
> defaults as long as it works, giving little thought to any kind of
> tuning or optimization until they hit a wall. Fully cached data is
> common, as are scan-heavy workloads. Calling it "absurd" is an
> exaggeration.

I agree that an unconfigured postgres is a common thing. But I don't think
that means that doing 30GB/s of IO from 32 backends is something that
workloads using an untuned postgres will do.  I cannot come up with any
halfway realistic scenarios where you'd do anywhere this much cached IO.

As soon as you don't do quite as much IO, the entire bottleneck the test
exercises *completely* vanishes. There's like 20-30 instructions covered by
that lwlock. You need a *lot* of invocations for that to become a bottlneck.




> > That's not to say we shouldn't spend some effort to avoid regressions for
> > it, but it also doesn't seem to be worth focusing all that much on it.
> 
> Fair, but we should acknowledge the places where the new defaults do better
> vs worse, and provide some guidance on what to look for and how to tune it.

If you find describable realistic workload that regress, I'm all ears. Both
from the perspective of fixing regressions and documenting how to choose the
correct io_method.  But providing tuning advice for a very extreme workload
that only conceivably exists with an untuned postgres doesn't seem likely to
help anybody.


> We should also not be in too much of a rush to get rid of "sync" mode until
> we have a better idea about where the tradeoffs are.

Nobody is in a rush to do so, from what I can tell? I don't know why you're so
focused on this.



> >   Or is there a real-world scenario this actually emulating?
> 
> This test was my first try at reproducing a smaller (but still
> noticeable) regression seen on a more realistic benchmark. I'm not 100%
> sure whether I reproduced the same effect or a different one, but I
> don't think we should dismiss it so quickly.

From what I can tell here, the regression that this benchmark observes is
entirely conditional on *extremely* high volumes of IO being issued by a lot
of backends.

What was the workload that hit the smaller regression?


> > *If* we actually care about this workload, we can make
> > pgaio_worker_submit_internal() acquire that lock conditionally, and
> > perform
> > the IOs synchronously instead.
> 
> I like the idea of some kind of fallback for multiple reasons. I
> noticed that if I set io_workers=1, and then I SIGSTOP that worker,
> then sequential scans make no progress at all until I send SIGCONT. A
> fallback to synchronous sounds more robust, and more similar to what we
> do with walwriter and bgwriter. (That may be 19 material, though.)

I don't think what I'm proposing would really change anything in such a
scenario, unless you were unlucky enough to send the SIGSTOP in the very short
window in which the worker held the lwlock.

We already fall back to synchronuous IO when the queue towards the IO workers
is full. But I don't see a way to identify the case of "the queue is not full,
but the worker isn't making enough progress".


> > But I'm really not sure doing > 30GB/s of repeated reads from the
> > page cache
> > is a particularly useful thing to optimize.
> 
> A long time ago, the expectation was that Postgres might be running on
> a machine along with other software, and perhaps many instances of
> Postgres on the same machine. In that case, low shared_buffers compared
> with the overall system memory makes sense, which would cause a lot of
> back-and-forth into shared buffers. That was also the era of magnetic
> disks, where such memory copies seemed almost free by comparison --
> perhaps we just don't care about that case any more?

I think we should still care about performing reasonably when data primarily
is cached in the OS page cache. I just don't think the benchmark is a good
example of such workloads - if you need to do data analysis of > 30GB/s of
data second, while the data is also cached, you can configure postgres at
least somewhat reasonably.  Even if you were to somehow analyze this much
data, any realistic workload will actually do more than just count(*), which
means AioWorkerSubmissionQueueLock won't be the bottleneck.


> > If I instead just increase s_b, I get 2x the throughput...
> 
> Increase to what? I tried a number of settings. Obviously >32GB makes
> it a non-issue because everything is cached. Values between 128MB and
> 32GB didn't seem to help, and were in some cases lower, but I didn't
> look into why yet. It might have something to do with crowding out the
> page cache.

I meant above ~32GB.

Greetings,

Andres Freund



Re: Should io_method=worker remain the default?

От
Thomas Munro
Дата:
On Thu, Sep 4, 2025 at 6:50 AM Jeff Davis <pgsql@j-davis.com> wrote:
> I like the idea of some kind of fallback for multiple reasons. I
> noticed that if I set io_workers=1, and then I SIGSTOP that worker,
> then sequential scans make no progress at all until I send SIGCONT. A
> fallback to synchronous sounds more robust, and more similar to what we
> do with walwriter and bgwriter. (That may be 19 material, though.)

This seems like a non-problem.  Robustness against SIGSTOP of random
backends is not a project goal or reasonable goal, is it?  You can
SIGSTOP a backend doing IO in any historical release, possibly
blocking other backends too based on locks etc etc.

That said, it is quite reasonable to ask why it doesn't just start a
new worker, and that's just code maturity:

* I have a patch basically ready to commit for v19 (CF #5913) that
would automatically add more workers if you did that.  But even then
you could be unlucky and SIGSTOP a worker while it holds the
submission queue lock.
* I also had experimental versions that use a lock free queue, but it
didn't seem necessary given how hard it is to measure meaningful lock
contention so far; I guess it must be easier on a NUMA system and one
might wonder about per-NUMA-node queues, but that also feels a bit
questionable because if you had a very high end system you'd probably
be looking into better tuning including io_method=io_uring.



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Thu, 2025-09-04 at 10:22 +1200, Thomas Munro wrote:
> This seems like a non-problem.  Robustness against SIGSTOP of random
> backends is not a project goal or reasonable goal, is it?

I think you misinterpreted -- I wasn't suggesting that sending SIGSTOP
is a use case. I just meant that it showed a strong dependence between
the processes, and that it might be more robust to have some kind of
fallback.

> * I have a patch basically ready to commit for v19 (CF #5913) that
> would automatically add more workers if you did that.  But even then
> you could be unlucky and SIGSTOP a worker while it holds the
> submission queue lock.

Making it adaptable sounds like a nice improvement, especially since we
don't have good guidance in the docs about how to tune it.

> * I also had experimental versions that use a lock free queue, but it
> didn't seem necessary given how hard it is to measure meaningful lock
> contention so far;

Andres suggested that the case I brought up at the top of the thread is
due to lock contention, so a lock free queue also sounds like a
potential improvement. If the code is working and can be applied to
REL_18_STABLE, then I can try it.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Thomas Munro
Дата:
On Thu, Sep 4, 2025 at 10:47 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Andres suggested that the case I brought up at the top of the thread is
> due to lock contention, so a lock free queue also sounds like a
> potential improvement. If the code is working and can be applied to
> REL_18_STABLE, then I can try it.

Not easily, unfortunately, that was a while back and I'd need to find
all the bits and rebase.

There is a preliminary step in CF5913[1]: it splits
AioWorkerSubmissionQueueLock's duties into AioWorkerControlLock for
pool management and AioWorkerSubmissionQueueLock for the queue itself.
I had a couple of different attempts at generic lock free queues, one
from a paper and one loosely based on FreeBSD's buf_ring.h, but I'm
not at all sure it's actually any good and I found it quite tricky to
integrate with the idle bitmap scheme which so far seems to be
valuable... Hmm, it might be good to try some simpler things first:
splitting AioWorkerSubmissionQueueLock into a producer lock (backends
enqueuing) and a consumer lock (workers dequeuing) must surely help a
lot.   I'll try to come up with some simple patches to evaluate...

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJLe-0E-wZ-is78CEHhjbC%3DihMVCQLoN1dmD-j05s9qRg%40mail.gmail.com#c3eb6c873b7218d5c3d5164244f7245d



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote:
> I think the regression is not due to anything inherent to worker, but
> due to
> pressure on AioWorkerSubmissionQueueLock - at least that's what I'm
> seeing on
> a older two socket machine. It's possible the bottleneck is different
> on a
> newer machine (my newer workstation is busy on another benchmark rn).

I believe what's happening is that parallelism of the IO completion
work (e.g. checksum verification) is reduced. In worker mode, the
completion work is happening on the io workers (of which there are 3);
while in sync mode the completion work is happening in the backends (of
which there are 32).

There may be lock contention too, but I don't think that's the primary
issue.

I attached a test patch for illustration. It simplifies the code inside
the LWLock to enqueue/dequeue only, and simplifies and reduces the
wakeups by doing pseudo-random wakeups only when enqueuing. Reducing
the wakeups should reduce the number of signals generated without
hurting my case, because the workers are never idle. And reducing the
instructions while holding the LWLock should reduce lock contention.
But the patch barely makes a difference: still around 24tps.

What *does* make a difference is changing io_worker_queue_size. A lower
value of 16 effectively starves the workers of work to do, and I get a
speedup to about 28tps. A higher value of 512 gives the workers more
chance to issue the IOs -- and more responsibility to complete them --
and it drops to 17tps. Furthermore, while the test is running, the io
workers are constantly at 100% (mostly verifying checksums) and the
backends are at 50% (20% when io_worker_queue_size=512).

As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype-
limits -Werror=missing-braces". But I notice that the meson build
doesn't seem to use -funroll-loops or -ftree-vectorize when building
checksums.c. Is that intentional? If not, perhaps slower checksum
calculations explain my results.

Regards,
    Jeff Davis


Вложения

Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Fri, 2025-09-05 at 13:25 -0700, Jeff Davis wrote:
> As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype-
> limits -Werror=missing-braces". But I notice that the meson build
> doesn't seem to use -funroll-loops or -ftree-vectorize when building
> checksums.c. Is that intentional? If not, perhaps slower checksum
> calculations explain my results.

Attached a patch. I'm not sure if it's the right way to do things, but
the resulting compiler command seems right to me.

It doesn't affect my tests in this thread much, but it seems best to be
consistent with autoconf.

Regards,
    Jeff Davis


Вложения

Re: Should io_method=worker remain the default?

От
Thomas Munro
Дата:
On Sat, Sep 6, 2025 at 8:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote:
> > I think the regression is not due to anything inherent to worker, but
> > due to
> > pressure on AioWorkerSubmissionQueueLock - at least that's what I'm
> > seeing on
> > a older two socket machine. It's possible the bottleneck is different
> > on a
> > newer machine (my newer workstation is busy on another benchmark rn).
>
> I believe what's happening is that parallelism of the IO completion
> work (e.g. checksum verification) is reduced. In worker mode, the
> completion work is happening on the io workers (of which there are 3);
> while in sync mode the completion work is happening in the backends (of
> which there are 32).

Make sense.

Some raw thoughts on this topic, and how we got here:  This type of
extreme workload, namely not doing any physical I/O, just copying the
same data from the kernel page cache to the buffer pool over and over
again, is also the workload where io_method=worker can beat
io_method=io_uring (and other native I/O methods I've prototyped),
assuming io_workers is increased to a level that keeps up.  That does
lead to a couple of ideas such as handing completions off to a new
generic task worker pool, even for native I/O methods.  At a guess,
you'd probably only want to do that when extra kernel completion
notifications are drained ahead of the one you need right now, so that
it only kicks in when you're reading ahead further than you strictly
need right now (that's just a guess about the cost of IPC and buffer
header cache line ping-pong effects vs the cost of running the
checksums yourself, not properly investigated).  I also experimented
with going the other way: having I/O workers do *just* the preadv()
and then hand completion duties back to the backend that waits for the
IO.  I have patches for that, creating a per-backend completion queue
that I/O methods could opt to use (I experimented with that model a
part of my ongoing struggle with multi-process PostgreSQL + native AIO
which fundamentally assumes threads on all other OSes, but it's
trivial to enable it for io_method=worker too).  But that leveled the
playing field in the wrong direction!  This I/O worker property is
actually good when tuned correctly so that you get more rather than
less checksum computation concurrency, so I abandoned that experiment.
Of course if we did that, in your case here you'd get all 32 backends
computing checksums no matter how many I/O workers are running.  The
original motivation for workers processing completions rather than
more directly simulating io_uring (which was the original and only
implementation when I got involved in this) was the sheer simplicity
of it, and this choice didn't matter much before the
checksums-by-default change went in just a little ahead of basic AIO
support.

> There may be lock contention too, but I don't think that's the primary
> issue.

Interesting that it shows up so clearly for Andres but not for you.
I've made a note to try out those lock improvement patches I mentioned
on a dual socket box once I'm done with some other work that is
keeping me busy...

> I attached a test patch for illustration. It simplifies the code inside
> the LWLock to enqueue/dequeue only, and simplifies and reduces the
> wakeups by doing pseudo-random wakeups only when enqueuing. Reducing
> the wakeups should reduce the number of signals generated without
> hurting my case, because the workers are never idle. And reducing the
> instructions while holding the LWLock should reduce lock contention.
> But the patch barely makes a difference: still around 24tps.

BTW There are already a couple of levels of signal suppression: if
workers are not idle then we don't set any latches, and even if we
did, SetLatch() only sends signals when the recipient is actually
waiting, which shouldn't happen when the pool is completely busy.

+    nextWakeupWorker = (nextWakeupWorker + 1) % io_workers;

FWIW, I experimented extensively with wakeup distribution schemes like
that ^ and others, and found that they performed significantly worse
in terms of avg(latency) and stddev(latency) for partially idle worker
pools, as discussed a bit in that auto-tuning thread.  It seems to be
generally better to keep a small number of workers busy.

> What *does* make a difference is changing io_worker_queue_size. A lower
> value of 16 effectively starves the workers of work to do, and I get a
> speedup to about 28tps. A higher value of 512 gives the workers more
> chance to issue the IOs -- and more responsibility to complete them --
> and it drops to 17tps. Furthermore, while the test is running, the io
> workers are constantly at 100% (mostly verifying checksums) and the
> backends are at 50% (20% when io_worker_queue_size=512).

I would value your feedback and this type of analysis on the thread
about automatic tuning for v19.  It should find the setting that keeps
your executor backends 100% busy, and some number of low-numbered I/O
workers 100% busy (only because here you're doing no physical I/O that
would cause them to sleep in D state), but leave some spare worker
capacity to allow for recent variation.  It should also keep the queue
from overflowing and falling back to synchronous I/O.  Keeping the
queue size low is important for latency, because we don't want new
queries to get stuck behind a huge queue: each new IO should always be
processed ASAP if the storage can keep up.  That's kinda why a queue
size of 64 is probably "enough", since 32 workers is the highest pool
size and it doesn't like there to be more than one extra queued IO per
currently running worker (which is debatable, see discussion on this
simplicity of that policy vs theoretical arguments about some unknown
variables).  Though I do think there are good reasons to make the
queue size user-configurable for developers only, and I'm planning to
propose that and more as a debug_ setting, since it turns out to be
quite useful to testing theories about higher level rate control
systems, discussed a bit in that thread, and hopefully with some more
patches a bit later...

You might also be interested in the phenomenon discussed in the index
prefetching thread, where concurrent reads of the *same* block from
different backs currently force a wait, which can be avoided with a
patch he posted.  In that thread it was happening due to index scans
trying to stream the same block repeatedly in one stream, but it's
also known to happen in torture tests much like yours where different
backends step on each other's toes; on the other hand you're reporting
100% CPU for backends, so I guess it's not happening much... hmm,
perhaps you disabled synchronous scans, or something about your test
conditions avoids that?

> As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype-
> limits -Werror=missing-braces". But I notice that the meson build
> doesn't seem to use -funroll-loops or -ftree-vectorize when building
> checksums.c. Is that intentional? If not, perhaps slower checksum
> calculations explain my results.

Ah, good discovery.



Re: Should io_method=worker remain the default?

От
Thomas Munro
Дата:
On Mon, Sep 8, 2025 at 2:39 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> You might also be interested in the phenomenon discussed in the index
> prefetching thread, where concurrent reads of the *same* block from
> different backs currently force a wait, which can be avoided with a
> patch he posted.  In that thread it was happening due to index scans

Sorry, bad edit: I meant a patch that Andres posted.



Re: Should io_method=worker remain the default?

От
Andres Freund
Дата:
Hi,

On 2025-09-05 16:07:09 -0700, Jeff Davis wrote:
> On Fri, 2025-09-05 at 13:25 -0700, Jeff Davis wrote:
> > As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype-
> > limits -Werror=missing-braces". But I notice that the meson build
> > doesn't seem to use -funroll-loops or -ftree-vectorize when building
> > checksums.c. Is that intentional? If not, perhaps slower checksum
> > calculations explain my results.

Yes, that's an oversight.


> Attached a patch. I'm not sure if it's the right way to do things, but
> the resulting compiler command seems right to me.
> 
> It doesn't affect my tests in this thread much, but it seems best to be
> consistent with autoconf.

Yes, we should do this - the patch looks good to me.

Greetings,

Andres Freund



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Mon, 2025-09-08 at 14:39 +1200, Thomas Munro wrote:
> Some raw thoughts on this topic, and how we got here:  This type of
> extreme workload, namely not doing any physical I/O, just copying the
> same data from the kernel page cache to the buffer pool over and over
> again,

Isn't that one of the major selling points of AIO? It does "real
readahead" from kernel buffers into PG buffers ahead of the time, so
that the backend doesn't have to do the memcpy and checksum
calculation. The benefit will be even larger when AIO eventually
enables effective Direct IO, so that shared_buffers can be a larger
share of system memory, and we don't need to move back-and-forth
between kernel buffers and PG buffers (and recalculate the checksum).

The only problem right now is that it doesn't (yet) work great when the
concurrency is higher because:

(a) we don't adapt well to saturated workers; and
(b) there's lock contention on the queue if there are more workers

and I believe both of those problems can be solved in 19.

>  is also the workload where io_method=worker can beat
> io_method=io_uring (and other native I/O methods I've prototyped),
> assuming io_workers is increased to a level that keeps up.

Right, when the workers are not saturated, they *increase* the
parallelism because there are more processes doing the work (unless you
run into lock contention).


> didn't matter much before the
> checksums-by-default change went in just a little ahead of basic AIO
> support.

Yeah, the trade-offs are much different when checksums are on vs off.

>
> Interesting that it shows up so clearly for Andres but not for you.

When I increase the io_worker count, then it does seem to be limited by
lock contention (at least I'm seeing evidence with -DLWLOCK_STATS). I
suppose my case is just below some threshold.

>
> BTW There are already a couple of levels of signal suppression: if
> workers are not idle then we don't set any latches, and even if we
> did, SetLatch() only sends signals when the recipient is actually
> waiting, which shouldn't happen when the pool is completely busy.

Oh, good to know.

> +    nextWakeupWorker = (nextWakeupWorker + 1) % io_workers;
>
> FWIW, I experimented extensively with wakeup distribution schemes

My patch was really just to try to test the two hypotheses; I wasn't
proposing it. But I was curious whether a simpler scheme might be just
as good, and looks like you already considered and rejected it.

>
> I would value your feedback and this type of analysis on the thread
> about automatic tuning for v19.

OK, I will continue the tuning discussion there.

Regarding $SUBJECT: it looks like others are just fine with worker mode
as the default in 18. I have added discussion links to the "no change"
entry in the Open Items list.

I think we'll probably see some of the effects (worker saturation or
lock contention) from my test case appear in real workloads, but
affected users can change to sync mode until we sort these things out
in 19.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Nathan Bossart
Дата:
On Mon, Sep 08, 2025 at 01:43:45PM -0400, Andres Freund wrote:
> On 2025-09-05 16:07:09 -0700, Jeff Davis wrote:
>> On Fri, 2025-09-05 at 13:25 -0700, Jeff Davis wrote:
>>> As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype-
>>> limits -Werror=missing-braces". But I notice that the meson build
>>> doesn't seem to use -funroll-loops or -ftree-vectorize when building
>>> checksums.c. Is that intentional? If not, perhaps slower checksum
>>> calculations explain my results.
> 
> Yes, that's an oversight.
> 
>> Attached a patch. I'm not sure if it's the right way to do things, but
>> the resulting compiler command seems right to me.
>> 
>> It doesn't affect my tests in this thread much, but it seems best to be
>> consistent with autoconf.
> 
> Yes, we should do this - the patch looks good to me.

I think we need to do something similar for numeric.c.

diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index dac372c3bea..7d0688ddb87 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -1,5 +1,14 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+numeric_backend_lib = static_library('numeric_backend_lib',
+  'numeric.c',
+  dependencies: backend_build_deps,
+  kwargs: internal_lib_args,
+  c_args: vectorize_cflags,
+)
+
+backend_link_with += numeric_backend_lib
+
 backend_sources += files(
   'acl.c',
   'amutils.c',
@@ -61,7 +70,6 @@ backend_sources += files(
   'network_gist.c',
   'network_selfuncs.c',
   'network_spgist.c',
-  'numeric.c',
   'numutils.c',
   'oid.c',
   'oracle_compat.c',

-- 
nathan



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Mon, 2025-09-08 at 15:14 -0500, Nathan Bossart wrote:
> I think we need to do something similar for numeric.c.

Good catch. Patch LGTM.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Andres Freund
Дата:
Hi,

On 2025-09-08 12:08:10 -0700, Jeff Davis wrote:
> On Mon, 2025-09-08 at 14:39 +1200, Thomas Munro wrote:
> > Some raw thoughts on this topic, and how we got here:  This type of
> > extreme workload, namely not doing any physical I/O, just copying the
> > same data from the kernel page cache to the buffer pool over and over
> > again,
> 
> Isn't that one of the major selling points of AIO? It does "real
> readahead" from kernel buffers into PG buffers ahead of the time, so
> that the backend doesn't have to do the memcpy and checksum
> calculation.

I don't think accelerating copying from the pagecache into postgres shared
buffers really is a goal of AIO. It can be a nice side-effect, but not more.
I'd say the absolute number one goal of AIO is to reduce the amount of
synchronously waiting for IO in the "foreground".  An important, but
secondary, goal is to offload work from the CPU, via DMA - but that
fundamentally only can work with DIO.


> The only problem right now is that it doesn't (yet) work great when the
> concurrency is higher because:
> 
> (a) we don't adapt well to saturated workers; and
> (b) there's lock contention on the queue if there are more workers

Afaict you've not explained in what even remotely plausible scenario you'd
have where > 30GB/s of reads from the page cache. I continue to maintain that
the workload presented does not exist in the real world.

Greetings,

Andres Freund



Re: Should io_method=worker remain the default?

От
Andres Freund
Дата:
Hi,

On 2025-09-08 16:45:52 -0400, Andres Freund wrote:
> On 2025-09-08 12:08:10 -0700, Jeff Davis wrote:
> > On Mon, 2025-09-08 at 14:39 +1200, Thomas Munro wrote:
> > > Some raw thoughts on this topic, and how we got here:  This type of
> > > extreme workload, namely not doing any physical I/O, just copying the
> > > same data from the kernel page cache to the buffer pool over and over
> > > again,
> > 
> > Isn't that one of the major selling points of AIO? It does "real
> > readahead" from kernel buffers into PG buffers ahead of the time, so
> > that the backend doesn't have to do the memcpy and checksum
> > calculation.
> 
> I don't think accelerating copying from the pagecache into postgres shared
> buffers really is a goal of AIO.

I forgot an addendum: In fact, if there were a sufficiently cheap way to avoid
using AIO when data is in the page cache, I'm fairly sure we'd want to use
that. However, there is not, from what I know (both fincore() and RWF_NOWAIT
are too expensive). The maximum gain from using AIO when the data is already
in the page cache is just not very big, and it can cause slowdowns due to IPC
overhead etc.

Greetings,

Andres Freund



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Mon, 2025-09-08 at 16:45 -0400, Andres Freund wrote:
> I don't think accelerating copying from the pagecache into postgres
> shared
> buffers really is a goal of AIO. It can be a nice side-effect, but
> not more.

Interesting. I thought that was part of the design.

> Afaict you've not explained in what even remotely plausible scenario
> you'd
> have where > 30GB/s of reads from the page cache. I continue to
> maintain that
> the workload presented does not exist in the real world.

For now, let's just consider it a test case that stressed a couple
subsystems and found some interesting effects. I think it will be
helpful context for potential improvements in 19. If I find other
interesting cases, I'll post them.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Mon, 2025-09-08 at 17:02 -0400, Andres Freund wrote:
> I forgot an addendum: In fact, if there were a sufficiently cheap way
> to avoid
> using AIO when data is in the page cache, I'm fairly sure we'd want
> to use
> that. However, there is not, from what I know (both fincore() and
> RWF_NOWAIT
> are too expensive). The maximum gain from using AIO when the data is
> already
> in the page cache is just not very big, and it can cause slowdowns
> due to IPC
> overhead etc.

Also interesting. It might be worth mentioning in the docs and/or
README.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Thomas Munro
Дата:
On Tue, Sep 9, 2025 at 9:02 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-09-08 16:45:52 -0400, Andres Freund wrote:
> > I don't think accelerating copying from the pagecache into postgres shared
> > buffers really is a goal of AIO.
>
> I forgot an addendum: In fact, if there were a sufficiently cheap way to avoid
> using AIO when data is in the page cache, I'm fairly sure we'd want to use
> that. However, there is not, from what I know (both fincore() and RWF_NOWAIT
> are too expensive). The maximum gain from using AIO when the data is already
> in the page cache is just not very big, and it can cause slowdowns due to IPC
> overhead etc.

FWIW, I briefly played around with RWF_NOWAIT in pre-AIO streaming
work: I tried preadv2(RWF_NOWAIT) before issuing WILLNEED advice.  I
cooked up some experimental heuristics to do it only when it seemed
likely to pay off*.  I also played with probing to find the frontier
where fadvise had "completed", while writing toy implementations of
some of Melanie's feedback control ideas.  It was awful.  Fun systems
programming puzzles, but it felt like jogging with one's shoelaces
tied together compared to proper AIO interfaces.

Note that io_uring already has vaguely similar behaviour internally:
see IOSQE_ASYNC heuristics in method_io_uring.c and man
io_uring_sqe_set_flags.

In the new AIO world, I therefore assume we'd only be talking about a
potential path that could skip some overheads for
io_method=worker/sync with a primed page cache, and that seems to have
some fundamental issues: (1) AFAIK the plan is to drop io_method=sync
soon, it's only a temporary be-more-like-v17 mitigation in case of
unforeseen problems or in case we decided not to launch with worker by
default, and this thread has (re-)concluded we should stick with
worker, (2) preadv2() is Linux-only and I'm not aware of a similar
"non-blocking file I/O" interface on any other system**, and yet
io_method=worker is primarily intended as a portable fallback for
systems lacking a better native option and (3) even though it should
win for Jeff's test case by skipping workers entirely if an initial
RWF_NOWAIT attempt succeeds, you could presumably change some
parameters and make it lose (number of backends vs number of I/O
workers performing copyout, cf in_flight_before > 4 in io_uring code,
and performing checksums as discussed).

Still, it's interesting to contemplate the two independent points of
variation: concurrency of page cache copyout (IOSQE_ASYNC, magic
number 4, what other other potential native I/O methods do here) and
concurrency of checksum computation (potential for worker pool
handoff).

*One scheme kept stats in a per-relation shm object.  That was
abandoned per the above reasoning, and, digressing a bit here, I'm
currently much more interested in tracking facts about our own buffer
pool contents, to inform streaming decisions and skip the buffer
mapping table in many common cases.  Digressing even further, my first
priority for per-relation shm objects is not even that, it's to
improve the fsync hand-off queue: (1) we probably shouldn't trust
Linux to remember relation sizes until we've fsync'd, and (2) Andres's
asynchronous buffer write project wants a no-throw guarantee when
enqueuing in a critical section.

**Anyone know of one?  This is basically a really ancient and
deliberate Unix design decision to hide I/O asynchrony and buffering
from user space completely, unlike pretty much every other OS, shining
through.  (I've thought about proposing it for FreeBSD as a
programming exercise but I'd rather spend my spare time making its AIO
better.)



Re: Should io_method=worker remain the default?

От
Thomas Munro
Дата:
On Tue, Sep 9, 2025 at 1:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> (3) even though it should
> win for Jeff's test case by skipping workers entirely if an initial
> RWF_NOWAIT attempt succeeds, you could presumably change some
> parameters and make it lose

... not to mention that with v19 patches even that effect should go
away when it automatically scales the worker pool to the perfect™
size.



Re: Should io_method=worker remain the default?

От
Nathan Bossart
Дата:
On Mon, Sep 08, 2025 at 01:35:02PM -0700, Jeff Davis wrote:
> On Mon, 2025-09-08 at 15:14 -0500, Nathan Bossart wrote:
>> I think we need to do something similar for numeric.c.
> 
> Good catch. Patch LGTM.

Thanks for reviewing.

My first instinct was to consider this for back-patching, but I see that
you didn't back-patch commit 9af672b.  Is there any technical reason we
shouldn't be back-patching these types of meson build script changes?

-- 
nathan



Re: Should io_method=worker remain the default?

От
Jeff Davis
Дата:
On Tue, 2025-09-09 at 10:39 -0500, Nathan Bossart wrote:
> My first instinct was to consider this for back-patching, but I see
> that
> you didn't back-patch commit 9af672b.  Is there any technical reason
> we
> shouldn't be back-patching these types of meson build script changes?

I think you are right, I can't see any harm in backporting it, and
arguably it's a bug: it would be annoying to compare performance across
different versions with different build flags.

I backported through 16.

Regards,
    Jeff Davis




Re: Should io_method=worker remain the default?

От
Nathan Bossart
Дата:
On Tue, Sep 09, 2025 at 04:11:49PM -0700, Jeff Davis wrote:
> On Tue, 2025-09-09 at 10:39 -0500, Nathan Bossart wrote:
>> My first instinct was to consider this for back-patching, but I see that
>> you didn't back-patch commit 9af672b.  Is there any technical reason we
>> shouldn't be back-patching these types of meson build script changes?
> 
> I think you are right, I can't see any harm in backporting it, and
> arguably it's a bug: it would be annoying to compare performance across
> different versions with different build flags.
> 
> I backported through 16.

Cool.  I committed/back-patched the numeric.c fix to v16 as well.

-- 
nathan