Обсуждение: Should io_method=worker remain the default?
Has there already been a discussion about leaving the default as io_method=worker? There was an Open Item for this, which was closed as "Won't Fix", but the links don't explain why as far as I can see. I tested a concurrent scan-heavy workload (see below) where the data fits in memory, and "worker" seems to be 30% slower than "sync" with default settings. I'm not suggesting that AIO overall is slow -- on the contrary, I'm excited about AIO. But if it regresses in some cases, we should make a conscious choice about the default and what kind of tuning advice needs to be offered. I briefly tried tuning to see if a different io_workers value would solve the problem, but no luck. The good news is that io_uring seemed to solve the problem. Unfortunately, that's platform-specific, so it can't be the default. I didn't dig in very much, but it seemed to be at least as good as "sync" mode for this workload. Regards, Jeff Davis Test summary: 32 connections each perform repeated sequential scans. Each connection scans a different 1GB partition of the same table. I used partitioning and a predicate to make it easier to script in pgbench. Test details: Machine: AMD Ryzen 9 9950X 16-Core Processor 64GB RAM Local storage, NVMe SSD Ubuntu 24.04 (Linux 6.11, liburing 2.5) Note: the storage didn't matter much, because the data fits in memory. To get consistent results, when changing between data directories for the 17 and 18 tests, I had to drop the filesystem cache first to make room, then run a few scans to warm it with the data from the right data directory. For simplicity I disabled parallel query, but that didn't seem to have a big effect. Everything else was set to the default. Setup (checksums enabled): => create table t(sid int8, c0 int8, c1 int8, c2 int8, c3 int8, c4 int8, c5 int8, c6 int8, c7 int8) partition by range (sid); $ (for i in `seq 0 31`; do echo "create table t$(printf "%02d" $i) partition of t for values from ($i) to ($((i+1)));"; done) | ./bin/psql postgres $ (for i in `seq 0 31`; do echo "insert into t$(printf "%02d" $i) select $i, 0, 1, 2, 3, 4, 5, 6, 7 from generate_series(0, 10000000);"; done) | ./bin/psql postgres => vacuum analyze; checkpoint; Script count.sql: SELECT COUNT(*) FROM t WHERE sid=:client_id; pgbench: ./bin/pgbench --dbname=postgres -M prepared -n -c 32 -T 60 \ -f count.sql Results: PG17: tps = 36.209048 PG18 (io_method=sync) tps = 34.014890 PG18 (io_method=worker io_workers=3) tps = 23.938509 PG18 (io_method=worker io_workers=16) tps = 16.734360 PG18 (io_method=io_uring) tps = 35.546825
On 9/3/25 08:47, Jeff Davis wrote: > > Has there already been a discussion about leaving the default as > io_method=worker? There was an Open Item for this, which was closed as > "Won't Fix", but the links don't explain why as far as I can see. > There was a discussion in the first thread mentioned in the commit message, including some test results: https://www.postgresql.org/message-id/e6db33f3-50de-43d3-9d9f-747c3b376e80%40vondra.me There were no proposals to change the default (from worker), so the conclusion was to stick with the current default. It wasn't quite clear who's expected to make the proposal / final decision, though. > I tested a concurrent scan-heavy workload (see below) where the data > fits in memory, and "worker" seems to be 30% slower than "sync" with > default settings. > I think this could be due to the "IPC overhead" discussed in the index prefetch thread recently: https://www.postgresql.org/message-id/1c9302da-c834-4773-a527-1c1a7029c5a3%40vondra.me > I'm not suggesting that AIO overall is slow -- on the contrary, I'm > excited about AIO. But if it regresses in some cases, we should make a > conscious choice about the default and what kind of tuning advice needs > to be offered. > > I briefly tried tuning to see if a different io_workers value would > solve the problem, but no luck. > Right, that's what I saw too. It simply boils down to how many signals can a single process send/receive. I'd bet if you try the signal-echo thing, it's be close to the throughput with io_method=worker. > The good news is that io_uring seemed to solve the problem. > Unfortunately, that's platform-specific, so it can't be the default. I > didn't dig in very much, but it seemed to be at least as good as "sync" > mode for this workload. > AFAICS that matches what we observed in the index prefetch thread. > > Test summary: 32 connections each perform repeated sequential scans. > Each connection scans a different 1GB partition of the same table. I > used partitioning and a predicate to make it easier to script in > pgbench. I'll try to reproduce this, but if it's due to the same IPC overhead, that would be surprising (for me). In the index case it makes sense, because the reads are random enough to prevent I/O combining. But for a sequential workload I'd expect I/O combining to help. Could it be that it ends up evicting buffers randomly, which (I guess) might interfere with the combining? What's shared_buffers set to? Have you watched how large the I/O requests are? iostat, iosnoop or strace would tell you. regards -- Tomas Vondra
On Wed, 2025-09-03 at 10:34 +0200, Tomas Vondra wrote: > I'll try to reproduce this, but if it's due to the same IPC overhead, > that would be surprising (for me). In the index case it makes sense, > because the reads are random enough to prevent I/O combining. But for > a > sequential workload I'd expect I/O combining to help. Could it be > that > it ends up evicting buffers randomly, which (I guess) might interfere > with the combining? What's shared_buffers set to? I left it as the default (128MB). > Have you watched how > large the I/O requests are? iostat, iosnoop or strace would tell you. In my test there's not much device IO, because the data is cached. I used: strace -p $io_worker_pid -e trace=preadv and looked briefly at the output. iov_len seems to range between 8kB and about 128kB, but still a lot of 8kB. That's from a very brief look, I can try to get more precise numbers, but there seem to be enough 8kB ones to support your theory. Regards, Jeff Davis
On 2025-Sep-02, Jeff Davis wrote: > The good news is that io_uring seemed to solve the problem. > Unfortunately, that's platform-specific, so it can't be the default. Why not? We have had wal_sync_method with a platform-dependent default for a very long time. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Hi, On 2025-09-02 23:47:48 -0700, Jeff Davis wrote: > Has there already been a discussion about leaving the default as > io_method=worker? There was an Open Item for this, which was closed as > "Won't Fix", but the links don't explain why as far as I can see. > I tested a concurrent scan-heavy workload (see below) where the data > fits in memory, and "worker" seems to be 30% slower than "sync" with > default settings. > > Test summary: 32 connections each perform repeated sequential scans. > Each connection scans a different 1GB partition of the same table. I > used partitioning and a predicate to make it easier to script in > pgbench. 32 parallel seq scans of a large relations, with default shared buffers, fully cached in the OS page cache, seems like a pretty absurd workload. That's not to say we shouldn't spend some effort to avoid regressions for it, but it also doesn't seem to be worth focusing all that much on it. Or is there a real-world scenario this actually emulating? I think the regression is not due to anything inherent to worker, but due to pressure on AioWorkerSubmissionQueueLock - at least that's what I'm seeing on a older two socket machine. It's possible the bottleneck is different on a newer machine (my newer workstation is busy on another benchmark rn). *If* we actually care about this workload, we can make pgaio_worker_submit_internal() acquire that lock conditionally, and perform the IOs synchronously instead. That seems to help here, sufficiently to make worker the same as sync - although plenty contention remains, from "the worker side", which can't just acquire the lock conditionally. But I'm really not sure doing > 30GB/s of repeated reads from the page cache is a particularly useful thing to optimize. I see a lot of unrelated contention, e.g. on the BufferMappingLock - unsurprising, it's a really extreme workload... If I instead just increase s_b, I get 2x the throughput... Greetings, Andres Freund
Hi, On 2025-09-03 17:53:55 +0200, Álvaro Herrera wrote: > On 2025-Sep-02, Jeff Davis wrote: > > > The good news is that io_uring seemed to solve the problem. > > Unfortunately, that's platform-specific, so it can't be the default. > > Why not? We have had wal_sync_method with a platform-dependent default > for a very long time. At least for now there are other reasons: 1) io_uring needs a higher ulimit -n, because we need to create an io_uring instance for each proc entry, in postmaster. In the future, we might want to increase soft ulimit -n, but we aren't doing that yet. In older kernels the number of io_uring memory mapping also can pose a performance issue at high connection rates. 2) Sometimes worker is faster than io_uring, specifically when checksums are enabled. Workers can offload the checksum computations, io_uring can't. 3) We can't, at build time, know whether io_uring is actually available at runtime. io_uring can be disabled with a runtime sysctl. We could add a dynamic fallback to worker, but we don't have that yet. Anyway, the issue with worker in this case isn't inherent, it's just a bit of lock contention in an extreme workload... Greetings, Andres Freund
On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote: > 32 parallel seq scans of a large relations, with default shared > buffers, fully > cached in the OS page cache, seems like a pretty absurd workload. It's the default settings, and users often just keep going with the defaults as long as it works, giving little thought to any kind of tuning or optimization until they hit a wall. Fully cached data is common, as are scan-heavy workloads. Calling it "absurd" is an exaggeration. > That's not > to say we shouldn't spend some effort to avoid regressions for it, > but it also > doesn't seem to be worth focusing all that much on it. Fair, but we should acknowledge the places where the new defaults do better vs worse, and provide some guidance on what to look for and how to tune it. We should also not be in too much of a rush to get rid of "sync" mode until we have a better idea about where the tradeoffs are. > Or is there a > real-world scenario this actually emulating? This test was my first try at reproducing a smaller (but still noticeable) regression seen on a more realistic benchmark. I'm not 100% sure whether I reproduced the same effect or a different one, but I don't think we should dismiss it so quickly. > *If* we actually care about this workload, we can make > pgaio_worker_submit_internal() acquire that lock conditionally, and > perform > the IOs synchronously instead. I like the idea of some kind of fallback for multiple reasons. I noticed that if I set io_workers=1, and then I SIGSTOP that worker, then sequential scans make no progress at all until I send SIGCONT. A fallback to synchronous sounds more robust, and more similar to what we do with walwriter and bgwriter. (That may be 19 material, though.) > But I'm really not sure doing > 30GB/s of repeated reads from the > page cache > is a particularly useful thing to optimize. A long time ago, the expectation was that Postgres might be running on a machine along with other software, and perhaps many instances of Postgres on the same machine. In that case, low shared_buffers compared with the overall system memory makes sense, which would cause a lot of back-and-forth into shared buffers. That was also the era of magnetic disks, where such memory copies seemed almost free by comparison -- perhaps we just don't care about that case any more? > If I instead just increase s_b, I get 2x the throughput... Increase to what? I tried a number of settings. Obviously >32GB makes it a non-issue because everything is cached. Values between 128MB and 32GB didn't seem to help, and were in some cases lower, but I didn't look into why yet. It might have something to do with crowding out the page cache. Regards, Jeff Davis
Hi, On 2025-09-03 11:50:05 -0700, Jeff Davis wrote: > On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote: > > 32 parallel seq scans of a large relations, with default shared > > buffers, fully > > cached in the OS page cache, seems like a pretty absurd workload. > > It's the default settings, and users often just keep going with the > defaults as long as it works, giving little thought to any kind of > tuning or optimization until they hit a wall. Fully cached data is > common, as are scan-heavy workloads. Calling it "absurd" is an > exaggeration. I agree that an unconfigured postgres is a common thing. But I don't think that means that doing 30GB/s of IO from 32 backends is something that workloads using an untuned postgres will do. I cannot come up with any halfway realistic scenarios where you'd do anywhere this much cached IO. As soon as you don't do quite as much IO, the entire bottleneck the test exercises *completely* vanishes. There's like 20-30 instructions covered by that lwlock. You need a *lot* of invocations for that to become a bottlneck. > > That's not to say we shouldn't spend some effort to avoid regressions for > > it, but it also doesn't seem to be worth focusing all that much on it. > > Fair, but we should acknowledge the places where the new defaults do better > vs worse, and provide some guidance on what to look for and how to tune it. If you find describable realistic workload that regress, I'm all ears. Both from the perspective of fixing regressions and documenting how to choose the correct io_method. But providing tuning advice for a very extreme workload that only conceivably exists with an untuned postgres doesn't seem likely to help anybody. > We should also not be in too much of a rush to get rid of "sync" mode until > we have a better idea about where the tradeoffs are. Nobody is in a rush to do so, from what I can tell? I don't know why you're so focused on this. > > Or is there a real-world scenario this actually emulating? > > This test was my first try at reproducing a smaller (but still > noticeable) regression seen on a more realistic benchmark. I'm not 100% > sure whether I reproduced the same effect or a different one, but I > don't think we should dismiss it so quickly. From what I can tell here, the regression that this benchmark observes is entirely conditional on *extremely* high volumes of IO being issued by a lot of backends. What was the workload that hit the smaller regression? > > *If* we actually care about this workload, we can make > > pgaio_worker_submit_internal() acquire that lock conditionally, and > > perform > > the IOs synchronously instead. > > I like the idea of some kind of fallback for multiple reasons. I > noticed that if I set io_workers=1, and then I SIGSTOP that worker, > then sequential scans make no progress at all until I send SIGCONT. A > fallback to synchronous sounds more robust, and more similar to what we > do with walwriter and bgwriter. (That may be 19 material, though.) I don't think what I'm proposing would really change anything in such a scenario, unless you were unlucky enough to send the SIGSTOP in the very short window in which the worker held the lwlock. We already fall back to synchronuous IO when the queue towards the IO workers is full. But I don't see a way to identify the case of "the queue is not full, but the worker isn't making enough progress". > > But I'm really not sure doing > 30GB/s of repeated reads from the > > page cache > > is a particularly useful thing to optimize. > > A long time ago, the expectation was that Postgres might be running on > a machine along with other software, and perhaps many instances of > Postgres on the same machine. In that case, low shared_buffers compared > with the overall system memory makes sense, which would cause a lot of > back-and-forth into shared buffers. That was also the era of magnetic > disks, where such memory copies seemed almost free by comparison -- > perhaps we just don't care about that case any more? I think we should still care about performing reasonably when data primarily is cached in the OS page cache. I just don't think the benchmark is a good example of such workloads - if you need to do data analysis of > 30GB/s of data second, while the data is also cached, you can configure postgres at least somewhat reasonably. Even if you were to somehow analyze this much data, any realistic workload will actually do more than just count(*), which means AioWorkerSubmissionQueueLock won't be the bottleneck. > > If I instead just increase s_b, I get 2x the throughput... > > Increase to what? I tried a number of settings. Obviously >32GB makes > it a non-issue because everything is cached. Values between 128MB and > 32GB didn't seem to help, and were in some cases lower, but I didn't > look into why yet. It might have something to do with crowding out the > page cache. I meant above ~32GB. Greetings, Andres Freund
On Thu, Sep 4, 2025 at 6:50 AM Jeff Davis <pgsql@j-davis.com> wrote: > I like the idea of some kind of fallback for multiple reasons. I > noticed that if I set io_workers=1, and then I SIGSTOP that worker, > then sequential scans make no progress at all until I send SIGCONT. A > fallback to synchronous sounds more robust, and more similar to what we > do with walwriter and bgwriter. (That may be 19 material, though.) This seems like a non-problem. Robustness against SIGSTOP of random backends is not a project goal or reasonable goal, is it? You can SIGSTOP a backend doing IO in any historical release, possibly blocking other backends too based on locks etc etc. That said, it is quite reasonable to ask why it doesn't just start a new worker, and that's just code maturity: * I have a patch basically ready to commit for v19 (CF #5913) that would automatically add more workers if you did that. But even then you could be unlucky and SIGSTOP a worker while it holds the submission queue lock. * I also had experimental versions that use a lock free queue, but it didn't seem necessary given how hard it is to measure meaningful lock contention so far; I guess it must be easier on a NUMA system and one might wonder about per-NUMA-node queues, but that also feels a bit questionable because if you had a very high end system you'd probably be looking into better tuning including io_method=io_uring.
On Thu, 2025-09-04 at 10:22 +1200, Thomas Munro wrote: > This seems like a non-problem. Robustness against SIGSTOP of random > backends is not a project goal or reasonable goal, is it? I think you misinterpreted -- I wasn't suggesting that sending SIGSTOP is a use case. I just meant that it showed a strong dependence between the processes, and that it might be more robust to have some kind of fallback. > * I have a patch basically ready to commit for v19 (CF #5913) that > would automatically add more workers if you did that. But even then > you could be unlucky and SIGSTOP a worker while it holds the > submission queue lock. Making it adaptable sounds like a nice improvement, especially since we don't have good guidance in the docs about how to tune it. > * I also had experimental versions that use a lock free queue, but it > didn't seem necessary given how hard it is to measure meaningful lock > contention so far; Andres suggested that the case I brought up at the top of the thread is due to lock contention, so a lock free queue also sounds like a potential improvement. If the code is working and can be applied to REL_18_STABLE, then I can try it. Regards, Jeff Davis
On Thu, Sep 4, 2025 at 10:47 AM Jeff Davis <pgsql@j-davis.com> wrote: > Andres suggested that the case I brought up at the top of the thread is > due to lock contention, so a lock free queue also sounds like a > potential improvement. If the code is working and can be applied to > REL_18_STABLE, then I can try it. Not easily, unfortunately, that was a while back and I'd need to find all the bits and rebase. There is a preliminary step in CF5913[1]: it splits AioWorkerSubmissionQueueLock's duties into AioWorkerControlLock for pool management and AioWorkerSubmissionQueueLock for the queue itself. I had a couple of different attempts at generic lock free queues, one from a paper and one loosely based on FreeBSD's buf_ring.h, but I'm not at all sure it's actually any good and I found it quite tricky to integrate with the idle bitmap scheme which so far seems to be valuable... Hmm, it might be good to try some simpler things first: splitting AioWorkerSubmissionQueueLock into a producer lock (backends enqueuing) and a consumer lock (workers dequeuing) must surely help a lot. I'll try to come up with some simple patches to evaluate... [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJLe-0E-wZ-is78CEHhjbC%3DihMVCQLoN1dmD-j05s9qRg%40mail.gmail.com#c3eb6c873b7218d5c3d5164244f7245d
On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote: > I think the regression is not due to anything inherent to worker, but > due to > pressure on AioWorkerSubmissionQueueLock - at least that's what I'm > seeing on > a older two socket machine. It's possible the bottleneck is different > on a > newer machine (my newer workstation is busy on another benchmark rn). I believe what's happening is that parallelism of the IO completion work (e.g. checksum verification) is reduced. In worker mode, the completion work is happening on the io workers (of which there are 3); while in sync mode the completion work is happening in the backends (of which there are 32). There may be lock contention too, but I don't think that's the primary issue. I attached a test patch for illustration. It simplifies the code inside the LWLock to enqueue/dequeue only, and simplifies and reduces the wakeups by doing pseudo-random wakeups only when enqueuing. Reducing the wakeups should reduce the number of signals generated without hurting my case, because the workers are never idle. And reducing the instructions while holding the LWLock should reduce lock contention. But the patch barely makes a difference: still around 24tps. What *does* make a difference is changing io_worker_queue_size. A lower value of 16 effectively starves the workers of work to do, and I get a speedup to about 28tps. A higher value of 512 gives the workers more chance to issue the IOs -- and more responsibility to complete them -- and it drops to 17tps. Furthermore, while the test is running, the io workers are constantly at 100% (mostly verifying checksums) and the backends are at 50% (20% when io_worker_queue_size=512). As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype- limits -Werror=missing-braces". But I notice that the meson build doesn't seem to use -funroll-loops or -ftree-vectorize when building checksums.c. Is that intentional? If not, perhaps slower checksum calculations explain my results. Regards, Jeff Davis
Вложения
On Fri, 2025-09-05 at 13:25 -0700, Jeff Davis wrote: > As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype- > limits -Werror=missing-braces". But I notice that the meson build > doesn't seem to use -funroll-loops or -ftree-vectorize when building > checksums.c. Is that intentional? If not, perhaps slower checksum > calculations explain my results. Attached a patch. I'm not sure if it's the right way to do things, but the resulting compiler command seems right to me. It doesn't affect my tests in this thread much, but it seems best to be consistent with autoconf. Regards, Jeff Davis
Вложения
On Sat, Sep 6, 2025 at 8:26 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote: > > I think the regression is not due to anything inherent to worker, but > > due to > > pressure on AioWorkerSubmissionQueueLock - at least that's what I'm > > seeing on > > a older two socket machine. It's possible the bottleneck is different > > on a > > newer machine (my newer workstation is busy on another benchmark rn). > > I believe what's happening is that parallelism of the IO completion > work (e.g. checksum verification) is reduced. In worker mode, the > completion work is happening on the io workers (of which there are 3); > while in sync mode the completion work is happening in the backends (of > which there are 32). Make sense. Some raw thoughts on this topic, and how we got here: This type of extreme workload, namely not doing any physical I/O, just copying the same data from the kernel page cache to the buffer pool over and over again, is also the workload where io_method=worker can beat io_method=io_uring (and other native I/O methods I've prototyped), assuming io_workers is increased to a level that keeps up. That does lead to a couple of ideas such as handing completions off to a new generic task worker pool, even for native I/O methods. At a guess, you'd probably only want to do that when extra kernel completion notifications are drained ahead of the one you need right now, so that it only kicks in when you're reading ahead further than you strictly need right now (that's just a guess about the cost of IPC and buffer header cache line ping-pong effects vs the cost of running the checksums yourself, not properly investigated). I also experimented with going the other way: having I/O workers do *just* the preadv() and then hand completion duties back to the backend that waits for the IO. I have patches for that, creating a per-backend completion queue that I/O methods could opt to use (I experimented with that model a part of my ongoing struggle with multi-process PostgreSQL + native AIO which fundamentally assumes threads on all other OSes, but it's trivial to enable it for io_method=worker too). But that leveled the playing field in the wrong direction! This I/O worker property is actually good when tuned correctly so that you get more rather than less checksum computation concurrency, so I abandoned that experiment. Of course if we did that, in your case here you'd get all 32 backends computing checksums no matter how many I/O workers are running. The original motivation for workers processing completions rather than more directly simulating io_uring (which was the original and only implementation when I got involved in this) was the sheer simplicity of it, and this choice didn't matter much before the checksums-by-default change went in just a little ahead of basic AIO support. > There may be lock contention too, but I don't think that's the primary > issue. Interesting that it shows up so clearly for Andres but not for you. I've made a note to try out those lock improvement patches I mentioned on a dual socket box once I'm done with some other work that is keeping me busy... > I attached a test patch for illustration. It simplifies the code inside > the LWLock to enqueue/dequeue only, and simplifies and reduces the > wakeups by doing pseudo-random wakeups only when enqueuing. Reducing > the wakeups should reduce the number of signals generated without > hurting my case, because the workers are never idle. And reducing the > instructions while holding the LWLock should reduce lock contention. > But the patch barely makes a difference: still around 24tps. BTW There are already a couple of levels of signal suppression: if workers are not idle then we don't set any latches, and even if we did, SetLatch() only sends signals when the recipient is actually waiting, which shouldn't happen when the pool is completely busy. + nextWakeupWorker = (nextWakeupWorker + 1) % io_workers; FWIW, I experimented extensively with wakeup distribution schemes like that ^ and others, and found that they performed significantly worse in terms of avg(latency) and stddev(latency) for partially idle worker pools, as discussed a bit in that auto-tuning thread. It seems to be generally better to keep a small number of workers busy. > What *does* make a difference is changing io_worker_queue_size. A lower > value of 16 effectively starves the workers of work to do, and I get a > speedup to about 28tps. A higher value of 512 gives the workers more > chance to issue the IOs -- and more responsibility to complete them -- > and it drops to 17tps. Furthermore, while the test is running, the io > workers are constantly at 100% (mostly verifying checksums) and the > backends are at 50% (20% when io_worker_queue_size=512). I would value your feedback and this type of analysis on the thread about automatic tuning for v19. It should find the setting that keeps your executor backends 100% busy, and some number of low-numbered I/O workers 100% busy (only because here you're doing no physical I/O that would cause them to sleep in D state), but leave some spare worker capacity to allow for recent variation. It should also keep the queue from overflowing and falling back to synchronous I/O. Keeping the queue size low is important for latency, because we don't want new queries to get stuck behind a huge queue: each new IO should always be processed ASAP if the storage can keep up. That's kinda why a queue size of 64 is probably "enough", since 32 workers is the highest pool size and it doesn't like there to be more than one extra queued IO per currently running worker (which is debatable, see discussion on this simplicity of that policy vs theoretical arguments about some unknown variables). Though I do think there are good reasons to make the queue size user-configurable for developers only, and I'm planning to propose that and more as a debug_ setting, since it turns out to be quite useful to testing theories about higher level rate control systems, discussed a bit in that thread, and hopefully with some more patches a bit later... You might also be interested in the phenomenon discussed in the index prefetching thread, where concurrent reads of the *same* block from different backs currently force a wait, which can be avoided with a patch he posted. In that thread it was happening due to index scans trying to stream the same block repeatedly in one stream, but it's also known to happen in torture tests much like yours where different backends step on each other's toes; on the other hand you're reporting 100% CPU for backends, so I guess it's not happening much... hmm, perhaps you disabled synchronous scans, or something about your test conditions avoids that? > As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype- > limits -Werror=missing-braces". But I notice that the meson build > doesn't seem to use -funroll-loops or -ftree-vectorize when building > checksums.c. Is that intentional? If not, perhaps slower checksum > calculations explain my results. Ah, good discovery.
On Mon, Sep 8, 2025 at 2:39 PM Thomas Munro <thomas.munro@gmail.com> wrote: > You might also be interested in the phenomenon discussed in the index > prefetching thread, where concurrent reads of the *same* block from > different backs currently force a wait, which can be avoided with a > patch he posted. In that thread it was happening due to index scans Sorry, bad edit: I meant a patch that Andres posted.
Hi, On 2025-09-05 16:07:09 -0700, Jeff Davis wrote: > On Fri, 2025-09-05 at 13:25 -0700, Jeff Davis wrote: > > As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype- > > limits -Werror=missing-braces". But I notice that the meson build > > doesn't seem to use -funroll-loops or -ftree-vectorize when building > > checksums.c. Is that intentional? If not, perhaps slower checksum > > calculations explain my results. Yes, that's an oversight. > Attached a patch. I'm not sure if it's the right way to do things, but > the resulting compiler command seems right to me. > > It doesn't affect my tests in this thread much, but it seems best to be > consistent with autoconf. Yes, we should do this - the patch looks good to me. Greetings, Andres Freund
On Mon, 2025-09-08 at 14:39 +1200, Thomas Munro wrote: > Some raw thoughts on this topic, and how we got here: This type of > extreme workload, namely not doing any physical I/O, just copying the > same data from the kernel page cache to the buffer pool over and over > again, Isn't that one of the major selling points of AIO? It does "real readahead" from kernel buffers into PG buffers ahead of the time, so that the backend doesn't have to do the memcpy and checksum calculation. The benefit will be even larger when AIO eventually enables effective Direct IO, so that shared_buffers can be a larger share of system memory, and we don't need to move back-and-forth between kernel buffers and PG buffers (and recalculate the checksum). The only problem right now is that it doesn't (yet) work great when the concurrency is higher because: (a) we don't adapt well to saturated workers; and (b) there's lock contention on the queue if there are more workers and I believe both of those problems can be solved in 19. > is also the workload where io_method=worker can beat > io_method=io_uring (and other native I/O methods I've prototyped), > assuming io_workers is increased to a level that keeps up. Right, when the workers are not saturated, they *increase* the parallelism because there are more processes doing the work (unless you run into lock contention). > didn't matter much before the > checksums-by-default change went in just a little ahead of basic AIO > support. Yeah, the trade-offs are much different when checksums are on vs off. > > Interesting that it shows up so clearly for Andres but not for you. When I increase the io_worker count, then it does seem to be limited by lock contention (at least I'm seeing evidence with -DLWLOCK_STATS). I suppose my case is just below some threshold. > > BTW There are already a couple of levels of signal suppression: if > workers are not idle then we don't set any latches, and even if we > did, SetLatch() only sends signals when the recipient is actually > waiting, which shouldn't happen when the pool is completely busy. Oh, good to know. > + nextWakeupWorker = (nextWakeupWorker + 1) % io_workers; > > FWIW, I experimented extensively with wakeup distribution schemes My patch was really just to try to test the two hypotheses; I wasn't proposing it. But I was curious whether a simpler scheme might be just as good, and looks like you already considered and rejected it. > > I would value your feedback and this type of analysis on the thread > about automatic tuning for v19. OK, I will continue the tuning discussion there. Regarding $SUBJECT: it looks like others are just fine with worker mode as the default in 18. I have added discussion links to the "no change" entry in the Open Items list. I think we'll probably see some of the effects (worker saturation or lock contention) from my test case appear in real workloads, but affected users can change to sync mode until we sort these things out in 19. Regards, Jeff Davis
On Mon, Sep 08, 2025 at 01:43:45PM -0400, Andres Freund wrote: > On 2025-09-05 16:07:09 -0700, Jeff Davis wrote: >> On Fri, 2025-09-05 at 13:25 -0700, Jeff Davis wrote: >>> As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype- >>> limits -Werror=missing-braces". But I notice that the meson build >>> doesn't seem to use -funroll-loops or -ftree-vectorize when building >>> checksums.c. Is that intentional? If not, perhaps slower checksum >>> calculations explain my results. > > Yes, that's an oversight. > >> Attached a patch. I'm not sure if it's the right way to do things, but >> the resulting compiler command seems right to me. >> >> It doesn't affect my tests in this thread much, but it seems best to be >> consistent with autoconf. > > Yes, we should do this - the patch looks good to me. I think we need to do something similar for numeric.c. diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build index dac372c3bea..7d0688ddb87 100644 --- a/src/backend/utils/adt/meson.build +++ b/src/backend/utils/adt/meson.build @@ -1,5 +1,14 @@ # Copyright (c) 2022-2025, PostgreSQL Global Development Group +numeric_backend_lib = static_library('numeric_backend_lib', + 'numeric.c', + dependencies: backend_build_deps, + kwargs: internal_lib_args, + c_args: vectorize_cflags, +) + +backend_link_with += numeric_backend_lib + backend_sources += files( 'acl.c', 'amutils.c', @@ -61,7 +70,6 @@ backend_sources += files( 'network_gist.c', 'network_selfuncs.c', 'network_spgist.c', - 'numeric.c', 'numutils.c', 'oid.c', 'oracle_compat.c', -- nathan
On Mon, 2025-09-08 at 15:14 -0500, Nathan Bossart wrote: > I think we need to do something similar for numeric.c. Good catch. Patch LGTM. Regards, Jeff Davis
Hi, On 2025-09-08 12:08:10 -0700, Jeff Davis wrote: > On Mon, 2025-09-08 at 14:39 +1200, Thomas Munro wrote: > > Some raw thoughts on this topic, and how we got here: This type of > > extreme workload, namely not doing any physical I/O, just copying the > > same data from the kernel page cache to the buffer pool over and over > > again, > > Isn't that one of the major selling points of AIO? It does "real > readahead" from kernel buffers into PG buffers ahead of the time, so > that the backend doesn't have to do the memcpy and checksum > calculation. I don't think accelerating copying from the pagecache into postgres shared buffers really is a goal of AIO. It can be a nice side-effect, but not more. I'd say the absolute number one goal of AIO is to reduce the amount of synchronously waiting for IO in the "foreground". An important, but secondary, goal is to offload work from the CPU, via DMA - but that fundamentally only can work with DIO. > The only problem right now is that it doesn't (yet) work great when the > concurrency is higher because: > > (a) we don't adapt well to saturated workers; and > (b) there's lock contention on the queue if there are more workers Afaict you've not explained in what even remotely plausible scenario you'd have where > 30GB/s of reads from the page cache. I continue to maintain that the workload presented does not exist in the real world. Greetings, Andres Freund
Hi, On 2025-09-08 16:45:52 -0400, Andres Freund wrote: > On 2025-09-08 12:08:10 -0700, Jeff Davis wrote: > > On Mon, 2025-09-08 at 14:39 +1200, Thomas Munro wrote: > > > Some raw thoughts on this topic, and how we got here: This type of > > > extreme workload, namely not doing any physical I/O, just copying the > > > same data from the kernel page cache to the buffer pool over and over > > > again, > > > > Isn't that one of the major selling points of AIO? It does "real > > readahead" from kernel buffers into PG buffers ahead of the time, so > > that the backend doesn't have to do the memcpy and checksum > > calculation. > > I don't think accelerating copying from the pagecache into postgres shared > buffers really is a goal of AIO. I forgot an addendum: In fact, if there were a sufficiently cheap way to avoid using AIO when data is in the page cache, I'm fairly sure we'd want to use that. However, there is not, from what I know (both fincore() and RWF_NOWAIT are too expensive). The maximum gain from using AIO when the data is already in the page cache is just not very big, and it can cause slowdowns due to IPC overhead etc. Greetings, Andres Freund
On Mon, 2025-09-08 at 16:45 -0400, Andres Freund wrote: > I don't think accelerating copying from the pagecache into postgres > shared > buffers really is a goal of AIO. It can be a nice side-effect, but > not more. Interesting. I thought that was part of the design. > Afaict you've not explained in what even remotely plausible scenario > you'd > have where > 30GB/s of reads from the page cache. I continue to > maintain that > the workload presented does not exist in the real world. For now, let's just consider it a test case that stressed a couple subsystems and found some interesting effects. I think it will be helpful context for potential improvements in 19. If I find other interesting cases, I'll post them. Regards, Jeff Davis
On Mon, 2025-09-08 at 17:02 -0400, Andres Freund wrote: > I forgot an addendum: In fact, if there were a sufficiently cheap way > to avoid > using AIO when data is in the page cache, I'm fairly sure we'd want > to use > that. However, there is not, from what I know (both fincore() and > RWF_NOWAIT > are too expensive). The maximum gain from using AIO when the data is > already > in the page cache is just not very big, and it can cause slowdowns > due to IPC > overhead etc. Also interesting. It might be worth mentioning in the docs and/or README. Regards, Jeff Davis
On Tue, Sep 9, 2025 at 9:02 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-09-08 16:45:52 -0400, Andres Freund wrote: > > I don't think accelerating copying from the pagecache into postgres shared > > buffers really is a goal of AIO. > > I forgot an addendum: In fact, if there were a sufficiently cheap way to avoid > using AIO when data is in the page cache, I'm fairly sure we'd want to use > that. However, there is not, from what I know (both fincore() and RWF_NOWAIT > are too expensive). The maximum gain from using AIO when the data is already > in the page cache is just not very big, and it can cause slowdowns due to IPC > overhead etc. FWIW, I briefly played around with RWF_NOWAIT in pre-AIO streaming work: I tried preadv2(RWF_NOWAIT) before issuing WILLNEED advice. I cooked up some experimental heuristics to do it only when it seemed likely to pay off*. I also played with probing to find the frontier where fadvise had "completed", while writing toy implementations of some of Melanie's feedback control ideas. It was awful. Fun systems programming puzzles, but it felt like jogging with one's shoelaces tied together compared to proper AIO interfaces. Note that io_uring already has vaguely similar behaviour internally: see IOSQE_ASYNC heuristics in method_io_uring.c and man io_uring_sqe_set_flags. In the new AIO world, I therefore assume we'd only be talking about a potential path that could skip some overheads for io_method=worker/sync with a primed page cache, and that seems to have some fundamental issues: (1) AFAIK the plan is to drop io_method=sync soon, it's only a temporary be-more-like-v17 mitigation in case of unforeseen problems or in case we decided not to launch with worker by default, and this thread has (re-)concluded we should stick with worker, (2) preadv2() is Linux-only and I'm not aware of a similar "non-blocking file I/O" interface on any other system**, and yet io_method=worker is primarily intended as a portable fallback for systems lacking a better native option and (3) even though it should win for Jeff's test case by skipping workers entirely if an initial RWF_NOWAIT attempt succeeds, you could presumably change some parameters and make it lose (number of backends vs number of I/O workers performing copyout, cf in_flight_before > 4 in io_uring code, and performing checksums as discussed). Still, it's interesting to contemplate the two independent points of variation: concurrency of page cache copyout (IOSQE_ASYNC, magic number 4, what other other potential native I/O methods do here) and concurrency of checksum computation (potential for worker pool handoff). *One scheme kept stats in a per-relation shm object. That was abandoned per the above reasoning, and, digressing a bit here, I'm currently much more interested in tracking facts about our own buffer pool contents, to inform streaming decisions and skip the buffer mapping table in many common cases. Digressing even further, my first priority for per-relation shm objects is not even that, it's to improve the fsync hand-off queue: (1) we probably shouldn't trust Linux to remember relation sizes until we've fsync'd, and (2) Andres's asynchronous buffer write project wants a no-throw guarantee when enqueuing in a critical section. **Anyone know of one? This is basically a really ancient and deliberate Unix design decision to hide I/O asynchrony and buffering from user space completely, unlike pretty much every other OS, shining through. (I've thought about proposing it for FreeBSD as a programming exercise but I'd rather spend my spare time making its AIO better.)
On Tue, Sep 9, 2025 at 1:59 PM Thomas Munro <thomas.munro@gmail.com> wrote: > (3) even though it should > win for Jeff's test case by skipping workers entirely if an initial > RWF_NOWAIT attempt succeeds, you could presumably change some > parameters and make it lose ... not to mention that with v19 patches even that effect should go away when it automatically scales the worker pool to the perfect™ size.
On Mon, Sep 08, 2025 at 01:35:02PM -0700, Jeff Davis wrote: > On Mon, 2025-09-08 at 15:14 -0500, Nathan Bossart wrote: >> I think we need to do something similar for numeric.c. > > Good catch. Patch LGTM. Thanks for reviewing. My first instinct was to consider this for back-patching, but I see that you didn't back-patch commit 9af672b. Is there any technical reason we shouldn't be back-patching these types of meson build script changes? -- nathan
On Tue, 2025-09-09 at 10:39 -0500, Nathan Bossart wrote: > My first instinct was to consider this for back-patching, but I see > that > you didn't back-patch commit 9af672b. Is there any technical reason > we > shouldn't be back-patching these types of meson build script changes? I think you are right, I can't see any harm in backporting it, and arguably it's a bug: it would be annoying to compare performance across different versions with different build flags. I backported through 16. Regards, Jeff Davis
On Tue, Sep 09, 2025 at 04:11:49PM -0700, Jeff Davis wrote: > On Tue, 2025-09-09 at 10:39 -0500, Nathan Bossart wrote: >> My first instinct was to consider this for back-patching, but I see that >> you didn't back-patch commit 9af672b. Is there any technical reason we >> shouldn't be back-patching these types of meson build script changes? > > I think you are right, I can't see any harm in backporting it, and > arguably it's a bug: it would be annoying to compare performance across > different versions with different build flags. > > I backported through 16. Cool. I committed/back-patched the numeric.c fix to v16 as well. -- nathan