Обсуждение: Re: AIO v2.0
On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
Attached is the next version of the patchset. (..)
Hi Andres,
Thank You for worth admiring persistence on this. Please do not take it as criticism, just more like set of questions regarding the patchset v2.1 that I finally got little time to play with:
0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ? Otherwise we'll never know that SQ is full in theory, perhaps at least such a check should be made with Assert() ? (I understand right now that we allow just up to io_uring_queue_init(io_max_concurrency), but what happens if:
a. previous io_uring_submit() failed for some reason and we do not have free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the code starts using just one big queue, still without checking for io_uring_get_sqe()?
1. In [0] you wrote that there's this high amount of FDs consumed for io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are many customers who are using extremely high max_connections (4k-5k, but there outliers with 10k in the wild too) - so they won't even start - and I have one doubt on the user-friendliness impact of this. I'm quite certain it's going to be the same as with pgbouncer where one is forced to tweak OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries to preallocate and then close() a lot of FDs, so that's safer in runtime. IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the max_files_per_process looses it's spirit a little bit and PG is going to start loose efficiency too due to frequent open()/close() calls as fd cache is too small. Tomas also complained about it some time ago in [1])
So maybe it would be good to introduce couple of sanity checks too (even after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have even real 1k FDs free for handling relation FDs (detect something bad like: getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)
2. In pgaio_uring_postmaster_child_init_local() there "io_uring_queue_init(32,...)" - why 32? :) And also there's separate io_uring_queue_init(io_max_concurrency) which seems to be derived from AioChooseMaxConccurrency() which can go up to 64?
3. I find having two GUCs named literally the same (effective_io_concurrency, io_max_concurrency). It is clear from IO_URING perspective what is io_max_concurrency all about, but I bet having also effective_io_concurrency in the mix is going to be a little confusing for users (well, it is to me). Maybe that README.md could elaborate a little bit on the relation between those two? Or maybe do you plan to remove io_max_concurrency and bind it to effective_io_concurrency in future? To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the earlier mentioned AioChooseMaxConccurrency() goes up to just 64
4. While we are at this, shouldn't the patch rename debug_io_direct to simply io_direct so that GUCs are consistent in terms of naming?
5. It appears that pg_stat_io.reads seems to be not refreshed until they query seems to be finished. While running a query for minutes with this patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0
Or is that how it is supposed to work? Also pg_stat_io.read_time would be something vague with io_uring/worker, so maybe zero is good here (?). Otherwise we would have to measure time spent on waiting alone, but that would force more instructions for calculating io times...
6. After playing with some basic measurements - which went fine, I wanted to go test simple PostGIS even with sequential scans to see any compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good testing point), but before that I've tried to see what's the TOAST performance alone with AIO+DIO (debug_io_direct=data). One issue I have found is that DIO seems to be unusable until somebody will teach TOAST to use readstreams, is that correct? Maybe I'm doing something wrong, but I haven't seen any TOAST <-> readstreams topic:
-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)
hotpath was detoast_attr() -> toast_fetch_datum() -> heap_fetch_toast_slice() -> systable_getnext_ordered() -> index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() -> ReadBufferExtended -> AIO code.
The difference is that on cold caches with DIO gets 2x slowdown; with clean s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO
7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints on how to inspect real I/O calls requested to review if the code is issuing sensible calls: there's no strace for uring, or do you stick to DEBUG3 or perhaps using some bpftrace / xfsslower is the best way to go ?
8. Not sure if that helps, but I've managed the somehow to hit the impossible situation You describe in pgaio_uring_submit() "(ret != num_staged_ios)", but I had to push urings really hard into using futexes and probably I've could made some error in coding too for that too occur [3]. As it stands in that patch from my thread, it was not covered: /* FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that code-path pretty often with 6.10.x kernel)
9. Please let me know, what's the current up to date line of thinking about this patchset: is it intended to be committed as v18 ? As a debug feature or as non-debug feature? (that is which of the IO methods should be scrutinized the most as it is going to be the new default - sync or worker?)
10. At this point, does it even make sense to give a try experimenty try to pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is going to cover it with 6.12.x apparently, but I could try with some -rcX)
-J.
Thank You for worth admiring persistence on this. Please do not take it as criticism, just more like set of questions regarding the patchset v2.1 that I finally got little time to play with:
0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ? Otherwise we'll never know that SQ is full in theory, perhaps at least such a check should be made with Assert() ? (I understand right now that we allow just up to io_uring_queue_init(io_max_concurrency), but what happens if:
a. previous io_uring_submit() failed for some reason and we do not have free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the code starts using just one big queue, still without checking for io_uring_get_sqe()?
1. In [0] you wrote that there's this high amount of FDs consumed for io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are many customers who are using extremely high max_connections (4k-5k, but there outliers with 10k in the wild too) - so they won't even start - and I have one doubt on the user-friendliness impact of this. I'm quite certain it's going to be the same as with pgbouncer where one is forced to tweak OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries to preallocate and then close() a lot of FDs, so that's safer in runtime. IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the max_files_per_process looses it's spirit a little bit and PG is going to start loose efficiency too due to frequent open()/close() calls as fd cache is too small. Tomas also complained about it some time ago in [1])
So maybe it would be good to introduce couple of sanity checks too (even after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have even real 1k FDs free for handling relation FDs (detect something bad like: getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)
2. In pgaio_uring_postmaster_child_init_local() there "io_uring_queue_init(32,...)" - why 32? :) And also there's separate io_uring_queue_init(io_max_concurrency) which seems to be derived from AioChooseMaxConccurrency() which can go up to 64?
3. I find having two GUCs named literally the same (effective_io_concurrency, io_max_concurrency). It is clear from IO_URING perspective what is io_max_concurrency all about, but I bet having also effective_io_concurrency in the mix is going to be a little confusing for users (well, it is to me). Maybe that README.md could elaborate a little bit on the relation between those two? Or maybe do you plan to remove io_max_concurrency and bind it to effective_io_concurrency in future? To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the earlier mentioned AioChooseMaxConccurrency() goes up to just 64
4. While we are at this, shouldn't the patch rename debug_io_direct to simply io_direct so that GUCs are consistent in terms of naming?
5. It appears that pg_stat_io.reads seems to be not refreshed until they query seems to be finished. While running a query for minutes with this patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0
Or is that how it is supposed to work? Also pg_stat_io.read_time would be something vague with io_uring/worker, so maybe zero is good here (?). Otherwise we would have to measure time spent on waiting alone, but that would force more instructions for calculating io times...
6. After playing with some basic measurements - which went fine, I wanted to go test simple PostGIS even with sequential scans to see any compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good testing point), but before that I've tried to see what's the TOAST performance alone with AIO+DIO (debug_io_direct=data). One issue I have found is that DIO seems to be unusable until somebody will teach TOAST to use readstreams, is that correct? Maybe I'm doing something wrong, but I haven't seen any TOAST <-> readstreams topic:
-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)
hotpath was detoast_attr() -> toast_fetch_datum() -> heap_fetch_toast_slice() -> systable_getnext_ordered() -> index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() -> ReadBufferExtended -> AIO code.
The difference is that on cold caches with DIO gets 2x slowdown; with clean s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO
Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------
7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints on how to inspect real I/O calls requested to review if the code is issuing sensible calls: there's no strace for uring, or do you stick to DEBUG3 or perhaps using some bpftrace / xfsslower is the best way to go ?
8. Not sure if that helps, but I've managed the somehow to hit the impossible situation You describe in pgaio_uring_submit() "(ret != num_staged_ios)", but I had to push urings really hard into using futexes and probably I've could made some error in coding too for that too occur [3]. As it stands in that patch from my thread, it was not covered: /* FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that code-path pretty often with 6.10.x kernel)
9. Please let me know, what's the current up to date line of thinking about this patchset: is it intended to be committed as v18 ? As a debug feature or as non-debug feature? (that is which of the IO methods should be scrutinized the most as it is going to be the new default - sync or worker?)
10. At this point, does it even make sense to give a try experimenty try to pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is going to cover it with 6.12.x apparently, but I could try with some -rcX)
-J.
p.s. I hope I did not ask stupid questions nor missed anything.
[0] - https://www.postgresql.org/message-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic%40bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to
each other process.(..)"
[1] - https://www.postgresql.org/message-id/flat/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c%40enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)"
[2] - https://www.postgresql.org/message-id/flat/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk%40xk7jy7fnua6w#dbedc74f7d19abf40b90f2c348fe1778
[3] - https://www.postgresql.org/message-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu%2BeNZDm6g%40mail.gmail.com
each other process.(..)"
[1] - https://www.postgresql.org/message-id/flat/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c%40enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)"
[2] - https://www.postgresql.org/message-id/flat/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk%40xk7jy7fnua6w#dbedc74f7d19abf40b90f2c348fe1778
[3] - https://www.postgresql.org/message-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu%2BeNZDm6g%40mail.gmail.com
Hi, Sorry for loosing track of your message for this long, I saw it just now because I was working on posting a new version. On 2024-11-18 13:19:58 +0100, Jakub Wartak wrote: > On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote: > Thank You for worth admiring persistence on this. Please do not take it as > criticism, just more like set of questions regarding the patchset v2.1 that > I finally got little time to play with: > > 0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in > pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ? Yea, it shouldn't ever happen, but it's worth adding a check. > Otherwise we'll never know that SQ is full in theory, perhaps at least such > a check should be made with Assert() ? (I understand right now that we > allow just up to io_uring_queue_init(io_max_concurrency), but what happens > if: > a. previous io_uring_submit() failed for some reason and we do not have > free space for SQ? We'd have PANICed at that failure :) > b. (hypothetical) someday someone will try to make PG multithreaded and the > code starts using just one big queue, still without checking for > io_uring_get_sqe()? That'd not make sense - you'd still want to use separate rings, to avoid contention. > 1. In [0] you wrote that there's this high amount of FDs consumed for > io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are > many customers who are using extremely high max_connections (4k-5k, but > there outliers with 10k in the wild too) - so they won't even start - and I > have one doubt on the user-friendliness impact of this. I'm quite certain > it's going to be the same as with pgbouncer where one is forced to tweak > OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries > to preallocate and then close() a lot of FDs, so that's safer in runtime. > IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the > max_files_per_process looses it's spirit a little bit and PG is going to > start loose efficiency too due to frequent open()/close() calls as fd cache > is too small. Tomas also complained about it some time ago in [1]) My current thoughts around this are that we should generally, independent of io_uring, increase the FD limit ourselves. In most distros the soft ulimit is set to something like 1024, but the hard limit is much higher. The reason for that is that some applications try to close all fds between 0 and RLIMIT_NOFILE - which takes a long time if RLIMIT_NOFILE is high. By setting only the soft limit to a low value any application needing higher limits can just opt into using more FDs. On several of my machines the hard limit is 1073741816. > So maybe it would be good to introduce couple of sanity checks too (even > after setting higher limit): > - issue FATAL in case of using io_method = io_ring && max_connections would > be close to getrusage(RLIMIT_NOFILE) > - issue warning in case of using io_method = io_ring && we wouldnt have > even real 1k FDs free for handling relation FDs (detect something bad like: > getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process) Probably still worth adding something like this, even if we were to do what I am suggesting above. > 2. In pgaio_uring_postmaster_child_init_local() there > "io_uring_queue_init(32,...)" - why 32? :) And also there's separate > io_uring_queue_init(io_max_concurrency) which seems to be derived from > AioChooseMaxConccurrency() which can go up to 64? Yea, that's probably not right. > 3. I find having two GUCs named literally the same > (effective_io_concurrency, io_max_concurrency). It is clear from IO_URING > perspective what is io_max_concurrency all about, but I bet having also > effective_io_concurrency in the mix is going to be a little confusing for > users (well, it is to me). Maybe that README.md could elaborate a little > bit on the relation between those two? Or maybe do you plan to remove > io_max_concurrency and bind it to effective_io_concurrency in future? io_max_concurrency is a hard maximum that needs to be set at server start, because it requires allocating shared memory. Whereas effective_io_concurrency can be changed on a per-session and per-tablespace basis. I.e. io_max_concurrency is a hard upper limit for an entire backend, whereas effective_io_concurrency controls how much one scan (or whatever does prefetching) can issue. > To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while > the earlier mentioned AioChooseMaxConccurrency() goes up to just 64 Yea, that should probably be disambiguated. > 4. While we are at this, shouldn't the patch rename debug_io_direct to > simply io_direct so that GUCs are consistent in terms of naming? I used to have a patch like that in the series and it was a pain to rebase... I also suspect sure this is quite enough to make debug_io_direct quite production ready, even if just considering io_direct=data. Without streaming read use in heap + index VACUUM, RelationCopyStorage() and a few other places the performance consequences of using direct IO can be, um, surprising. > 5. It appears that pg_stat_io.reads seems to be not refreshed until they > query seems to be finished. While running a query for minutes with this > patchset, I've got: > now | reads | read_time > -------------------------------+----------+----------- > 2024-11-15 12:09:09.151631+00 | 15004271 | 0 > [..] > 2024-11-15 12:10:25.241175+00 | 15004271 | 0 > 2024-11-15 12:10:26.241179+00 | 15004271 | 0 > 2024-11-15 12:10:27.241139+00 | 18250913 | 0 > > Or is that how it is supposed to work? Currently the patch has a FIXME to add some IO statistics (I think I raised that somewhere in this thread, too). It's not clear to me what IO time ought to mean. I suspect the least bad answer is what you suggest: > Also pg_stat_io.read_time would be something vague with io_uring/worker, so > maybe zero is good here (?). Otherwise we would have to measure time spent > on waiting alone, but that would force more instructions for calculating io > times... I.e. we should track the amount of time spent waiting for IOs. I don't think tracking time in worker or such would make much sense, that'd often end up with reporting more IO time than a query took. > 6. After playing with some basic measurements - which went fine, I wanted > to go test simple PostGIS even with sequential scans to see any > compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good > testing point), but before that I've tried to see what's the TOAST > performance alone with AIO+DIO (debug_io_direct=data). It's worth noting that with the last posted version you needed to increase effective_io_concurrency to something very high to see sensible performance. That's due to the way read_stream_begin_impl() limited the number of buffers pinned to effective_io_concurrency * 4 - which, due to io_combine_limit, ends up allowing only a single IO in flight in case of sequential blocks until effective_io_concurrency is set to 8 or such. I've adjusted that to some degree now, but I think that might need a bit more sophistication. > One issue I have found is that DIO seems to be unusable until somebody will > teach TOAST to use readstreams, is that correct? Maybe I'm doing something > wrong, but I haven't seen any TOAST <-> readstreams topic: Hm, I suspect that aq read stream won't help a whole lot in manyq toast cases. Unless you have particularly long toast datums, the time is going to be dominated by the random accesses, as each toast datum is looked up in a non-predictable way. Generally, using DIO requires tuning shared buffers much more aggressively than not using DIO, no amount of stream use will change that. Of course we shoul try to reduce that "downside"... I'm not sure if the best way to do prefetching toast chunks would be to rely on more generalized index->table prefetching support, or to have dedicated code. > -- 12MB table , 25GB toast > create table t (id bigint, t text storage external); > insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r > from generate_series(1,200000) s(i); > set max_parallel_workers_per_gather=0; > \timing > -- with cold caches: empty s_b, echo 3 > drop_caches > select sum(length(t)) from t; > master 101897.823 ms (01:41.898) > AIO 99758.399 ms (01:39.758) > AIO+DIO 191479.079 ms (03:11.479) > > hotpath was detoast_attr() -> toast_fetch_datum() -> > heap_fetch_toast_slice() -> systable_getnext_ordered() -> > index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() -> > ReadBufferExtended -> AIO code. > > The difference is that on cold caches with DIO gets 2x slowdown; with clean > s_b and so on: > * getting normal heap data seqscan: up to 285MB/s > * but TOASTs maxes out at 132MB/s when using io_uring+DIO I started loading the data to try this out myself :). > Not about patch itself, but questions about related stack functionality: > ---------------------------------------------------------------------------------------------------- > > > 7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints > on how to inspect real I/O calls requested to review if the code is issuing > sensible calls: there's no strace for uring, or do you stick to DEBUG3 or > perhaps using some bpftrace / xfsslower is the best way to go ? I think we still want something like it, but I don't think it needs to be in the initial commits. There are kernel events that you can track using e.g. perf. Particularly useful are io_uring:io_uring_submit_req io_uring:io_uring_complete > 8. Not sure if that helps, but I've managed the somehow to hit the > impossible situation You describe in pgaio_uring_submit() "(ret != > num_staged_ios)", but I had to push urings really hard into using futexes > and probably I've could made some error in coding too for that too occur > [3]. As it stands in that patch from my thread, it was not covered: /* > FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that > code-path pretty often with 6.10.x kernel) I think you can hit that if you don't take care to limit the number of IOs being submitted at once or if you're not consuming completions. If the completion queue is full enough the kernel at some point won't allow more IOs to be submitted. > 9. Please let me know, what's the current up to date line of thinking about > this patchset: is it intended to be committed as v18 ? I'd love to get some of it into 18. I don't quite know whether we can make it happen and to what extent. > As a debug feature or as non-debug feature? (that is which of the IO methods > should be scrutinized the most as it is going to be the new default - sync > or worker?) I'd say initially worker, with a beta 1 or 2 checklist item to revise it. > 10. At this point, does it even make sense to give a try experimenty try to > pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is > going to cover it with 6.12.x apparently, but I could try with some -rcX) I don't think that's worth doing right now. There's too many dependencies and it's going to be a while till the kernel support for that is widespread enough to matter. There's also the issue that, to my knowledge, outside of cloud environments there's pretty much no hardware that actually reports power-fail atomicity sizes bigger than a sector. > p.s. I hope I did not ask stupid questions nor missed anything. You did not! Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > My current thoughts around this are that we should generally, independent of > io_uring, increase the FD limit ourselves. I'm seriously down on that, because it amounts to an assumption that we own the machine and can appropriate all its resources. If ENFILE weren't a thing, it'd be all right, but that is a thing. We have no business trying to consume resources the DBA didn't tell us we could. regards, tom lane
Hi, On 2024-12-19 17:34:29 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > My current thoughts around this are that we should generally, independent of > > io_uring, increase the FD limit ourselves. > > I'm seriously down on that, because it amounts to an assumption that > we own the machine and can appropriate all its resources. If ENFILE > weren't a thing, it'd be all right, but that is a thing. We have no > business trying to consume resources the DBA didn't tell us we could. Arguably the configuration *did* tell us, by having a higher hard limit... I'm not saying that we should increase the limit without a bound or without a configuration option, btw. As I had mentioned, the problem with relying on increasing the soft limit that is that it's not generally sensible to do so, because it causes a bunch of binaries to do be weirdly slow. Another reason to not increase the soft rlimit is that doing so can break programs relying on select(). But opting into a higher rlimit, while obviously adhering to the hard limit and perhaps some other config knob, seems fine? Greetings, Andres Freund
On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote: > Arguably the configuration *did* tell us, by having a higher hard limit... > <snip> > But opting into a higher rlimit, while obviously adhering to the hard limit > and perhaps some other config knob, seems fine? Yes, totally fine. That's exactly the reasoning why the hard limit is so much larger than the soft limit by default on systems with systemd: https://0pointer.net/blog/file-descriptor-limits.html
Hi, On 2024-12-20 18:27:13 +0100, Jelte Fennema-Nio wrote: > On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote: > > Arguably the configuration *did* tell us, by having a higher hard limit... > > <snip> > > But opting into a higher rlimit, while obviously adhering to the hard limit > > and perhaps some other config knob, seems fine? > > Yes, totally fine. That's exactly the reasoning why the hard limit is > so much larger than the soft limit by default on systems with systemd: > > https://0pointer.net/blog/file-descriptor-limits.html Good link. This isn't just relevant for using io_uring: There obviously are several people working on threaded postgres. Even if we didn't duplicate fd.c file descriptors between threads (we probably will, at least initially), the client connection FDs alone will mean that we have a lot more FDs open. Due to the select() issue the soft limit won't be increased beyond 1024, requiring everyone to add a 'ulimit -n $somehighnumber' before starting postgres on linux doesn't help anyone. Greetings, Andres Freund
Hi, On 2024-12-19 17:29:12 -0500, Andres Freund wrote: > > Not about patch itself, but questions about related stack functionality: > > ---------------------------------------------------------------------------------------------------- > > > > > > 7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints > > on how to inspect real I/O calls requested to review if the code is issuing > > sensible calls: there's no strace for uring, or do you stick to DEBUG3 or > > perhaps using some bpftrace / xfsslower is the best way to go ? > > I think we still want something like it, but I don't think it needs to be in > the initial commits. After I got this question from Thomas as well, I started hacking one up. What information would you like to see? Here's what I currently have: ┌─[ RECORD 1 ]───┬────────────────────────────────────────────────┐ │ pid │ 358212 │ │ io_id │ 2050 │ │ io_generation │ 4209 │ │ state │ COMPLETED_SHARED │ │ operation │ read │ │ offset │ 509083648 │ │ length │ 262144 │ │ subject │ smgr │ │ iovec_data_len │ 32 │ │ raw_result │ 262144 │ │ result │ OK │ │ error_desc │ (null) │ │ subject_desc │ blocks 1372864..1372895 in file "base/5/16388" │ │ flag_sync │ f │ │ flag_localmem │ f │ │ flag_buffered │ t │ ├─[ RECORD 2 ]───┼────────────────────────────────────────────────┤ │ pid │ 358212 │ │ io_id │ 2051 │ │ io_generation │ 4199 │ │ state │ IN_FLIGHT │ │ operation │ read │ │ offset │ 511967232 │ │ length │ 262144 │ │ subject │ smgr │ │ iovec_data_len │ 32 │ │ raw_result │ (null) │ │ result │ UNKNOWN │ │ error_desc │ (null) │ │ subject_desc │ blocks 1373216..1373247 in file "base/5/16388" │ │ flag_sync │ f │ │ flag_localmem │ f │ │ flag_buffered │ t │ I didn't think that pg_stat_* was quite the right namespace, given that it shows not stats, but the currently ongoing IOs. I am going with pg_aios for now, but I don't particularly like that. I think we'll want a pg_stat_aio as well, tracking things like: - how often the queue to IO workes was full - how many times we submitted IO to the kernel (<= #ios with io_uring) - how many times we asked the kernel for events (<= #ios with io_uring) - how many times we had to wait for in-flight IOs before issuing more IOs Greetings, Andres Freund
On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2024-12-19 17:29:12 -0500, Andres Freund wrote: > > > Not about patch itself, but questions about related stack functionality: > > > ---------------------------------------------------------------------------------------------------- > > > > > > > > > 7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints > > > on how to inspect real I/O calls requested to review if the code is issuing > > > sensible calls: there's no strace for uring, or do you stick to DEBUG3 or > > > perhaps using some bpftrace / xfsslower is the best way to go ? > > > > I think we still want something like it, but I don't think it needs to be in > > the initial commits. > > After I got this question from Thomas as well, I started hacking one up. > > What information would you like to see? > > Here's what I currently have: .. > ├─[ RECORD 2 ]───┼────────────────────────────────────────────────┤ > │ pid │ 358212 │ > │ io_id │ 2051 │ > │ io_generation │ 4199 │ > │ state │ IN_FLIGHT │ > │ operation │ read │ > │ offset │ 511967232 │ > │ length │ 262144 │ > │ subject │ smgr │ > │ iovec_data_len │ 32 │ > │ raw_result │ (null) │ > │ result │ UNKNOWN │ > │ error_desc │ (null) │ > │ subject_desc │ blocks 1373216..1373247 in file "base/5/16388" │ > │ flag_sync │ f │ > │ flag_localmem │ f │ > │ flag_buffered │ t │ Cool! It's more than enough for me in future, thanks! > I didn't think that pg_stat_* was quite the right namespace, given that it > shows not stats, but the currently ongoing IOs. I am going with pg_aios for > now, but I don't particularly like that. If you are looking for other proposals: * pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?) * pg_debug_aios ? * pg_debug_io ? > I think we'll want a pg_stat_aio as well, tracking things like: > > - how often the queue to IO workes was full > - how many times we submitted IO to the kernel (<= #ios with io_uring) > - how many times we asked the kernel for events (<= #ios with io_uring) > - how many times we had to wait for in-flight IOs before issuing more IOs If I could dream of one thing that would be 99.9% percentile of IO response times in milliseconds for different classes of I/O traffic (read/write/flush). But it sounds like it would be very similiar to pg_stat_io and potentially would have to be per-tablespace/IO-traffic(subject)-type too. AFAIU pg_stat_io has improper structure to have that there. BTW: before trying to even start to compile that AIO v2.2* and responding to the previous review, what are You looking interested to hear the most about it so that it adds some value ? Any workload specific measurements? just general feedback, functionality gaps? Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay to try the error handling routines? Some kind of AIO <-> standby/recovery interactions? * - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so let's officially recognize the 2025 as the year of AIO in PG, as it was 1st message :D -J.
Hi, On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote: > On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote: > > I didn't think that pg_stat_* was quite the right namespace, given that it > > shows not stats, but the currently ongoing IOs. I am going with pg_aios for > > now, but I don't particularly like that. > > If you are looking for other proposals: > * pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?) > * pg_debug_aios ? > * pg_debug_io ? I think pg_aios is better than those, if not by much. Seems others are ok with that name too. And we easily can evolve it later. > > I think we'll want a pg_stat_aio as well, tracking things like: > > > > - how often the queue to IO workes was full > > - how many times we submitted IO to the kernel (<= #ios with io_uring) > > - how many times we asked the kernel for events (<= #ios with io_uring) > > - how many times we had to wait for in-flight IOs before issuing more IOs > > If I could dream of one thing that would be 99.9% percentile of IO > response times in milliseconds for different classes of I/O traffic > (read/write/flush). But it sounds like it would be very similiar to > pg_stat_io and potentially would have to be > per-tablespace/IO-traffic(subject)-type too. Yea, that's a significant project on its own. It's not that cheap to compute reasonably accurate percentiles and we have no infrastructure for doing so right now. > AFAIU pg_stat_io has improper structure to have that there. Hm, not obvious to me why? It might make the view a bit wide to add it as an additional column, but otherwise I don't see a problem? > BTW: before trying to even start to compile that AIO v2.2* and > responding to the previous review, what are You looking interested to > hear the most about it so that it adds some value? Due to the rather limited "users" of AIO in the patchset, I think most benchmarks aren't expected to show any meaningful gains. However, they shouldn't show any significant regressions either (when not using direct IO). I think trying to find regressions would be a rather valuable thing. I'm tempted to collect a few of the reasonbly-ready read stream conversions into the patchset, to make the potential gains more visible. But I am not sure it's a good investment of time right now. One small regression I do know about, namely scans of large relations that are bigger than shared buffers but do fit in the kernel page cache. The increase of BAS_BULKREAD does cause a small slowdown - but without it we never can do sufficient asynchronous IO. I think the slowdown is small enough to just accept that, but it's worth qualifying that on a few machines. > Any workload specific measurements? just general feedback, functionality > gaps? To see the benefits it'd be interesting to compare: 1) sequential scan performance with data not in shared buffers, using buffered IO 2) same, but using direct IO when testing the patch 3) checkpoint performance In my experiments 1) gains a decent amount of performance in many cases, but nothing overwhelming - sequential scans are easy for the kernel to read ahead. I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs that each can do ~3.5 GB/s I measured very parallel sequential scans (I had to use ALTER TABLE to get sufficient numbers of workers): master: ~18 GB/s patch, buffered: ~20 GB/s patch, direct, worker: ~28 GB/s patch, direct, uring: ~35 GB/s This was with io_workers=32, io_max_concurrency=128, effective_io_concurrency=1000 (doesn't need to be that high, but it's what I still have the numbers for). This was without data checksums enabled as otherwise the checksum code becomes a *huge* bottleneck. I also see significant gains with 3). Bigger when using direct IO. One complicating factor measuring 3) is that the first write to a block will often be slower than subsequent writes because the filesystem will need to update some journaled metadata, presenting a bottleneck. Checkpoint performance is also severely limited by data checksum computation if enabled - independent of this patchset. One annoying thing when testing DIO is that right now VACUUM will be rather slow if the data isn't already in s_b, as it isn't yet read-stream-ified. > Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay > to try the error handling routines? Hm. I don't think that's going to work very well even on master. If the filesystem fails there's not much that PG can do... > Some kind of AIO <-> standby/recovery interactions? I wouldn't expect anything there. I think Thomas somewhere has a patch that read-stream-ifies recovery prefetching, once that's done it would be more interesting. > * - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so > let's officially recognize the 2025 as the year of AIO in PG, as it > was 1st message :D Hah, that was actually the opposite of what I intended :). I'd hoped to post earlier, but jetlag had caught up with me... Greetings, Andres Freund
On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres@anarazel.de> wrote: > master: ~18 GB/s > patch, buffered: ~20 GB/s > patch, direct, worker: ~28 GB/s > patch, direct, uring: ~35 GB/s > > > This was with io_workers=32, io_max_concurrency=128, > effective_io_concurrency=1000 (doesn't need to be that high, but it's what I > still have the numbers for). > > > This was without data checksums enabled as otherwise the checksum code becomes > a *huge* bottleneck. I'm curious about this because the checksum code should be fast enough to easily handle that throughput. I remember checksum overhead being negligible even when pulling in pages from page cache. Is it just that the calculation is slow, or is it the fact that checksumming needs to bring the page into the CPU cache. Did you notice any hints which might be the case? I don't really have a machine at hand that can do anywhere close to this amount of I/O. I'm asking because if it's the calculation that is slow then it seems like it's time to compile different ISA extension variants of the checksum code and select the best one at runtime. -- Ants Aasma
Hi, On 2025-01-09 10:59:22 +0200, Ants Aasma wrote: > On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres@anarazel.de> wrote: > > master: ~18 GB/s > > patch, buffered: ~20 GB/s > > patch, direct, worker: ~28 GB/s > > patch, direct, uring: ~35 GB/s > > > > > > This was with io_workers=32, io_max_concurrency=128, > > effective_io_concurrency=1000 (doesn't need to be that high, but it's what I > > still have the numbers for). > > > > > > This was without data checksums enabled as otherwise the checksum code becomes > > a *huge* bottleneck. > > I'm curious about this because the checksum code should be fast enough > to easily handle that throughput. It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y workstation. But we don't have a good ready-made way of testing that without also doing IO, so it's kinda hard to say. > I remember checksum overhead being negligible even when pulling in pages > from page cache. It's indeed much less of an issue when pulling pages from the page cache, as the copy from the page cache is fairly slow. With direct-IO, where the copy from the page cache isn't the main driver of CPU use anymore, it becomes much clearer. Even with buffered IO it became a bigger issue with 17, due to io_combine_limit. It turns out that lots of tiny syscalls are slow, so the peak throughput that could reach the checksumming code was lower. I created a 21554MB relation and measured the time to do a pg_prewarm() of that relation after evicting all of shared buffers (the first time buffers are touched has a bit different perf characteristics). I am using direct IO and io_uring here, as buffered IO would include the page cache copy cost and worker mode could parallelize the checksum computation on reads. The checksum cost is a bigger issue for writes than reads, but it's harder to quickly generate enough dirty data for a repeatable benchmark. This system can do about 12.5GB/s of read IO. Just to show the effect of the read size on page cache copy performance: config checksums time in ms buffered io_engine=sync io_combine_limit=1 0 6712.153 buffered io_engine=sync io_combine_limit=2 0 5919.215 buffered io_engine=sync io_combine_limit=4 0 5738.496 buffered io_engine=sync io_combine_limit=8 0 5396.415 buffered io_engine=sync io_combine_limit=16 0 5312.803 buffered io_engine=sync io_combine_limit=32 0 5275.389 To see the effect of page cache copy overhead: config checksums time in ms buffered io_engine=io_uring 0 3901.625 direct io_engine=io_uring 0 2075.330 Now to show the effect of checksums (enabled/disabled with pg_checksums): config checksums time in ms buffered io_engine=io_uring 0 3883.127 buffered io_engine=io_uring 1 5880.892 direct io_engine=io_uring 0 2067.142 direct io_engine=io_uring 1 3835.968 So with direct + uring w/o checksums, we can reach 10427 MB/s (close-ish to disk speed), but with checksums we only reach 5620 MB/s. > Is it just that the calculation is slow, or is it the fact that checksumming > needs to bring the page into the CPU cache. Did you notice any hints which > might be the case? I don't think the issue is that checksumming pulls the data into CPU caches 1) This is visible with SELECT that actually uses the data 2) I added prefetching to avoid any meaningful amount of cache misses and it doesn't change the overall timing much 3) It's visible with buffered IO, which has pulled the data into CPU caches already > I don't really have a machine at hand that can do anywhere close to this > amount of I/O. It's visible even when pulling from the page cache, if to a somewhat lesser degree. I wonder if it's worth adding a test function that computes checksums of all shared buffers in memory already. That'd allow exercising the checksum code in a realistic context (i.e. buffer locking etc preventing some out-of-order effects, using 8kB chunks etc) without also needing to involve the IO path. > I'm asking because if it's the calculation that is slow then it seems > like it's time to compile different ISA extension variants of the > checksum code and select the best one at runtime. You think it's ISA specific? I don't see a significant effect of compiling with -march=native or not - and that should suffice to make the checksum code built with sufficiently high ISA support, right? FWIW CPU profiles show all the time being spent in the "main checksum calculation" loop: Percent | Source code & Disassembly of postgres for cycles:P (5866 samples, percent: local period) -------------------------------------------------------------------------------------------------------- : : : : 3 Disassembly of section .text: : : 5 00000000009e3670 <pg_checksum_page>: : 6 * calculation isn't affected by the old checksum stored on the page. : 7 * Restore it after, because actually updating the checksum is NOT part of : 8 * the API of this function. : 9 */ : 10 save_checksum = cpage->phdr.pd_checksum; : 11 cpage->phdr.pd_checksum = 0; 0.00 : 9e3670: xor %eax,%eax : 13 CHECKSUM_COMP(sums[j], page->data[i][j]); 0.00 : 9e3672: mov $0x1000193,%r8d : 15 cpage->phdr.pd_checksum = 0; 0.00 : 9e3678: vmovdqa -0x693fa0(%rip),%ymm3 # 34f6e0 <.LC0> 0.05 : 9e3680: vmovdqa -0x6935c8(%rip),%ymm4 # 3500c0 <.LC1> 0.00 : 9e3688: vmovdqa -0x693c10(%rip),%ymm0 # 34fa80 <.LC2> 0.00 : 9e3690: vmovdqa -0x693598(%rip),%ymm1 # 350100 <.LC3> : 20 { 0.00 : 9e3698: mov %esi,%ecx 0.02 : 9e369a: lea 0x2000(%rdi),%rdx : 23 save_checksum = cpage->phdr.pd_checksum; 0.00 : 9e36a1: movzwl 0x8(%rdi),%esi : 25 CHECKSUM_COMP(sums[j], page->data[i][j]); 0.00 : 9e36a5: vpbroadcastd %r8d,%ymm5 : 27 cpage->phdr.pd_checksum = 0; 0.00 : 9e36ab: mov %ax,0x8(%rdi) : 29 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) 0.14 : 9e36af: mov %rdi,%rax 0.00 : 9e36b2: nopw 0x0(%rax,%rax,1) : 32 CHECKSUM_COMP(sums[j], page->data[i][j]); 15.36 : 9e36b8: vpxord (%rax),%ymm1,%ymm1 4.79 : 9e36be: vmovdqu 0x80(%rax),%ymm2 : 35 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) 0.07 : 9e36c6: add $0x100,%rax : 37 CHECKSUM_COMP(sums[j], page->data[i][j]); 2.45 : 9e36cc: vpxord -0xe0(%rax),%ymm0,%ymm0 2.85 : 9e36d3: vpmulld %ymm5,%ymm1,%ymm6 0.02 : 9e36d8: vpsrld $0x11,%ymm1,%ymm1 3.17 : 9e36dd: vpternlogd $0x96,%ymm6,%ymm1,%ymm2 2.01 : 9e36e4: vpmulld %ymm5,%ymm0,%ymm6 13.16 : 9e36e9: vpmulld %ymm5,%ymm2,%ymm1 0.03 : 9e36ee: vpsrld $0x11,%ymm2,%ymm2 0.02 : 9e36f3: vpsrld $0x11,%ymm0,%ymm0 2.57 : 9e36f8: vpxord %ymm2,%ymm1,%ymm1 0.89 : 9e36fe: vmovdqu -0x60(%rax),%ymm2 0.12 : 9e3703: vpternlogd $0x96,%ymm6,%ymm0,%ymm2 4.48 : 9e370a: vpmulld %ymm5,%ymm2,%ymm0 0.49 : 9e370f: vpsrld $0x11,%ymm2,%ymm2 0.99 : 9e3714: vpxord %ymm2,%ymm0,%ymm0 11.88 : 9e371a: vpxord -0xc0(%rax),%ymm4,%ymm2 2.80 : 9e3721: vpmulld %ymm5,%ymm2,%ymm6 0.68 : 9e3726: vpsrld $0x11,%ymm2,%ymm4 4.94 : 9e372b: vmovdqu -0x40(%rax),%ymm2 1.45 : 9e3730: vpternlogd $0x96,%ymm6,%ymm4,%ymm2 8.63 : 9e3737: vpmulld %ymm5,%ymm2,%ymm4 0.17 : 9e373c: vpsrld $0x11,%ymm2,%ymm2 1.81 : 9e3741: vpxord %ymm2,%ymm4,%ymm4 0.10 : 9e3747: vpxord -0xa0(%rax),%ymm3,%ymm2 0.70 : 9e374e: vpmulld %ymm5,%ymm2,%ymm6 1.65 : 9e3753: vpsrld $0x11,%ymm2,%ymm3 0.03 : 9e3758: vmovdqu -0x20(%rax),%ymm2 0.85 : 9e375d: vpternlogd $0x96,%ymm6,%ymm3,%ymm2 3.73 : 9e3764: vpmulld %ymm5,%ymm2,%ymm3 0.07 : 9e3769: vpsrld $0x11,%ymm2,%ymm2 1.48 : 9e376e: vpxord %ymm2,%ymm3,%ymm3 : 68 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) 0.02 : 9e3774: cmp %rax,%rdx 2.32 : 9e3777: jne 9e36b8 <pg_checksum_page+0x48> : 71 CHECKSUM_COMP(sums[j], 0); 0.17 : 9e377d: vpmulld %ymm5,%ymm0,%ymm7 0.07 : 9e3782: vpmulld %ymm5,%ymm1,%ymm6 : 74 checksum = pg_checksum_block(cpage); : 75 cpage->phdr.pd_checksum = save_checksum; 0.00 : 9e3787: mov %si,0x8(%rdi) : 77 CHECKSUM_COMP(sums[j], 0); 0.02 : 9e378b: vpsrld $0x11,%ymm0,%ymm0 0.02 : 9e3790: vpsrld $0x11,%ymm1,%ymm1 0.02 : 9e3795: vpsrld $0x11,%ymm4,%ymm2 0.00 : 9e379a: vpxord %ymm0,%ymm7,%ymm7 0.10 : 9e37a0: vpmulld %ymm5,%ymm4,%ymm0 0.00 : 9e37a5: vpxord %ymm1,%ymm6,%ymm6 0.17 : 9e37ab: vpmulld %ymm5,%ymm3,%ymm1 0.19 : 9e37b0: vpmulld %ymm5,%ymm6,%ymm4 0.00 : 9e37b5: vpsrld $0x11,%ymm6,%ymm6 0.02 : 9e37ba: vpxord %ymm2,%ymm0,%ymm0 0.00 : 9e37c0: vpsrld $0x11,%ymm3,%ymm2 0.22 : 9e37c5: vpmulld %ymm5,%ymm7,%ymm3 0.02 : 9e37ca: vpsrld $0x11,%ymm7,%ymm7 0.00 : 9e37cf: vpxord %ymm2,%ymm1,%ymm1 0.03 : 9e37d5: vpsrld $0x11,%ymm0,%ymm2 0.15 : 9e37da: vpmulld %ymm5,%ymm0,%ymm0 : 94 result ^= sums[i]; 0.00 : 9e37df: vpternlogd $0x96,%ymm3,%ymm7,%ymm2 : 96 CHECKSUM_COMP(sums[j], 0); 0.05 : 9e37e6: vpsrld $0x11,%ymm1,%ymm3 0.19 : 9e37eb: vpmulld %ymm5,%ymm1,%ymm1 : 99 result ^= sums[i]; 0.02 : 9e37f0: vpternlogd $0x96,%ymm4,%ymm6,%ymm0 0.10 : 9e37f7: vpxord %ymm1,%ymm0,%ymm0 0.07 : 9e37fd: vpternlogd $0x96,%ymm2,%ymm3,%ymm0 0.15 : 9e3804: vextracti32x4 $0x1,%ymm0,%xmm1 0.03 : 9e380b: vpxord %xmm0,%xmm1,%xmm0 0.14 : 9e3811: vpsrldq $0x8,%xmm0,%xmm1 0.12 : 9e3816: vpxord %xmm1,%xmm0,%xmm0 0.09 : 9e381c: vpsrldq $0x4,%xmm0,%xmm1 0.12 : 9e3821: vpxord %xmm1,%xmm0,%xmm0 0.05 : 9e3827: vmovd %xmm0,%eax : : 111 /* Mix in the block number to detect transposed pages */ : 112 checksum ^= blkno; 0.07 : 9e382b: xor %ecx,%eax : : 115 /* : 116 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of : 117 * one. That avoids checksums of zero, which seems like a good idea. : 118 */ : 119 return (uint16) ((checksum % 65535) + 1); 0.00 : 9e382d: mov $0x80008001,%ecx 0.03 : 9e3832: mov %eax,%edx 0.27 : 9e3834: imul %rcx,%rdx 0.09 : 9e3838: shr $0x2f,%rdx 0.07 : 9e383c: lea 0x1(%rax,%rdx,1),%eax 0.00 : 9e3840: vzeroupper : 126 } 0.15 : 9e3843: ret I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64 seems to be about the same as 32. Greetings, Andres Freund
Hi, On 2025-01-09 20:10:24 +0200, Ants Aasma wrote: > On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres@anarazel.de> wrote: > > > I'm curious about this because the checksum code should be fast enough > > > to easily handle that throughput. > > > > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y > > workstation. But we don't have a good ready-made way of testing that without > > also doing IO, so it's kinda hard to say. > > Interesting, I wonder if it's related to Intel increasing vpmulld > latency to 10 already back in Haswell. The Zen 3 I'm testing on has > latency 3 and has twice the throughput. > Attached is a naive and crude benchmark that I used for testing here. > Compiled with: > > gcc -O2 -funroll-loops -ftree-vectorize -march=native \ > -I$(pg_config --includedir-server) \ > bench-checksums.c -o bench-checksums-native > > Just fills up an array of pages and checksums them, first argument is > number of checksums, second is array size. I used 1M checksums and 100 > pages for in cache behavior and 100000 pages for in memory > performance. > > 869.85927ms @ 9.418 GB/s - generic from memory > 772.12252ms @ 10.610 GB/s - generic in cache > 442.61869ms @ 18.508 GB/s - native from memory > 137.07573ms @ 59.763 GB/s - native in cache printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I ~/src/postgresql/src/include/-I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1--membind 0 ./bench-checksums-native 1000000 $mem;done; done Workstation w/ 2x Xeon Gold 6442Y: march mem result x86-64 100 731.87779ms @ 11.193 GB/s x86-64-v2 100 327.18580ms @ 25.038 GB/s x86-64-v3 100 264.03547ms @ 31.026 GB/s x86-64-v4 100 282.08065ms @ 29.041 GB/s native 100 246.13766ms @ 33.282 GB/s x86-64 100000 842.66827ms @ 9.722 GB/s x86-64-v2 100000 604.52959ms @ 13.551 GB/s x86-64-v3 100000 477.16239ms @ 17.168 GB/s x86-64-v4 100000 476.07039ms @ 17.208 GB/s native 100000 456.08080ms @ 17.962 GB/s x86-64 1000000 845.51132ms @ 9.689 GB/s x86-64-v2 1000000 612.07973ms @ 13.384 GB/s x86-64-v3 1000000 485.23738ms @ 16.882 GB/s x86-64-v4 1000000 483.86411ms @ 16.930 GB/s native 1000000 462.88461ms @ 17.698 GB/s Zen 4 laptop (AMD Ryzen 7 PRO 7840U): march mem result x86-64 100 417.19762ms @ 19.636 GB/s x86-64-v2 100 130.67596ms @ 62.689 GB/s x86-64-v3 100 97.07758ms @ 84.386 GB/s x86-64-v4 100 95.67704ms @ 85.621 GB/s native 100 95.15734ms @ 86.089 GB/s x86-64 100000 431.38370ms @ 18.990 GB/s x86-64-v2 100000 215.74856ms @ 37.970 GB/s x86-64-v3 100000 199.74492ms @ 41.012 GB/s x86-64-v4 100000 186.98300ms @ 43.811 GB/s native 100000 187.68125ms @ 43.648 GB/s x86-64 1000000 433.87893ms @ 18.881 GB/s x86-64-v2 1000000 217.46561ms @ 37.670 GB/s x86-64-v3 1000000 200.40667ms @ 40.877 GB/s x86-64-v4 1000000 187.51978ms @ 43.686 GB/s native 1000000 190.29273ms @ 43.049 GB/s Workstation w/ 2x Xeon Gold 5215: march mem result x86-64 100 780.38881ms @ 10.497 GB/s x86-64-v2 100 389.62005ms @ 21.026 GB/s x86-64-v3 100 323.97294ms @ 25.286 GB/s x86-64-v4 100 274.19493ms @ 29.877 GB/s native 100 283.48674ms @ 28.897 GB/s x86-64 100000 1112.63898ms @ 7.363 GB/s x86-64-v2 100000 831.45641ms @ 9.853 GB/s x86-64-v3 100000 696.20789ms @ 11.767 GB/s x86-64-v4 100000 685.61636ms @ 11.948 GB/s native 100000 689.78023ms @ 11.876 GB/s x86-64 1000000 1128.65580ms @ 7.258 GB/s x86-64-v2 1000000 843.92594ms @ 9.707 GB/s x86-64-v3 1000000 718.78848ms @ 11.397 GB/s x86-64-v4 1000000 687.68258ms @ 11.912 GB/s native 1000000 705.34731ms @ 11.614 GB/s That's quite the drastic difference between amd and intel. Of course it's also comparing a multi-core server uarch (lower per-core bandwidth, much higher aggregate bandwidth) with a client uarch. The difference between the baseline CPU target and a more modern profile is also rather impressive. Looks like some cpu-capability based dispatch would likely be worth it, even if it didn't matter in my case due to -march=native. I just realized that a) The meson build doesn't use the relevant flags for bufpage.c - it didn't matter in my numbers though because I was building with -O3 and march=native. This clearly ought to be fixed. b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both of which include checksum_imp.h directly. This probably should be fixed too - perhaps by building the relevant code once as part of fe_utils or such? It probably matters less than it used to - these days -O2 turns on -ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't enabled. I do see a perf difference at -O2 between using/not using -funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually being enabled by -O3. I think the relevant option that *is* turned on by O3 is -fpeel-loops. Here's a comparison of different flags run the 6442Y printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2 x86-64-v3x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf "%16s\t%32s\t%16s\t""$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c-o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000 $mem;done;done;done march flags mem result x86-64 -O2 100 2280.86253ms @ 10.775 GB/s x86-64 -O2 -funroll-loops 100 2195.66942ms @ 11.193 GB/s x86-64 -O3 100 2422.57588ms @ 10.145 GB/s x86-64 -O3 -funroll-loops 100 2243.75826ms @ 10.953 GB/s x86-64-v2 -O2 100 1243.68063ms @ 19.761 GB/s x86-64-v2 -O2 -funroll-loops 100 979.67783ms @ 25.086 GB/s x86-64-v2 -O3 100 988.80296ms @ 24.854 GB/s x86-64-v2 -O3 -funroll-loops 100 991.31632ms @ 24.791 GB/s x86-64-v3 -O2 100 1146.90165ms @ 21.428 GB/s x86-64-v3 -O2 -funroll-loops 100 785.81395ms @ 31.275 GB/s x86-64-v3 -O3 100 800.53627ms @ 30.699 GB/s x86-64-v3 -O3 -funroll-loops 100 790.21230ms @ 31.101 GB/s x86-64-v4 -O2 100 883.82916ms @ 27.806 GB/s x86-64-v4 -O2 -funroll-loops 100 831.55372ms @ 29.554 GB/s x86-64-v4 -O3 100 843.23141ms @ 29.145 GB/s x86-64-v4 -O3 -funroll-loops 100 821.19969ms @ 29.927 GB/s native -O2 100 1197.41357ms @ 20.524 GB/s native -O2 -funroll-loops 100 718.05253ms @ 34.226 GB/s native -O3 100 747.94090ms @ 32.858 GB/s native -O3 -funroll-loops 100 751.52379ms @ 32.702 GB/s x86-64 -O2 100000 2911.47087ms @ 8.441 GB/s x86-64 -O2 -funroll-loops 100000 2525.45504ms @ 9.731 GB/s x86-64 -O3 100000 2497.42016ms @ 9.841 GB/s x86-64 -O3 -funroll-loops 100000 2346.33551ms @ 10.474 GB/s x86-64-v2 -O2 100000 2124.10102ms @ 11.570 GB/s x86-64-v2 -O2 -funroll-loops 100000 1819.09659ms @ 13.510 GB/s x86-64-v2 -O3 100000 1613.45823ms @ 15.232 GB/s x86-64-v2 -O3 -funroll-loops 100000 1607.09245ms @ 15.292 GB/s x86-64-v3 -O2 100000 1972.89390ms @ 12.457 GB/s x86-64-v3 -O2 -funroll-loops 100000 1432.58229ms @ 17.155 GB/s x86-64-v3 -O3 100000 1533.18003ms @ 16.029 GB/s x86-64-v3 -O3 -funroll-loops 100000 1539.39779ms @ 15.965 GB/s x86-64-v4 -O2 100000 1591.96881ms @ 15.437 GB/s x86-64-v4 -O2 -funroll-loops 100000 1434.91828ms @ 17.127 GB/s x86-64-v4 -O3 100000 1454.30133ms @ 16.899 GB/s x86-64-v4 -O3 -funroll-loops 100000 1429.13733ms @ 17.196 GB/s native -O2 100000 1980.53734ms @ 12.409 GB/s native -O2 -funroll-loops 100000 1373.95337ms @ 17.887 GB/s native -O3 100000 1517.90164ms @ 16.191 GB/s native -O3 -funroll-loops 100000 1508.37021ms @ 16.293 GB/s > > > Is it just that the calculation is slow, or is it the fact that checksumming > > > needs to bring the page into the CPU cache. Did you notice any hints which > > > might be the case? > > > > I don't think the issue is that checksumming pulls the data into CPU caches > > > > 1) This is visible with SELECT that actually uses the data > > > > 2) I added prefetching to avoid any meaningful amount of cache misses and it > > doesn't change the overall timing much > > > > 3) It's visible with buffered IO, which has pulled the data into CPU caches > > already > > I didn't yet check the code, when doing aio completions checksumming > be running on the same core as is going to be using the page? With io_uring normally yes, the exception being that another backend that needs the same page could end up running the completion. With worker mode normally no. Greetings, Andres Freund
On Thu, 9 Jan 2025 at 22:53, Andres Freund <andres@anarazel.de> wrote: <Edited to highlight interesting numbers> > Workstation w/ 2x Xeon Gold 6442Y: > > march mem result > native 100 246.13766ms @ 33.282 GB/s > native 100000 456.08080ms @ 17.962 GB/s > > Zen 4 laptop (AMD Ryzen 7 PRO 7840U): > march mem result > native 100 95.15734ms @ 86.089 GB/s > native 100000 187.68125ms @ 43.648 GB/s > > Workstation w/ 2x Xeon Gold 5215: > march mem result > native 100 283.48674ms @ 28.897 GB/s > native 100000 689.78023ms @ 11.876 GB/s > > That's quite the drastic difference between amd and intel. Of course it's also > comparing a multi-core server uarch (lower per-core bandwidth, much higher > aggregate bandwidth) with a client uarch. In hindsight building the hash around mulld primitive was a bad decision because Intel for whatever reason decided to kill the performance of it: vpmulld latency throughput (values/cycle) Sandy Bridge 5 4 Alder Lake 10 8 Zen 4 3 16 Zen 5 3 32 Most top performing hashes these days seem to be built around AES instructions. But I was curious why there is such a difference in streaming results. Turns out that for whatever reason one core gets access to much less bandwidth on Intel than on AMD. [1] This made me take another look at your previous prewarm numbers. It looks like performance is suspiciously proportional to the number of copies of data the CPU has to make: config checksums time in ms number of copies buffered io_engine=io_uring 0 3883.127 2 buffered io_engine=io_uring 1 5880.892 3 direct io_engine=io_uring 0 2067.142 1 direct io_engine=io_uring 1 3835.968 2 To me that feels like there is a bandwidth bottleneck in this workload and checksumming the page when the contents is not looked at just adds to consumed bandwidth, bringing down the performance correspondingly. This doesn't explain why you observed slowdown in the select case, but I think that might be due to the per-core bandwidth limitation. Both cases might pull in the same amount of data into the cache, but without checksums it is spread out over a longer time allowing other work to happen concurrently. [1] https://chipsandcheese.com/p/a-peek-at-sapphire-rapids#%C2%A7bandwidth > The difference between the baseline CPU target and a more modern profile is > also rather impressive. Looks like some cpu-capability based dispatch would > likely be worth it, even if it didn't matter in my case due to -march=native. Yes, along with using function attributes for the optimization flags to avoid the build system hacks. -- Ants