Обсуждение: Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

Поиск
Список
Период
Сортировка

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-06-03 12:24:38 -0700, MARK CALLAGHAN wrote:
> When measuring the time to create a connection, it is ~2.3X longer with
> io_method=io_uring then with io_method=sync (6.9ms vs 3ms), and the
> postmaster process uses ~3.5X more CPU to create connections.

I can reproduce that - the reason for the slowdown is that we create one
io_uring instance for each potential process, and the way we create them
creates one mmap()ed region for each potential process.  That creates extra
overhead, particularly when child processes exit.


> The reproduction case so far is my usage of the Insert Benchmark on a large
> server with 48 cores. I need to fix the benchmark client -- today it
> creates ~1000 connections/s to run a monitoring query in between every 100
> queries and the extra latency from connection create makes results worse
> for one of the benchmark steps.

Heh, yea - 1000/connections sec will influence performance regardless of this issue.


> While I can fix the benchmark client to avoid this, I am curious about the
> extra latency in connection create.
> 
> I used "perf record -e cycles -F 333 -g -p $pidof_postmaster -- sleep 30"
> but I have yet to find a big difference from the reports generated with
> that for io_method=io_uring vs =sync. It shows that much time is spent in
> the kernel dealing with the VM (page tables, etc).

I see a lot of additional time spent below
  do_group_exit->do_exit->...->unmap_vmas
which fits the theory that this is due to the number of memory mappings.

There has been a bunch of discussion around this on mastodon, particularly
below [1] which ended in Jens prototyping that approach [2] where Jens pointed
out that we should use
https://man7.org/linux/man-pages/man3/io_uring_queue_init_mem.3.html to avoid
creating this many memory mappings.

There are a few complications around that though - only newer kernels (>=6.5)
support the caller providing the memory for the mapping and there isn't yet a
good way to figure out how much memory needs to be provided.


I think this is a big enough pitfall that it's, obviously assuming the patch
has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
think?

Greetings,

Andres Freund

[1] https://fosstodon.org/@axboe/114630982449670090
[2] https://pastebin.com/7M3C8aFH



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I think this is a big enough pitfall that it's, obviously assuming the patch
> has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> think?

Let's see the patch ... but yeah, I'd rather not ship 18 like this.

            regards, tom lane



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think this is a big enough pitfall that it's, obviously assuming the patch
> > has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> > think?
> 
> Let's see the patch ... but yeah, I'd rather not ship 18 like this.

I've attached a first draft.

I can't make heads or tails of the ordering in configure.ac, so the function
test is probably in the wrong place.

Greetings,

Andres

Вложения

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Nathan Bossart
Дата:
On Thu, Jun 05, 2025 at 12:47:52PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I think this is a big enough pitfall that it's, obviously assuming the patch
>> has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
>> think?
> 
> Let's see the patch ... but yeah, I'd rather not ship 18 like this.

+1, I see no point in waiting for v19, especially since all of this stuff
is new in v18, anyway.

-- 
nathan



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
> On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > I think this is a big enough pitfall that it's, obviously assuming the patch
> > > has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> > > think?
> > 
> > Let's see the patch ... but yeah, I'd rather not ship 18 like this.
> 
> I've attached a first draft.
> 
> I can't make heads or tails of the ordering in configure.ac, so the function
> test is probably in the wrong place.

Any comments on that patch?  I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

Greetings,

Andres



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Jim Nasby
Дата:
+#if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP) && 1

Is that && 1 intentional?

Nit:
+ "mmap(%zu) to determine io_uring_queue_init_mem() support has failed: %m",
IMHO that would read better without "has".

+ /* FIXME: This should probably not stay at DEBUG1? */
+ elog(DEBUG1,
+ "can use combined memory mapping for io_uring, each ring needs %d bytes",
+ ret);
Assuming my read that this is only executed at postmaster start is correct, I agree that NOTICE would also be reasonable. Though I'm not sure what a user could actually do with the info...

+ elog(DEBUG1,
+ "can't use combined memory mapping for io_uring, kernel or liburing too old");
OTOH this message would definitely be of interest to users; I'd say it should at least be NOTICE, possibly even WARNING. It'd also be good to have a HINT either explaining the downside or pointing to the docs.

+ * Memory for rings needs to be allocated to the page boundary,
+ * reserve space. Luckily it does not need to be aligned to hugepage
+ * boundaries, even if huge pages are used.
Is "reserve space" left over from something else? AFAICT pgaio_uring_ring_shmem_size() isn't even reserving space...

On Mon, Jun 30, 2025 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
> On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > I think this is a big enough pitfall that it's, obviously assuming the patch
> > > has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> > > think?
> >
> > Let's see the patch ... but yeah, I'd rather not ship 18 like this.
>
> I've attached a first draft.
>
> I can't make heads or tails of the ordering in configure.ac, so the function
> test is probably in the wrong place.

Any comments on that patch?  I'd hoped for some review comments... Unless I'll
hear otherwise, I'll just do a bit more polish and push..

Greetings,

Andres


Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
"Burd, Greg"
Дата:

> On Jun 30, 2025, at 12:27 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
>> On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
>>> Andres Freund <andres@anarazel.de> writes:
>>>> I think this is a big enough pitfall that it's, obviously assuming the patch
>>>> has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
>>>> think?
>>>
>>> Let's see the patch ... but yeah, I'd rather not ship 18 like this.
>>
>> I've attached a first draft.
>>
>> I can't make heads or tails of the ordering in configure.ac, so the function
>> test is probably in the wrong place.
>
> Any comments on that patch?  I'd hoped for some review comments... Unless I'll
> hear otherwise, I'll just do a bit more polish and push..

Thanks for doing this work!

I just read through the v1 patch and it looks good.  I have just a few small nit-picky questions:

+ #if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP) && 1

The '1' looks like cruft, or am I missing something?

+ /* FIXME: This should probably not stay at DEBUG1? */

Worth fixing before pushing?

Also, this returns 'Size' but in the function uses 'size_t' I assume that's intentional?

+ static Size
+ pgaio_uring_ring_shmem_size(void)

The next, similar, function below this one returns 'size_t'.

Finally, and this may be me missing something everyone else knows is convention.

+ * XXX: We allocate memory for all PgAioUringContext instances and, if

Is there any reason to keep the 'XXX'?  You ask yourself a question in that comment, do you know the answer or was that
arequest to reviewers for feedback? :) 

I hope that is helpful.

-greg


>
> Greetings,
>
> Andres



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-06-30 15:31:14 -0400, Burd, Greg wrote:
> > On Jun 30, 2025, at 12:27 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
> >> On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> >>> Andres Freund <andres@anarazel.de> writes:
> >>>> I think this is a big enough pitfall that it's, obviously assuming the patch
> >>>> has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> >>>> think?
> >>> 
> >>> Let's see the patch ... but yeah, I'd rather not ship 18 like this.
> >> 
> >> I've attached a first draft.
> >> 
> >> I can't make heads or tails of the ordering in configure.ac, so the function
> >> test is probably in the wrong place.
> > 
> > Any comments on that patch?  I'd hoped for some review comments... Unless I'll
> > hear otherwise, I'll just do a bit more polish and push..
> 
> Thanks for doing this work!
> 
> I just read through the v1 patch and it looks good.  I have just a few small nit-picky questions:
> 
> + #if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP) && 1
> 
> The '1' looks like cruft, or am I missing something?

It's for making it easy to test both paths when running on an kernel/liburing
combo that's new enought o have support.


> + /* FIXME: This should probably not stay at DEBUG1? */
> 
> Worth fixing before pushing?

Yes.  I was just not yet sure what it should be.  I ended up concluding that
it's probably fine to just keep it at DEBUG1...


> Also, this returns 'Size' but in the function uses 'size_t' I assume that's intentional?
> 
> + static Size
> + pgaio_uring_ring_shmem_size(void)
> 
> The next, similar, function below this one returns 'size_t'.

You're right - I wish we would just do a (slightly smarter) version of
s/Size/size_t/...


> Finally, and this may be me missing something everyone else knows is convention.
> 
> + * XXX: We allocate memory for all PgAioUringContext instances and, if
> 
> Is there any reason to keep the 'XXX'?  You ask yourself a question in that
> comment, do you know the answer or was that a request to reviewers for
> feedback? :)

A bit of both :).  I concluded that it's not worth having a separate segment,
there's not enough memory here to matter...


> I hope that is helpful.

Yep!

Greetings,

Andres Freund



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-06-30 13:57:28 -0500, Jim Nasby wrote:
> +#if defined(HAVE_LIBURING_QUEUE_INIT_MEM) && defined(IORING_SETUP_NO_MMAP)
> && 1
> 
> Is that && 1 intentional?

It was for testing both branches...


> Nit:
> + "mmap(%zu) to determine io_uring_queue_init_mem() support has failed: %m",
> IMHO that would read better without "has".

Agreed, fixed.


> + /* FIXME: This should probably not stay at DEBUG1? */
> + elog(DEBUG1,
> + "can use combined memory mapping for io_uring, each ring needs %d bytes",
> + ret);
> Assuming my read that this is only executed at postmaster start is correct,
> I agree that NOTICE would also be reasonable. Though I'm not sure what a
> user could actually do with the info...

I was thinking of *lowering* it, given that the user, as you point out, can't
do much with the information.


> + elog(DEBUG1,
> + "can't use combined memory mapping for io_uring, kernel or liburing too
> old");
> OTOH this message would definitely be of interest to users; I'd say it
> should at least be NOTICE, possibly even WARNING.

I don't think it's worth it - typically the user won't be able to do much,
given that just upgrading the kernel is rarely easily possible.


> It'd also be good to have a HINT either explaining the downside or pointing
> to the docs.

I don't know about that - outside of extreme cases the performance effects
really aren't that meaningful. E.g. compiling with openssl support also has
connection establishment performance overhead, yet we don't document that
anywhere either, even though it's present even with ssl=off.


> + * Memory for rings needs to be allocated to the page boundary,
> + * reserve space. Luckily it does not need to be aligned to hugepage
> + * boundaries, even if huge pages are used.
> Is "reserve space" left over from something else?

No, it's trying to say that this is reserving space for alignment.


> AFAICT pgaio_uring_ring_shmem_size() isn't even reserving space...

That's all it does? It's used for sizing the shared memory allocation...

Greetings,

Andres Freund



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-06-30 12:27:10 -0400, Andres Freund wrote:
> On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
> > On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> > > Andres Freund <andres@anarazel.de> writes:
> > > > I think this is a big enough pitfall that it's, obviously assuming the patch
> > > > has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> > > > think?
> > > 
> > > Let's see the patch ... but yeah, I'd rather not ship 18 like this.
> > 
> > I've attached a first draft.
> > 
> > I can't make heads or tails of the ordering in configure.ac, so the function
> > test is probably in the wrong place.
> 
> Any comments on that patch?  I'd hoped for some review comments... Unless I'll
> hear otherwise, I'll just do a bit more polish and push..

After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
to increase the log level as Jim suggested, but if we end up deciding that
that's the way to go, we can easily change that...

Greetings,

Andres



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Jakub Wartak
Дата:
On Tue, Jul 8, 2025 at 5:22 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-06-30 12:27:10 -0400, Andres Freund wrote:
> > On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
> > > On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> > > > Andres Freund <andres@anarazel.de> writes:
> > > > > I think this is a big enough pitfall that it's, obviously assuming the patch
> > > > > has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> > > > > think?
> > > >
> > > > Let's see the patch ... but yeah, I'd rather not ship 18 like this.
> > >
> > > I've attached a first draft.
> > >
> > > I can't make heads or tails of the ordering in configure.ac, so the function
> > > test is probably in the wrong place.
> >
> > Any comments on that patch?  I'd hoped for some review comments... Unless I'll
> > hear otherwise, I'll just do a bit more polish and push..
>
> After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
> to increase the log level as Jim suggested, but if we end up deciding that
> that's the way to go, we can easily change that...
>

Hi Andres,

I'm with Jim as I've just hit it but not on exit() but for fork(), so:

1. Could we s/DEBUG1/INFO/ that debug message level? (for those two:
"cannot use combined memory mapping for io_uring" , and maybe add
"potential slow new connections" there too along the way?)
2. Maybe we could add some wording to the docs about io_method that it
might cause such trouble ?

Just wasted an hour on wondering why $stuff is slow, given:
    max_connections = '20000' # yes, yay..
    io_method = 'io_uring'

I was getting like slow fork()/clone() performance when there's were
lots of io_uring fds/instances in the main postmaster:
    $ /usr/pgsql19/bin/pgbench -f select1.sql -c 1000 -j 1 -t 1 -P 1
    [..]
    progress: 39.7 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 40.6 s, 1039.9 tps, lat 407.696 ms stddev 291.856, 0 failed
    [..]
    initial connection time = 39632.164 ms
    tps = 1015.608893 (without initial connection time)

So yes, ~40s to just connect to the database and I was using some old
branch from back before Jun (it was not having f54af9f2679d5987b46),
so simulating <= 6.5 as You say more or less. I was limited to 20-30
forks()/1sec according to bpftrace. It goes away with default
io_method (~800 forks()/1sec). With max_connections = 2k, I got 5s
initial connection times. It looked like caused by io_uring, as with
io_uring fork() was slow somewhere in vma_interval_tree_insert_after
<- copy_process <- kernel_clone <- __do_sys_clone <- do_syscall_64
(?). I've tested it on 6.14.17 too, but also on LTS 6.1.x too (well
the difference is that it takes 65s instead of 40s...). Then searched
and hit this thread, but 6.1 is the LTS kernel, so plenty of people
are going to hit those regressions with io_uring io_method, won't
they?

I can try to prepare a patch, please just let me know.

-J.



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Robert Treat
Дата:
On Tue, Aug 26, 2025 at 9:32 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> On Tue, Jul 8, 2025 at 5:22 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2025-06-30 12:27:10 -0400, Andres Freund wrote:
> > > On 2025-06-05 14:32:10 -0400, Andres Freund wrote:
> > > > On 2025-06-05 12:47:52 -0400, Tom Lane wrote:
> > > > > Andres Freund <andres@anarazel.de> writes:
> > > > > > I think this is a big enough pitfall that it's, obviously assuming the patch
> > > > > > has a sensible complexity, worth fixing this in 18. RMT, anyone, what do you
> > > > > > think?
> > > > >
> > > > > Let's see the patch ... but yeah, I'd rather not ship 18 like this.
> > > >
> > > > I've attached a first draft.
> > > >
> > > > I can't make heads or tails of the ordering in configure.ac, so the function
> > > > test is probably in the wrong place.
> > >
> > > Any comments on that patch?  I'd hoped for some review comments... Unless I'll
> > > hear otherwise, I'll just do a bit more polish and push..
> >
> > After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
> > to increase the log level as Jim suggested, but if we end up deciding that
> > that's the way to go, we can easily change that...
> >
>
> Hi Andres,
>
> I'm with Jim as I've just hit it but not on exit() but for fork(), so:
>
> 1. Could we s/DEBUG1/INFO/ that debug message level? (for those two:
> "cannot use combined memory mapping for io_uring" , and maybe add
> "potential slow new connections" there too along the way?)
> 2. Maybe we could add some wording to the docs about io_method that it
> might cause such trouble ?
>
> Just wasted an hour on wondering why $stuff is slow, given:
>     max_connections = '20000' # yes, yay..
>     io_method = 'io_uring'
>
> I was getting like slow fork()/clone() performance when there's were
> lots of io_uring fds/instances in the main postmaster:
>     $ /usr/pgsql19/bin/pgbench -f select1.sql -c 1000 -j 1 -t 1 -P 1
>     [..]
>     progress: 39.7 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
>     progress: 40.6 s, 1039.9 tps, lat 407.696 ms stddev 291.856, 0 failed
>     [..]
>     initial connection time = 39632.164 ms
>     tps = 1015.608893 (without initial connection time)
>
> So yes, ~40s to just connect to the database and I was using some old
> branch from back before Jun (it was not having f54af9f2679d5987b46),
> so simulating <= 6.5 as You say more or less. I was limited to 20-30
> forks()/1sec according to bpftrace. It goes away with default
> io_method (~800 forks()/1sec). With max_connections = 2k, I got 5s
> initial connection times. It looked like caused by io_uring, as with
> io_uring fork() was slow somewhere in vma_interval_tree_insert_after
> <- copy_process <- kernel_clone <- __do_sys_clone <- do_syscall_64
> (?). I've tested it on 6.14.17 too, but also on LTS 6.1.x too (well
> the difference is that it takes 65s instead of 40s...). Then searched
> and hit this thread, but 6.1 is the LTS kernel, so plenty of people
> are going to hit those regressions with io_uring io_method, won't
> they?
>
> I can try to prepare a patch, please just let me know.
>

Did anything ever happen with this? I do think it would be helpful to
make some of these pot-holes more user visible / discoverable. I have
a suspicion that we're going to see people using pre-built packages
with io_uring support installed on to older kernels they are still
hanging on to because pg_upgrade was the easiest path, but that they
could either update the kernel or upgrade via logical replication to
get the new functionality if they knew about it.

Robert Treat
https://xzilla.net



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,

On 2025-09-06 09:12:19 -0400, Robert Treat wrote:
> On Tue, Aug 26, 2025 at 9:32 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > On Tue, Jul 8, 2025 at 5:22 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2025-06-30 12:27:10 -0400, Andres Freund wrote:
> > > After addressing most of Greg's and Jim's feedback, I pushed this. I chose not
> > > to increase the log level as Jim suggested, but if we end up deciding that
> > > that's the way to go, we can easily change that...
> >
> > I'm with Jim as I've just hit it but not on exit() but for fork(), so:
> >
> > 1. Could we s/DEBUG1/INFO/ that debug message level? (for those two:
> > "cannot use combined memory mapping for io_uring" , and maybe add
> > "potential slow new connections" there too along the way?)
> > 2. Maybe we could add some wording to the docs about io_method that it
> > might cause such trouble ?
> >
> > Just wasted an hour on wondering why $stuff is slow, given:
> >     max_connections = '20000' # yes, yay..
> >     io_method = 'io_uring'
> >
> > I was getting like slow fork()/clone() performance when there's were
> > lots of io_uring fds/instances in the main postmaster:
> >     $ /usr/pgsql19/bin/pgbench -f select1.sql -c 1000 -j 1 -t 1 -P 1
> >     [..]
> >     progress: 39.7 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> >     progress: 40.6 s, 1039.9 tps, lat 407.696 ms stddev 291.856, 0 failed
> >     [..]
> >     initial connection time = 39632.164 ms
> >     tps = 1015.608893 (without initial connection time)
> >
> > So yes, ~40s to just connect to the database and I was using some old
> > branch from back before Jun (it was not having f54af9f2679d5987b46),
> > so simulating <= 6.5 as You say more or less. I was limited to 20-30
> > forks()/1sec according to bpftrace. It goes away with default
> > io_method (~800 forks()/1sec). With max_connections = 2k, I got 5s
> > initial connection times. It looked like caused by io_uring, as with
> > io_uring fork() was slow somewhere in vma_interval_tree_insert_after
> > <- copy_process <- kernel_clone <- __do_sys_clone <- do_syscall_64
> > (?). I've tested it on 6.14.17 too, but also on LTS 6.1.x too (well
> > the difference is that it takes 65s instead of 40s...). Then searched
> > and hit this thread, but 6.1 is the LTS kernel, so plenty of people
> > are going to hit those regressions with io_uring io_method, won't
> > they?

I doubt it, but who knows.


> > I can try to prepare a patch, please just let me know.

Yes, please do.


> Did anything ever happen with this?

No.  I missed the email. So thanks for the reminder.


> I do think it would be helpful to make some of these pot-holes more user
> visible / discoverable.

> I have a suspicion that we're going to see people using pre-built packages
> with io_uring support installed on to older kernels they are still hanging
> on to because pg_upgrade was the easiest path, but that they could either
> update the kernel or upgrade via logical replication to get the new
> functionality if they knew about it.

If they just upgrade in-place, they won't use io_uring. And they won't simply
use io_uring with this large max_connections without also tuning the file
descriptor limits...

Greetings,

Andres Freund



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Jakub Wartak
Дата:
Hi Andres / Robert,

On Mon, Sep 8, 2025 at 5:55 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-09-06 09:12:19 -0400, Robert Treat wrote:
[..]
> > > [..], but 6.1 is the LTS kernel, so plenty of people
> > > are going to hit those regressions with io_uring io_method, won't
> > > they?
>
> I doubt it, but who knows.

RHEL 8.x won't have it (RH KB [1] says "RHEL 8.x: The addition to
RHEL8 was being tracked in private Bug 1881561 - Add io_uring support.
Unfortunately, it has been decided that io_uring support will not be
enabled in RHEL8."

RHEL 9.x seems to be all based on 5.14.x (so much below 6.5.x) and
states that uring is in Tech Preview there and is disabled, but it can
be enabled via sysctl. Hard to tell what they will backpatch into
5.14.x there. So if anywhere, I would speculate it would be RHEL9 (?),
therefore 5.14.x (+their custom back patches).

> > > I can try to prepare a patch, please just let me know.
>
> Yes, please do.

Attached.

> If they just upgrade in-place, they won't use io_uring. And they won't simply
> use io_uring with this large max_connections without also tuning the file
> descriptor limits...

Business as usual, just another obstacle...

-J.

Вложения

Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Andres Freund
Дата:
Hi,


> From ad7c856e964b614507a06342c2acbf10bfa4855c Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Date: Tue, 9 Sep 2025 14:30:48 +0200
> Subject: [PATCH v1] aio: warn user if combined io_uring memory mappings are
>  unavailable
> 
> In f54af9f2 we have added solution to avoid connection and disconnection hit
> caused by io_uring managing large number of memory mappings. Unfortunately
> it is available only on more modern Linux kernels (6.5) therefore notify user
> in visible way if this optimization is not available.
> 
> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Reviewed-by:
> Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
> ---
>  doc/src/sgml/config.sgml                  |  6 ++++++
>  src/backend/storage/aio/method_io_uring.c | 14 ++++++++++----
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index 2a3685f474a..9d541999dc1 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2784,6 +2784,12 @@ include_dir 'conf.d'
>          <para>
>           This parameter can only be set at server start.
>          </para>
> +        <para>
> +         Note that for optimum performance with <literal>io_uring</literal>
> +         Linux kernel version >= 6.5 is recommended, as it provides way to
> +         reduce the number of additional memory mappings which may negatively
> +         affect the efficiency of establishing and terminating connections.
> +        </para>
>         </listitem>
>        </varlistentry>

This seems too low-level for end user docs, while not explaining that the
impact is due to a high max_connections value, rather than a large number of
actually established connections. How about something like

    Note that for optimal performance with <literal>io_uring</literal> Linux
    kernel version >= 6.5 is recommended.  Older Linux versions, high values
    of <xref linkend="guc-max-connections"/> will slow down connection
    establishment and termination.
    

> diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
> index bb06da63a8e..5cd839df2f3 100644
> --- a/src/backend/storage/aio/method_io_uring.c
> +++ b/src/backend/storage/aio/method_io_uring.c
> @@ -207,8 +207,11 @@ pgaio_uring_check_capabilities(void)
>               * pgaio_uring_shmem_init().
>               */
>              errno = -ret;
> -            elog(DEBUG1,
> -                 "cannot use combined memory mapping for io_uring, ring creation failed: %m");
> +            ereport(WARNING,
> +                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                     errmsg("cannot use combined memory mapping for io_uring, ring creation failed: %m"),
> +                     errdetail("Connection and disconnection rates and efficiency may be degraded."),
> +                     errhint("Ensure that you are running kernel >= 6.5")));

To me this seems too verbose, particularly because the majority of users
encountering it have zero chance to address the issue. And it's not like most
real world workloads are particularly affected, if you run with
max_connections=20k and have 100/connections second, you'll have a *lot* of
other problems.

Here's the full log of a start with the fallback branch forced:

2025-09-21 12:20:49.666 EDT [4090828][postmaster][:0][] WARNING:  cannot use combined memory mapping for io_uring, ring
creationfailed: Unknown error -8192
 
2025-09-21 12:20:49.666 EDT [4090828][postmaster][:0][] DETAIL:  Connection and disconnection rates and efficiency may
bedegraded.
 
2025-09-21 12:20:49.666 EDT [4090828][postmaster][:0][] HINT:  Ensure that you are running kernel >= 6.5
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-15.2.0,64-bit
 
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG:  listening on IPv6 address "::1", port 5440
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG:  listening on IPv4 address "127.0.0.1", port 5440
2025-09-21 12:20:49.708 EDT [4090828][postmaster][:0][] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5440"
2025-09-21 12:20:49.712 EDT [4090831][startup][:0][] LOG:  database system was shut down at 2025-09-21 12:20:42 EDT
2025-09-21 12:20:49.717 EDT [4090828][postmaster][:0][] LOG:  database system is ready to accept connections

Close to half the lines are the new warning.

Greetings,

Andres Freund



Re: postmaster uses more CPU in 18 beta1 with io_method=io_uring

От
Jakub Wartak
Дата:
Hi Andres,

On Sun, Sep 21, 2025 at 6:29 PM Andres Freund <andres@anarazel.de> wrote:
[..]
> This seems too low-level for end user docs, while not explaining that the
> impact is due to a high max_connections value, rather than a large number of
> actually established connections. How about something like
>
>     Note that for optimal performance with <literal>io_uring</literal> Linux
>     kernel version >= 6.5 is recommended.  Older Linux versions, high values
>     of <xref linkend="guc-max-connections"/> will slow down connection
>     establishment and termination.

Agreed, attached v2. Just one nitpick -- wouldn't '>> On << older
Linux versions ' sound better there?

[..v1 patch]
>
> To me this seems too verbose, particularly because the majority of users
> encountering it have zero chance to address the issue. And it's not like most
> real world workloads are particularly affected, if you run with
> max_connections=20k and have 100/connections second, you'll have a *lot* of
> other problems.

> Here's the full log of a start with the fallback branch forced:
[..]
> Close to half the lines are the new warning.

I see two paths forward:

1. either we make it shorter, but I do not know if a multi-sentence
error message isn't against some project's policy? Feel free to
readjust as necessary, I'm not strongly attached to the exact wording
, just to hint people.
2. maybe we could emit the warning only in certain criteria, like
if(max_connections>1000) for example. However Mark (OP) reported it
even for the value of 100 so it seems we should warn about it like
always? (and it deteriorated 3x for him @ 1000 max_connections), so
it's like opening a new can of worms (to establish a proper
threshold).

Anyway attached v2 generates:

2025-09-22 09:56:21.123 CEST [12144] WARNING:  io_uring combined
memory mapping creation failed: Unknown error -8192. Upgrade kernel to
6.5+ for improved performance
2025-09-22 09:56:21.179 CEST [12144] LOG:  starting PostgreSQL 19devel
on x86_64-linux, compiled by clang-16.0.6, 64-bit
2025-09-22 09:56:21.180 CEST [12144] LOG:  listening on IPv6 address
"::1", port 1236
2025-09-22 09:56:21.180 CEST [12144] LOG:  listening on IPv4 address
"127.0.0.1", port 1236
2025-09-22 09:56:21.185 CEST [12144] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.1236"
2025-09-22 09:56:21.197 CEST [12147] LOG:  database system was shut
down at 2025-09-22 09:55:44 CEST
2025-09-22 09:56:21.207 CEST [12144] LOG:  database system is ready to
accept connections

BTW: on RHEL/derivatives it was possible to push people in certain
critical conditions into using kernel-lt/kernel-ml (but that's from
EPEL repos) , so it's not that they do not have space for maneuver.

-J.

Вложения