Re: Adding basic NUMA awareness
| От | Jakub Wartak |
|---|---|
| Тема | Re: Adding basic NUMA awareness |
| Дата | |
| Msg-id | CAKZiRmzQ=jzWSz7NjHggyqpnMkZUaeO00t7rUovs+zvi_YY48w@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| Список | pgsql-hackers |
On Fri, Oct 31, 2025 at 12:57 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> here's a significantly reworked version of this patch series.
>
> I had a couple discussions about these patches at pgconf.eu last week,[..]
I've just had a quick look at this and oh, my, I've started getting
into this partitioned clocksweep and that's ambitious! Yes, this
sequencing of patches makes it much more understandable. Anyway I've
spotted some things, attempted to fix some and have some basic
questions too (so small baby steps, all of this was on 4s/4 NUMA nodes
with HP on) -- the 000X refers to question/issue/bug in specific
patchset file:
0001: you mention 'debug_numa = buffers' in commitmsg, but there's
nothing there like that? it comes with 0006
0002: dunno, but wouldn't it make some educational/debugging sense to
add a debug function returning clocksweep partition index
(calculate_partition_index) for backend? (so that we know which
partition we are working on right now)
0003: those two "elog(INFO, "rebalance skipped:" should be at DEBUG2+
IMHO (they are way too verbose during runs)
0006a: Needs update - s/patches later in the patch series/patches
earlier in the patch series/
0006b: IMHO longer term, we should hide some complexity of those calls
via src/port numa shims (pg_numa_sched_cpu()?)
0006c: after GUC commit fce7c73fba4e5, apply complains with:
error: patch failed: src/backend/utils/misc/guc_parameters.dat:906
error: src/backend/utils/misc/guc_parameters.dat: patch does not apply
0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
I've thought it is going to be simpler to use to_hex() -- see 0009
attached.
0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?
0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
during start:
2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
num_procs 2523 node 0
2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
num_procs 2523 node 1
2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
num_procs 2523 node 2
2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
num_procs 2523 node 3
2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
0007d: so we probably need numa_warn()/numa_error() wrappers (this was
initially part of NUMA observability patches but got removed during
the course of action), I'm attaching 0008. With that you'll get
something a little more up to our standards:
2025-11-04 10:27:07.140 CET [59696] DEBUG:
fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
0x7f4f4d4b1660
2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind
0007e: elog DEBUG says it's pg_proc_init_partition but it's
pgproc_partition_init() actually ;)
0007f: The "mbind: Invalid argument"" issue itself with the below addition:
+elog(DEBUG1, "NUMA: fastpath_partition_init ptr %p endptr %p
num_procs %d node %d", ptr, endptr, num_procs, node);
showed this:
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eea00000 endptr 0x7f39eeab1660
num_procs 2523 node 0
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eec00000 endptr 0x7f39eecb1660
num_procs 2523 node 1
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eee00000 endptr 0x7f39eeeb1660
num_procs 2523 node 2
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
[..]
Meanwhile it's full hugepage size (e.g. 0x7f39eec00000−0x7f39eea00000 = 2MB)
$ grep --color 7f39ee[ace] /proc/61841/smaps
7f39ee800000-7f39eea00000 rw-s 87de00000 00:11 122710
/anon_hugepage (deleted)
7f39eea00000-7f39eec00000 rw-s 87e000000 00:11 122710
/anon_hugepage (deleted)
7f39eec00000-7f39eee00000 rw-s 87e200000 00:11 122710
/anon_hugepage (deleted)
7f39eee00000-7f39ef000000 rw-s 87e400000 00:11 122710
/anon_hugepage (deleted)
but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
0xB1660 = 726624 bytes, but if adjust blindly endptr in that
fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
(HP) it doesn't complain anymore and I get success:
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7000000 endptr 0x7f7bf7200000
num_procs 2523 node 0
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7200000 endptr 0x7f7bf7400000
num_procs 2523 node 1
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7400000 endptr 0x7f7bf7600000
num_procs 2523 node 2
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7600000 endptr 0x7f7bf7800000
num_procs 2523 node 3
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7800000 endptr 0x7f7bf7a00000
num_procs 38 node -1
2025-11-04 12:08:30.239 CET [62352] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)
Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) print ptr
$1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x0000561a672a0efe in ExecMakeFunctionResultSet
(fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
isDone=isDone@entry=0x561a97e8e5c0) at
../src/backend/executor/execSRF.c:624
[..]
Postmaster had still attached shm (visible via smaps), and if you
compare closely 0x7f4ed0200000 against sorted smaps:
7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
/anon_hugepage (deleted)
7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
/anon_hugepage (deleted)
7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
/anon_hugepage (deleted)
7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
/anon_hugepage (deleted)
7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
/anon_hugepage (deleted)
it's NOT there at all (there's no mmap region starting with
0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
aware of this new mmaped() regions and instead does simple loop over
all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
os_page_size)"?
0006e:
I'm seeking confirmation, but is this the issue we have discussed
on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
$ numactl --membind="0,1" --cpunodebind="0,1"
/usr/pgsql19/bin/pg_ctl -D /path start
still shows 4 NUMA nodes used. Current patches use
numa_num_configured_nodes(), but it says 'This count includes any
nodes that are currently DISABLED'. So I was wondering if I could help
by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
It's the same as You wrote earlier to Alexy?
> But that's not what you proposed here, clearly. You're saying we should
> find which NUMA nodes the process is allowed to run, and use those.
> Instead of just using all *configured* nodes. And I agree with that.
So are you already on it ?
> There are a couple unsolved issues, though. While running the tests, I
> ran into a bunch of weird issues. I saw two types of failures:
> 1) Bad address
> 2) Operation canceled
I did run (with io_uring) a short test(< 10min with -c 128) and didn't
get those. Could you please share specific tips/workload for
reproducing this?
That's all for today, I hope it helps a little.
-J.
Вложения
В списке pgsql-hackers по дате отправления: