Обсуждение: Adding basic NUMA awareness

Поиск
Список
Период
Сортировка

Adding basic NUMA awareness

От
Tomas Vondra
Дата:
Hi,

This is a WIP version of a patch series I'm working on, adding some
basic NUMA awareness for a couple parts of our shared memory (shared
buffers, etc.). It's based on Andres' experimental patches he spoke
about at pgconf.eu 2024 [1], and while it's improved and polished in
various ways, it's still experimental.

But there's a recent thread aiming to do something similar [2], so
better to share it now so that we can discuss both approaches. This
patch set is a bit more ambitious, handling NUMA in a way to allow
smarter optimizations later, so I'm posting it in a separate thread.

The series is split into patches addressing different parts of the
shared memory, starting (unsurprisingly) from shared buffers, then
buffer freelists and ProcArray. There's a couple additional parts, but
those are smaller / addressing miscellaneous stuff.

Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.

Each patch should have a commit message explaining the intent and
implementation, and then also detailed comments explaining various
challenges and open questions.

But let me go over the basics, and discuss some of the design choices
and open questions that need solving.


1) v1-0001-NUMA-interleaving-buffers.patch

This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.

Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).

So the patch handles this by explicitly mapping chunks of shared buffers
to different nodes - a bit like interleaving, but in larger chunks.
Ideally each node gets (1/N) of shared buffers, as a contiguous chunk.

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.

The patch is fairly simple. Most of the complexity is about picking the
chunk size, and aligning the arrays (so that it nicely aligns with
memory pages).

The patch has a GUC "numa_buffers_interleave", with "off" by default.


2) v1-0002-NUMA-localalloc.patch

This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

The patch has a GUC "numa_localalloc", with "off" by default.


3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch

Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.


4) v1-0004-NUMA-partition-buffer-freelist.patch

Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:

* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.

* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.

* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.

* none - nothing, sigle freelist

Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.

The main challenge is that with multiple smaller lists, a process can't
really use the whole shared buffers. So a single backed will only use
part of the memory. The more lists there are, the worse this effect is.
This is also why I think we won't use the other partitioning options,
because there's going to be more CPUs than NUMA nodes.

Obviously, this needs solving even with NUMA nodes - we need to allow a
single backend to utilize the whole shared buffers if needed. There
should be a way to "steal" buffers from other freelists (if the
"regular" freelist is empty), but the patch does not implement this.
Shouldn't be hard, I think.

The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).

I do have a separate experimental patch doing something like that, I
need to make it part of this branch.


5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

The fast-path arrays are less of a problem, because those tend to be
larger, and are accessed through pointers, so we can just adjust that.

So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.

This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.

But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.

There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.

There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).

This has a GUC "numa_procs_interleave", again "off" by default. It's not
quite correct, though, because the partitioning happens always. It only
affects the PGPROC lookup. (In a way, this may be a bit broken.)


6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch

This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.

Driven by GUC "numa_procs_pin" (default: off).


Summary
-------

So this is what I have at the moment. I've tried to organize the patches
in the order of importance, but that's just my guess. It's entirely
possible there's something I missed, some other order might make more
sense, etc.

There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).

I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).


regards


[1] https://www.youtube.com/watch?v=V75KpACdl6E

[2]
https://www.postgresql.org/message-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw%40mail.gmail.com

-- 
Tomas Vondra

Вложения

Re: Adding basic NUMA awareness

От
Ashutosh Bapat
Дата:
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
>
>
> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
>
> Minor optimization. Andres noticed we're tracking the tail of buffer
> freelist, without using it. So the patch removes that.
>

The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.

This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?

If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.

>
> There's also the question how this is related to other patches affecting
> shared memory - I think the most relevant one is the "shared buffers
> online resize" by Ashutosh, simply because it touches the shared memory.

I have added Dmitry to this thread since he has written most of the
shared memory handling code.

>
> I don't think the splitting would actually make some things simpler, or
> maybe more flexible - in particular, it'd allow us to enable huge pages
> only for some regions (like shared buffers), and keep the small pages
> e.g. for PGPROC. So that'd be good.

The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?

>
> But there'd also need to be some logic to "rework" how shared buffers
> get mapped to NUMA nodes after resizing. It'd be silly to start with
> memory on 4 nodes (25% each), resize shared buffers to 50% and end up
> with memory only on 2 of the nodes (because the other 2 nodes were
> originally assigned the upper half of shared buffers).
>
> I don't have a clear idea how this would be done, but I guess it'd
> require a bit of code invoked sometime after the resize. It'd already
> need to rebuild the freelists in some way, I guess.

Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.

--
Best Wishes,
Ashutosh Bapat



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:

On 7/2/25 13:37, Ashutosh Bapat wrote:
> On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
>>
>>
>> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
>>
>> Minor optimization. Andres noticed we're tracking the tail of buffer
>> freelist, without using it. So the patch removes that.
>>
> 
> The patches for resizing buffers use the lastFreeBuffer to add new
> buffers to the end of free list when expanding it. But we could as
> well add it at the beginning of the free list.
> 
> This patch seems almost independent of the rest of the patches. Do you
> need it in the rest of the patches? I understand that those patches
> don't need to worry about maintaining lastFreeBuffer after this patch.
> Is there any other effect?
> 
> If we are going to do this, let's do it earlier so that buffer
> resizing patches can be adjusted.
> 

My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization. I don't know how significant is the improvement, but if
it's measurable we could just do that independently of our patches.

>>
>> There's also the question how this is related to other patches affecting
>> shared memory - I think the most relevant one is the "shared buffers
>> online resize" by Ashutosh, simply because it touches the shared memory.
> 
> I have added Dmitry to this thread since he has written most of the
> shared memory handling code.
> 

Thanks.

>>
>> I don't think the splitting would actually make some things simpler, or
>> maybe more flexible - in particular, it'd allow us to enable huge pages
>> only for some regions (like shared buffers), and keep the small pages
>> e.g. for PGPROC. So that'd be good.
> 
> The resizing patches split the shared buffer related structures into
> separate memory segments. I think that itself will help enabling huge
> pages for some regions. Would that help in your case?
> 

Indirectly. My patch can work just fine with a single segment, but being
able to enable huge pages only for some of the segments seems better.

>>
>> But there'd also need to be some logic to "rework" how shared buffers
>> get mapped to NUMA nodes after resizing. It'd be silly to start with
>> memory on 4 nodes (25% each), resize shared buffers to 50% and end up
>> with memory only on 2 of the nodes (because the other 2 nodes were
>> originally assigned the upper half of shared buffers).
>>
>> I don't have a clear idea how this would be done, but I guess it'd
>> require a bit of code invoked sometime after the resize. It'd already
>> need to rebuild the freelists in some way, I guess.
> 
> Yes, there's code to build the free list. I think we will need code to
> remap the buffers and buffer descriptor.
> 

Right. The good thing is that's just "advisory" information, it doesn't
break anything if it's temporarily out of sync. We don't need to "stop"
everything to remap the buffers to other nodes, or anything like that.
Or at least I think so.

It's one thing to "flip" the target mapping (determining which node a
buffer should be on), and actually migrating the buffers. The first part
can be done instantaneously, the second part can happen in the
background over a longer time period.

I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.

regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Ashutosh Bapat
Дата:
On Wed, Jul 2, 2025 at 6:06 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> I'm not sure how you're rebuilding the freelist. Presumably it can
> contain buffers that are no longer valid (after shrinking). How is that
> handled to not break anything? I think the NUMA variant would do exactly
> the same thing, except that there's multiple lists.

Before shrinking the buffers, we walk the free list removing any
buffers that are going to be removed. When expanding, by linking the
new buffers in the order and then adding those to the already existing
free list. 0005 patch in [1] has the code for the same.

[1] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj%40hmuxsf2ngov2

--
Best Wishes,
Ashutosh Bapat



Re: Adding basic NUMA awareness

От
Dmitry Dolgov
Дата:
> On Wed, Jul 02, 2025 at 05:07:28PM +0530, Ashutosh Bapat wrote:
> > There's also the question how this is related to other patches affecting
> > shared memory - I think the most relevant one is the "shared buffers
> > online resize" by Ashutosh, simply because it touches the shared memory.
>
> I have added Dmitry to this thread since he has written most of the
> shared memory handling code.

Thanks! I like the idea behind this patch series. I haven't read it in
details yet, but I can imagine both patches (interleaving and online
resizing) could benefit from each other. In online resizing we've
introduced a possibility to use multiple shared mappings for different
types of data, maybe it would be convenient to use the same interface to
create separate mappings for different NUMA nodes as well. Using a
separate shared mapping per NUMA node would also make resizing easier,
since it would be more straightforward to fit an increased segment into
NUMA boundaries.

> > I don't think the splitting would actually make some things simpler, or
> > maybe more flexible - in particular, it'd allow us to enable huge pages
> > only for some regions (like shared buffers), and keep the small pages
> > e.g. for PGPROC. So that'd be good.
>
> The resizing patches split the shared buffer related structures into
> separate memory segments. I think that itself will help enabling huge
> pages for some regions. Would that help in your case?

Right, separate segments would allow to mix and match huge pages with
pages of regular size. It's not implemented in the latest version of
online resizing patch, purely to reduce complexity and maintain the same
invariant (everything is either using huge pages or not) -- but we could
do it other way around as well.



Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi!

> 1) v1-0001-NUMA-interleaving-buffers.patch
[..]
> It's a bit more complicated, because the patch distributes both the
> blocks and descriptors, in the same way. So a buffer and it's descriptor
> always end on the same NUMA node. This is one of the reasons why we need
> to map larger chunks, because NUMA works on page granularity, and the
> descriptors are tiny - many fit on a memory page.

Oh, now I get it! OK, let's stick to this one.

> I don't think the splitting would actually make some things simpler, or
> maybe more flexible - in particular, it'd allow us to enable huge pages
> only for some regions (like shared buffers), and keep the small pages
> e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

> The other thing I haven't thought about very much is determining on
> which CPUs/nodes the instance is allowed to run. I assume we'd start by
> simply inherit/determine that at the start through libnuma, not through
> some custom PG configuration (which the patch [2] proposed to do).

0. I think that we could do better, some counter arguments to
no-configuration-at-all:

a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)

b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases?  (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?

c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)

d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:

> 2) v1-0002-NUMA-localalloc.patch
> This simply sets "localalloc" when initializing a backend, so that all
> memory allocated later is local, not interleaved. Initially this was
> necessary because the patch set the allocation policy to interleaving
> before initializing shared memory, and we didn't want to interleave the
> private memory. But that's no longer the case - the explicit mapping to
> nodes does not have this issue. I'm keeping the patch for convenience,
> it allows experimenting with numactl etc.

.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.

Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:

1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:

   numa_buffers_interleave=off  numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
        latency average = 0.373 ms
        latency stddev = 0.237 ms
        initial connection time = 45.899 ms
        tps = 160242.147877 (without initial connection time)

   numa_buffers_interleave=on   numa_pgproc_interleave=on
        latency average = 0.345 ms
        latency stddev = 0.373 ms
        initial connection time = 44.485 ms
        tps = 177564.686094 (without initial connection time)

2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
   default/numa_buffers_interleave=off
        latency average = 1375.478 ms
        latency stddev = 1141.423 ms
        initial connection time = 46.104 ms
        tps = 45.868075 (without initial connection time)

   numa_buffers_interleave=on
        latency average = 838.128 ms
        latency stddev = 498.787 ms
        initial connection time = 43.437 ms
        tps = 75.413894 (without initial connection time)

    and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
        latency average = 727.717 ms
        latency stddev = 410.767 ms
        initial connection time = 45.119 ms
        tps = 86.844161 (without initial connection time)

    (but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
    So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?

3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)

4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?

5. In v1-0001, pg_numa_interleave_memory()

+                * XXX no return value, to make this fail on error, has to use
+                * numa_set_strict

Yes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.

+                * XXX Should we still touch the memory first, like
with numa_move_pages,
+                * or is that not necessary?

It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.

6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c

Accidental indents (also fails to apply)

7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?

8. v1-0005 2x + /* if (numa_procs_interleave) */

   Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

    2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
    [..]
    2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
was terminated by signal 11: Segmentation fault
    2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
active server processes
    2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
"restart_after_crash" is off
    2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down

    [New LWP 28755]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `postgres: io worker                     '.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
    136     ./nptl/sem_waitcommon.c: No such file or directory.
    (gdb) where
    #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
    #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
    #2  0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
    #3  0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
    #4  0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
    #5  0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
    #6  0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
        startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
    #7  0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
    #8  0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
    [..]
    (gdb) print *MyProc->sem
    Cannot access memory at address 0x0

9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.

-J.

[1] - https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 7/4/25 13:05, Jakub Wartak wrote:
> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi!
> 
>> 1) v1-0001-NUMA-interleaving-buffers.patch
> [..]
>> It's a bit more complicated, because the patch distributes both the
>> blocks and descriptors, in the same way. So a buffer and it's descriptor
>> always end on the same NUMA node. This is one of the reasons why we need
>> to map larger chunks, because NUMA works on page granularity, and the
>> descriptors are tiny - many fit on a memory page.
> 
> Oh, now I get it! OK, let's stick to this one.
> 
>> I don't think the splitting would actually make some things simpler, or
>> maybe more flexible - in particular, it'd allow us to enable huge pages
>> only for some regions (like shared buffers), and keep the small pages
>> e.g. for PGPROC. So that'd be good.
> 
> You have made assumption that this is good, but small pages (4KB) are
> not hugetlb, and are *swappable* (Transparent HP are swappable too,
> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
> most frequent problem I see these days are OOMs, and it makes me
> believe that making certain critical parts of shared memory being
> swappable just to make pagesize granular is possibly throwing the baby
> out with the bathwater. I'm thinking about bad situations like: some
> wrong settings of vm.swapiness that people keep (or distros keep?) and
> general inability of PG to restrain from allocating more memory in
> some cases.
> 

I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.

I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.

Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?


>> The other thing I haven't thought about very much is determining on
>> which CPUs/nodes the instance is allowed to run. I assume we'd start by
>> simply inherit/determine that at the start through libnuma, not through
>> some custom PG configuration (which the patch [2] proposed to do).
> 
> 0. I think that we could do better, some counter arguments to
> no-configuration-at-all:
> 
> a. as Robert & Bertrand already put it there after review: let's say I
> want just to run on NUMA #2 node, so here I would need to override
> systemd's script ExecStart= to include that numactl (not elegant?). I
> could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
> less friendly. Also it probably requires root to edit/reload systemd,
> while having GUC for this like in my proposal makes it more smooth (I
> think?)
> 
> b. wouldn't it be better if that stayed as drop-in rather than always
> on? What if there's a problem, how do you disable those internal
> optimizations if they do harm in some cases?  (or let's say I want to
> play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
> numa_buffers_interleave would be nice?
> 
> c. What if I want my standby (walreceiver+startup/recovery) to run
> with NUMA affinity to get better performance (I'm not going to hack
> around systemd script every time, but I could imagine changing
> numa=X,Y,Z after restart/before promotion)
> 
> d. Now if I would be forced for some reason to do that numactl(1)
> voodoo, and use the those above mentioned overrides and PG wouldn't be
> having GUC (let's say I would use `numactl
> --weighted-interleave=0,1`), then:
> 

I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.

>> 2) v1-0002-NUMA-localalloc.patch
>> This simply sets "localalloc" when initializing a backend, so that all
>> memory allocated later is local, not interleaved. Initially this was
>> necessary because the patch set the allocation policy to interleaving
>> before initializing shared memory, and we didn't want to interleave the
>> private memory. But that's no longer the case - the explicit mapping to
>> nodes does not have this issue. I'm keeping the patch for convenience,
>> it allows experimenting with numactl etc.
> 
> .. .is not accurate anymore and we would require to have that in
> (still with GUC) ?
> Thoughts? I can add that mine part into Your's patches if you want.
> 

I'm sorry, I don't understand what's the question :-(

> Way too quick review and some very fast benchmark probes, I've
> concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
> would be too new topic for me), but let's start:
> 
> 1. normal pgbench -S (still with just s_b@4GB), done many tries,
> consistent benefit for the patch with like +8..10% boost on generic
> run:
> 
>    numa_buffers_interleave=off  numa_pgproc_interleave=on(due that
> always on "if"), s_b just on 1 NUMA node (might happen)
>         latency average = 0.373 ms
>         latency stddev = 0.237 ms
>         initial connection time = 45.899 ms
>         tps = 160242.147877 (without initial connection time)
> 
>    numa_buffers_interleave=on   numa_pgproc_interleave=on
>         latency average = 0.345 ms
>         latency stddev = 0.373 ms
>         initial connection time = 44.485 ms
>         tps = 177564.686094 (without initial connection time)
> 
> 2. Tested it the same way as I did for mine(problem#2 from Andres's
> presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
> seqconcurrscans.pgb as earlier)
>    default/numa_buffers_interleave=off
>         latency average = 1375.478 ms
>         latency stddev = 1141.423 ms
>         initial connection time = 46.104 ms
>         tps = 45.868075 (without initial connection time)
> 
>    numa_buffers_interleave=on
>         latency average = 838.128 ms
>         latency stddev = 498.787 ms
>         initial connection time = 43.437 ms
>         tps = 75.413894 (without initial connection time)
> 
>     and i've repeated the the same test (identical conditions) with my
> patch, got me slightly more juice:
>         latency average = 727.717 ms
>         latency stddev = 410.767 ms
>         initial connection time = 45.119 ms
>         tps = 86.844161 (without initial connection time)
> 
>     (but mine didn't get that boost from normal pgbench as per #1
> pgbench -S -- my numa='all' stays @ 160k TPS just as
> numa_buffers_interleave=off), so this idea is clearly better.

Good, thanks for the testing. I should have done something like this
when I posted my patches, but I forgot about that (and the email felt
too long anyway).

But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.

For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.


>     So should I close https://commitfest.postgresql.org/patch/5703/
> and you'll open a new one or should I just edit the #5703 and alter it
> and add this thread too?
> 

Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?

> 3. Patch is not calling interleave on PQ shmem, do we want to add that
> in as some next item like v1-0007? Question is whether OS interleaving
> makes sense there ? I believe it does there, please see my thread
> (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
> being spawned by postmaster and may end up on different NUMA nodes
> randomly, so actually OS-interleaving that memory reduces jitter there
> (AKA bandwidth-over-latency). My thinking is that one cannot expect
> static/forced CPU-to-just-one-NUMA-node assignment for backend and
> it's PQ workers, because it is impossible have always available CPU
> power there in that NUMA node, so it might be useful to interleave
> that shared mem there too (as separate patch item?)
> 

Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.

wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?

> 4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
> (also in v1-0005). My worry is should we may put some
> known-limitations docs (?) from start and mention that
> if the VM is greatly resized and NUMA numa nodes appear, they might
> not be used until restart?
> 

Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.

But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.

But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.

Still have to figure out the details, but seems it might help.

> 5. In v1-0001, pg_numa_interleave_memory()
> 
> +                * XXX no return value, to make this fail on error, has to use
> +                * numa_set_strict
> 
> Yes, my patch has those numa_error() and numa_warn() handlers too in
> pg_numa. Feel free to use it for better UX.
> 
> +                * XXX Should we still touch the memory first, like
> with numa_move_pages,
> +                * or is that not necessary?
> 
> It's not necessary to touch after numa_tonode_memory() (wrapper around
> numa_interleave_memory()), if it is going to be used anyway it will be
> correctly placed to best of my knowledge.
> 
> 6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
> 
> Accidental indents (also fails to apply)
> 
> 7. We miss the pg_numa_* shims, but for sure that's for later and also
> avoid those Linux specific #ifdef USE_LIBNUMA and so on?
> 

Right, we need to add those. Or actually, we need to think about how
we'd do this for non-NUMA systems. I wonder if we even want to just
build everything the "old way" (without the partitions, etc.).

But per the earlier comment, the partitioning seems beneficial even on
non-NUMA systems, so maybe the shims are good enough OK.

> 8. v1-0005 2x + /* if (numa_procs_interleave) */
> 
>    Ha! it's a TRAP! I've uncommented it because I wanted to try it out
> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
> 
>     2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
>     [..]
>     2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
> was terminated by signal 11: Segmentation fault
>     2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
> active server processes
>     2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
> "restart_after_crash" is off
>     2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down
> 
>     [New LWP 28755]
>     [Thread debugging using libthread_db enabled]
>     Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>     Core was generated by `postgres: io worker                     '.
>     Program terminated with signal SIGSEGV, Segmentation fault.
>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
> at ./nptl/sem_waitcommon.c:136
>     136     ./nptl/sem_waitcommon.c: No such file or directory.
>     (gdb) where
>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
> at ./nptl/sem_waitcommon.c:136
>     #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
>     #2  0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
> ../src/backend/port/posix_sema.c:302
>     #3  0x0000556191970553 in InitAuxiliaryProcess () at
> ../src/backend/storage/lmgr/proc.c:992
>     #4  0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
> ../src/backend/postmaster/auxprocess.c:65
>     #5  0x0000556191940676 in IoWorkerMain (startup_data=<optimized
> out>, startup_data_len=<optimized out>) at
> ../src/backend/storage/aio/method_worker.c:393
>     #6  0x00005561918e8163 in postmaster_child_launch
> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
> startup_data=startup_data@entry=0x0,
>         startup_data_len=startup_data_len@entry=0,
> client_sock=client_sock@entry=0x0) at
> ../src/backend/postmaster/launch_backend.c:290
>     #7  0x00005561918ea09a in StartChildProcess
> (type=type@entry=B_IO_WORKER) at
> ../src/backend/postmaster/postmaster.c:3973
>     #8  0x00005561918ea308 in maybe_adjust_io_workers () at
> ../src/backend/postmaster/postmaster.c:4404
>     [..]
>     (gdb) print *MyProc->sem
>     Cannot access memory at address 0x0
> 

Yeah, good catch. I'll look into that next week.

> 9. v1-0006: is this just a thought or serious candidate? I can imagine
> it can easily blow-up with some backends somehow requesting CPUs only
> from one NUMA node, while the second node being idle. Isn't it better
> just to leave CPU scheduling, well, to the CPU scheduler? The problem
> is that you have tools showing overall CPU usage, even mpstat(1) per
> CPU , but no tools for per-NUMA node CPU util%, so it would be hard
> for someone to realize that this is happening.
> 

Mostly experimental, for benchmarking etc. I agree we may not want to
mess with the task scheduling too much.


Thanks for the feedback!

regards

-- 
Tomas Vondra




Hi Tomas,


I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.

However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.


> 6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
>
> This is an experimental patch, that simply pins the new process to the
> NUMA node obtained from the freelist.
>
> Driven by GUC "numa_procs_pin" (default: off).


In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.


We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.

The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:


CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
   [ HANDLER handler_function | NO HANDLER ]
   [ VALIDATOR validator_function | NO VALIDATOR ]
   [ OPTIONS ( option 'value' [, ... ] ) ]

CREATE PROFILE profile_name
   [IF NOT EXISTS]
   USING profile_manager
     SET key = value [, key = value]...
   [USING profile_manager
     SET key = value [, key = value]...]
   [...];

CREATE PROFILE MAPPING
   [IF NOT EXISTS]
   FOR PROFILE profile_name
   [MATCH [ ALL | ANY ] (
     [ROLE role_name],
     [BACKEND TYPE backend_type],
     [DATABASE database_name],
     [APPLICATION appname]
   )];

## PROFILE RESOLUTION ORDER

1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)

As currently designed, this approach allows quite a lot of flexibility:

* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary:
https://blogs.oracle.com/linux/post/numa-balancing )

* Someone else could implement support for Windows or BSD specifics.

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Best regards,
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D





On 7/5/25 09:09, Cédric Villemain wrote:
> Hi Tomas,
> 
> 
> I haven't yet had time to fully read all the work and proposals around
> NUMA and related features, but I hope to catch up over the summer.
> 
> However, I think it's important to share some thoughts before it's too
> late, as you might find them relevant to the NUMA management code.
> 
> 
>> 6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
>>
>> This is an experimental patch, that simply pins the new process to the
>> NUMA node obtained from the freelist.
>>
>> Driven by GUC "numa_procs_pin" (default: off).
> 
> 
> In my work on more careful PostgreSQL resource management, I've come to
> the conclusion that we should avoid pushing policy too deeply into the
> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
> NUMA-specific management directly into core PostgreSQL in such a way.
> 
> 
> We are working on a PROFILE and PROFILE MANAGER specification to provide
> PostgreSQL with only the APIs and hooks needed so that extensions can
> manage whatever they want externally.
> 
> The basic syntax (not meant to be discussed here, and even the names
> might change) is roughly as follows, just to illustrate the intent:
> 
> 
> CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
>   [ HANDLER handler_function | NO HANDLER ]
>   [ VALIDATOR validator_function | NO VALIDATOR ]
>   [ OPTIONS ( option 'value' [, ... ] ) ]
> 
> CREATE PROFILE profile_name
>   [IF NOT EXISTS]
>   USING profile_manager
>     SET key = value [, key = value]...
>   [USING profile_manager
>     SET key = value [, key = value]...]
>   [...];
> 
> CREATE PROFILE MAPPING
>   [IF NOT EXISTS]
>   FOR PROFILE profile_name
>   [MATCH [ ALL | ANY ] (
>     [ROLE role_name],
>     [BACKEND TYPE backend_type],
>     [DATABASE database_name],
>     [APPLICATION appname]
>   )];
> 
> ## PROFILE RESOLUTION ORDER
> 
> 1. ALTER ROLE IN DATABASE
> 2. ALTER ROLE
> 3. ALTER DATABASE
> 4. First matching PROFILE MAPPING (global or specific)
> 5. No profile (fallback)
> 
> As currently designed, this approach allows quite a lot of flexibility:
> 
> * pg_psi is used to ensure the spec is suitable for a cgroup profile
> manager (moving PIDs as needed; NUMA and cgroups could work well
> together, see e.g. this Linux kernel summary: https://blogs.oracle.com/
> linux/post/numa-balancing )
> 
> * Someone else could implement support for Windows or BSD specifics.
> 
> * Others might use it to integrate PostgreSQL's own resources (e.g.,
> "areas" of shared buffers) into policies.
> 
> Hope this perspective is helpful.

Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
Hi Tomas, some more thoughts after the weekend:

On Fri, Jul 4, 2025 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 7/4/25 13:05, Jakub Wartak wrote:
> > On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> >
> > Hi!
> >
> >> 1) v1-0001-NUMA-interleaving-buffers.patch
> > [..]
> >> It's a bit more complicated, because the patch distributes both the
> >> blocks and descriptors, in the same way. So a buffer and it's descriptor
> >> always end on the same NUMA node. This is one of the reasons why we need
> >> to map larger chunks, because NUMA works on page granularity, and the
> >> descriptors are tiny - many fit on a memory page.
> >
> > Oh, now I get it! OK, let's stick to this one.
> >
> >> I don't think the splitting would actually make some things simpler, or
> >> maybe more flexible - in particular, it'd allow us to enable huge pages
> >> only for some regions (like shared buffers), and keep the small pages
> >> e.g. for PGPROC. So that'd be good.
> >
> > You have made assumption that this is good, but small pages (4KB) are
> > not hugetlb, and are *swappable* (Transparent HP are swappable too,
> > manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
> > most frequent problem I see these days are OOMs, and it makes me
> > believe that making certain critical parts of shared memory being
> > swappable just to make pagesize granular is possibly throwing the baby
> > out with the bathwater. I'm thinking about bad situations like: some
> > wrong settings of vm.swapiness that people keep (or distros keep?) and
> > general inability of PG to restrain from allocating more memory in
> > some cases.
> >
>
> I haven't observed such issues myself, or maybe I didn't realize it's
> happening. Maybe it happens, but it'd be good to see some data showing
> that, or a reproducer of some sort. But let's say it's real.
>
> I don't think we should use huge pages merely to ensure something is not
> swapped out. The "not swappable" is more of a limitation of huge pages,
> not an advantage. You can't just choose to make them swappable.
>
> Wouldn't it be better to keep using 4KB pages, but lock the memory using
> mlock/mlockall?

In my book, not being swappable is a win (it's hard for me to imagine
when it could be beneficial to swap out parts of s_b).

I was trying to think about it and also got those:

Anyway mlock() probably sounds like it, but e.g. Rocky 8.10 by default
has max locked memory (ulimit -l) as low as 64kB due to systemd's
DefaultLimitMEMLOCK, but Debian/Ubuntu have those at higher values.
Wasn't expecting that - those are bizzare low values. I think we would
need something like (10000*900)/1024/1024 or more, but with each
PGPROC on a separate page that would be even way more?

Another thing with 4kB pages: there's this big assumption now made
that once we arrive in InitProcess() we won't ever change NUMA node,
so we stick to the PGPROC from where we started (based on getcpu(2)).
Let's assume CPU scheduler reassigned us to differnt node, but we have
now this 4kB patch ready for PGPROC in theory and this means we would
need to rely on the NUMA autobalancing doing it's job to migrate that
4kB page from node to node (to get better local accesses instead of
remote ones).  The questions in my head are now like that:
- but we have asked intially asked those PGPROC pages to be localized
on certain node (they have policy), so they won't autobalance? We
would need to somewhere call getcpu() again notice the difference and
unlocalize (clear the NUMA/mbind() policy) for the PGPROC page?
- mlocked() as above says stick to physical RAM page (?) , so it won't move?
- after what time kernel's autobalancing would migrate that page since
switching the active CPU<->node? I mean do we execute enough reads on
this page?

BTW: to move this into pragmatic real, what's the most
one-liner/trivial way to exercise/stress PGPROC?

> >> The other thing I haven't thought about very much is determining on
> >> which CPUs/nodes the instance is allowed to run. I assume we'd start by
> >> simply inherit/determine that at the start through libnuma, not through
> >> some custom PG configuration (which the patch [2] proposed to do).
> >
> > 0. I think that we could do better, some counter arguments to
> > no-configuration-at-all:
> >
> > a. as Robert & Bertrand already put it there after review: let's say I
> > want just to run on NUMA #2 node, so here I would need to override
> > systemd's script ExecStart= to include that numactl (not elegant?). I
> > could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
> > less friendly. Also it probably requires root to edit/reload systemd,
> > while having GUC for this like in my proposal makes it more smooth (I
> > think?)
> >
> > b. wouldn't it be better if that stayed as drop-in rather than always
> > on? What if there's a problem, how do you disable those internal
> > optimizations if they do harm in some cases?  (or let's say I want to
> > play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
> > numa_buffers_interleave would be nice?
> >
> > c. What if I want my standby (walreceiver+startup/recovery) to run
> > with NUMA affinity to get better performance (I'm not going to hack
> > around systemd script every time, but I could imagine changing
> > numa=X,Y,Z after restart/before promotion)
> >
> > d. Now if I would be forced for some reason to do that numactl(1)
> > voodoo, and use the those above mentioned overrides and PG wouldn't be
> > having GUC (let's say I would use `numactl
> > --weighted-interleave=0,1`), then:
> >
>
> I'm not against doing something like this, but I don't plan to do that
> in V1. I don't have a clear idea what configurability is actually
> needed, so it's likely I'd do the interface wrong.
>
> >> 2) v1-0002-NUMA-localalloc.patch
> >> This simply sets "localalloc" when initializing a backend, so that all
> >> memory allocated later is local, not interleaved. Initially this was
> >> necessary because the patch set the allocation policy to interleaving
> >> before initializing shared memory, and we didn't want to interleave the
> >> private memory. But that's no longer the case - the explicit mapping to
> >> nodes does not have this issue. I'm keeping the patch for convenience,
> >> it allows experimenting with numactl etc.
> >
> > .. .is not accurate anymore and we would require to have that in
> > (still with GUC) ?
> > Thoughts? I can add that mine part into Your's patches if you want.
> >
>
> I'm sorry, I don't understand what's the question :-(

That patch reference above, it was a chain of thought from step "d".
What I had in mind was that you cannot remove the patch
`v1-0002-NUMA-localalloc.patch` from the scope if forcing people to
use numactl by not having enough configurability on the PG side. That
is: if someone will have to use systemd+numactl
--interleave/--weighted-interleave then, he will also need to have a
way to use numa_localalloc=on (to override the new/user's policy
default, otherwise local mem allocations are also going to be
interleaved, and we are back to square one). Which brings me to a
point why instead of this toggle, should include the configuration
properly inside from start (it's not that hard apparently).

> > Way too quick review and some very fast benchmark probes, I've
> > concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
> > would be too new topic for me), but let's start:
> >
> > 1. normal pgbench -S (still with just s_b@4GB), done many tries,
> > consistent benefit for the patch with like +8..10% boost on generic
> > run:
> >
[.. removed numbers]
>
> But this actually brings an interesting question. What exactly should we
> expect / demand from these patches? In my mind it'd primarily about
> predictability and stability of results.
>
> For example, the results should not depend on how was the database
> warmed up - was it done by a single backend or many backends? Was it
> restarted, or what? I could probably warmup the system very carefully to
> ensure it's balanced. The patches mean I don't need to be that careful.

Well, pretty much the same here. I was after minimizing "stddev" (to
have better predictability of results, especially across restarts) and
increasing available bandwidth [which is pretty much related]. Without
our NUMA work, PG can just put that s_b on any random node or spill
randomly from to another (depending on size of allocation request).

> >     So should I close https://commitfest.postgresql.org/patch/5703/
> > and you'll open a new one or should I just edit the #5703 and alter it
> > and add this thread too?
> >
>
> Good question. It's probably best to close the original entry as
> "withdrawn" and I'll add a new entry. Sounds OK?

Sure thing, marked it as `Returned with feedback`, this approach seems
to be much more advanced.

> > 3. Patch is not calling interleave on PQ shmem, do we want to add that
> > in as some next item like v1-0007? Question is whether OS interleaving
> > makes sense there ? I believe it does there, please see my thread
> > (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
> > being spawned by postmaster and may end up on different NUMA nodes
> > randomly, so actually OS-interleaving that memory reduces jitter there
> > (AKA bandwidth-over-latency). My thinking is that one cannot expect
> > static/forced CPU-to-just-one-NUMA-node assignment for backend and
> > it's PQ workers, because it is impossible have always available CPU
> > power there in that NUMA node, so it might be useful to interleave
> > that shared mem there too (as separate patch item?)
> >
>
> Excellent question. I haven't thought about this at all. I agree it
> probably makes sense to interleave this memory, in some way. I don't
> know what's the perfect scheme, though.
>
> wild idea: Would it make sense to pin the workers to the same NUMA node
> as the leader? And allocate all memory only from that node?

I'm trying to convey exactly the opposite message or at least that it
might depend on configuration. Please see
https://www.postgresql.org/message-id/CAKZiRmxYMPbQ4WiyJWh%3DVuw_Ny%2BhLGH9_9FaacKRJvzZ-smm%2Bw%40mail.gmail.com
(btw it should read there that I don't indent spend a lot of thime on
PQ), but anyway: I think we should NOT pin the PQ workers the same
NODE as you do not know if there's CPU left there (same story as with
v1-0006 here).

I'm just proposing quick OS-based interleaving of PQ shm if using all
nodes, literally:

@@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
request_size,
     }
     *mapped_address = address;
     *mapped_size = request_size;
+
+    /* We interleave memory only at creation time. */
+    if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) {
+        elog(DEBUG1, "interleaving shm mem @ %p size=%zu",
*mapped_address, *mapped_size);
+        pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes);
+    }
+

Because then if memory is interleaved you have probably less variance
for memory access. But also from that previous thread:

"So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this"

So wild idea was: maybe PQ shm interleaving should on NUMA
configuration (if intereavling to all nodes, then interleave normally,
but if configuration sets to just 1 NUMA node, it automatically binds
there -- there was '@' support for that in my patch).

> > 4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
> > (also in v1-0005). My worry is should we may put some
> > known-limitations docs (?) from start and mention that
> > if the VM is greatly resized and NUMA numa nodes appear, they might
> > not be used until restart?
> >
>
> Yes, this is one thing I need some feedback on. The patches mostly
> assume there are no disabled nodes, that the set of allowed nodes does
> not change, etc. I think for V1 that's a reasonable limitation.

Sure!

> But let's say we want to relax this a bit. How do we learn about the
> change, after a node/CPU gets disabled? For some parts it's not that
> difficult (e.g. we can "remap" buffers/descriptors) in the background.
> But for other parts that's not practical. E.g. we can't rework how the
> PGPROC gets split.
>
> But while discussing this with Andres yesterday, he had an interesting
> suggestion - to always use e.g. 8 or 16 partitions, then partition this
> by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
> go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
> node gets disabled, we can rebuild just this small "mapping" and not the
> 16 partitions. And the partitioning may be helpful even without NUMA.
>
> Still have to figure out the details, but seems it might help.

Right, no idea how the shared_memory remapping patch will work
(how/when the s_b change will be executed), but we could somehow mark
that number of NUMA zones could be rechecked during SIGHUP (?) and
then just simple compare check if old_numa_num_configured_nodes ==
new_numa_num_configured_nodes is true.

Anyway, I think it's way too advanced for now, don't you think? (like
CPU ballooning [s_b itself] is rare, and NUMA ballooning seems to be
super-wild-rare).

As for the rest, forgot to include this too: getcpu() - this really
needs a portable pg_getcpu() wrapper.

-J.



>> * Others might use it to integrate PostgreSQL's own resources (e.g.,
>> "areas" of shared buffers) into policies.
>>
>> Hope this perspective is helpful.
>
> Can you explain how you want to manage this by an extension defined at
> the SQL level, when most of this stuff has to be done when setting up
> shared memory, which is waaaay before we have any access to catalogs?

I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).


--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D




On 7/7/25 16:51, Cédric Villemain wrote:
>>> * Others might use it to integrate PostgreSQL's own resources (e.g.,
>>> "areas" of shared buffers) into policies.
>>>
>>> Hope this perspective is helpful.
>>
>> Can you explain how you want to manage this by an extension defined at
>> the SQL level, when most of this stuff has to be done when setting up
>> shared memory, which is waaaay before we have any access to catalogs?
> 
> I should have said module instead, I didn't follow carefully but at some
> point there were discussion about shared buffers resized "on-line".
> Anyway, it was just to give some few examples, maybe this one is to be
> considered later (I'm focused on cgroup/psi, and precisely reassigning
> PIDs as needed).
> 

I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...

The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.

Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.

The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).


regards

-- 
Tomas Vondra




Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
> In my work on more careful PostgreSQL resource management, I've come to the
> conclusion that we should avoid pushing policy too deeply into the
> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
> NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.

Greetings,

Andres Freund



> On 7/7/25 16:51, Cédric Villemain wrote:
>>>> * Others might use it to integrate PostgreSQL's own resources (e.g.,
>>>> "areas" of shared buffers) into policies.
>>>>
>>>> Hope this perspective is helpful.
>>>
>>> Can you explain how you want to manage this by an extension defined at
>>> the SQL level, when most of this stuff has to be done when setting up
>>> shared memory, which is waaaay before we have any access to catalogs?
>>
>> I should have said module instead, I didn't follow carefully but at some
>> point there were discussion about shared buffers resized "on-line".
>> Anyway, it was just to give some few examples, maybe this one is to be
>> considered later (I'm focused on cgroup/psi, and precisely reassigning
>> PIDs as needed).
>>
>
> I don't know. I have a hard time imagining what exactly would the
> policies / profiles do exactly to respond to changes in the system
> utilization. And why should that interfere with this patch ...
>
> The main thing patch series aims to implement is partitioning different
> pieces of shared memory (buffers, freelists, ...) to better work for
> NUMA. I don't think there's that many ways to do this, and I doubt it
> makes sense to make this easily customizable from external modules of
> any kind. I can imagine providing some API allowing to isolate the
> instance on selected NUMA nodes, but that's about it.
>
> Yes, there's some relation to the online resizing of shared buffers, in
> which case we need to "refresh" some of the information. But AFAICS it's
> not very extensive (on top of what already needs to happen after the
> resize), and it'd happen within the boundaries of the partitioning
> scheme. There's not that much flexibility.
>
> The last bit (pinning backends to a NUMA node) is experimental, and
> mostly intended for easier evaluation of the earlier parts (e.g. to
> limit the noise when processes get moved to a CPU from a different NUMA
> node, and so on).

The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?

Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
    pmroutine->init_process != NULL)
    pmroutine->init_process(MyProc);

...

pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
    pmroutine->init_auxilliary != NULL)
    pmroutine->init_auxilliary(MyProc);

Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D




Hi Andres,

> Hi,
> 
> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>> In my work on more careful PostgreSQL resource management, I've come to the
>> conclusion that we should avoid pushing policy too deeply into the
>> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
>> NUMA-specific management directly into core PostgreSQL in such a way.
> 
> I think it's actually the opposite - whenever we pushed stuff like this
> outside of core it has hurt postgres substantially. Not having replication in
> core was a huge mistake. Not having HA management in core is probably the
> biggest current adoption hurdle for postgres.
> 
> To deal better with NUMA we need to improve memory placement and various
> algorithms, in an interrelated way - that's pretty much impossible to do
> outside of core.

Except the backend pinning which is easy to achieve, thus my comment on 
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core 
(though I didn't read other patches yet).

-- 
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D









>> On 7/7/25 16:51, Cédric Villemain wrote:
>>>>> * Others might use it to integrate PostgreSQL's own resources (e.g.,
>>>>> "areas" of shared buffers) into policies.
>>>>>
>>>>> Hope this perspective is helpful.
>>>>
>>>> Can you explain how you want to manage this by an extension defined at
>>>> the SQL level, when most of this stuff has to be done when setting up
>>>> shared memory, which is waaaay before we have any access to catalogs?
>>>
>>> I should have said module instead, I didn't follow carefully but at some
>>> point there were discussion about shared buffers resized "on-line".
>>> Anyway, it was just to give some few examples, maybe this one is to be
>>> considered later (I'm focused on cgroup/psi, and precisely reassigning
>>> PIDs as needed).
>>>
>>
>> I don't know. I have a hard time imagining what exactly would the
>> policies / profiles do exactly to respond to changes in the system
>> utilization. And why should that interfere with this patch ...
>>
>> The main thing patch series aims to implement is partitioning different
>> pieces of shared memory (buffers, freelists, ...) to better work for
>> NUMA. I don't think there's that many ways to do this, and I doubt it
>> makes sense to make this easily customizable from external modules of
>> any kind. I can imagine providing some API allowing to isolate the
>> instance on selected NUMA nodes, but that's about it.
>>
>> Yes, there's some relation to the online resizing of shared buffers, in
>> which case we need to "refresh" some of the information. But AFAICS it's
>> not very extensive (on top of what already needs to happen after the
>> resize), and it'd happen within the boundaries of the partitioning
>> scheme. There's not that much flexibility.
>>
>> The last bit (pinning backends to a NUMA node) is experimental, and
>> mostly intended for easier evaluation of the earlier parts (e.g. to
>> limit the noise when processes get moved to a CPU from a different NUMA
>> node, and so on).
> 
> The backend pinning can be done by replacing your patch on proc.c to 
> call an external profile manager doing exactly the same thing maybe ?
> 
> Similar to:
> pmroutine = GetPmRoutineForInitProcess();
> if (pmroutine != NULL &&
>      pmroutine->init_process != NULL)
>      pmroutine->init_process(MyProc);
> 
> ...
> 
> pmroutine = GetPmRoutineForInitAuxilliary();
> if (pmroutine != NULL &&
>      pmroutine->init_auxilliary != NULL)
>      pmroutine->init_auxilliary(MyProc);
> 
> Added on some rare places should cover most if not all the requirement 
> around process placement (process_shared_preload_libraries() is called 
> earlier in the process creation I believe).
> 

After a first read I think this works for patches 002 and 005. For this 
last one, InitProcGlobal() may setup things as you do but then expose 
the choice a bit later, basically in places where you added the if 
condition on the GUC: numa_procs_interleave).


-- 
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> > I don't think the splitting would actually make some things simpler, or
> > maybe more flexible - in particular, it'd allow us to enable huge pages
> > only for some regions (like shared buffers), and keep the small pages
> > e.g. for PGPROC. So that'd be good.
> 
> You have made assumption that this is good, but small pages (4KB) are
> not hugetlb, and are *swappable* (Transparent HP are swappable too,
> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
> most frequent problem I see these days are OOMs, and it makes me
> believe that making certain critical parts of shared memory being
> swappable just to make pagesize granular is possibly throwing the baby
> out with the bathwater. I'm thinking about bad situations like: some
> wrong settings of vm.swapiness that people keep (or distros keep?) and
> general inability of PG to restrain from allocating more memory in
> some cases.

The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 7/8/25 05:04, Andres Freund wrote:
> Hi,
> 
> On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
>> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
>>> I don't think the splitting would actually make some things simpler, or
>>> maybe more flexible - in particular, it'd allow us to enable huge pages
>>> only for some regions (like shared buffers), and keep the small pages
>>> e.g. for PGPROC. So that'd be good.
>>
>> You have made assumption that this is good, but small pages (4KB) are
>> not hugetlb, and are *swappable* (Transparent HP are swappable too,
>> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
>> most frequent problem I see these days are OOMs, and it makes me
>> believe that making certain critical parts of shared memory being
>> swappable just to make pagesize granular is possibly throwing the baby
>> out with the bathwater. I'm thinking about bad situations like: some
>> wrong settings of vm.swapiness that people keep (or distros keep?) and
>> general inability of PG to restrain from allocating more memory in
>> some cases.
> 
> The reason it would be advantageous to put something like the procarray onto
> smaller pages is that otherwise the entire procarray (unless particularly
> large) ends up on a single NUMA node, increasing the latency for backends on
> every other numa node and increasing memory traffic on that node.
> 

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.

regards

-- 
Tomas Vondra




On 7/8/25 03:55, Cédric Villemain wrote:
> Hi Andres,
> 
>> Hi,
>>
>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>>> In my work on more careful PostgreSQL resource management, I've come
>>> to the
>>> conclusion that we should avoid pushing policy too deeply into the
>>> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
>>> NUMA-specific management directly into core PostgreSQL in such a way.
>>
>> I think it's actually the opposite - whenever we pushed stuff like this
>> outside of core it has hurt postgres substantially. Not having
>> replication in
>> core was a huge mistake. Not having HA management in core is probably the
>> biggest current adoption hurdle for postgres.
>>
>> To deal better with NUMA we need to improve memory placement and various
>> algorithms, in an interrelated way - that's pretty much impossible to do
>> outside of core.
> 
> Except the backend pinning which is easy to achieve, thus my comment on
> the related patch.
> I'm not claiming NUMA memory and all should be managed outside of core
> (though I didn't read other patches yet).
> 

But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> On 7/8/25 05:04, Andres Freund wrote:
> > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > The reason it would be advantageous to put something like the procarray onto
> > smaller pages is that otherwise the entire procarray (unless particularly
> > large) ends up on a single NUMA node, increasing the latency for backends on
> > every other numa node and increasing memory traffic on that node.
> > 
> 
> That's why the patch series splits the procarray into multiple pieces,
> so that it can be properly distributed on multiple NUMA nodes even with
> huge pages. It requires adjusting a couple places accessing the entries,
> but it surprised me how limited the impact was.

Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...


> The thing I'm not sure about is how much this actually helps with the
> traffic between node. Sure, if we pick a PGPROC from the same node, and
> the task does not get moved, it'll be local traffic. But if the task
> moves, there'll be traffic. I don't have any estimates how often this
> happens, e.g. for older tasks.

I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.

Greetings,

Andres Freund








> On 7/8/25 03:55, Cédric Villemain wrote:
>> Hi Andres,
>>
>>> Hi,
>>>
>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>>>> In my work on more careful PostgreSQL resource management, I've come
>>>> to the
>>>> conclusion that we should avoid pushing policy too deeply into the
>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
>>>> NUMA-specific management directly into core PostgreSQL in such a way.
>>>
>>> I think it's actually the opposite - whenever we pushed stuff like this
>>> outside of core it has hurt postgres substantially. Not having
>>> replication in
>>> core was a huge mistake. Not having HA management in core is probably the
>>> biggest current adoption hurdle for postgres.
>>>
>>> To deal better with NUMA we need to improve memory placement and various
>>> algorithms, in an interrelated way - that's pretty much impossible to do
>>> outside of core.
>>
>> Except the backend pinning which is easy to achieve, thus my comment on
>> the related patch.
>> I'm not claiming NUMA memory and all should be managed outside of core
>> (though I didn't read other patches yet).
>>
> 
> But an "optimal backend placement" seems to very much depend on where we
> placed the various pieces of shared memory. Which the external module
> will have trouble following, I suspect.
> 
> I still don't have any idea what exactly would the external module do,
> how would it decide where to place the backend. Can you describe some
> use case with an example?
> 
> Assuming we want to actually pin tasks from within Postgres, what I
> think might work is allowing modules to "advise" on where to place the
> task. But the decision would still be done by core.

Possibly exactly what you're doing in proc.c when managing allocation of 
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good 
candidates), I didn't get that they require information not available to 
any process executing code from a module.

Parts of your code where you assign/define policy could be in one or 
more relevant routines of a "numa profile manager", like in an 
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
     pmroutine->init_process != NULL)
     pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able 
to adjust when hardware and linux kernel changes.


-- 
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D





On 7/8/25 18:06, Cédric Villemain wrote:
> 
> 
> 
> 
> 
> 
>> On 7/8/25 03:55, Cédric Villemain wrote:
>>> Hi Andres,
>>>
>>>> Hi,
>>>>
>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>>>>> In my work on more careful PostgreSQL resource management, I've come
>>>>> to the
>>>>> conclusion that we should avoid pushing policy too deeply into the
>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about
>>>>> integrating
>>>>> NUMA-specific management directly into core PostgreSQL in such a way.
>>>>
>>>> I think it's actually the opposite - whenever we pushed stuff like this
>>>> outside of core it has hurt postgres substantially. Not having
>>>> replication in
>>>> core was a huge mistake. Not having HA management in core is
>>>> probably the
>>>> biggest current adoption hurdle for postgres.
>>>>
>>>> To deal better with NUMA we need to improve memory placement and
>>>> various
>>>> algorithms, in an interrelated way - that's pretty much impossible
>>>> to do
>>>> outside of core.
>>>
>>> Except the backend pinning which is easy to achieve, thus my comment on
>>> the related patch.
>>> I'm not claiming NUMA memory and all should be managed outside of core
>>> (though I didn't read other patches yet).
>>>
>>
>> But an "optimal backend placement" seems to very much depend on where we
>> placed the various pieces of shared memory. Which the external module
>> will have trouble following, I suspect.
>>
>> I still don't have any idea what exactly would the external module do,
>> how would it decide where to place the backend. Can you describe some
>> use case with an example?
>>
>> Assuming we want to actually pin tasks from within Postgres, what I
>> think might work is allowing modules to "advise" on where to place the
>> task. But the decision would still be done by core.
> 
> Possibly exactly what you're doing in proc.c when managing allocation of
> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
> candidates), I didn't get that they require information not available to
> any process executing code from a module.
> 

Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...

> Parts of your code where you assign/define policy could be in one or
> more relevant routines of a "numa profile manager", like in an
> initProcessRoutine(), and registered in pmroutine struct:
> 
> pmroutine = GetPmRoutineForInitProcess();
> if (pmroutine != NULL &&
>     pmroutine->init_process != NULL)
>     pmroutine->init_process(MyProc);
> 
> This way it's easier to manage alternative policies, and also to be able
> to adjust when hardware and linux kernel changes.
> 

I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?


regards

-- 
Tomas Vondra




> On 7/8/25 18:06, Cédric Villemain wrote:
>>
>>
>>
>>
>>
>>
>>> On 7/8/25 03:55, Cédric Villemain wrote:
>>>> Hi Andres,
>>>>
>>>>> Hi,
>>>>>
>>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>>>>>> In my work on more careful PostgreSQL resource management, I've come
>>>>>> to the
>>>>>> conclusion that we should avoid pushing policy too deeply into the
>>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about
>>>>>> integrating
>>>>>> NUMA-specific management directly into core PostgreSQL in such a way.
>>>>>
>>>>> I think it's actually the opposite - whenever we pushed stuff like this
>>>>> outside of core it has hurt postgres substantially. Not having
>>>>> replication in
>>>>> core was a huge mistake. Not having HA management in core is
>>>>> probably the
>>>>> biggest current adoption hurdle for postgres.
>>>>>
>>>>> To deal better with NUMA we need to improve memory placement and
>>>>> various
>>>>> algorithms, in an interrelated way - that's pretty much impossible
>>>>> to do
>>>>> outside of core.
>>>>
>>>> Except the backend pinning which is easy to achieve, thus my comment on
>>>> the related patch.
>>>> I'm not claiming NUMA memory and all should be managed outside of core
>>>> (though I didn't read other patches yet).
>>>>
>>>
>>> But an "optimal backend placement" seems to very much depend on where we
>>> placed the various pieces of shared memory. Which the external module
>>> will have trouble following, I suspect.
>>>
>>> I still don't have any idea what exactly would the external module do,
>>> how would it decide where to place the backend. Can you describe some
>>> use case with an example?
>>>
>>> Assuming we want to actually pin tasks from within Postgres, what I
>>> think might work is allowing modules to "advise" on where to place the
>>> task. But the decision would still be done by core.
>>
>> Possibly exactly what you're doing in proc.c when managing allocation of
>> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
>> candidates), I didn't get that they require information not available to
>> any process executing code from a module.
>>
>
> Well, it needs to understand how some other stuff (especially PGPROC
> entries) is distributed between nodes. I'm not sure how much of this
> internal information we want to expose outside core ...
>
>> Parts of your code where you assign/define policy could be in one or
>> more relevant routines of a "numa profile manager", like in an
>> initProcessRoutine(), and registered in pmroutine struct:
>>
>> pmroutine = GetPmRoutineForInitProcess();
>> if (pmroutine != NULL &&
>>      pmroutine->init_process != NULL)
>>      pmroutine->init_process(MyProc);
>>
>> This way it's easier to manage alternative policies, and also to be able
>> to adjust when hardware and linux kernel changes.
>>
>
> I'm not against making this extensible, in some way. But I still
> struggle to imagine a reasonable alternative policy, where the external
> module gets the same information and ends up with a different decision.
>
> So what would the alternate policy look like? What use case would the
> module be supporting?


That's the whole point: there are very distinct usages of PostgreSQL in
the field. And maybe not all of them will require the policy defined by
PostgreSQL core.

May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.

There are some very early processing for memory setup that I can see as
a current blocker, and here I'd refer a more compliant NUMA api as
proposed by Jakub so it's possible to arrange based on workload,
hardware configuration or other matters. Reworking to get distinct
segment and all as you do is great, and combo of both approach probably
of great interest. There is also this weighted interleave discussed and
probably much more to come in this area in Linux.

I think some points raised already about possible distinct policies, I
am precisely claiming that it is hard to come with one good policy with
limited setup options, thus requirement to keep that flexible enough
(hooks, api, 100 GUc ?).

There is an EPYC story here also, given the NUMA setup can vary
depending on BIOS setup, associated NUMA policy must probably take that
into account (L3 can be either real cache or 4 extra "local" NUMA nodes
- with highly distinct access cost from a RAM module).
Does that change how PostgreSQL will place memory and process? Is it
important or of interest ?


--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D




Hi,

On Wed, Jul 09, 2025 at 06:40:00AM +0000, Cédric Villemain wrote:
> > On 7/8/25 18:06, Cédric Villemain wrote:
> > I'm not against making this extensible, in some way. But I still
> > struggle to imagine a reasonable alternative policy, where the external
> > module gets the same information and ends up with a different decision.
> > 
> > So what would the alternate policy look like? What use case would the
> > module be supporting?
> 
> 
> That's the whole point: there are very distinct usages of PostgreSQL in the
> field. And maybe not all of them will require the policy defined by
> PostgreSQL core.
> 
> May I ask the reverse: what prevent external modules from taking those
> decisions ? There are already a lot of area where external code can take
> over PostgreSQL processing, like Neon is doing.
> 
> There are some very early processing for memory setup that I can see as a
> current blocker, and here I'd refer a more compliant NUMA api as proposed by
> Jakub so it's possible to arrange based on workload, hardware configuration
> or other matters. Reworking to get distinct segment and all as you do is
> great, and combo of both approach probably of great interest.

I think that Tomas's approach helps to have more "predictable" performance
expectations, I mean more consistent over time, fewer "surprises".

While your approach (and Jakub's one)) could help to get performance gains
depending on a "known" context (so less generic).

So, probably having both could make sense but I think that they serve different
purposes.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> > On 7/8/25 05:04, Andres Freund wrote:
> > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > > The reason it would be advantageous to put something like the procarray onto
> > > smaller pages is that otherwise the entire procarray (unless particularly
> > > large) ends up on a single NUMA node, increasing the latency for backends on
> > > every other numa node and increasing memory traffic on that node.
> > >

Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?

> > That's why the patch series splits the procarray into multiple pieces,
> > so that it can be properly distributed on multiple NUMA nodes even with
> > huge pages. It requires adjusting a couple places accessing the entries,
> > but it surprised me how limited the impact was.

Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1][2] (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.

Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

> Sure, you can do that, but it does mean that iterations over the procarray now
> have an added level of indirection...

So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?

> > The thing I'm not sure about is how much this actually helps with the
> > traffic between node. Sure, if we pick a PGPROC from the same node, and
> > the task does not get moved, it'll be local traffic. But if the task
> > moves, there'll be traffic.

With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

> > I don't have any estimates how often this happens, e.g. for older tasks.

We could measure, kernel 6.16+ has per PID numa_task_migrated in
/proc/{PID}/sched , but I assume we would have to throw backends >>
VCPUs at it, to simulate reality and do some "waves" between different
activity periods of certain pools (I can imagine worst case scenario:
a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep
for a while
b) pgbench "b" open some $VCPU new connections to some other DB, all
active from start (tpcbb or readonly)
c) manually ping CPUs using taskset for each PID all from "b" to
specific NUMA node #2 -- just to simulate unfortunate app working on
every 2nd conn
d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node
#1 is idle, #2 is full, CPU scheduler starts puting "a" backends on
CPUs from #1 , and we should notice PIDs being migrated)

> I think the most important bit is to not put everything onto one numa node,
> otherwise the chance of increased latency for *everyone* due to the increased
> memory contention is more likely to hurt.

-J.

p.s. I hope i did write in an understandable way, because I had many
interruptions, so if anything is unclear please let me know.

[1] - https://lkml.org/lkml/2024/7/3/352
[2] - https://lkml.rescloud.iu.edu/2402.2/03227.html



Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote:
> On 7/2/25 13:37, Ashutosh Bapat wrote:
> > On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
> >>
> >>
> >> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
> >>
> >> Minor optimization. Andres noticed we're tracking the tail of buffer
> >> freelist, without using it. So the patch removes that.
> >>
> >
> > The patches for resizing buffers use the lastFreeBuffer to add new
> > buffers to the end of free list when expanding it. But we could as
> > well add it at the beginning of the free list.

Yea, I don't see any point in adding buffers to the tail instead of to the
front. We probably want more recently used buffers at the front, since they
(and the associated BufferDesc) are more likely to be in a CPU cache.


> > This patch seems almost independent of the rest of the patches. Do you
> > need it in the rest of the patches? I understand that those patches
> > don't need to worry about maintaining lastFreeBuffer after this patch.
> > Is there any other effect?
> >
> > If we are going to do this, let's do it earlier so that buffer
> > resizing patches can be adjusted.
> >
>
> My patches don't particularly rely on this bit, it would work even with
> lastFreeBuffer. I believe Andres simply noticed the current code does
> not use lastFreeBuffer, it just maintains is, so he removed that as an
> optimization.

Optimiziation / simplification.  When building multiple freelists it was
harder to maintain the tail pointer, and since it was never used...

+1 to just applying that part.


> I don't know how significant is the improvement, but if it's measurable we
> could just do that independently of our patches.

I doubt it's really an improvement in any realistic scenario, but it's also
not a regression in any way, since it's never used...


FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread (rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Also needing to switch between getting buffers from the freelist and the sweep
makes the code more expensive.  I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.

Greetings,

Andres Freund

[1]

A single pg_prewarm of a large relation shows a difference between using the
freelist and not that's around the noise level, whereas 40 parallel
pg_prewarms of seperate relations is over 5x faster when disabling the
freelist.

For the test:
- I modified pg_buffercache_evict_* to put buffers onto the freelist

- Ensured all of shared buffers is allocated by querying
  pg_shmem_allocations_numa, as otherwise the workload is dominated by the
  kernel zeroing out buffers

- used shared_buffers bigger than the data

- data for single threaded is 9.7GB, data for the parallel case is 40
  relations of 610MB each.

- in the single threaded case I pinned postgres to a single core, to make sure
  core-to-core variation doesn't play a role

- single threaded case

  c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_large');")-t1
 

  concurrent case:

  c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_:client_id');")-t1
 



Re: Adding basic NUMA awareness

От
Greg Burd
Дата:

On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:

> FWIW, I've started to wonder if we shouldn't just get rid of the freelist
> entirely. While clocksweep is perhaps minutely slower in a single
> thread than
> the freelist, clock sweep scales *considerably* better [1]. As it's rather
> rare to be bottlenecked on clock sweep speed for a single thread
> (rather then
> IO or memory copy overhead), I think it's worth favoring clock sweep.

Hey Andres, thanks for spending time on this.  I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something.  I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.

> Also needing to switch between getting buffers from the freelist and
> the sweep
> makes the code more expensive.  I think just having the buffer in the sweep,
> with a refcount / usagecount of zero would suffice.

If you're not already coding this, I'll jump in. :)
 
> That seems particularly advantageous if we invest energy in making the clock
> sweep deal well with NUMA systems, because we don't need have both a NUMA
> aware freelist and a NUMA aware clock sweep.

100% agree here, very clever approach adapting clock sweep to a NUMA world.

best.

-greg

> 
> Greetings,
> 
> Andres Freund



Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-09 12:04:00 +0200, Jakub Wartak wrote:
> On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> > > On 7/8/25 05:04, Andres Freund wrote:
> > > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > > > The reason it would be advantageous to put something like the procarray onto
> > > > smaller pages is that otherwise the entire procarray (unless particularly
> > > > large) ends up on a single NUMA node, increasing the latency for backends on
> > > > every other numa node and increasing memory traffic on that node.
> > > >
> 
> Sure thing, I fully understand the motivation and underlying reason
> (without claiming that I understand the exact memory access patterns
> that involve procarray/PGPROC/etc and hotspots involved from PG side).
> Any single-liner pgbench help for how to really easily stress the
> PGPROC or procarray?

Unfortunately it's probably going to be slightly more complicated workloads
that show the effect - the very simplest cases don't go iterate through the
procarray itself anymore.


> > > That's why the patch series splits the procarray into multiple pieces,
> > > so that it can be properly distributed on multiple NUMA nodes even with
> > > huge pages. It requires adjusting a couple places accessing the entries,
> > > but it surprised me how limited the impact was.
> 
> Yes, and we are discussing if it is worth getting into smaller pages
> for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
> what more even more waste 1GB hugetlb if we dont request 2MB for some
> small structs: btw, we have ability to select MAP_HUGE_2MB vs
> MAP_HUGE_1GB). I'm thinking about two problems:
> - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning

I'm not really bought into this being a problem.  If your system has enough
pressure to swap out the PGPROC array, you're so hosed that this won't make a
difference.


> - using libnuma often leads to MPOL_BIND which disarms NUMA
> autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
> since 5.12+ kernel they can take additional flag
> MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
> memory anyway (if way too many tasks are relocated, so would be
> memory?). It is available only in recent libnuma as
> numa_set_membind_balancing(3), but sadly there's no way via libnuma to
> do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
> would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
> (select one node from many for each node while still allowing
> balancing?), but in [1][2] (2024) it is stated that "It's not
> legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
> maybe stuff has been improved since then.
> 
> Something like:
> PGPROC/procarray 2MB page for node#1 - mbind(addr1,
> MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
> PGPROC/procarray 2MB page for node#2 - mbind(addr2,
> MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

I'm rather doubtful that it's a good idea to combine numa awareness with numa
balancing. Numa balancing adds latency and makes it much more expensive for
userspace to act in a numa aware way, since it needs to regularly update its
knowledge about where memory resides.


> > Sure, you can do that, but it does mean that iterations over the procarray now
> > have an added level of indirection...
> 
> So the most efficient would be the old-way (no indirections) vs
> NUMA-way? Can this be done without #ifdefs at all?

If we used 4k pages for the procarray we would just have ~4 procs on one page,
if that range were marked as interleaved, it'd probably suffice.


> > > The thing I'm not sure about is how much this actually helps with the
> > > traffic between node. Sure, if we pick a PGPROC from the same node, and
> > > the task does not get moved, it'll be local traffic. But if the task
> > > moves, there'll be traffic.
> 
> With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

I doubt that NUMA balancing is going to help a whole lot here, there are too
many procs on one page for that to be helpful.  One thing that might be worth
doing is to *increase* the size of PGPROC by moving other pieces of data that
are keyed by ProcNumber into PGPROC.

I think the main thing to avoid is the case where all of PGPROC, buffer
mapping table, ... resides on one NUMA node (e.g. because it's the one
postmaster was scheduled on), as the increased memory traffic will lead to
queries on that node being slower than the other node.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
>
> > FWIW, I've started to wonder if we shouldn't just get rid of the freelist
> > entirely. While clocksweep is perhaps minutely slower in a single
> > thread than
> > the freelist, clock sweep scales *considerably* better [1]. As it's rather
> > rare to be bottlenecked on clock sweep speed for a single thread
> > (rather then
> > IO or memory copy overhead), I think it's worth favoring clock sweep.
>
> Hey Andres, thanks for spending time on this.  I've worked before on
> freelist implementations (last one in LMDB) and I think you're onto
> something.  I think it's an innovative idea and that the speed
> difference will either be lost in the noise or potentially entirely
> mitigated by avoiding duplicate work.

Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...


> > Also needing to switch between getting buffers from the freelist and
> > the sweep
> > makes the code more expensive.  I think just having the buffer in the sweep,
> > with a refcount / usagecount of zero would suffice.
>
> If you're not already coding this, I'll jump in. :)

My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().

Obviously a real patch would need to do some more work than that.  Feel free
to take on that project, I am not planning on tackling that in near term.


There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.

Greetings,

Andres Freund



Hi,

On 2025-07-08 16:06:00 +0000, Cédric Villemain wrote:
> > Assuming we want to actually pin tasks from within Postgres, what I
> > think might work is allowing modules to "advise" on where to place the
> > task. But the decision would still be done by core.
> 
> Possibly exactly what you're doing in proc.c when managing allocation of
> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
> candidates), I didn't get that they require information not available to any
> process executing code from a module.

> Parts of your code where you assign/define policy could be in one or more
> relevant routines of a "numa profile manager", like in an
> initProcessRoutine(), and registered in pmroutine struct:
> 
> pmroutine = GetPmRoutineForInitProcess();
> if (pmroutine != NULL &&
>     pmroutine->init_process != NULL)
>     pmroutine->init_process(MyProc);
> 
> This way it's easier to manage alternative policies, and also to be able to
> adjust when hardware and linux kernel changes.

I am doubtful this makes sense - as you can see patch 05 needs to change a
fair bit of core code to make this work, there's no way we can delegate much
of that to an extension.

But even if it's doable, I think it's *very* premature to focus on such
extensibility at this point - we need to get the basics into a mergeable
state, if you then want to argue for adding extensibility, we can do that at
this stage.  Trying to design this for extensibility from the get go, where
that extensibility is very unlikely to be used widely, seems rather likely to
just tank this entire project without getting us anything in return.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

Thanks for working on this!  I think it's an area we have long neglected...


On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
> Each patch has a numa_ GUC, intended to enable/disable that part. This
> is meant to make development easier, not as a final interface. I'm not
> sure how exactly that should look. It's possible some combinations of
> GUCs won't work, etc.

Wonder if some of it might be worth putting into a multi-valued GUC (like
debug_io_direct).



> 1) v1-0001-NUMA-interleaving-buffers.patch
>
> This is the main thing when people think about NUMA - making sure the
> shared buffers are allocated evenly on all the nodes, not just on a
> single node (which can happen easily with warmup). The regular memory
> interleaving would address this, but it also has some disadvantages.
>
> Firstly, it's oblivious to the contents of the shared memory segment,
> and we may not want to interleave everything. It's also oblivious to
> alignment of the items (a buffer can easily end up "split" on multiple
> NUMA nodes), or relationship between different parts (e.g. there's a
> BufferBlock and a related BufferDescriptor, and those might again end up
> on different nodes).

Two more disadvantages:

With OS interleaving postgres doesn't (not easily at least) know about what
maps to what, which means postgres can't do stuff like numa aware buffer
replacement.

With OS interleaving the interleaving is "too fine grained", with pages being
mapped at each page boundary, making it less likely for things like one
strategy ringbuffer to reside on a single numa node.


I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool.  Not that big a deal with 2MB huge pages, but with 1GB huge
pages...



> 4) v1-0004-NUMA-partition-buffer-freelist.patch
>
> Right now we have a single freelist, and in busy instances that can be
> quite contended. What's worse, the freelist may trash between different
> CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
> subsets of buffers. The patch implements multiple strategies how the
> list can be split (configured using "numa_partition_freelist" GUC), for
> experimenting:
>
> * node - One list per NUMA node. This is the most natural option,
> because we now know which buffer is on which node, so we can ensure a
> list for a node only has buffers from that list.
>
> * cpu - One list per CPU. Pretty simple, each CPU gets it's own list.
>
> * pid - Similar to "cpu", but the processes are mapped to lists based on
> PID, not CPU ID.
>
> * none - nothing, sigle freelist
>
> Ultimately, I think we'll want to go with "node", simply because it
> aligns with the buffer interleaving. But there are improvements needed.

I think we might eventually want something more granular than just "node" -
the freelist (and the clock sweep) can become a contention point even within
one NUMA node.  I'm imagining something like an array of freelists/clocksweep
states, where the current numa node selects a subset of the array and the cpu
is used to choose the entry within that list.

But we can do that later, that should be a fairly simple extension of what
you're doing.



> The other missing part is clocksweep - there's still just a single
> instance of clocksweep, feeding buffers to all the freelists. But that's
> clearly a problem, because the clocksweep returns buffers from all NUMA
> nodes. The clocksweep really needs to be partitioned the same way as a
> freelists, and each partition will operate on a subset of buffers (from
> the right NUMA node).
>
> I do have a separate experimental patch doing something like that, I
> need to make it part of this branch.

I'm really curious about that patch, as I wrote elsewhere in this thread, I
think we should just get rid of the freelist alltogether.  Even if we don't do
so, in a steady state system the clock sweep is commonly much more important
than the freelist...


> 5) v1-0005-NUMA-interleave-PGPROC-entries.patch
>
> Another area that seems like it might benefit from NUMA is PGPROC, so I
> gave it a try. It turned out somewhat challenging. Similarly to buffers
> we have two pieces that need to be located in a coordinated way - PGPROC
> entries and fast-path arrays. But we can't use the same approach as for
> buffers/descriptors, because
>
> (a) Neither of those pieces aligns with memory page size (PGPROC is
> ~900B, fast-path arrays are variable length).

> (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
> rather high max_connections before we use multiple huge pages.

We should probably pad them regardless? Right now sizeof(PGPROC) happens to be
multiple of 64 (i.e. the most common cache line size), but that hasn't always
been the case, and isn't the case on systems with 128 bit cachelines like
common ARMv8 systems.  And having one cacheline hold one backends fast path
states and another backend's xmin doesn't sound like a recipe for good
performance.

Seems like we should also do some reordering of the contents within PGPROC. We
have e.g. have very frequently changing data (->waitStatus, ->lwWaiting) in
the same caceheline as almost immutable data (->pid, ->pgxactoff,
->databaseId,).


> So what I did instead is splitting the whole PGPROC array into one array
> per NUMA node, and one array for auxiliary processes and 2PC xacts. So
> with 4 NUMA nodes there are 5 separate arrays, for example. Each array
> is a multiple of memory pages, so we may waste some of the memory. But
> that's simply how NUMA works - page granularity.

Theoretically we could use the "padding" memory at the end of each NUMA node's
PGPROC array to for the 2PC entries, for those we presumably don't care for
locality.  Not sure it's worth the complexity though.


For a while I thought I had a better solution: Given that we're going to waste
all the "padding" memory, why not just oversize the PGPROC array so that it
spans the required number of NUMA nodes?

The problem is that that would lead to ProcNumbers to get much larger, and we
do have other arrays that are keyed by ProcNumber. Which probably makes this
not so great an idea.


> This however makes one particular thing harder - in a couple places we
> accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
> much just one large array. And GetNumberFromPGProc() relied on array
> arithmetics to determine procnumber. With the array partitioned, this
> can't work the same way.
>
> But there's a simple solution - if we turn allProcs into an array of
> *pointers* to PGPROC arrays, there's no issue. All the places need a
> pointer anyway. And then we need an explicit procnumber field in PGPROC,
> instead of calculating it.
>
> There's a chance this have negative impact on code that accessed PGPROC
> very often, but so far I haven't seen such cases. But if you can come up
> with such examples, I'd like to see those.

I'd not be surprised if there were overhead, adding a level of indirection to
things like ProcArrayGroupClearXid(), GetVirtualXIDsDelayingChkpt(),
SignalVirtualTransaction() probably won't be free.

BUT: For at least some of these a better answer might be to add additional
"dense" arrays like we have for xids etc, so they don't need to trawl through
PGPROCs.


> There's another detail - when obtaining a PGPROC entry in InitProcess(),
> we try to get an entry from the same NUMA node. And only if that doesn't
> work, we grab the first one from the list (there's still just one PGPROC
> freelist, I haven't split that - maybe we should?).

I guess it might be worth partitioning the freelist, iterating through a few
thousand links just to discover that there's no free proc on the current numa
node, while holding a spinlock, doesn't sound great. Even if it's likely
rarely a huge issue compared to other costs.


> The other thing I haven't thought about very much is determining on
> which CPUs/nodes the instance is allowed to run. I assume we'd start by
> simply inherit/determine that at the start through libnuma, not through
> some custom PG configuration (which the patch [2] proposed to do).

That seems like the right thing to me.


One thing that this patchset afaict doesn't address so far is that there is a
fair bit of other important shared memory that this patch doesn't set up
intelligently e.g. the buffer mapping table itself (but there are loads of
other cases). Because we touch a lot of that memory during startup, most it
will be allocated on whatever NUMA node postmaster was scheduled.  I suspect
that the best we can do for parts of shared memory where we don't have
explicit NUMA awareness is to default to an interleave policy.


> From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Tue, 6 May 2025 21:12:21 +0200
> Subject: [PATCH v1 1/6] NUMA: interleaving buffers
>
> Ensure shared buffers are allocated from all NUMA nodes, in a balanced
> way, instead of just using the node where Postgres initially starts, or
> where the kernel decides to migrate the page, etc. With pre-warming
> performed by a single backend, this can easily result in severely
> unbalanced memory distribution (with most from a single NUMA node).
>
> The kernel would eventually move some of the memory to other nodes
> (thanks to zone_reclaim), but that tends to take a long time. So this
> patch improves predictability, reduces the time needed for warmup
> during benchmarking, etc.  It's less dependent on what the CPU
> scheduler does, etc.

FWIW, I don't think zone_reclaim_mode will commonly do that? Even if enabled,
which I don't think it is anymore by default.  At least huge pages can't be
reclaimed by the kernel, but even when not using huge pages, I think the only
scenario where that would happen is if shared_buffers were swapped out.

Numa balancing might eventually "fix" such an imbalance though.


> diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
> index ed1dc488a42..2ad34624c49 100644
> --- a/src/backend/storage/buffer/buf_init.c
> +++ b/src/backend/storage/buffer/buf_init.c
> @@ -14,9 +14,17 @@
>   */
>  #include "postgres.h"
>
> +#ifdef USE_LIBNUMA
> +#include <numa.h>
> +#include <numaif.h>
> +#endif
> +

I wonder how much of this we should try to put into port/pg_numa.c. Having
direct calls to libnuma code all over the backend will make it rather hard to
add numa awareness for hypothetical platforms not using libnuma compatible
interfaces.


> +/* number of buffers allocated on the same NUMA node */
> +static int64 numa_chunk_buffers = -1;

Given that NBuffers is a 32bit quantity, this probably doesn't need to be
64bit...  Anyway, I'm not going to review on that level going forward, the
patch is probably in too early a state for that.



> @@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
>                  foundDescs,
>                  foundIOCV,
>                  foundBufCkpt;
> +    Size        mem_page_size;
> +    Size        buffer_align;
> +
> +    /*
> +     * XXX A bit weird. Do we need to worry about postmaster? Could this even
> +     * run outside postmaster? I don't think so.

It can run in single user mode - but that shouldn't prevent us from using
pg_get_shmem_pagesize().


> +     * XXX Another issue is we may get different values than when sizing the
> +     * the memory, because at that point we didn't know if we get huge pages,
> +     * so we assumed we will. Shouldn't cause crashes, but we might allocate
> +     * shared memory and then not use some of it (because of the alignment
> +     * that we don't actually need). Not sure about better way, good for now.
> +     */

Ugh, not seeing a great way to deal with that either.


> +     * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
> +     * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
> +     * that doesn't seem quite worth it. Maybe we should simply align to
> +     * BLCKSZ, so that buffers don't get split? Still, we might interfere with
> +     * other stuff stored in shared memory that we want to allocate on a
> +     * particular NUMA node (e.g. ProcArray).
> +     *
> +     * XXX Maybe with "too large" huge pages we should just not do this, or
> +     * maybe do this only for sufficiently large areas (e.g. shared buffers,
> +     * but not ProcArray).

I think that's right - there's no point in using 1GB pages for anything other
than shared_buffers, we should allocate shared_buffers separately.




> +/*
> + * Determine the size of memory page.
> + *
> + * XXX This is a bit tricky, because the result depends at which point we call
> + * this. Before the allocation we don't know if we succeed in allocating huge
> + * pages - but we have to size everything for the chance that we will. And then
> + * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
> + * pages. But at that point we can't adjust the sizing.
> + *
> + * XXX Maybe with huge_pages=try we should do the sizing twice - first with
> + * huge pages, and if that fails, then without them. But not for this patch.
> + * Up to this point there was no such dependency on huge pages.

Doing it twice sounds somewhat nasty - but perhaps we could just have the
shmem size infrastructure compute two different numbers, one for use with huge
pages and one without?



> +static int64
> +choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
> +{
> +    int64        num_items;
> +    int64        max_items;
> +
> +    /* make sure the chunks will align nicely */
> +    Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
> +    Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
> +    Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
> +
> +    /*
> +     * The minimum number of items to fill a memory page with descriptors and
> +     * blocks. The NUMA allocates memory in pages, and we need to do that for
> +     * both buffers and descriptors.
> +     *
> +     * In practice the BLCKSZ doesn't really matter, because it's much larger
> +     * than BufferDescPadded, so the result is determined buffer descriptors.
> +     * But it's clearer this way.
> +     */
> +    num_items = Max(mem_page_size / sizeof(BufferDescPadded),
> +                    mem_page_size / BLCKSZ);
> +
> +    /*
> +     * We shouldn't use chunks larger than NBuffers/num_nodes, because with
> +     * larger chunks the last NUMA node would end up with much less memory (or
> +     * no memory at all).
> +     */
> +    max_items = (NBuffers / num_nodes);
> +
> +    /*
> +     * Did we already exceed the maximum desirable chunk size? That is, will
> +     * the last node get less than one whole chunk (or no memory at all)?
> +     */
> +    if (num_items > max_items)
> +        elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
> +             num_items, max_items);
> +
> +    /* grow the chunk size until we hit the max limit. */
> +    while (2 * num_items <= max_items)
> +        num_items *= 2;

Something around this logic leads to a fair bit of imbalance - I started postgres with
huge_page_size=1GB, shared_buffers=4GB on a 2 node system and that results in

postgres[4188255][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│        name        │ numa_node │    size    │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks      │         0 │ 5368709120 │
│ Buffer Blocks      │         1 │ 1073741824 │
│ Buffer Descriptors │         0 │ 1073741824 │
│ Buffer Descriptors │         1 │ 1073741824 │
└────────────────────┴───────────┴────────────┘
(4 rows)


With shared_buffers=8GB postgres failed to start, even though 16 1GB huge
pages are available, as 18GB were requested.

After increasing the limit, the top allocations were as follows:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations ORDER BY allocated_size DESC LIMIT 5;
┌──────────────────────┬─────────────┬────────────┬────────────────┐
│         name         │     off     │    size    │ allocated_size │
├──────────────────────┼─────────────┼────────────┼────────────────┤
│ Buffer Blocks        │  1192223104 │ 9663676416 │     9663676416 │
│ PGPROC structures    │ 10970279808 │ 3221733342 │     3221733376 │
│ Fast-Path Lock Array │ 14192013184 │ 3221396544 │     3221396608 │
│ Buffer Descriptors   │    51372416 │ 1140850688 │     1140850688 │
│ (null)               │ 17468590976 │  785020032 │      785020032 │
└──────────────────────┴─────────────┴────────────┴────────────────┘

With a fair bit of imbalance:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│        name        │ numa_node │    size    │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks      │         0 │ 8589934592 │
│ Buffer Blocks      │         1 │ 2147483648 │
│ Buffer Descriptors │         0 │          0 │
│ Buffer Descriptors │         1 │ 2147483648 │
└────────────────────┴───────────┴────────────┘
(4 rows)

Note that the buffer descriptors are all on node 1.


> +/*
> + * Calculate the NUMA node for a given buffer.
> + */
> +int
> +BufferGetNode(Buffer buffer)
> +{
> +    /* not NUMA interleaving */
> +    if (numa_chunk_buffers == -1)
> +        return -1;
> +
> +    return (buffer / numa_chunk_buffers) % numa_nodes;
> +}

FWIW, this is likely rather expensive - when not a compile time constant,
divisions and modulo can take a fair number of cycles.



> +/*
> + * pg_numa_interleave_memory
> + *        move memory to different NUMA nodes in larger chunks
> + *
> + * startptr - start of the region (should be aligned to page size)
> + * endptr - end of the region (doesn't need to be aligned)
> + * mem_page_size - size of the memory page size
> + * chunk_size - size of the chunk to move to a single node (should be multiple
> + *              of page size
> + * num_nodes - number of nodes to allocate memory to
> + *
> + * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
> + * That might be more efficient than numa_move_pages, as it works on larger
> + * chunks of memory, not individual system pages, I think.
> + *
> + * XXX The "interleave" name is not quite accurate, I guess.
> + */
> +static void
> +pg_numa_interleave_memory(char *startptr, char *endptr,
> +                          Size mem_page_size, Size chunk_size,
> +                          int num_nodes)
> +{

Seems like this should be in pg_numa.c?




> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
> index 69b6a877dc9..c07de903f76 100644
> --- a/src/bin/pgbench/pgbench.c
> +++ b/src/bin/pgbench/pgbench.c

I assume those changes weren't intentionally part of this patch...


> From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Thu, 22 May 2025 18:38:41 +0200
> Subject: [PATCH v1 4/6] NUMA: partition buffer freelist
>
> Instead of a single buffer freelist, partition into multiple smaller
> lists, to reduce lock contention, and to spread the buffers over all
> NUMA nodes more evenly.
>
> There are four strategies, specified by GUC numa_partition_freelist
>
> * none - single long freelist, should work just like now
>
> * node - one freelist per NUMA node, with only buffers from that node
>
> * cpu - one freelist per CPU
>
> * pid - freelist determined by PID (same number of freelists as 'cpu')
>
> When allocating a buffer, it's taken from the correct freelist (e.g.
> same NUMA node).
>
> Note: This is (probably) more important than partitioning ProcArray.

>
> +/*
> + * Represents one freelist partition.
> + */
> +typedef struct BufferStrategyFreelist
> +{
> +    /* Spinlock: protects the values below */
> +    slock_t        freelist_lock;
> +
> +    /*
> +     * XXX Not sure why this needs to be aligned like this. Need to ask
> +     * Andres.
> +     */
> +    int            firstFreeBuffer __attribute__((aligned(64)));    /* Head of list of
> +                                                                 * unused buffers */
> +
> +    /* Number of buffers consumed from this list. */
> +    uint64        consumed;
> +}            BufferStrategyFreelist;

I think this might be a leftover from measuring performance of a *non*
partitioned freelist.  I saw unnecessar contention between

  BufferStrategyControl->{nextVictimBuffer,buffer_strategy_lock,numBufferAllocs}

and was testing what effect the simplest avoidance scheme has.

I don't this should be part of this patchset.



>
>  /*
>   * The shared freelist control information.
> @@ -39,8 +66,6 @@ typedef struct
>       */
>      pg_atomic_uint32 nextVictimBuffer;
>
> -    int            firstFreeBuffer;    /* Head of list of unused buffers */
> -
>      /*
>       * Statistics.  These counters should be wide enough that they can't
>       * overflow during a single bgwriter cycle.
> @@ -51,13 +76,27 @@ typedef struct
>      /*
>       * Bgworker process to be notified upon activity or -1 if none. See
>       * StrategyNotifyBgWriter.
> +     *
> +     * XXX Not sure why this needs to be aligned like this. Need to ask
> +     * Andres. Also, shouldn't the alignment be specified after, like for
> +     * "consumed"?
>       */
> -    int            bgwprocno;
> +    int            __attribute__((aligned(64))) bgwprocno;
> +
> +    BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
>  } BufferStrategyControl;

Here the reason was that it's silly to put almost-readonly data (like
bgwprocno) onto the same cacheline as very frequently modified data like
->numBufferAllocs.  That causes unnecessary cache misses in many
StrategyGetBuffer() calls, as another backend's StrategyGetBuffer() will
always have modified ->numBufferAllocs and either ->buffer_strategy_lock or
->nextVictimBuffer.


>
> +static BufferStrategyFreelist *
> +ChooseFreeList(void)
> +{
> +    unsigned    cpu;
> +    unsigned    node;
> +    int            rc;
> +
> +    int            freelist_idx;
> +
> +    /* freelist not partitioned, return the first (and only) freelist */
> +    if (numa_partition_freelist == FREELIST_PARTITION_NONE)
> +        return &StrategyControl->freelists[0];
> +
> +    /*
> +     * freelist is partitioned, so determine the CPU/NUMA node, and pick a
> +     * list based on that.
> +     */
> +    rc = getcpu(&cpu, &node);
> +    if (rc != 0)
> +        elog(ERROR, "getcpu failed: %m");

Probably should put this into somewhere abstracted away...


> +    /*
> +     * Pick the freelist, based on CPU, NUMA node or process PID. This matches
> +     * how we built the freelists above.
> +     *
> +     * XXX Can we rely on some of the values (especially strategy_nnodes) to
> +     * be a power-of-2? Then we could replace the modulo with a mask, which is
> +     * likely more efficient.
> +     */
> +    switch (numa_partition_freelist)
> +    {
> +        case FREELIST_PARTITION_CPU:
> +            freelist_idx = cpu % strategy_ncpus;

As mentioned earlier, modulo is rather expensive for something executed so
frequently...

> +            break;
> +
> +        case FREELIST_PARTITION_NODE:
> +            freelist_idx = node % strategy_nnodes;
> +            break;

Here we shouldn't need modulo, right?


> +
> +        case FREELIST_PARTITION_PID:
> +            freelist_idx = MyProcPid % strategy_ncpus;
> +            break;
> +
> +        default:
> +            elog(ERROR, "unknown freelist partitioning value");
> +    }
> +
> +    return &StrategyControl->freelists[freelist_idx];
> +}


>      /* size of lookup hash table ... see comment in StrategyInitialize */
>      size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
>
>      /* size of the shared replacement strategy control block */
> -    size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
> +    size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
> +
> +    /*
> +     * Allocate one frelist per CPU. We might use per-node freelists, but the
> +     * assumption is the number of CPUs is less than number of NUMA nodes.
> +     *
> +     * FIXME This assumes the we have more CPUs than NUMA nodes, which seems
> +     * like a safe assumption. But maybe we should calculate how many elements
> +     * we actually need, depending on the GUC? Not a huge amount of memory.

FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
very well could end up as more NUMA nodes than cores.


Ugh, -ETOOLONG. Gotta schedule some other things...


Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
On Wed, Jul 9, 2025 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:

> > Yes, and we are discussing if it is worth getting into smaller pages
> > for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
> > what more even more waste 1GB hugetlb if we dont request 2MB for some
> > small structs: btw, we have ability to select MAP_HUGE_2MB vs
> > MAP_HUGE_1GB). I'm thinking about two problems:
> > - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
>
> I'm not really bought into this being a problem.  If your system has enough
> pressure to swap out the PGPROC array, you're so hosed that this won't make a
> difference.

OK I need to bend here, yet still part of me believes that the
situation where we have hugepages (for 'Buffer Blocks') and yet some
smaller more, but way critical structs are more likely to be swapped
out due to pressure of some backend-gone-wild random mallocs() is
unhealthy to me (especially the fact the OS might prefer swapping on
per node rather than global picture)

> I'm rather doubtful that it's a good idea to combine numa awareness with numa
> balancing. Numa balancing adds latency and makes it much more expensive for
> userspace to act in a numa aware way, since it needs to regularly update its
> knowledge about where memory resides.

Well the problem is that backends come here and go to random CPUs
often (migrated++ on very high backend counts and non-uniform
workloads in terms of backend-CPU usage), but the autobalancing
doesn't need to be on or off for everything. It could be autobalancing
for a certain memory region and it is not affecting the app in any way
(well, other than those minor page faulting, literally ).

> If we used 4k pages for the procarray we would just have ~4 procs on one page,
> if that range were marked as interleaved, it'd probably suffice.

OK, this sounds like the best and simplest proposal to me, yet the
patch doesn't do OS-based interleaving for those today. Gonna try that
mlock() sooner or later... ;)

-J.



Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
On Wed, Jul 9, 2025 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:

> On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
> > Each patch has a numa_ GUC, intended to enable/disable that part. This
> > is meant to make development easier, not as a final interface. I'm not
> > sure how exactly that should look. It's possible some combinations of
> > GUCs won't work, etc.
>
> Wonder if some of it might be worth putting into a multi-valued GUC (like
> debug_io_direct).

Long-term or for experimentation? Also please see below as it is related:

[..]

> FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
> PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
> very well could end up as more NUMA nodes than cores.

In my earlier apparently very way too naive approach, I've tried to
handle this CXL scenario, but I'm afraid this cannot be done without
further configuration, please see review/use cases [1] and [2]

-J.

[1]
https://www.postgresql.org/message-id/attachment/178119/v4-0001-Add-capability-to-interleave-shared-memory-across.patch
- just see sgml/GUC and we have numa_parse_nodestring(3)
[2] https://www.postgresql.org/message-id/aAKPMrX1Uq6quKJy%40ip-10-97-1-34.eu-west-3.compute.internal



Re: Adding basic NUMA awareness

От
"Burd, Greg"
Дата:

> On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
>> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
>>
>>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist
>>> entirely. While clocksweep is perhaps minutely slower in a single
>>> thread than
>>> the freelist, clock sweep scales *considerably* better [1]. As it's rather
>>> rare to be bottlenecked on clock sweep speed for a single thread
>>> (rather then
>>> IO or memory copy overhead), I think it's worth favoring clock sweep.
>>
>> Hey Andres, thanks for spending time on this.  I've worked before on
>> freelist implementations (last one in LMDB) and I think you're onto
>> something.  I think it's an innovative idea and that the speed
>> difference will either be lost in the noise or potentially entirely
>> mitigated by avoiding duplicate work.
>
> Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
> perform better because it doesn't need to maintain the freelist anymore...
>
>
>>> Also needing to switch between getting buffers from the freelist and
>>> the sweep
>>> makes the code more expensive.  I think just having the buffer in the sweep,
>>> with a refcount / usagecount of zero would suffice.
>>
>> If you're not already coding this, I'll jump in. :)
>
> My experimental patch is literally a four character addition ;), namely adding
> "0 &&" to the relevant code in StrategyGetBuffer().
>
> Obviously a real patch would need to do some more work than that.  Feel free
> to take on that project, I am not planning on tackling that in near term.
>

I started on this last night, making good progress.  Thanks for the inspiration.  I'll create a new thread to track the
workand cross-reference when I have something reasonable to show (hopefully later today). 

> There's other things around this that could use some attention. It's not hard
> to see clock sweep be a bottleneck in concurrent workloads - partially due to
> the shared maintenance of the clock hand. A NUMAed clock sweep would address
> that.

Working on it.  Other than NUMA-fying clocksweep there is a function have_free_buffer() that might be a tad tricky to
re-implementefficiently and/or make NUMA aware.  Or maybe I can remove that too?  It is used in autoprewarm.c and
possiblyother extensions, but no where else in core. 

> However, we also maintain StrategyControl->numBufferAllocs, which is a
> significant contention point and would not necessarily be removed by a
> NUMAificiation of the clock sweep.

Yep, I noted this counter and its potential for contention too.  Fortunately, it seems like it is only used so that
"bgwritercan estimate the rate of buffer consumption" which to me opens the door to a less accurate partitioned
counter,perhaps something lock-free (no mutex/CAS) that is bucketed then combined when read. 

A quick look at bufmgr.c indicates that recent_allocs (which is StrategyControl->numBufferAllocs) is used to track a
"movingaverage" and other voodoo there I've yet to fully grok.  Any thoughts on this approximate count approach? 

Also, what are your thoughts on updating the algorithm to CLOCK-Pro [1] while I'm there?  I guess I'd have to try it
out,measure it a lot and see if there are any material benefits.  Maybe I'll keep that for a future patch, or at least
layerit... back to work! 

> Greetings,
>
> Andres Freund

best.

-greg

[1]
https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html




Re: Adding basic NUMA awareness

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:
> Hi,
> 
> Thanks for working on this!

Indeed, thanks!

> On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
> > 1) v1-0001-NUMA-interleaving-buffers.patch
> >
> > This is the main thing when people think about NUMA - making sure the
> > shared buffers are allocated evenly on all the nodes, not just on a
> > single node (which can happen easily with warmup). The regular memory
> > interleaving would address this, but it also has some disadvantages.
> >
> > Firstly, it's oblivious to the contents of the shared memory segment,
> > and we may not want to interleave everything. It's also oblivious to
> > alignment of the items (a buffer can easily end up "split" on multiple
> > NUMA nodes), or relationship between different parts (e.g. there's a
> > BufferBlock and a related BufferDescriptor, and those might again end up
> > on different nodes).
> 
> Two more disadvantages:
> 
> With OS interleaving postgres doesn't (not easily at least) know about what
> maps to what, which means postgres can't do stuff like numa aware buffer
> replacement.
> 
> With OS interleaving the interleaving is "too fine grained", with pages being
> mapped at each page boundary, making it less likely for things like one
> strategy ringbuffer to reside on a single numa node.

> > There's a secondary benefit of explicitly assigning buffers to nodes,
> > using this simple scheme - it allows quickly determining the node ID
> > given a buffer ID. This is helpful later, when building freelist.

I do think this is a big advantage as compare to the OS interleaving.

> I wonder if we should *increase* the size of shared_buffers whenever huge
> pages are in use and there's padding space due to the huge page
> boundaries. Pretty pointless to waste that memory if we can instead use if for
> the buffer pool.  Not that big a deal with 2MB huge pages, but with 1GB huge
> pages...

I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?

> > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch
> >
> > Another area that seems like it might benefit from NUMA is PGPROC, so I
> > gave it a try. It turned out somewhat challenging. Similarly to buffers
> > we have two pieces that need to be located in a coordinated way - PGPROC
> > entries and fast-path arrays. But we can't use the same approach as for
> > buffers/descriptors, because
> >
> > (a) Neither of those pieces aligns with memory page size (PGPROC is
> > ~900B, fast-path arrays are variable length).
> 
> > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
> > rather high max_connections before we use multiple huge pages.
> 
> Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
> cache line size)

Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.

With a bit of reordering:

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5cb1632718e..2ed2f94202a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,8 +194,6 @@ struct PGPROC
                                                                 * vacuum must not remove tuples deleted by
                                                                 * xid >= xmin ! */

-       int                     procnumber;             /* index in ProcGlobal->allProcs */
-
        int                     pid;                    /* Backend's process ID; 0 if prepared xact */

        int                     pgxactoff;              /* offset into various ProcGlobal->arrays with
@@ -243,6 +241,7 @@ struct PGPROC

        /* Support for condition variables. */
        proclist_node cvWaitLink;       /* position in CV wait list */
+       int                     procnumber;             /* index in ProcGlobal->allProcs */

        /* Info about lock the process is currently waiting for, if any. */
        /* waitLock and waitProcLock are NULL if not currently waiting. */
@@ -268,6 +267,7 @@ struct PGPROC
         */
        XLogRecPtr      waitLSN;                /* waiting for this LSN or higher */
        int                     syncRepState;   /* wait state for sync rep */
+       int                     numa_node;
        dlist_node      syncRepLinks;   /* list link if process is in syncrep queue */

        /*
@@ -321,9 +321,6 @@ struct PGPROC
        PGPROC     *lockGroupLeader;    /* lock group leader, if I'm a member */
        dlist_head      lockGroupMembers;       /* list of members, if I'm a leader */
        dlist_node      lockGroupLink;  /* my member link, if I'm a member */
-
-       /* NUMA node */
-       int                     numa_node;
 };

That could be back to 832 (the order does not make sense logically speaking
though).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




On 7/9/25 08:40, Cédric Villemain wrote:
>> On 7/8/25 18:06, Cédric Villemain wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 7/8/25 03:55, Cédric Villemain wrote:
>>>>> Hi Andres,
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
>>>>>>> In my work on more careful PostgreSQL resource management, I've come
>>>>>>> to the
>>>>>>> conclusion that we should avoid pushing policy too deeply into the
>>>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about
>>>>>>> integrating
>>>>>>> NUMA-specific management directly into core PostgreSQL in such a
>>>>>>> way.
>>>>>>
>>>>>> I think it's actually the opposite - whenever we pushed stuff like
>>>>>> this
>>>>>> outside of core it has hurt postgres substantially. Not having
>>>>>> replication in
>>>>>> core was a huge mistake. Not having HA management in core is
>>>>>> probably the
>>>>>> biggest current adoption hurdle for postgres.
>>>>>>
>>>>>> To deal better with NUMA we need to improve memory placement and
>>>>>> various
>>>>>> algorithms, in an interrelated way - that's pretty much impossible
>>>>>> to do
>>>>>> outside of core.
>>>>>
>>>>> Except the backend pinning which is easy to achieve, thus my
>>>>> comment on
>>>>> the related patch.
>>>>> I'm not claiming NUMA memory and all should be managed outside of core
>>>>> (though I didn't read other patches yet).
>>>>>
>>>>
>>>> But an "optimal backend placement" seems to very much depend on
>>>> where we
>>>> placed the various pieces of shared memory. Which the external module
>>>> will have trouble following, I suspect.
>>>>
>>>> I still don't have any idea what exactly would the external module do,
>>>> how would it decide where to place the backend. Can you describe some
>>>> use case with an example?
>>>>
>>>> Assuming we want to actually pin tasks from within Postgres, what I
>>>> think might work is allowing modules to "advise" on where to place the
>>>> task. But the decision would still be done by core.
>>>
>>> Possibly exactly what you're doing in proc.c when managing allocation of
>>> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
>>> candidates), I didn't get that they require information not available to
>>> any process executing code from a module.
>>>
>>
>> Well, it needs to understand how some other stuff (especially PGPROC
>> entries) is distributed between nodes. I'm not sure how much of this
>> internal information we want to expose outside core ...
>>
>>> Parts of your code where you assign/define policy could be in one or
>>> more relevant routines of a "numa profile manager", like in an
>>> initProcessRoutine(), and registered in pmroutine struct:
>>>
>>> pmroutine = GetPmRoutineForInitProcess();
>>> if (pmroutine != NULL &&
>>>      pmroutine->init_process != NULL)
>>>      pmroutine->init_process(MyProc);
>>>
>>> This way it's easier to manage alternative policies, and also to be able
>>> to adjust when hardware and linux kernel changes.
>>>
>>
>> I'm not against making this extensible, in some way. But I still
>> struggle to imagine a reasonable alternative policy, where the external
>> module gets the same information and ends up with a different decision.
>>
>> So what would the alternate policy look like? What use case would the
>> module be supporting?
> 
> 
> That's the whole point: there are very distinct usages of PostgreSQL in
> the field. And maybe not all of them will require the policy defined by
> PostgreSQL core.
> 
> May I ask the reverse: what prevent external modules from taking those
> decisions ? There are already a lot of area where external code can take
> over PostgreSQL processing, like Neon is doing.
> 

The complexity of making everything extensible in an arbitrary way. To
make it extensible in a useful, we need to have a reasonably clear idea
what aspects need to be extensible, and what's the goal.

> There are some very early processing for memory setup that I can see as
> a current blocker, and here I'd refer a more compliant NUMA api as
> proposed by Jakub so it's possible to arrange based on workload,
> hardware configuration or other matters. Reworking to get distinct
> segment and all as you do is great, and combo of both approach probably
> of great interest. There is also this weighted interleave discussed and
> probably much more to come in this area in Linux.
> 
> I think some points raised already about possible distinct policies, I
> am precisely claiming that it is hard to come with one good policy with
> limited setup options, thus requirement to keep that flexible enough
> (hooks, api, 100 GUc ?).
> 

I'm sorry, I don't want to sound too negative, but "I want arbitrary
extensibility" is not a very useful feedback. I've asked you to give
some examples of policies that'd customize some of the NUMA stuff.

> There is an EPYC story here also, given the NUMA setup can vary
> depending on BIOS setup, associated NUMA policy must probably take that
> into account (L3 can be either real cache or 4 extra "local" NUMA nodes
> - with highly distinct access cost from a RAM module).
> Does that change how PostgreSQL will place memory and process? Is it
> important or of interest ?
> 

So how exactly would the policy handle this? Right now we're entirely
oblivious to L3, or on-CPU caches in general. We don't even consider the
size of L3 when sizing hash tables in a hashjoin etc.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:

On 7/9/25 19:23, Andres Freund wrote:
> Hi,
> 
> On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
>> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
>>
>>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist
>>> entirely. While clocksweep is perhaps minutely slower in a single
>>> thread than
>>> the freelist, clock sweep scales *considerably* better [1]. As it's rather
>>> rare to be bottlenecked on clock sweep speed for a single thread
>>> (rather then
>>> IO or memory copy overhead), I think it's worth favoring clock sweep.
>>
>> Hey Andres, thanks for spending time on this.  I've worked before on
>> freelist implementations (last one in LMDB) and I think you're onto
>> something.  I think it's an innovative idea and that the speed
>> difference will either be lost in the noise or potentially entirely
>> mitigated by avoiding duplicate work.
> 
> Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
> perform better because it doesn't need to maintain the freelist anymore...
> 
> 
>>> Also needing to switch between getting buffers from the freelist and
>>> the sweep
>>> makes the code more expensive.  I think just having the buffer in the sweep,
>>> with a refcount / usagecount of zero would suffice.
>>
>> If you're not already coding this, I'll jump in. :)
> 
> My experimental patch is literally a four character addition ;), namely adding
> "0 &&" to the relevant code in StrategyGetBuffer().
> 
> Obviously a real patch would need to do some more work than that.  Feel free
> to take on that project, I am not planning on tackling that in near term.
> 
> 
> There's other things around this that could use some attention. It's not hard
> to see clock sweep be a bottleneck in concurrent workloads - partially due to
> the shared maintenance of the clock hand. A NUMAed clock sweep would address
> that. However, we also maintain StrategyControl->numBufferAllocs, which is a
> significant contention point and would not necessarily be removed by a
> NUMAificiation of the clock sweep.
> 

Wouldn't it make sense to partition the numBufferAllocs too, though? I
don't remember if my hacky experimental patch NUMA-partitioning did that
or I just thought about doing that, but why wouldn't that be enough?

Places that need the "total" count would have to sum the counters, but
it seemed to me most of the places would be fine with the "local" count
for that partition. If we also make sure to "sync" the clocksweeps so as
to not work on just a single partition, that might be enough ...

regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-10 17:31:45 +0200, Tomas Vondra wrote:
> On 7/9/25 19:23, Andres Freund wrote:
> > There's other things around this that could use some attention. It's not hard
> > to see clock sweep be a bottleneck in concurrent workloads - partially due to
> > the shared maintenance of the clock hand. A NUMAed clock sweep would address
> > that. However, we also maintain StrategyControl->numBufferAllocs, which is a
> > significant contention point and would not necessarily be removed by a
> > NUMAificiation of the clock sweep.
> > 
> 
> Wouldn't it make sense to partition the numBufferAllocs too, though? I
> don't remember if my hacky experimental patch NUMA-partitioning did that
> or I just thought about doing that, but why wouldn't that be enough?

It could be solved together with partitioning, yes - that's what I was trying
to reference with the emphasized bit in "would *not necessarily* be removed by
a NUMAificiation of the clock sweep".

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-10 14:17:21 +0000, Bertrand Drouvot wrote:
> On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:
> > I wonder if we should *increase* the size of shared_buffers whenever huge
> > pages are in use and there's padding space due to the huge page
> > boundaries. Pretty pointless to waste that memory if we can instead use if for
> > the buffer pool.  Not that big a deal with 2MB huge pages, but with 1GB huge
> > pages...
> 
> I think that makes sense, except maybe for operations that need to scan
> the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?

I don't think the increases here are big enough for that to matter, unless
perhaps you're using 1GB huge pages. But if you're concerned about dropping
tables very fast (i.e. you're running schema change heavy regression tests),
you're not going to use 1GB huge pages.



> > > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch
> > >
> > > Another area that seems like it might benefit from NUMA is PGPROC, so I
> > > gave it a try. It turned out somewhat challenging. Similarly to buffers
> > > we have two pieces that need to be located in a coordinated way - PGPROC
> > > entries and fast-path arrays. But we can't use the same approach as for
> > > buffers/descriptors, because
> > >
> > > (a) Neither of those pieces aligns with memory page size (PGPROC is
> > > ~900B, fast-path arrays are variable length).
> > 
> > > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
> > > rather high max_connections before we use multiple huge pages.
> > 
> > Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
> > cache line size)
> 
> Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.

I don't think the patch itself is the problem - it really is just happenstance
that it's a multiple of the line size right now. And it's not on common Armv8
platforms...


> With a bit of reordering:
> 
> That could be back to 832 (the order does not make sense logically speaking
> though).

I don't think shrinking the size in a one-off way just to keep the
"accidental" size-is-multiple-of-64 property is promising. It'll just get
broken again.  I think we should:

a) pad the size of PGPROC to a cache line (or even to a subsequent power of 2,
   to make array access cheaper, right now that involves actual
   multiplications rather than shifts or indexed `lea` instructions).

   That's probably just a pg_attribute_aligned

b) Reorder PGPROC to separate frequently modified from almost-read-only data,
   to increase cache hit ratio.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
"Burd, Greg"
Дата:

> On Jul 10, 2025, at 8:13 AM, Burd, Greg <greg@burd.me> wrote:
>
>
>> On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
>>
>> On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
>>> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
>>>
>>>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist
>>>> entirely. While clocksweep is perhaps minutely slower in a single
>>>> thread than
>>>> the freelist, clock sweep scales *considerably* better [1]. As it's rather
>>>> rare to be bottlenecked on clock sweep speed for a single thread
>>>> (rather then
>>>> IO or memory copy overhead), I think it's worth favoring clock sweep.
>>>
>>> Hey Andres, thanks for spending time on this.  I've worked before on
>>> freelist implementations (last one in LMDB) and I think you're onto
>>> something.  I think it's an innovative idea and that the speed
>>> difference will either be lost in the noise or potentially entirely
>>> mitigated by avoiding duplicate work.
>>
>> Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
>> perform better because it doesn't need to maintain the freelist anymore...
>>
>>
>>>> Also needing to switch between getting buffers from the freelist and
>>>> the sweep
>>>> makes the code more expensive.  I think just having the buffer in the sweep,
>>>> with a refcount / usagecount of zero would suffice.
>>>
>>> If you're not already coding this, I'll jump in. :)
>>
>> My experimental patch is literally a four character addition ;), namely adding
>> "0 &&" to the relevant code in StrategyGetBuffer().
>>
>> Obviously a real patch would need to do some more work than that.  Feel free
>> to take on that project, I am not planning on tackling that in near term.
>>
>
> I started on this last night, making good progress.  Thanks for the inspiration.  I'll create a new thread to track
thework and cross-reference when I have something reasonable to show (hopefully later today). 
>
>> There's other things around this that could use some attention. It's not hard
>> to see clock sweep be a bottleneck in concurrent workloads - partially due to
>> the shared maintenance of the clock hand. A NUMAed clock sweep would address
>> that.
>
> Working on it.

For archival sake, and to tie up loose ends I'll link from here to a new thread I just started that proposes the
removalof the freelist and the buffer_strategy_lock [1]. 

That patch set doesn't address any NUMA-related tasks directly, but it should remove some pain when working in that
directionby removing code that requires partitioning and locking and... 

best.

-greg

[1] https://postgr.es/m/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me




Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
Hi,

Here's a v2 of the patch series, with a couple changes:

* I simplified the various freelist partitioning by keeping only the
"node" partitioning (so the cpu/pid strategies are gone). Those were
meant for experimenting, but it made the code more complicated so I
ditched it.


* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So
even if your system is not NUMA, you'll have 4 of them. If you have 2
nodes, you'll still have 4, and each node will get 2. With 3 nodes we
get 6 partitions (we need 2 per node, and we want to keep the number
equal to keep things simple). Once the number of nodes exceeds 4, the
heuristics switches to one partition per node.

I'm aware there's a discussion about maybe simply removing freelists
entirely. If that happens, this becomes mostly irrelevant, of course.

The code should also make sure the freelists "agree" with how the
earlier patch mapped the buffers to NUMA nodes, i.e. the freelist should
only contain buffers from the "correct" NUMA node, etc. I haven't paid
much attention to this - I believe it should work for "nice" values of
shared buffers (when it evenly divides between nodes). But I'm sure it's
possible to confuse that (won't cause crashes, but inefficiency).


* There's now a patch partitioning clocksweep, using the same scheme as
the freelists. I came to the conclusion it doesn't make much sense to
partition these things differently - I can't think of a reason why that
would be advantageous, and it makes it easier to reason about.

The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.

It does work, as in "it doesn't crash". But this part definitely needs
review to make sure I got the changes to the "pacing" right.


* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.


I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.

For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.


regards

-- 
Tomas Vondra

Вложения

Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 7/4/25 20:12, Tomas Vondra wrote:
> On 7/4/25 13:05, Jakub Wartak wrote:
>> ...
>>
>> 8. v1-0005 2x + /* if (numa_procs_interleave) */
>>
>>    Ha! it's a TRAP! I've uncommented it because I wanted to try it out
>> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
>>
>>     2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
>> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
>>     [..]
>>     2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
>> was terminated by signal 11: Segmentation fault
>>     2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
>> active server processes
>>     2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
>> "restart_after_crash" is off
>>     2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down
>>
>>     [New LWP 28755]
>>     [Thread debugging using libthread_db enabled]
>>     Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>     Core was generated by `postgres: io worker                     '.
>>     Program terminated with signal SIGSEGV, Segmentation fault.
>>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
>> at ./nptl/sem_waitcommon.c:136
>>     136     ./nptl/sem_waitcommon.c: No such file or directory.
>>     (gdb) where
>>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
>> at ./nptl/sem_waitcommon.c:136
>>     #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
>>     #2  0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
>> ../src/backend/port/posix_sema.c:302
>>     #3  0x0000556191970553 in InitAuxiliaryProcess () at
>> ../src/backend/storage/lmgr/proc.c:992
>>     #4  0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
>> ../src/backend/postmaster/auxprocess.c:65
>>     #5  0x0000556191940676 in IoWorkerMain (startup_data=<optimized
>> out>, startup_data_len=<optimized out>) at
>> ../src/backend/storage/aio/method_worker.c:393
>>     #6  0x00005561918e8163 in postmaster_child_launch
>> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
>> startup_data=startup_data@entry=0x0,
>>         startup_data_len=startup_data_len@entry=0,
>> client_sock=client_sock@entry=0x0) at
>> ../src/backend/postmaster/launch_backend.c:290
>>     #7  0x00005561918ea09a in StartChildProcess
>> (type=type@entry=B_IO_WORKER) at
>> ../src/backend/postmaster/postmaster.c:3973
>>     #8  0x00005561918ea308 in maybe_adjust_io_workers () at
>> ../src/backend/postmaster/postmaster.c:4404
>>     [..]
>>     (gdb) print *MyProc->sem
>>     Cannot access memory at address 0x0
>>
> 
> Yeah, good catch. I'll look into that next week.
> 

I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?

regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:
> Here's a v2 of the patch series, with a couple changes:

Not a deep look at the code, just a quick reply.


> * I changed the freelist partitioning scheme a little bit, based on the
> discussion in this thread. Instead of having a single "partition" per
> NUMA node, there's not a minimum number of partitions (set to 4). So

I assume s/not/now/?


> * There's now a patch partitioning clocksweep, using the same scheme as
> the freelists.

Nice!


> I came to the conclusion it doesn't make much sense to partition these
> things differently - I can't think of a reason why that would be
> advantageous, and it makes it easier to reason about.

Agreed.


> The clocksweep partitioning is somewhat harder, because it affects
> BgBufferSync() and related code. With the partitioning we now have
> multiple "clock hands" for different ranges of buffers, and the clock
> sweep needs to consider that. I modified BgBufferSync to simply loop
> through the ClockSweep partitions, and do a small cleanup for each.

That probably makes sense for now. It might need a bit of a larger adjustment
at some point, but ...



> * This new freelist/clocksweep partitioning scheme is however harder to
> disable. I now realize the GUC may quite do the trick, and there even is
> not a GUC for the clocksweep. I need to think about this, but I'm not
> even how feasible it'd be to have two separate GUCs (because of how
> these two pieces are intertwined). For now if you want to test without
> the partitioning, you need to skip the patch.

I think it's totally fair to enable/disable them at the same time. They're so
closely related, that I don't think it really makes sense to measure them
separately.


> I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
> the results are encouraging. For a read-only pgbench (2x shared buffers,
> within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
> bad considering the patch is more about consistency than raw throughput.

Personally I think an 1.18x improvement on a relatively small NUMA machine is
really rather awesome.


> For a read-write pgbench I however saw some strange drops/increases of
> throughput. I suspect this might be due to some thinko in the clocksweep
> partitioning, but I'll need to take a closer look.

Was that with pinning etc enabled or not?



> From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Sun, 8 Jun 2025 18:53:12 +0200
> Subject: [PATCH v2 5/7] NUMA: clockweep partitioning
>


> @@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
>      /*
>       * Nothing on the freelist, so run the "clock sweep" algorithm
>       *
> -     * XXX Should we also make this NUMA-aware, to only access buffers from
> -     * the same NUMA node? That'd probably mean we need to make the clock
> -     * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
> -     * subset of buffers. But that also means each process could "sweep" only
> -     * a fraction of buffers, even if the other buffers are better candidates
> -     * for eviction. Would that also mean we'd have multiple bgwriters, one
> -     * for each node, or would one bgwriter handle all of that?
> +     * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
> +     * buffers from a single partition, aligned with the NUMA node. That
> +     * means it only accesses buffers from the same NUMA node.
> +     *
> +     * XXX That also means each process "sweeps" only a fraction of buffers,
> +     * even if the other buffers are better candidates for eviction. Maybe
> +     * there should be some logic to "steal" buffers from other freelists
> +     * or other nodes?

I think we *definitely* need "stealing" from other clock sweeps, whenever
there's a meaningful imbalance between the different sweeps.

I don't think we need to be overly precise about it, a small imbalance won't
have that much of an effect. But clearly it doesn't make sense to say that one
backend can only fill buffers in the current partition, that'd lead to massive
performance issues in a lot of workloads.

The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.

One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.

Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set.  I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 7/18/25 18:46, Andres Freund wrote:
> Hi,
> 
> On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:
>> Here's a v2 of the patch series, with a couple changes:
> 
> Not a deep look at the code, just a quick reply.
> 
> 
>> * I changed the freelist partitioning scheme a little bit, based on the
>> discussion in this thread. Instead of having a single "partition" per
>> NUMA node, there's not a minimum number of partitions (set to 4). So
> 
> I assume s/not/now/?
> 

Yes.

> 
>> * There's now a patch partitioning clocksweep, using the same scheme as
>> the freelists.
> 
> Nice!
> 
> 
>> I came to the conclusion it doesn't make much sense to partition these
>> things differently - I can't think of a reason why that would be
>> advantageous, and it makes it easier to reason about.
> 
> Agreed.
> 
> 
>> The clocksweep partitioning is somewhat harder, because it affects
>> BgBufferSync() and related code. With the partitioning we now have
>> multiple "clock hands" for different ranges of buffers, and the clock
>> sweep needs to consider that. I modified BgBufferSync to simply loop
>> through the ClockSweep partitions, and do a small cleanup for each.
> 
> That probably makes sense for now. It might need a bit of a larger
adjustment at some point, but ...
> 

I couldn't think of something fundamentally better and not too complex.
I suspect we might want to use multiple bgwriters in the future, and
this scheme seems to be reasonably well suited for that too.

I'm also thinking about having some sort of "unified" partitioning
scheme for all the places partitioning shared buffers. Right now each of
the places does it on it's own, i.e. buff_init, freelist and clocksweep
all have their code splitting NBuffers into partitions. And it should
align. Because what would be the benefit if it didn't? But I guess
having three variants of the same code seems a bit pointless.

I think buff_init should build a common definition of buffer partitions,
and the remaining parts should use that as the source of truth ...

> 
>> * This new freelist/clocksweep partitioning scheme is however harder to
>> disable. I now realize the GUC may quite do the trick, and there even is
>> not a GUC for the clocksweep. I need to think about this, but I'm not
>> even how feasible it'd be to have two separate GUCs (because of how
>> these two pieces are intertwined). For now if you want to test without
>> the partitioning, you need to skip the patch.
> 
> I think it's totally fair to enable/disable them at the same time. They're so
> closely related, that I don't think it really makes sense to measure them
> separately.
> 

Yeah, that's a fair point.

> 
>> I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
>> the results are encouraging. For a read-only pgbench (2x shared buffers,
>> within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
>> bad considering the patch is more about consistency than raw throughput.
> 
> Personally I think an 1.18x improvement on a relatively small NUMA machine is
> really rather awesome.
> 

True, but I want to stress out it's just one quick (& simple test). Much
more testing is needed before I can make reliable claims.

> 
>> For a read-write pgbench I however saw some strange drops/increases of
>> throughput. I suspect this might be due to some thinko in the clocksweep
>> partitioning, but I'll need to take a closer look.
> 
> Was that with pinning etc enabled or not?
> 

IIRC it was with everything enabled, except for numa_procs_pin (which
pins backend to NUMA node). I found that to actually harm performance in
some of the tests (even just read-only ones), resulting in uneven usage
of cores and lower throughput.

> 
> 
>> From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas@vondra.me>
>> Date: Sun, 8 Jun 2025 18:53:12 +0200
>> Subject: [PATCH v2 5/7] NUMA: clockweep partitioning
>>
> 
> 
>> @@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
>>      /*
>>       * Nothing on the freelist, so run the "clock sweep" algorithm
>>       *
>> -     * XXX Should we also make this NUMA-aware, to only access buffers from
>> -     * the same NUMA node? That'd probably mean we need to make the clock
>> -     * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
>> -     * subset of buffers. But that also means each process could "sweep" only
>> -     * a fraction of buffers, even if the other buffers are better candidates
>> -     * for eviction. Would that also mean we'd have multiple bgwriters, one
>> -     * for each node, or would one bgwriter handle all of that?
>> +     * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
>> +     * buffers from a single partition, aligned with the NUMA node. That
>> +     * means it only accesses buffers from the same NUMA node.
>> +     *
>> +     * XXX That also means each process "sweeps" only a fraction of buffers,
>> +     * even if the other buffers are better candidates for eviction. Maybe
>> +     * there should be some logic to "steal" buffers from other freelists
>> +     * or other nodes?
> 
> I think we *definitely* need "stealing" from other clock sweeps, whenever
> there's a meaningful imbalance between the different sweeps.
> 
> I don't think we need to be overly precise about it, a small imbalance won't
> have that much of an effect. But clearly it doesn't make sense to say that one
> backend can only fill buffers in the current partition, that'd lead to massive
> performance issues in a lot of workloads.
> 

Agreed.

> The hardest thing probably is to make the logic for when to check foreign
> clock sweeps cheap enough.
> 
> One way would be to do it whenever a sweep wraps around, that'd probably
> amortize the cost sufficiently, and I don't think it'd be too imprecise, as
> we'd have processed that set of buffers in a row without partitioning as
> well. But it'd probably be too coarse when determining for how long to use a
> foreign sweep instance. But we probably could address that by rechecking the
> balanace more frequently when using a foreign partition.
> 

What you mean by "it"? What would happen after a sweep wraps around?

> Another way would be to have bgwriter manage this. Whenever it detects that
> one ring is too far ahead, it could set a "avoid this partition" bit, which
> would trigger backends that natively use that partition to switch to foreign
> partitions that don't currently have that bit set.  I suspect there's a
> problem with that approach though, I worry that the amount of time that
> bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
> much imbalance.
> 

I'm afraid having hard "avoid" flags would lead to sudden and unexpected
changes in performance as we enable/disable partitions. I think a good
solution should "smooth it out" somehow, e.g. by not having a true/false
flag, but having some sort of "preference" factor with values between
(0.0, 1.0) which says how much we should use that partition.

I was imagining something like this:

Say we know the number of buffers allocated for each partition (in the
last round), and we (or rather the BgBufferSync) calculate:

    coefficient = 1.0 - (nallocated_partition / nallocated)

and then use that to "correct" which partition to allocate buffers from.
Or maybe just watch how far from the "fair share" we were in the last
interval, and gradually increase/decrease the "partition preference"
which would say how often we need to "steal" from other partitions.

E.g. we find nallocated_partition is 2x the fair share, i.e.

   nallocated_partition / (nallocated / nparts) = 2.0

Then we say 25% of the time look at some other partition, to "cut" the
imbalance in half. And then repeat that in the next cycle, etc.

So a process would look at it's "home partition" by default, but it's
"roll a dice" first and if above the calculated probability it'd pick
some other partition instead (this would need to be done so that it gets
balanced overall).

If the bgwriter interval is too long, maybe the recalculation could be
triggered regularly after any of the clocksweeps wraps around, or after
some number of allocations, or something like that.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-07-18 22:48:00 +0200, Tomas Vondra wrote:
> On 7/18/25 18:46, Andres Freund wrote:
> >> For a read-write pgbench I however saw some strange drops/increases of
> >> throughput. I suspect this might be due to some thinko in the clocksweep
> >> partitioning, but I'll need to take a closer look.
> > 
> > Was that with pinning etc enabled or not?
> > 
> 
> IIRC it was with everything enabled, except for numa_procs_pin (which
> pins backend to NUMA node). I found that to actually harm performance in
> some of the tests (even just read-only ones), resulting in uneven usage
> of cores and lower throughput.

FWIW, I really doubt that something like numa_procs_pin is viable outside of
very narrow niches until we have a *lot* more infrastructure in place. Like PG
would need to be threaded, we'd need a separation between thread and
connection and an executor that'd allow us to switch from working on one query
to working on another query.


> > The hardest thing probably is to make the logic for when to check foreign
> > clock sweeps cheap enough.
> > 
> > One way would be to do it whenever a sweep wraps around, that'd probably
> > amortize the cost sufficiently, and I don't think it'd be too imprecise, as
> > we'd have processed that set of buffers in a row without partitioning as
> > well. But it'd probably be too coarse when determining for how long to use a
> > foreign sweep instance. But we probably could address that by rechecking the
> > balanace more frequently when using a foreign partition.
> > 
> 
> What you mean by "it"?

it := Considering switching back from using a "foreign" clock sweep instance
whenever the sweep wraps around.


> What would happen after a sweep wraps around?

The scenario I'm worried about is this:

1) a bunch of backends read buffers on numa node A, using the local clock
   sweep instance

2) due to all of that activity, the clock sweep advances much faster than the
   clock sweep for numa node B

3) the clock sweep on A wraps around, we discover the imbalance, and all the
   backend switch to scanning on numa node B, moving that clock sweep ahead
   much more aggressively

4) clock sweep on B wraps around, there's imbalance the other way round now,
   so they all switch back to A



> > Another way would be to have bgwriter manage this. Whenever it detects that
> > one ring is too far ahead, it could set a "avoid this partition" bit, which
> > would trigger backends that natively use that partition to switch to foreign
> > partitions that don't currently have that bit set.  I suspect there's a
> > problem with that approach though, I worry that the amount of time that
> > bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
> > much imbalance.
> > 
> 
> I'm afraid having hard "avoid" flags would lead to sudden and unexpected
> changes in performance as we enable/disable partitions. I think a good
> solution should "smooth it out" somehow, e.g. by not having a true/false
> flag, but having some sort of "preference" factor with values between
> (0.0, 1.0) which says how much we should use that partition.

Yea, I think that's a fair worry.


> I was imagining something like this:
> 
> Say we know the number of buffers allocated for each partition (in the
> last round), and we (or rather the BgBufferSync) calculate:
> 
>     coefficient = 1.0 - (nallocated_partition / nallocated)
> 
> and then use that to "correct" which partition to allocate buffers from.
> Or maybe just watch how far from the "fair share" we were in the last
> interval, and gradually increase/decrease the "partition preference"
> which would say how often we need to "steal" from other partitions.
> 
> E.g. we find nallocated_partition is 2x the fair share, i.e.
> 
>    nallocated_partition / (nallocated / nparts) = 2.0
> 
> Then we say 25% of the time look at some other partition, to "cut" the
> imbalance in half. And then repeat that in the next cycle, etc.
> 
> So a process would look at it's "home partition" by default, but it's
> "roll a dice" first and if above the calculated probability it'd pick
> some other partition instead (this would need to be done so that it gets
> balanced overall).

That does sound reasonable.


> If the bgwriter interval is too long, maybe the recalculation could be
> triggered regularly after any of the clocksweeps wraps around, or after
> some number of allocations, or something like that.

I'm pretty sure the bgwriter might not be often enough and not predictably
frequently running for that.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 7/4/25 20:12, Tomas Vondra wrote:
> > On 7/4/25 13:05, Jakub Wartak wrote:
> >> ...
> >>
> >> 8. v1-0005 2x + /* if (numa_procs_interleave) */
> >>
> >>    Ha! it's a TRAP! I've uncommented it because I wanted to try it out
> >> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
> >>
> >>     2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
> >> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
> >>     [..]
> >>     2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
> >> was terminated by signal 11: Segmentation fault
> >>     2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
> >> active server processes
> >>     2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
> >> "restart_after_crash" is off
> >>     2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down
> >>
> >>     [New LWP 28755]
> >>     [Thread debugging using libthread_db enabled]
> >>     Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >>     Core was generated by `postgres: io worker                     '.
> >>     Program terminated with signal SIGSEGV, Segmentation fault.
> >>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
> >> at ./nptl/sem_waitcommon.c:136
> >>     136     ./nptl/sem_waitcommon.c: No such file or directory.
> >>     (gdb) where
> >>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
> >> at ./nptl/sem_waitcommon.c:136
> >>     #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
> >>     #2  0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
> >> ../src/backend/port/posix_sema.c:302
> >>     #3  0x0000556191970553 in InitAuxiliaryProcess () at
> >> ../src/backend/storage/lmgr/proc.c:992
> >>     #4  0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
> >> ../src/backend/postmaster/auxprocess.c:65
> >>     #5  0x0000556191940676 in IoWorkerMain (startup_data=<optimized
> >> out>, startup_data_len=<optimized out>) at
> >> ../src/backend/storage/aio/method_worker.c:393
> >>     #6  0x00005561918e8163 in postmaster_child_launch
> >> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
> >> startup_data=startup_data@entry=0x0,
> >>         startup_data_len=startup_data_len@entry=0,
> >> client_sock=client_sock@entry=0x0) at
> >> ../src/backend/postmaster/launch_backend.c:290
> >>     #7  0x00005561918ea09a in StartChildProcess
> >> (type=type@entry=B_IO_WORKER) at
> >> ../src/backend/postmaster/postmaster.c:3973
> >>     #8  0x00005561918ea308 in maybe_adjust_io_workers () at
> >> ../src/backend/postmaster/postmaster.c:4404
> >>     [..]
> >>     (gdb) print *MyProc->sem
> >>     Cannot access memory at address 0x0
> >>
> >
> > Yeah, good catch. I'll look into that next week.
> >
>
> I've been unable to reproduce this issue, but I'm not sure what settings
> you actually used for this instance. Can you give me more details how to
> reproduce this?

Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:

For complete picture, steps:

1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch

BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej

3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.

4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)

5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok

6. restart with numa_procs_interleave=off, which gets me to every bg
worker crashing e.g.:

(gdb) where
#0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at
./nptl/sem_waitcommon.c:136
#1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2  0x0000563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3  0x0000563e2d774d93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
#4  0x0000563e2d6e9252 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5  0x0000563e2d6eb683 in CheckpointerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/postmaster/checkpointer.c:190
#6  0x0000563e2d6ec363 in postmaster_child_launch
(child_type=child_type@entry=B_CHECKPOINTER, child_slot=249,
startup_data=startup_data@entry=0x0,
    startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7  0x0000563e2d6ee29a in StartChildProcess
(type=type@entry=B_CHECKPOINTER) at
../src/backend/postmaster/postmaster.c:3973
#8  0x0000563e2d6f17a6 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x563e377cc0e0) at
../src/backend/postmaster/postmaster.c:1386
#9  0x0000563e2d4948fc in main (argc=3, argv=0x563e377cc0e0) at
../src/backend/main/main.c:231

notice sema=0x0, because:
#3  0x000056050928cd93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
995             PGSemaphoreReset(MyProc->sem);
(gdb) print MyProc
$1 = (PGPROC *) 0x7f09a0c013b0
(gdb) print MyProc->sem
$2 = (PGSemaphore) 0x0

or with printfs:

2025-07-25 11:17:23.683 CEST [21772] LOG:  in InitProcGlobal
PGPROC=0x7f9de827b880 requestSize=148770
// after proc && ptr manipulation:
2025-07-25 11:17:23.683 CEST [21772] LOG:  in InitProcGlobal
PGPROC=0x7f9de827bdf0 requestSize=148770 procs=0x7f9de827b880
ptr=0x7f9de827bdf0
[..initialization of aux PGPROCs i=0.., still fromInitProcGlobal(),
each gets proper sem allocated as one would expect:]
[..for i loop:]
2025-07-25 11:17:23.689 CEST [21772] LOG:  i=136 ,
proc=0x7f9de8600000, proc->sem=0x7f9da4e04438
2025-07-25 11:17:23.689 CEST [21772] LOG:  i=137 ,
proc=0x7f9de8600348, proc->sem=0x7f9da4e044b8
2025-07-25 11:17:23.689 CEST [21772] LOG:  i=138 ,
proc=0x7f9de8600690, proc->sem=0x7f9da4e04538
[..but then in the children codepaths, out of the blue in
InitAuxilaryProcess the whole MyProc looks like it would memsetted to
zeros:]
2025-07-25 11:17:23.693 CEST [21784] LOG:  auxiliary process using
MyProc=0x7f9de8600000 auxproc=0x7f9de8600000 proctype=0
MyProcPid=21784 MyProc->sem=(nil)

above got pgproc slot i=136 with addr 0x7f9de8600000 and later that
auxiliary is launched but somehow something NULLified ->sem there
(according to gdb , everything is zero there)

7. Original patch v2-0006 (with commented out 2x if
numa_procs_interleave), behaves OK, so in my case here with 1x NUMA
node that gives add_size(.., 1+1 * 2MB)=4MB

2025-07-25 11:38:54.131 CEST [23939] LOG:  in InitProcGlobal
PGPROC=0x7f25cbe7b880 requestSize=4343074
2025-07-25 11:38:54.132 CEST [23939] LOG:  in InitProcGlobal
PGPROC=0x7f25cbe7bdf0 requestSize=4343074 procs=0x7f25cbe7b880
ptr=0x7f25cbe7bdf0

so something is zeroing out all those MyProc structures apparently on
startup (probably due to some wrong alignment maybe somewhere ?) I was
thinking about trapping via mprotect() this single i=136
0x7f9de8600000 PGPROC to see what is resetting it, but oh well,
mprotect() works only on whole pages...

-J.



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:

On 7/25/25 12:27, Jakub Wartak wrote:
> On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 7/4/25 20:12, Tomas Vondra wrote:
>>> On 7/4/25 13:05, Jakub Wartak wrote:
>>>> ...
>>>>
>>>> 8. v1-0005 2x + /* if (numa_procs_interleave) */
>>>>
>>>>    Ha! it's a TRAP! I've uncommented it because I wanted to try it out
>>>> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
>>>>
>>>>     2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
>>>> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
>>>>     [..]
>>>>     2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
>>>> was terminated by signal 11: Segmentation fault
>>>>     2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
>>>> active server processes
>>>>     2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
>>>> "restart_after_crash" is off
>>>>     2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down
>>>>
>>>>     [New LWP 28755]
>>>>     [Thread debugging using libthread_db enabled]
>>>>     Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>>     Core was generated by `postgres: io worker                     '.
>>>>     Program terminated with signal SIGSEGV, Segmentation fault.
>>>>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
>>>> at ./nptl/sem_waitcommon.c:136
>>>>     136     ./nptl/sem_waitcommon.c: No such file or directory.
>>>>     (gdb) where
>>>>     #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
>>>> at ./nptl/sem_waitcommon.c:136
>>>>     #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
>>>>     #2  0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
>>>> ../src/backend/port/posix_sema.c:302
>>>>     #3  0x0000556191970553 in InitAuxiliaryProcess () at
>>>> ../src/backend/storage/lmgr/proc.c:992
>>>>     #4  0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
>>>> ../src/backend/postmaster/auxprocess.c:65
>>>>     #5  0x0000556191940676 in IoWorkerMain (startup_data=<optimized
>>>> out>, startup_data_len=<optimized out>) at
>>>> ../src/backend/storage/aio/method_worker.c:393
>>>>     #6  0x00005561918e8163 in postmaster_child_launch
>>>> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
>>>> startup_data=startup_data@entry=0x0,
>>>>         startup_data_len=startup_data_len@entry=0,
>>>> client_sock=client_sock@entry=0x0) at
>>>> ../src/backend/postmaster/launch_backend.c:290
>>>>     #7  0x00005561918ea09a in StartChildProcess
>>>> (type=type@entry=B_IO_WORKER) at
>>>> ../src/backend/postmaster/postmaster.c:3973
>>>>     #8  0x00005561918ea308 in maybe_adjust_io_workers () at
>>>> ../src/backend/postmaster/postmaster.c:4404
>>>>     [..]
>>>>     (gdb) print *MyProc->sem
>>>>     Cannot access memory at address 0x0
>>>>
>>>
>>> Yeah, good catch. I'll look into that next week.
>>>
>>
>> I've been unable to reproduce this issue, but I'm not sure what settings
>> you actually used for this instance. Can you give me more details how to
>> reproduce this?
> 
> Better late than never, well feel free to partially ignore me, i've
> missed that it is known issue as per FIXME there, but I would just rip
> out that commented out `if(numa_proc_interleave)` from
> FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
> those memory pages of course (in case of no-NUMA). If you do want to
> save those pages I think we have problem:
> 
> For complete picture, steps:
> 
> 1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
> 2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch
> 
> BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
> 14 out of 14 hunks FAILED -- saving rejects to file
> src/bin/pgbench/pgbench.c.rej
> 
> 3. As I'm just applying 0001 and 0006, I've got two simple rejects,
> but fixed it (due to not applying missing numa_ freelist patches).
> That's intentional on my part, because I wanted to play just with
> those two.
> 
> 4. Then I uncomment those two "if (numa_procs_interleave)" related for
> optional memory shm initialization - add_size() and so on (that have
> XXX comment above that it is causing bootstrap issues)
> 

Ah, I didn't realize you uncommented these "if" conditions. In that case
the crash is not very surprising, because the actual initialization in
InitProcGlobal ignores the GUCs and just assumes it's enabled. But
without the extra padding that likely messes up something. Or something
allocated later "overwrites" the some of the memory.

I need to clean this up, to actually consider the GUC properly.

FWIW I do have a new patch version that I plan to share in a day or two,
once I get some numbers. It didn't change this particular part, though,
it's more about the buffers/freelists/clocksweep. I'll work on PGPROC
next, I think.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
Hi,

Here's a somewhat cleaned up v3 of this patch series, with various
improvements and a lot of cleanup. Still WIP, but I hope it resolves the
various crashes reported for v2, but it still requires --with-libnuma
(it won't build without it).

I'm aware there's an ongoing discussion about removing the freelists,
and changing the clocksweep in some way. If that happens, the relevant
parts of this series will need some adjustment, of course. I haven't
looked into that yet, I plan to review those patches soon.


main changes in v3
------------------

1) I've introduced "registry" of the buffer partitions (imagine a small
array of structs), serving as a source of truth for places that need
info about the partitions (range of buffers, ...).

With v2 there was no "shared definition" - the shared buffers, freelist
and clocksweep did their own thing. But per the discussion it doesn't
really make much sense for to partition buffers in different ways.

So in v3 the 0001 patch defines the partitions, records them in shared
memory (in a small array), and the later parts just reuse this.

I also added a pg_buffercache_partitions() listing the partitions, with
first/last buffer, etc. The freelist/clocksweep patches add additional
information.


2) The PGPROC part introduces a similar registry, even though there are
no other patches building on this. But it seemed useful to have a clear
place recording this info.

There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.


3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".


4) This still doesn't do anything about "balancing" the clocksweep. I
have some ideas how to do that, I'll work on that next.


simple benchmark
----------------

I did a simple benchmark, measuring pgbench throughput with scale still
fitting into RAM, but much larger (~2x) than shared buffers. See the
attached test script, testing builds with more and more of the patches.

I'm attaching results from two different machines (the "usual" 2P xeon
and also a much larger cloud instance with EPYC/Genoa) - both the raw
CSV files, with average tps and percentiles, and PDFs. The PDFs also
have a comparison either to the "preceding" build (right side), or to
master (below the table).

There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").

For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):


      mode   s_b buffers localal no-tail freelist sweep pgproc pinning
  ====================================================================
  prepared  16GB     99%    101%    100%     103%  111%    99%    102%
            32GB     98%    102%     99%     103%  107%   101%    112%
             8GB     97%    102%    100%     102%  101%   101%    106%
  --------------------------------------------------------------------
    simple  16GB    100%    100%     99%     105%  108%    99%    108%
            32GB     98%    101%    100%     103%  100%   101%     97%
             8GB    100%    100%    101%      99%  100%   104%    104%

The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).

For the xeon the differences (in either direction) are much smaller, so
I'm not going to post it here. It's in the PDF, though.

I think this looks reasonable. The way I see this patch series is not
about improving peak throughput, but more about reducing imbalance and
making the behavior more consistent.

The results are more a confirmation there's not some sort of massive
overhead, somewhere. But I'll get to this in a minute.

To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.


overhead of partitioning calculation
------------------------------------

Regarding the "overhead", while the results look mostly OK, I think
we'll need to rethink the partitioning scheme - particularly how the
partition size is calculated. The current scheme has to use %, which can
be somewhat expensive.

The 0001 patch calculates a "chunk size", which is the smallest number
of buffers it can "assign" to a NUMA node. This depends on how many
buffer descriptors fit onto a single memory page, and it's either 512KB
(with 4KB pages), or 256MB (with 2MB huge pages). And then each NUMA
node gets multiple chunks, to cover shared_buffers/num_nodes. But this
can be an arbitrary number - it minimizes the imbalance, but it also
forces the use of % and / in the formulas.

AFAIK if we required the partitions to be 2^k multiples of the chunk
size, we could switch to using shifts and masking. Which is supposed to
be much faster. But I haven't measured this, and the cost is that some
of the nodes could get much less memory. Maybe that's fine.


reserving number of huge pages
------------------------------

The other thing I realized is that partitioning buffers with huge pages
is quite tricky, and can easily lead to SIGBUS when accessing the memory
later. The crashes I saw happen like this:

1) figure # of pages needed (using shared_memory_size_in_huge_pages)

   This can be 16828 for shared_buffers=32GB.

2) make sure there's enough huge pages

   echo 16828 > /proc/sys/vm/nr_hugepages

3) start postgres - everything seems to works just fine

4) query pg_buffercache_numa - triggers SIGBUS accessing memory for a
valid buffer (usually ~2GB from the end)

It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See

  $ numastat -cm

Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.

And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.

You may ask why the per-node limit is too low. We still need just
shared_memory_size_in_huge_pages, right? And if we were partitioning the
whole memory segment, that'd be true. But we only to that for shared
buffers, and there's a lot of other shared memory - could be 1-2GB or
so, depending on the configuration.

And this gets placed on one of the nodes, and it counts against the
limit on that particular node. And so it doesn't have enough huge pages
to back the partition of shared buffers.

The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.

I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.

I also realize this can be used to make sure the memory is balanced on
NUMA systems. Because if you set nr_hugepages, the kernel will ensure
the shared memory is distributed on all the nodes.

It won't have the benefits of "coordinating" the buffers and buffer
descriptors, and so on. But it will be balanced.


regards

-- 
Tomas Vondra

Вложения

Re: Adding basic NUMA awareness

От
Jakub Wartak
Дата:
On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

just a quick look here:

> 2) The PGPROC part introduces a similar registry, [..]
>
> There's also a view pg_buffercache_pgproc. The pg_buffercache location
> is a bit bogus - it has nothing to do with buffers, but it was good
> enough for now.

If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.

> 3) The PGPROC partitioning is reworked and should fix the crash with the
> GUC set to "off".

Thanks!

> simple benchmark
> ----------------
[..]
> There's results for the three "pgbench pinning" strategies, and that can
> have pretty significant impact (colocated generally performs much better
> than either "none" or "random").

Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.

Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.

> For the "bigger" machine (wiuth 176 cores) the incremental results look
> like this (for pinning=none, i.e. regular pgbench):
>
>
>       mode   s_b buffers localal no-tail freelist sweep pgproc pinning
>   ====================================================================
>   prepared  16GB     99%    101%    100%     103%  111%    99%    102%
>             32GB     98%    102%     99%     103%  107%   101%    112%
>              8GB     97%    102%    100%     102%  101%   101%    106%
>   --------------------------------------------------------------------
>     simple  16GB    100%    100%     99%     105%  108%    99%    108%
>             32GB     98%    101%    100%     103%  100%   101%     97%
>              8GB    100%    100%    101%      99%  100%   104%    104%
>
> The way I read this is that the first three patches have about no impact
> on throughput. Then freelist partitioning and (especially) clocksweep
> partitioning can help quite a bit. pgproc is again close to ~0%, and
> PGPROC pinning can help again (but this part is merely experimental).

Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)

[..]
> To quantify this kind of improvement, I think we'll need tests that
> intentionally cause (or try to) imbalance. If you have ideas for such
> tests, let me know.

Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?

> reserving number of huge pages
> ------------------------------
[..]
> It took me ages to realize what's happening, but it's very simple. The
> nr_hugepages is a global limit, but it's also translated into limits for
> each NUMA node. So when you write 16828 to it, in a 4-node system each
> node gets 1/4 of that. See
>
>   $ numastat -cm
>
> Then we do the mmap(), and everything looks great, because there really
> is enough huge pages and the system can allocate memory from any NUMA
> node it needs.

Yup, similiar story as with OOMs just for per-zone/node.

> And then we come around, and do the numa_tonode_memory(). And that's
> where the issues start, because AFAIK this does not check the per-node
> limit of huge pages in any way. It just appears to work. And then later,
> when we finally touch the buffer, it tries to actually allocate the
> memory on the node, and realizes there's not enough huge pages. And
> triggers the SIGBUS.

I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flag

Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.

> The only way around this I found is by inflating the number of huge
> pages, significantly above the shared_memory_size_in_huge_pages value.
> Just to make sure the nodes get enough huge pages.
>
> I don't know what to do about this. It's quite annoying. If we only used
> huge pages for the partitioned parts, this wouldn't be a problem.

Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages  still works OK to
establish the exact count for vm.nr_hugepages, right?

Regards,
-J.



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 7/30/25 10:29, Jakub Wartak wrote:
> On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi Tomas,
> 
> just a quick look here:
> 
>> 2) The PGPROC part introduces a similar registry, [..]
>>
>> There's also a view pg_buffercache_pgproc. The pg_buffercache location
>> is a bit bogus - it has nothing to do with buffers, but it was good
>> enough for now.
> 
> If you are looking for better names: pg_shmem_pgproc_numa would sound
> like a more natural name.
> 
>> 3) The PGPROC partitioning is reworked and should fix the crash with the
>> GUC set to "off".
> 
> Thanks!
> 
>> simple benchmark
>> ----------------
> [..]
>> There's results for the three "pgbench pinning" strategies, and that can
>> have pretty significant impact (colocated generally performs much better
>> than either "none" or "random").
> 
> Hint: real world is that network cards are usually located on some PCI
> slot that is assigned to certain node (so traffic is flowing from/to
> there), so probably it would make some sense to put pgbench outside
> this machine and remove this as "variable" anyway and remove the need
> for that pgbench --pin-cpus in script. In optimal conditions: most
> optimized layout would be probably to have 2 cards on separate PCI
> slots, each for different node and some LACP between those, with
> xmit_hash_policy allowing traffic distribution on both of those cards
> -- usually there's not just single IP/MAC out there talking to/from
> such server, so that would be real-world (or lack of) affinity.
> 

The pgbench pinning certainly reduces some of the noise / overhead you
get when using multiple machines. I use it to "isolate" patches, and
make the effects more visible.

> Also classic pgbench workload, seems to be poor fit for testing it out
> (at least v3-0001 buffers), there I would propose sticking to just
> lots of big (~s_b size) full table seq scans to put stress on shared
> memory. Classic pgbench is usually not there enough to put serious
> bandwidth on the interconnect by my measurements.
> 

Yes, that's possible. The simple pgbench workload is a bit of a "worst
case" for the NUMA patches, in that it's can benefit less from the
improvements, and it's also fairly sensitive to regressions.

I plan to do more tests with other types of workloads, like the one
doing a lot of large sequential scans, etc.

>> For the "bigger" machine (wiuth 176 cores) the incremental results look
>> like this (for pinning=none, i.e. regular pgbench):
>>
>>
>>       mode   s_b buffers localal no-tail freelist sweep pgproc pinning
>>   ====================================================================
>>   prepared  16GB     99%    101%    100%     103%  111%    99%    102%
>>             32GB     98%    102%     99%     103%  107%   101%    112%
>>              8GB     97%    102%    100%     102%  101%   101%    106%
>>   --------------------------------------------------------------------
>>     simple  16GB    100%    100%     99%     105%  108%    99%    108%
>>             32GB     98%    101%    100%     103%  100%   101%     97%
>>              8GB    100%    100%    101%      99%  100%   104%    104%
>>
>> The way I read this is that the first three patches have about no impact
>> on throughput. Then freelist partitioning and (especially) clocksweep
>> partitioning can help quite a bit. pgproc is again close to ~0%, and
>> PGPROC pinning can help again (but this part is merely experimental).
> 
> Isn't the "pinning" column representing just numa_procs_pin=on ?
> (shouldn't it be tested with numa_procs_interleave = on?)
> 

Maybe I don't understand the question, but the last column (pinning)
compares two builds.

1) Build with all the patches up to "pgproc interleaving" (and all of
the GUCs set to "on").

2) Build with all the patches from (1), and "pinning" too (again, all
GUCs set to "on).

Or do I misunderstand the question?

> [..]
>> To quantify this kind of improvement, I think we'll need tests that
>> intentionally cause (or try to) imbalance. If you have ideas for such
>> tests, let me know.
> 
> Some ideas:
> 1. concurrent seq scans hitting s_b-sized table
> 2. one single giant PX-enabled seq scan with $VCPU workers (stresses
> the importance of interleaving dynamic shm for workers)
> 3. select txid_current() with -M prepared?
> 

Thanks. I think we'll try something like (1), but it'll need to be a bit
more elaborate, because scans on tables larger than 1/4 shared buffers
use a small circular buffer.

>> reserving number of huge pages
>> ------------------------------
> [..]
>> It took me ages to realize what's happening, but it's very simple. The
>> nr_hugepages is a global limit, but it's also translated into limits for
>> each NUMA node. So when you write 16828 to it, in a 4-node system each
>> node gets 1/4 of that. See
>>
>>   $ numastat -cm
>>
>> Then we do the mmap(), and everything looks great, because there really
>> is enough huge pages and the system can allocate memory from any NUMA
>> node it needs.
> 
> Yup, similiar story as with OOMs just for per-zone/node.
> 
>> And then we come around, and do the numa_tonode_memory(). And that's
>> where the issues start, because AFAIK this does not check the per-node
>> limit of huge pages in any way. It just appears to work. And then later,
>> when we finally touch the buffer, it tries to actually allocate the
>> memory on the node, and realizes there's not enough huge pages. And
>> triggers the SIGBUS.
> 
> I think that's why options for strict policy numa allocation exist and
> I had the option to use it in my patches (anyway with one big call to
> numa_interleave_memory() for everything it was much more simpler and
> just not micromanaging things). Good reads are numa(3) but e.g.
> mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
> MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
> but it could be happening for you too if you start using it). Anyway,
> numa_set_strict() is just wrapper around setting this exact flag
> 
> Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
> that should be always called in your patch series to pre-populate
> everything during startup, so that others testing will get proper
> guaranteed layout, even without issuing any pg_buffercache calls.
> 

I think I tried using numa_set_strict, but it didn't change the behavior
(i.e. the numa_tonode_memory didn't error out).

>> The only way around this I found is by inflating the number of huge
>> pages, significantly above the shared_memory_size_in_huge_pages value.
>> Just to make sure the nodes get enough huge pages.
>>
>> I don't know what to do about this. It's quite annoying. If we only used
>> huge pages for the partitioned parts, this wouldn't be a problem.
> 
> Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
> NUMA affinity, seems like a logical trade-off, doesn't it?
> But postgres -C shared_memory_size_in_huge_pages  still works OK to
> establish the exact count for vm.nr_hugepages, right?
> 

Well, yes and no. It tells you the exact number of huge pages, but it
does not tell you how much you need to inflate it to account for the
non-shared buffer part that may get allocated on a random node.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:

Hi,

Here's an updated version of the patch series. The main improvement is
the new 0006 patch, adding "adaptive balancing" of allocations. I'll
also share some results from a workload doing a lot of allocations.


adaptive balancing of allocations
---------------------------------

Imagine each backend only allocates buffers from the partition on the
same NUMA node. E.g. you have 4 NUMA nodes (i.e. 4 partitions), and a
backend only allocates buffers from "home" partition (on the same NUMA
node). This is what the earlier patch versions did, and with many
backends that's mostly fine (assuming the backends get spread over all
the NUMA nodes).

But if there's only few backends doing the allocations, this can result
in very inefficient use of shared buffers - a single backend would be
limited to 25% of buffers, even if the rest is unused.

There needs to be some say to "redirect" excess allocations to other
partitions, so that the partitions are utilized about the same. This is
what the 0006 patch aims to do (I kept is separate, but it should
probably get merged into the "clocksweep partitioning" in the end).

The balancing is fairly simple:

(1) It tracks the number of allocations "requested" from each partition.

(2) In regular intervals (by bgwriter) calculate the "fair share" per
partition, and determine what fraction of "requests" to handle from the
partition itself, and how many to redirect to other partitions.

(3) Calculate coefficients to drive this for each partition.


I emphasize (1) talks about "requests", not the actual allocations. Some
of the requests could have been redirected to different partitions, and
be counted as allocations there. We want to balance allocations, but it
relies on the requests.

To give you a simple example - imagine there are 2 partitions with this
number of allocation requests:

  P1: 900,000 requests
  P2: 100,000 requests

This means the "fair share" is 500,000 allocations, so P1 needs to
redirect some requests to P2. And we end with these weights:

  P1: [ 55,  45]
  P2: [  0, 100]

Assuming the workload does not shift in some dramatic way, this should
result in both partitions handling ~500k allocations.

It's not hard to extend this algorithm to more partitions. For more
details see StrategySyncBalance(), which recalculates this.

There are a couple open questions, like:

* The algorithm combines the old/new weights by averaging, to add a bit
of hysteresis. Right now it's a simple average with 0.5 weight, to
dampen sudden changes. I think it works fine (in the long run), but I'm
open to suggestions how to do this better.

* There's probably additional things we should consider when deciding
where to redirect the allocations. For example, we may have multiple
partitions per NUMA node, in which case it's better to redirect to that
node as many allocations as possible. The current patch ignores this.

* The partitions may have slightly different sizes, but the balancing
ignores that for now. This is not very difficult to address.


clocksweep benchmark
--------------------

I ran a simple benchmark focused on allocation-heavy workloads, namely
large concurrent sequential scans. The attached scripts generate a
number of 1GB tables, and then run concurrent sequential scans with
shared buffers set to 60%, 75%, 90% and 110% of the total dataset size.

I did this for master, and with the NUMA patches applied (and the GUCs
set to 'on'). I also increased tried with the of partitions increased to
16 (so a NUMA node got multiple partitions).

There are results from three machines

1) ryzen - small non-NUMA system, mostly to see if there's regressions

2) xeon - older 2-node NUMA system

3) hb176 - big EPYC system with 176 cores / 4 NUMA nodes


The script records detailed TPS stats (e.g. percentiles), I'm attaching
CSV files with complete results, and some PDFs with charts summarizing
that (I'll get to that in a minute).

For the EPYC, the average tps for the three builds looks like this:


    clients  |  master      numa   numa-16    |    numa     numa-16
   ----------|--------------------------------|---------------------
          8  |      20        27        26    |    133%        129%
         16  |      23        39        45    |    170%        193%
         24  |      23        48        58    |    211%        252%
         32  |      21        57        68    |    268%        321%
         40  |      21        56        76    |    265%        363%
         48  |      22        59        82    |    270%        375%
         56  |      22        66        88    |    296%        397%
         64  |      23        62        93    |    277%        411%
         72  |      24        68        95    |    277%        389%
         80  |      24        72        95    |    295%        391%
         88  |      25        71        98    |    283%        392%
         96  |      26        74        97    |    282%        369%
        104  |      26        74        97    |    282%        367%
        112  |      27        77        95    |    287%        355%
        120  |      28        77        92    |    279%        335%
        128  |      27        75        89    |    277%        328%

That's not bad - the clocksweep partitioning increases the throughput
2-3x. Having 16 partitions (instead of 4) helps yet a bit more, to 3-4x.

This is for shared buffers set to 60% of the dataset, which depends on
the number of clients / tables. With 64 clients/tables, there's 64GB of
data, and shared buffers are set to ~39GB.

The results for 75% and 90% follow the same pattern. For 110% there's
much less impact - there are no allocations, so this has to be thanks to
the other NUMA patches.

The charts in the attached PDFs add a bit more detail, with various
percentiles (of per-second throughput). The bands are roughly quartiles:
5-25%, 25-50%, 50-75%, 75-95%. The thick middle line is the median.

There's only charts for 60%, 90% and 110% shared buffers, for fit it on
a single page. There 75% is not very different.

For ryzen there's little difference. Not surprising, it's not a NUMA
system. So this is positive result, as there's no regression.

For xeon the patches help a little bit. Again, not surprising. It's a
fairly old system (~2016), and the differences between NUMA nodes are
not that significant.

For epyc (hb176), the differences are pretty massive.



regards

-- 
Tomas Vondra
Вложения

Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
Hi!

Here's a slightly improved version of the patch series.

The main improvement is related to rebalancing partitions of different
sizes (which can happen because the sizes have to be a multiple of some
minimal "chunk" determined by memory page size etc.). Part 0009 deals
with that by adjusting the allocations by partition size. It works OK,
but it's also true it matters less as the shared_buffers size increases
(as the relative difference between large/small partition gets smaller).


The other improvements are related to the pg_buffercache_partitions
view, showing the weights and (running) totals of allocations.

I plan to take a break from this patch series for a while, so this would
be a good time to take a look, do a review, run some tests etc. ;-)


One detail about the balancing I forgot to mention in my last message is
how the patch "distributes" allocations to match the balancing weights.
Consider for example the example weights from that message:

  P1: [ 55,  45]
  P2: [  0, 100]

Imagine a backend located on P1 requests allocation of a buffer. The
weights say 55% buffers should be allocated from P1, and 45% should be
redirected to P2. One way to achieve that would be generating a random
number in [1, 100], and if it's [1,55] then P1, otherwise P2.

The patch does a much simpler thing - treat the weight as a "budget",
i.e. number of buffers to allocate before proceeding to the "next"
partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
and then goes back to P1 in a round-robin way. The advantage is it can
do away without a PRNG.


There's two things I'm not entirely sure about:

1) memory model - I'm not quite sure the current code ensures updates to
weights are properly "communicated" to the other processes. That is, if
the bgwriter recalculates the weights, will the other backends see the
new weights right away? Using a stale weights won't cause "failures",
the consequence is just a bit of imbalance. But it shouldn't stay like
that for too long, so maybe it'd be good to add some memory barriers or
something like that.

2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
assumes each core / piece of RAM is assigned to a particular NUMA node.
For the buffer partitioning the patch mostly cares about memory, as it
"locates" the buffers on different NUMA nodes. Which works mostly OK
(ignoring the issues with huge pages described in previous message).

But it also cares about the cores (and the node for each core), because
it uses that to pick the right partition for a backend. And here the
situation is less clear, because the CPUs don't need to be assigned to a
particular node, even on a NUMA system. Consider the rpi5 NUMA layout:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 992 MB
node 0 free: 274 MB
node 1 cpus: 0 1 2 3
node 1 size: 1019 MB
node 1 free: 327 MB
node 2 cpus: 0 1 2 3
node 2 size: 1019 MB
node 2 free: 321 MB
node 3 cpus: 0 1 2 3
node 3 size: 955 MB
node 3 free: 251 MB
node 4 cpus: 0 1 2 3
node 4 size: 1019 MB
node 4 free: 332 MB
node 5 cpus: 0 1 2 3
node 5 size: 1019 MB
node 5 free: 342 MB
node 6 cpus: 0 1 2 3
node 6 size: 1019 MB
node 6 free: 352 MB
node 7 cpus: 0 1 2 3
node 7 size: 1014 MB
node 7 free: 339 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  10  10  10  10  10  10  10
  1:  10  10  10  10  10  10  10  10
  2:  10  10  10  10  10  10  10  10
  3:  10  10  10  10  10  10  10  10
  4:  10  10  10  10  10  10  10  10
  5:  10  10  10  10  10  10  10  10
  6:  10  10  10  10  10  10  10  10
  7:  10  10  10  10  10  10  10  10

This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
are not assigned to particular nodes - each core is mapped to all 8 NUMA
nodes. I'm not sure what to do about this (or how getcpu() or libnuma
handle this). And can the situation be even more complicated?


regards

-- 
Tomas Vondra

Вложения

Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 8/7/25 11:24, Tomas Vondra wrote:
> Hi!
> 
> Here's a slightly improved version of the patch series.
> 

Ah, I made a mistake when generating the patches. The 0001 and 0002
patches are not part of the NUMA stuff, it's just something related to
benchmarking (addressing unrelated bottlenecks etc.). The actual NUMA
patches start with 0003.

Also, 0007, 0008 and 0009 should ultimately be collapsed into a single
patch. It's all about the clocksweep partitioning, I only kept those
separate to make it easier to see the changes and review.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
> 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
> assumes each core / piece of RAM is assigned to a particular NUMA node.

There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if
you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory
controller. In that case numactl -H (and obviously also the libnuma APIs) will
report that the numa node is not associated with any CPU.

I don't currently have live access to such a system, but this PR piece happens
to have numactl -H output:
https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers
> numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
3839 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
119120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 
> node 0 size: 1031904 MB
> node 0 free: 1025554 MB
> node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
8384 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
164165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
 
> node 1 size: 1032105 MB
> node 1 free: 1024244 MB
> node 2 cpus:
> node 2 size: 262144 MB
> node 2 free: 262143 MB
> node 3 cpus:
> node 3 size: 262144 MB
> node 3 free: 262142 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  14  24
>   1:  21  10  24  14
>   2:  14  24  10  26
>   3:  24  14  26  10

Note that node 2 & 3 don't have associated CPUs (and higher access costs).

I don't think this is common enough to worry about from a performance POV, but
we probably shouldn't crash if we encounter it...


> But it also cares about the cores (and the node for each core), because
> it uses that to pick the right partition for a backend. And here the
> situation is less clear, because the CPUs don't need to be assigned to a
> particular node, even on a NUMA system. Consider the rpi5 NUMA layout:
>
> $ numactl --hardware
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3
> node 0 size: 992 MB
> node 0 free: 274 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 1019 MB
> node 1 free: 327 MB
> ...
> node   0   1   2   3   4   5   6   7
>   0:  10  10  10  10  10  10  10  10
>   1:  10  10  10  10  10  10  10  10
>   2:  10  10  10  10  10  10  10  10
>   3:  10  10  10  10  10  10  10  10
>   4:  10  10  10  10  10  10  10  10
>   5:  10  10  10  10  10  10  10  10
>   6:  10  10  10  10  10  10  10  10
>   7:  10  10  10  10  10  10  10  10
> This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
> are not assigned to particular nodes - each core is mapped to all 8 NUMA
> nodes.

FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as
NUMA" is enabled.


> I'm not sure what to do about this (or how getcpu() or libnuma handle this).

I don't immediately see any libnuma functions that would care?

I also am somewhat curious about what getcpu() returns for the current node...

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:

On 8/9/25 02:25, Andres Freund wrote:
> Hi,
> 
> On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
>> 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
>> assumes each core / piece of RAM is assigned to a particular NUMA node.
> 
> There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if
> you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory
> controller. In that case numactl -H (and obviously also the libnuma APIs) will
> report that the numa node is not associated with any CPU.
> 
> I don't currently have live access to such a system, but this PR piece happens
> to have numactl -H output:
> https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers
>> numactl -H
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
3839 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
119120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 
>> node 0 size: 1031904 MB
>> node 0 free: 1025554 MB
>> node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
8283 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
163164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
 
>> node 1 size: 1032105 MB
>> node 1 free: 1024244 MB
>> node 2 cpus:
>> node 2 size: 262144 MB
>> node 2 free: 262143 MB
>> node 3 cpus:
>> node 3 size: 262144 MB
>> node 3 free: 262142 MB
>> node distances:
>> node   0   1   2   3
>>   0:  10  21  14  24
>>   1:  21  10  24  14
>>   2:  14  24  10  26
>>   3:  24  14  26  10
> 
> Note that node 2 & 3 don't have associated CPUs (and higher access costs).
> 
> I don't think this is common enough to worry about from a performance POV, but
> we probably shouldn't crash if we encounter it...
> 

Right. I don't think the current patch would crash - I can't test it,
but I don't see why it would crash. In the worst case it'd end up with
partitions that are not ideal. The question is more what would an ideal
partitioning for buffers and PGPROC look like. Any opinions?

For PGPROC, it's simple - it doesn't make sense to allocate partitions
for nodes without CPUs.

For buffers, it probably does not really matter if a node does not have
any CPUs. If a node does not have any CPUs, that does not mean we should
not put any buffers on it. After all, CXL will never have any CPUs (at
least I think that's the case), and not using it for shared buffers
would be a bit strange. Although, it could still be used for page cache.

Maybe it should be "tiered" a bit more? The patch differentiates only
between partitions on "my" NUMA node vs. every other partition. Maybe it
should have more layers?

That'd make the "balancing" harder. But I wanted to make it a smarter
when handling cases with multiple partitions per NUMA node.

> 
>> But it also cares about the cores (and the node for each core), because
>> it uses that to pick the right partition for a backend. And here the
>> situation is less clear, because the CPUs don't need to be assigned to a
>> particular node, even on a NUMA system. Consider the rpi5 NUMA layout:
>>
>> $ numactl --hardware
>> available: 8 nodes (0-7)
>> node 0 cpus: 0 1 2 3
>> node 0 size: 992 MB
>> node 0 free: 274 MB
>> node 1 cpus: 0 1 2 3
>> node 1 size: 1019 MB
>> node 1 free: 327 MB
>> ...
>> node   0   1   2   3   4   5   6   7
>>   0:  10  10  10  10  10  10  10  10
>>   1:  10  10  10  10  10  10  10  10
>>   2:  10  10  10  10  10  10  10  10
>>   3:  10  10  10  10  10  10  10  10
>>   4:  10  10  10  10  10  10  10  10
>>   5:  10  10  10  10  10  10  10  10
>>   6:  10  10  10  10  10  10  10  10
>>   7:  10  10  10  10  10  10  10  10
>> This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
>> are not assigned to particular nodes - each core is mapped to all 8 NUMA
>> nodes.
> 
> FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as
> NUMA" is enabled.
> 
> 
>> I'm not sure what to do about this (or how getcpu() or libnuma handle this).
> 
> I don't immediately see any libnuma functions that would care?
> 

Not sure what "care" means here. I don't think it's necessarily broken,
it's more about the APIs not making the situation very clear (or
convenient).

How do you determine nodes for a CPU, for example? The closest thing I
see is numa_node_of_cpu(), but that only returns a single node. Or how
would you determine the number of nodes with CPUs (so that we create
PGPROC partitions only for those)? I suppose that requires literally
walking all the nodes.

> I also am somewhat curious about what getcpu() returns for the current node...
> 

It seems it only returns node 0. The cpu changes, but the node does not.

regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote:
> Right. I don't think the current patch would crash - I can't test it,
> but I don't see why it would crash. In the worst case it'd end up with
> partitions that are not ideal. The question is more what would an ideal
> partitioning for buffers and PGPROC look like. Any opinions?
> 
> For PGPROC, it's simple - it doesn't make sense to allocate partitions
> for nodes without CPUs.
> 
> For buffers, it probably does not really matter if a node does not have
> any CPUs. If a node does not have any CPUs, that does not mean we should
> not put any buffers on it. After all, CXL will never have any CPUs (at
> least I think that's the case), and not using it for shared buffers
> would be a bit strange. Although, it could still be used for page cache.

For CXL memory to be really usable, I think we'd need nontrivial additional
work. CXL memory has considerably higher latency and lower throughput. You'd
*never* want things like BufferDescs or such on such nodes. And even the
buffered data itself, you'd want to make sure that frequently used data,
e.g. inner index pages, never end up on it.

Which leads to:

> Maybe it should be "tiered" a bit more?

Yes, for proper CXL support, we'd need a component that explicitly demotes and
promotes pages from "real" memory to CXL memory and the other way round. The
demotion is relatively easy, you'd probably just do it whenever you'd
otherwise throw out a victim buffer. When to promote back is harder...


> The patch differentiates only between partitions on "my" NUMA node vs. every
> other partition. Maybe it should have more layers?

Given the relative unavailability of CXL memory systems, I think just not
crashing is good enough for now...


> >> I'm not sure what to do about this (or how getcpu() or libnuma handle this).
> > 
> > I don't immediately see any libnuma functions that would care?
> > 
> 
> Not sure what "care" means here. I don't think it's necessarily broken,
> it's more about the APIs not making the situation very clear (or
> convenient).

What I mean is that I was looking through the libnuma functions and didn't see
any that would be affected by having multiple "local" NUMA nodes. But:


> How do you determine nodes for a CPU, for example? The closest thing I
> see is numa_node_of_cpu(), but that only returns a single node. Or how
> would you determine the number of nodes with CPUs (so that we create
> PGPROC partitions only for those)? I suppose that requires literally
> walking all the nodes.

I didn't think of numa_node_of_cpu().

As long as numa_node_of_cpu() returns *something* I think it may be good
enough. Nobody uses an RPi for high-throughput postgres workloads with a lot
of memory. Slightly sub-optimal mappings should really not matter.

I'm kinda wondering if we should deal with such fake numa systems by detecting
them and disabling our numa support.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:

On 8/12/25 16:24, Andres Freund wrote:
> Hi,
> 
> On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote:
>> Right. I don't think the current patch would crash - I can't test it,
>> but I don't see why it would crash. In the worst case it'd end up with
>> partitions that are not ideal. The question is more what would an ideal
>> partitioning for buffers and PGPROC look like. Any opinions?
>>
>> For PGPROC, it's simple - it doesn't make sense to allocate partitions
>> for nodes without CPUs.
>>
>> For buffers, it probably does not really matter if a node does not have
>> any CPUs. If a node does not have any CPUs, that does not mean we should
>> not put any buffers on it. After all, CXL will never have any CPUs (at
>> least I think that's the case), and not using it for shared buffers
>> would be a bit strange. Although, it could still be used for page cache.
> 
> For CXL memory to be really usable, I think we'd need nontrivial additional
> work. CXL memory has considerably higher latency and lower throughput. You'd
> *never* want things like BufferDescs or such on such nodes. And even the
> buffered data itself, you'd want to make sure that frequently used data,
> e.g. inner index pages, never end up on it.
> 

OK, let's keep that out of scope for these patches and assume we're
dealing only with local memory. CXL could still be used by the OS for
page cache, of whatever.

What does that mean for the patch, though. Does it need a way to
configure which nodes to use? I argued to leave this to the OS/numactl,
and we'd just use whatever is made available to Postgres. But maybe
we'll need something within Postgres after all?

FWIW there's work needed a actually inherit NUMA info from the OS. Right
now the patches just use all NUMA nodes, indexed by 0 ... (N-1) etc. I
like the "registry" concept I used for buffer/PGPROC partitions, it made
the patches much simpler. Maybe we should use something like that for
NUMA info too. That is, at startup build a record of the NUMA layout,
and use this as source of truth everywhere (instead of using libnuma
from all those places).

> Which leads to:
> 
>> Maybe it should be "tiered" a bit more?
> 
> Yes, for proper CXL support, we'd need a component that explicitly demotes and
> promotes pages from "real" memory to CXL memory and the other way round. The
> demotion is relatively easy, you'd probably just do it whenever you'd
> otherwise throw out a victim buffer. When to promote back is harder...
> 

Sounds very much like page cache (but that only works for buffered I/O).

> 
>> The patch differentiates only between partitions on "my" NUMA node vs. every
>> other partition. Maybe it should have more layers?
> 
> Given the relative unavailability of CXL memory systems, I think just not
> crashing is good enough for now...
> 

The lowest of bars ;-)

> 
>>>> I'm not sure what to do about this (or how getcpu() or libnuma handle this).
>>>
>>> I don't immediately see any libnuma functions that would care?
>>>
>>
>> Not sure what "care" means here. I don't think it's necessarily broken,
>> it's more about the APIs not making the situation very clear (or
>> convenient).
> 
> What I mean is that I was looking through the libnuma functions and didn't see
> any that would be affected by having multiple "local" NUMA nodes. But:
> 

My question is a bit of a "reverse" to this. That is, how do we even
find (with libnuma) there are multiple local nodes?

> 
>> How do you determine nodes for a CPU, for example? The closest thing I
>> see is numa_node_of_cpu(), but that only returns a single node. Or how
>> would you determine the number of nodes with CPUs (so that we create
>> PGPROC partitions only for those)? I suppose that requires literally
>> walking all the nodes.
> 
> I didn't think of numa_node_of_cpu().
> 

Yeah. I think most of the libnuma API is designed for each CPU belonging
to single NUMA node. I suppose we'd need to use numa_node_to_cpus() to
build this kind of information ourselves.

> As long as numa_node_of_cpu() returns *something* I think it may be good
> enough. Nobody uses an RPi for high-throughput postgres workloads with a lot
> of memory. Slightly sub-optimal mappings should really not matter.
> 

I'm not really concerned about rpi, or the performance on it. I only use
it as an example of system with "weird" NUMA layout.

> I'm kinda wondering if we should deal with such fake numa systems by detecting
> them and disabling our numa support.
> 

That'd be an option too, if we can identify such systems. We could do
that while building the "NUMA registry" I mentioned earlier.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Andres Freund
Дата:
Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
> The patch does a much simpler thing - treat the weight as a "budget",
> i.e. number of buffers to allocate before proceeding to the "next"
> partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
> and then goes back to P1 in a round-robin way. The advantage is it can
> do away without a PRNG.

I think that's a good plan.


A few comments about the clock sweep patch:

- It'd be easier to review if BgBufferSync() weren't basically re-indented
  wholesale. Maybe you could instead move the relevant code to a helper
  function that's called by BgBufferSync() for each clock?

- I think choosing a clock sweep partition in every tick would likely show up
  in workloads that do a lot of buffer replacement, particularly if buffers
  in the workload often have a high usagecount (and thus more ticks are used).
  Given that your balancing approach "sticks" with a partition for a while,
  could we perhaps only choose the partition after exhausting that budget?

- I don't really understand what

> +    /*
> +     * Buffers that should have been allocated in this partition (but might
> +     * have been redirected to keep allocations balanced).
> +     */
> +    pg_atomic_uint32 numRequestedAllocs;
> +

  is intended for.

  Adding yet another atomic increment for every clock sweep tick seems rather
  expensive...


- I wonder if the balancing budgets being relatively low will be good
  enough. It's not too hard to imagine that this frequent "partition choosing"
  will be bad in buffer access heavy workloads. But it's probably the right
  approach until we've measured it being a problem.


- It'd be interesting to do some very simple evaluation like a single
  pg_prewarm() of a relation that's close to the size of shared buffers and
  verify that we don't end up evicting newly read in buffers.  I think your
  approach should work, but verifying that...

  I wonder if we could make some of this into tests somehow. It's pretty easy
  to break this kind of thing and not notice, as everything just continues to
  work, just a tad slower.

Greetings,

Andres Freund



Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 8/13/25 17:16, Andres Freund wrote:
> Hi,
> 
> On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
>> The patch does a much simpler thing - treat the weight as a "budget",
>> i.e. number of buffers to allocate before proceeding to the "next"
>> partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
>> and then goes back to P1 in a round-robin way. The advantage is it can
>> do away without a PRNG.
> 
> I think that's a good plan.
> 
> 
> A few comments about the clock sweep patch:
> 
> - It'd be easier to review if BgBufferSync() weren't basically re-indented
>   wholesale. Maybe you could instead move the relevant code to a helper
>   function that's called by BgBufferSync() for each clock?
> 

True, I'll rework it like that.

> - I think choosing a clock sweep partition in every tick would likely show up
>   in workloads that do a lot of buffer replacement, particularly if buffers
>   in the workload often have a high usagecount (and thus more ticks are used).
>   Given that your balancing approach "sticks" with a partition for a while,
>   could we perhaps only choose the partition after exhausting that budget?
> 

That should be possible, yes. By "exhausting budget" you mean going
through all the partitions, right?

> - I don't really understand what
> 
>> +    /*
>> +     * Buffers that should have been allocated in this partition (but might
>> +     * have been redirected to keep allocations balanced).
>> +     */
>> +    pg_atomic_uint32 numRequestedAllocs;
>> +
> 
>   is intended for.
> 
>   Adding yet another atomic increment for every clock sweep tick seems rather
>   expensive...
> 

For the balancing (to calculate the budgets), we need to know the number
of allocation requests for each partition, before some of the requests
got redirected to other partitions. We can't use the number of "actual"
allocations. But it seems useful to have both - one to calculate the
budgets, the other to monitor how balanced the result is.

I haven't seen the extra atomic in profiles, even on workloads that do a
lot of buffer allocations (e.g. seqscan with datasets > shared buffers).
But if that happens, I think there are ways to mitigate that.

> 
> - I wonder if the balancing budgets being relatively low will be good
>   enough. It's not too hard to imagine that this frequent "partition choosing"
>   will be bad in buffer access heavy workloads. But it's probably the right
>   approach until we've measured it being a problem.
> 

I don't follow. How would making the budgets higher change any of this?

Anyway, I think choosing the partitions less frequently - e.g. only
after consuming budget for the current partition, or going "full cycle",
would make this a non-issue.

> 
> - It'd be interesting to do some very simple evaluation like a single
>   pg_prewarm() of a relation that's close to the size of shared buffers and
>   verify that we don't end up evicting newly read in buffers.  I think your
>   approach should work, but verifying that...
> 

Will try.

>   I wonder if we could make some of this into tests somehow. It's pretty easy
>   to break this kind of thing and not notice, as everything just continues to
>   work, just a tad slower.
> 

Do you mean a test that'd be a part of make check, or a standalone test?
AFAICS any meaningful test would need to be fairly expensive, so
probably not a good fit for make check.


regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
Hi,

Here's a fresh version of the NUMA patch series. There's a number of
substantial improvements:

1) Rebase to current master, particularly on top of 2c7894052759 which
removed the freelist. The patch that partitioned the freelist is gone.

2) A bunch of fixes, and it now passes CI workflows on github, and all
other testing I did. Of course, more testing is needed.

3) It builds with/without libnuma support, and so on.

4) The separate GUCs were replaced by a single list GUC, similar to what
we do for debug_io_direct. The GUC is called debug_numa, and accepts
"buffets" and "procs", to partition the two areas.

I'm considering adding a "clock-sweep" option in a future patch. I
didn't do that here, because the buffer partitioning is already enabled
by "buffers", and the clock-sweep just builds on that (partitioning the
same way, pretty much).

5) It also works with EXEC_BACKEND, but this turned out to be a bit more
challenging than expected. The trouble is some of the parameters (e.g.
memory page size) are used both in the "size" and "init" phases, and we
need to be extra careful to not make something "inconsistent".

For example, we may get confused about the memory page size. The "size"
happens before allocation, and at that point we don't know if we succeed
in getting enough huge pages. When "init" happens, we already know that,
so our "memory page size" could be different. We must be careful, e.g.
to not need more memory than we requested.

This is a general problem, but the EXEC_BACKEND makes it a bit trickier.
In regular fork() case we can simply set some global variables in "size"
and use them later in "init", but that doesn't work for EXEC_BACKEND.

The current approach simply does the calculations twice, in a way that
should end with the same results. But I'm not 100% happy with it, and I
suspect it just confirms we should store the results in shmem memory
(which is what I called "NUMA registry" before). That's still a TODO.

6) BufferDescPadded is 64B everywhere. Originally, the padding was
applied only on 64-bit platforms, and on 32-bit systems the struct was
left at 60B. But that is not compatible with buffer partitioning, which
relies on the memory page being a multiple of BufferDescPadded.

Perhaps it could be relaxed (so that a BufferDesc might span two memory
pages), but this seems like the cleanes solution. I don't expect it to
make any measurable difference.

7) I've moved some of the code to BgBufferSyncPartition, to make review
easier (without all the indentation changes).

8) I've realized some of the TAP tests occasionally fail with

    ERROR: no unpinned buffers

and I think I know why. Some of the tests set shared_buffers to a very
low value - like 1MB or even 128kB, and StrategyGetBuffer() may search
only a single partition (but not always). We may run out of unpinned
buffers in that one partition.

This apparently happens more easily on rpi5, due to the weird NUMA
layout (there are 8 nodes with memory, but getcpu() reports node 0 for
all cores).

I suspect the correct fix is to ensure StrategyGetBuffer() scans all
partitions, if there are no unpinned buffers in the current one. On
realistic setups this shouldn't happen very often, I think.

The other issue I just realized is that StrategyGetBuffer() recalculates
the partition index over and over, which seems unnecessary (and possibly
expensive, due to the modulo). And it also does too many loops, because
it used NBuffers instead of the partition size. I'll fix those later.

9) I'm keeping the cloc-sweep balancing patches separate for now. In the
end all the clock-sweep patches should be merged, but it keeps the
changes easier to review this way, I think.

10) There's not many changes in the PGPROC partitioning patch. I merely
fixed issues that broke it on EXEC_BACKEND, and did some smaller tweaks.

11) I'm not including the experimental patches to pin backends to CPUs
(or nodes), and so on. It's clear those are unlikely to go in, so it'd
be just a distraction.

12) What's the right / portable way to determine the current CPU for a
process? The clock-sweep / PGPROC patches need this to pick the right
partition, but it's not clear to me which API is the most portable. In
the end I used sched_getcpu(), and then numa_node_of_cpu(), but maybe
there's a better way.


regards

-- 
Tomas Vondra

Вложения

Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 9/11/25 10:32, Tomas Vondra wrote:
> ...
> 
> For example, we may get confused about the memory page size. The "size"
> happens before allocation, and at that point we don't know if we succeed
> in getting enough huge pages. When "init" happens, we already know that,
> so our "memory page size" could be different. We must be careful, e.g.
> to not need more memory than we requested.

I forgot to mention the other issue with huge pages on NUMA. I already
reported [1] it's trivial to crash with a SIGBUS, because

(1) huge pages get reserved on all NUMA nodes (evenly)

(2) the decision whether to use huge pages is done by mmap(), which only
needs to check if there are enough huge pages in total

(3) numa_tonode_memory is called later, and does not verify if the
target node has enough free pages (I'm not sure it should / can)

(4) we only partition (and locate to NUMA nodes) some of the memory, and
the rest (which is much smaller, but still sizeable) is likely causing
"imbalance" - it gets placed on one (random) node, and it then does not
have enough space for the stuff we explicitly placed there

(5) then at some point we try accessing one of the shared buffers, that
triggers page fault, tries to get a huge page on the NUMA node, realizes
there are no free huge pages, and crashes with SIGBUS

It clearly is not an option to just let it crash, but I still don't have
a great idea how to address it. The only idea I have is to manually
interleave the whole shared memory (when using huge pages), page by
page, so that this imbalance does not happen.

But it's harder than it looks, because we don't necessarily partition
everything evenly. For example, one node can get a smaller chunk of
shared buffers, because we try to partition buffers and buffers
descriptors in a "nice" way. The PGPROC stuff is also not distributed
quite evenly (e.g. aux/2pc entries are not mapped to any node).

A different approach would be to calculate how many per-node huge pages
we'll need (for the stuff we partition explicitly - buffers and PGPROC),
and then the rest of the memory that can get placed on any node. And
require the "maximum" number of pages that can get placed on any node.
But that's annoying wasteful, because every other node will end up with
unusable memory.


regards

[1]
https://www.postgresql.org/message-id/71a46484-053c-4b81-ba32-ddac050a8b5d%40vondra.me

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

От
Tomas Vondra
Дата:
On 9/11/25 10:32, Tomas Vondra wrote:
> ...
>
> 8) I've realized some of the TAP tests occasionally fail with
> 
>     ERROR: no unpinned buffers
> 
> and I think I know why. Some of the tests set shared_buffers to a very
> low value - like 1MB or even 128kB, and StrategyGetBuffer() may search
> only a single partition (but not always). We may run out of unpinned
> buffers in that one partition.
> 
> This apparently happens more easily on rpi5, due to the weird NUMA
> layout (there are 8 nodes with memory, but getcpu() reports node 0 for
> all cores).
> 
> I suspect the correct fix is to ensure StrategyGetBuffer() scans all
> partitions, if there are no unpinned buffers in the current one. On
> realistic setups this shouldn't happen very often, I think.
> 
> The other issue I just realized is that StrategyGetBuffer() recalculates
> the partition index over and over, which seems unnecessary (and possibly
> expensive, due to the modulo). And it also does too many loops, because
> it used NBuffers instead of the partition size. I'll fix those later.

Here's a version fixing this issue (in the 0006 part). It modifies
StrategyGetBuffer() to walk through all the partitions, in a round-robin
manner. The way it steps to the next partition is a bit ugly, but it
works and I'll think about some better way.

I haven't done anything about the other issue (the one with huge pages
reserved on NUMA nodes, and SIGBUS).

regards

-- 
Tomas Vondra
Вложения